Hive Complete Guide

Sections :

1 : Introduction & Basic Commands .

2 : Functions in Hive.

3 : Bucketing and Partitioning .

4 : Joins .

5 : Views and Indexes .

6 : UDF .

7 : Table properties (ACID as well) .

8 : File Formats and Loading data .

9 : Questions .

1 - Introduction & Basic Commands :

Hive is SQL like querying tool to query data stored in HDFS / Other file systems that integrate with Hadoop.
Hive is not a Database.
It can only process structured data stored in table.
Developed by Facebook & taken by Apache now .
Hive is efficient for batch processing , where amount of data is large and there is no real-time data availability requirements.
Supports different file formats ie Text, Parquet, Sequential , Avro etc..
Most of the times at back-end Map-reduce will run as soon as we run a query using Hive. It was created to reduced complexity of Map-reduce programs .
Easy for non-Java Analysts.
Hive is built on Write-Once and Read-Many concept. Whereas RDBMS is based on Write-Many and Read-Many.

HIVE ARCHITECTURE

BASIC COMMANDS

1 . Create Database & Table :

Db is just to store metadata.
Table creation is also creation of metadata for table.
Every table in database will be stored as sub-directory of schema .
Default table sub-directory is /user/hive/warehouse.

Commands :

hive> create database test1;
OK
Time taken: 1.297 seconds
## If not exists will not give error in case database already exists . 

hive> create database if not exists test1;
OK
Time taken: 0.044 seconds

hive> create database test1;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database test1 already exists

hive> describe database test1;
OK
test1           location/in/test        hadoop  USER
Time taken: 0.144 seconds, Fetched: 1 row(s)

## As a good coding practise while creating a database we should add comments , give creator details ,creation time etc for db . 

hive> create database if not exists test2 comment "This is a test database";
OK
Time taken: 0.496 seconds

hive> desc database test2;
OK
test2   This is a test database location/in/test        hadoop  USER
Time taken: 0.028 seconds, Fetched: 1 row(s)

hive> describe database extended test2;
OK
test2   This is a test database location/in/test        hadoop  USER
Time taken: 0.027 seconds, Fetched: 1 row(s)

hive> create database if not exists test3 with dbproperties ('creator'='test' ,'date' = '2020-01-02' );
OK
Time taken: 0.528 seconds

hive> describe database extended test3;
OK
test3           location/in/test        hadoop  USER    {creator=test, date=2020-01-02}
Time taken: 0.028 seconds, Fetched: 1 row(s)

## command to display schemas which names staring with test .
hive> show databases like 'test*';
OK
test1
test2
test3
Time taken: 0.021 seconds, Fetched: 3 row(s)

Tables created in hive are of 2 types :

a : Internal Tables : By default all tables created in hive are internal tables until unless specified . For these tables hive is responsible for metadata and data . It means if we drop table then metadata and data both are lost .

b : External Tables : They are created with External keyword , Hive is only responsible for metadata of table ie if table is dropped then data remains in HDFS/S3 etc ..

## Before creating a table make sure to go to schema using use command . 
hive> use test2;
OK
Time taken: 0.037 seconds

## Create table with string,array data type , int data type which will accept data stored in file which is comma separated for each column and rows by new line . 
hive> create table if not exists table1 (col1 string , col2 array <string>, col3 string , col4 int) row format delimited fields terminated by ',' collection items terminated by ':' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.885 seconds
hive> describe table1;
OK
col1                    string
col2                    array<string>
col3                    string
col4                    int
Time taken: 0.104 seconds, Fetched: 4 row(s)

## Get details for a table using show create table command . 
hive> show create table tablel1;
OK
CREATE TABLE `table1`(
  `col1` string,
  `col2` array<string>,
  `col3` string,
  `col4` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'colelction.delim'=':',
  'field.delim'=',',
  'line.delim'='\n',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://data/hive_warehouse/test2.db/table1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1580491562')
Time taken: 0.146 seconds, Fetched: 25 row(s)

## We can create a table without explicitly giving details regarding storage and default properties are given ie columns are /u0001 ie control+A  separated , new line by /n and by default created in schema sub-directory. 
hive> create table if not exists table2 (col1 string , col2 array <string>, col3 string , col4 int) ;
OK
Time taken: 0.688 seconds
hive> show create table table2;
OK
CREATE TABLE `table2`(
  `col1` string,
  `col2` array<string>,
  `col3` string,
  `col4` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://data/hive_warehouse/test2.db/table2'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1580491653')
Time taken: 0.104 seconds, Fetched: 20 row(s)

## Create table with location on S3 /HDFS directory. 
hive> create table if not exists table3 (col1 string , col2 array <string>, col3 string , col4 int) row format delimited fields terminated by ',' collection items terminated by ':' lines terminated by '\n' stored as textfile location 's3://data/hive_warehouse/table_3_details/';
OK
Time taken: 2.084 seconds
hive> desc table3;
OK
col1                    string
col2                    array<string>
col3                    string
col4                    int
Time taken: 0.103 seconds, Fetched: 4 row(s)
hive> show create table table3;
OK
CREATE TABLE `table3`(
  `col1` string,
  `col2` array<string>,
  `col3` string,
  `col4` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'colelction.delim'=':',
  'field.delim'=',',
  'line.delim'='\n',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://data/hive_warehouse/table_3_details'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580558814')
Time taken: 0.105 seconds, Fetched: 20 row(s)

## Loading data from a file on local system (OS) to table in hive . Make sure data is compatible with schema defined else Null /blank would be stored as hive is schema on read . 
hive> load data local inpath '/home/hadoop/table1.txt' into table table1;
Loading data to table test2.table1
OK
Time taken: 1.648 seconds
hive> select * from table1;
OK
499     ["Poole","GBR"] England 141000
501     ["Blackburn","GBR"]     England 140000
500     ["Bolton","GBR"]        England 139020
502     ["Newport","GBR"]       Wales   139000
503     ["PrestON","GBR"]       England 135000
504     ["Stockport","GBR"]     England 132813
Time taken: 0.125 seconds, Fetched: 6 row(s)

## Insert into will append data always. 
hive> load data local inpath '/home/hadoop/table1.txt' into table table1;
Loading data to table test2.table1
OK
Time taken: 1.027 seconds
hive> select * from table1;
OK
499     ["Poole","GBR"] England 141000
501     ["Blackburn","GBR"]     England 140000
500     ["Bolton","GBR"]        England 139020
502     ["Newport","GBR"]       Wales   139000
503     ["PrestON","GBR"]       England 135000
504     ["Stockport","GBR"]     England 132813
499     ["Poole","GBR"] England 141000
501     ["Blackburn","GBR"]     England 140000
500     ["Bolton","GBR"]        England 139020
502     ["Newport","GBR"]       Wales   139000
503     ["PrestON","GBR"]       England 135000
504     ["Stockport","GBR"]     England 132813
Time taken: 0.12 seconds, Fetched: 12 row(s)

## Insert-overwrite will delete the data and then load the data into table always .
hive> load data local inpath '/home/hadoop/table1.txt' overwrite into table table1;
Loading data to table test2.table1
OK
Time taken: 1.167 seconds
hive> select * from table1;
OK
499     ["Poole","GBR"] England 141000
501     ["Blackburn","GBR"]     England 140000
500     ["Bolton","GBR"]        England 139020
502     ["Newport","GBR"]       Wales   139000
503     ["PrestON","GBR"]       England 135000
504     ["Stockport","GBR"]     England 132813
Time taken: 0.124 seconds, Fetched: 6 row(s)

## Hive command to check if file exists on HDFS from hive CLI , just add ! sign in beginning 
hive> !hadoop fs -ls s3://data/hive_warehouse/table1_ext_details/ ;
Found 1 items
-rw-rw-rw-   1        185 2020-02-01 12:12 s3://data/hive_warehouse/table1_ext_details/table1.txt

## Insert data into table from another table . We can add where condition as well . 
hive> insert into table_test select * from table1;
Query ID = hadoop_20200201121721_28d407c1-1d45-4195-a628-162db09fc475
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.

Loading data to table test2.table_test
OK
Time taken: 15.622 seconds
hive> select * from table_test;
OK
499     ["Poole","GBR"] England 141000
501     ["Blackburn","GBR"]     England 140000
500     ["Bolton","GBR"]        England 139020
502     ["Newport","GBR"]       Wales   139000
503     ["PrestON","GBR"]       England 135000
504     ["Stockport","GBR"]     England 132813
Time taken: 0.117 seconds, Fetched: 6 row(s)


## Multi-Insert from one table 

hive> create table developer_tab (id int,name string,desg string) stored as textfile;
OK
Time taken: 0.726 seconds
hive> create table manager_tab (id int,name string,desg string) stored as textfile;
OK
Time taken: 0.633 seconds

hive> create table emp_tab(col1 int,col2 string,col3 string,col4 int,col5 int ,col6 int,col7 string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.679 seconds
hive> load data local inpath '/home/hadoop/emp.txt' into table emp_tab;
Loading data to table test2.emp_tab
OK
Time taken: 0.994 seconds
hive> select * from emp_tab;
OK
1281    Shawn   Architect       7890    1481    10      IXZ
1381    Jacob   Admin   4560    1481    20      POI
1481    flink   Mgr     9580    1681    10      IXZ
1581    Richard Developer       1000    1681    40      LKJ
1681    Mira    Mgr     5098    1481    10      IKZ
1781    John    Developer       6500    1681    10      IXZ
Time taken: 0.116 seconds, Fetched: 6 row(s)

## inserting data into 2 tables from 1 table. 
hive> from emp_tab insert into table developer_tab select col1,col2,col3 where col3 ='Developer' insert into table manager_tab select col1,col2,col3 where col3='Mgr';
Query ID = hadoop_20200201122251_0046f323-da32-43fd-9a94-14da6fd75f8d
Total jobs = 1
Launching Job 1 out of 1

Loading data to table test2.developer_tab
Loading data to table test2.manager_tab
OK
Time taken: 12.56 seconds
hive> select * from manager_tab;
OK
1481    flink   Mgr
1681    Mira    Mgr
Time taken: 0.118 seconds, Fetched: 2 row(s)

hive> select * from developer_tab;
OK
1581    Richard Developer
1781    John    Developer
Time taken: 0.114 seconds, Fetched: 2 row(s)
hive>

ALTER TABLE COMMANDS :

It is possible to change the location of columns but only after adding it in the table using CHANGE COLUMN

hive> desc emp_tab;
OK
col1                    int
col2                    string
col3                    string
col4                    int
col5                    int
col6                    int
col7                    string
Time taken: 0.1 seconds, Fetched: 7 row(s)

## By default a new column is added at end  and will contain all null values . 
hive> alter table emp_tab add columns (col8 int);
OK
Time taken: 0.422 seconds
hive> desc emp_tab;
OK
col1                    int
col2                    string
col3                    string
col4                    int
col5                    int
col6                    int
col7                    string
col8                    int
Time taken: 0.101 seconds, Fetched: 8 row(s)

hive> select * from emp_tab;
OK
1281    Shawn   Architect       7890    1481    10      IXZ     NULL
1381    Jacob   Admin   4560    1481    20      POI     NULL
1481    flink   Mgr     9580    1681    10      IXZ     NULL
1581    Richard Developer       1000    1681    40      LKJ     NULL
1681    Mira    Mgr     5098    1481    10      IKZ     NULL
1781    John    Developer       6500    1681    10      IXZ     NULL
Time taken: 0.116 seconds, Fetched: 6 row(s)


## Command to add col 8 at beginning , but  it caused datatype issue since due to addition of col8 at starting will cause all columns to move one position and this causes data type incompatibility issues . So be careful with these commands . 
hive> alter table emp_tab change column col8 col8 string first;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
col1,col6
hive> alter table emp_tab change column col8 col8 int after col5;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
col6
hive> alter table emp_tab add columns (col9 int);
OK
Time taken: 0.413 seconds
hive> desc emp_tab;
OK
col1                    int
col2                    string
col3                    string
col4                    int
col5                    int
col6                    int
col7                    string
col8                    int
col9                    int
Time taken: 0.113 seconds, Fetched: 9 row(s)
hive> alter table emp_tab change column col9 col9 int after col7;
OK
Time taken: 0.462 seconds

hive> desc emp_tab;
OK
col1                    int
col2                    string
col3                    string
col4                    int
col5                    int
col6                    int
col7                    string
col9                    int
col8                    int
Time taken: 0.154 seconds, Fetched: 9 row(s)

hive> select * from emp_tab;
OK
1281    Shawn   Architect       7890    1481    10      IXZ     NULL    NULL
1381    Jacob   Admin   4560    1481    20      POI     NULL    NULL
1481    flink   Mgr     9580    1681    10      IXZ     NULL    NULL
1581    Richard Developer       1000    1681    40      LKJ     NULL    NULL
1681    Mira    Mgr     5098    1481    10      IKZ     NULL    NULL
1781    John    Developer       6500    1681    10      IXZ     NULL    NULL
Time taken: 0.116 seconds, Fetched: 6 row(s)

## Renaming table name . 
hive> alter table emp_tab rename to employee;
OK
Time taken: 0.73 seconds
hive> desc emp_tab;
FAILED: SemanticException [Error 10001]: Table not found emp_tab
hive> desc employee;
OK
col1                    int
col2                    string
col3                    string
col4                    int
col5                    int
col6                    int
col7                    string
col9                    int
col8                    int
Time taken: 0.111 seconds, Fetched: 9 row(s)

## Removing old columns and keeping only new columns 
hive> alter table employee replace columns(col9 int, name string);
OK
Time taken: 0.415 seconds
hive> desc employee;
OK
col9                    int
name                    string
Time taken: 0.098 seconds, Fetched: 2 row(s)

hive> select * from employee;
OK
1281    Shawn
1381    Jacob
1481    flink
1581    Richard
1681    Mira
1781    John
Time taken: 0.115 seconds, Fetched: 6 row(s)

## Details regarding a table 
hive> describe extended employee;
OK
col9                    int
name                    string

Detailed Table Information      Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560767, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560767, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.142 seconds, Fetched: 5 row(s)

## Setting table properties . 
hive> alter table employee set tblproperties('auto.purge'='true');
OK
Time taken: 0.375 seconds
hive> describe extended employee;
OK
col9                    int
name                    string

Detailed Table Information      Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560876, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560876, auto.purge=true, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.119 seconds, Fetched: 5 row(s)


hive> alter table employee set fileformat avro;
OK
Time taken: 0.408 seconds
hive> describe extended employee;
OK
col9                    int
name                    string

Detailed Table Information      Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.avro.AvroSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560936, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560936, auto.purge=true, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.116 seconds, Fetched: 5 row(s)

ORDER BY , GROUP BY , DISTRIBUTE BY .

Order by sorts complete data , but involves too much of shuffling since sorting is performed on 1 reducer .

hive> select * from s order by col2 ;
Query ID = hadoop_20200201124712_51a598d6-a6fd-495b-aecc-b0c3b74f93eb
Total jobs = 1
Launching Job 1 out of 1

OK
        NULL
aa      1
bb      1
teh     1
sr      1
ef      1
dth     1
dth     1
yj      1
cv      1
jfy     2
vyj     2
rht     2
vhk     2
gj      2
ub      2
ef      2
h       3
bjk     4
l       4
dth     4
hyj     4
bj      4
j       5
kf      5
g       5
h       5
dfh     5
dd      5
fh      6
s       6
dth     12
Time taken: 6.333 seconds, Fetched: 32 row(s)

Sort by will sort with in a reducer , final result is not complete sort by union of each reducer sort .

hive> select * from s sort by col2 ;
Query ID = hadoop_20200201124737_1436f9a7-deb8-4eee-905b-6c9903e5585e
Total jobs = 1
Launching Job 1 out of 1



OK
        NULL
aa      1
bb      1
teh     1
sr      1
ef      1
dth     1
dth     1
yj      1
cv      1
jfy     2
vyj     2
rht     2
vhk     2
gj      2
ub      2
ef      2
h       3
bjk     4
l       4
dth     4
hyj     4
bj      4
j       5
kf      5
g       5
h       5
dfh     5
dd      5
fh      6
s       6
dth     12
Time taken: 6.775 seconds, Fetched: 32 row(s)

Distribute by will sort non-overlapping values ie it decides which row goes to which reducer .

hive> select * from s distribute by col2 ;
Query ID = hadoop_20200201124814_0646f56e-62b8-4a96-a516-cee8c5e98a4c
Total jobs = 1
Launching Job 1 out of 1


OK
aa      1
bb      1
dd      5
ef      2
teh     1
dth     4
dfh     5
h       3
sr      1
ef      1
dth     1
dth     1
dth     12
rht     2
vhk     2
gj      2
ub      2
l       4
bj      4
bjk     4
hyj     4
s       6
fh      6
g       5
j       5
h       5
kf      5
yj      1
cv      1
vyj     2
jfy     2
        NULL
Time taken: 6.272 seconds, Fetched: 32 row(s)

Clustery by is combination of distribute by sort by  , just an alias . 
hive> select * from s distribute by col2 sort by col2; or select *from s clustery by col2;
Query ID = hadoop_20200201124841_80bb494c-11e4-48fd-a748-eb3156b312d6
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1579671009944_0132)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.85 s
----------------------------------------------------------------------------------------------
OK
        NULL
aa      1
bb      1
teh     1
sr      1
ef      1
dth     1
dth     1
yj      1
cv      1
jfy     2
vyj     2
rht     2
vhk     2
gj      2
ub      2
ef      2
h       3
bjk     4
l       4
dth     4
hyj     4
bj      4
j       5
kf      5
g       5
h       5
dfh     5
dd      5
fh      6
s       6
dth     12
Time taken: 6.35 seconds, Fetched: 32 row(s)

Hive Complete Guide

1 - Introduction & Basic Commands :

Recent Posts

Comments