Sections :
1 : Introduction & Basic Commands .
2 : Functions in Hive.
3 : Bucketing and Partitioning .
4 : Joins .
5 : Views and Indexes .
6 : UDF .
7 : Table properties (ACID as well) .
8 : File Formats and Loading data .
9 : Questions .
1 - Introduction & Basic Commands :
Hive is SQL like querying tool to query data stored in HDFS / Other file systems that integrate with Hadoop.
Hive is not a Database.
It can only process structured data stored in table.
Developed by Facebook & taken by Apache now .
Hive is efficient for batch processing , where amount of data is large and there is no real-time data availability requirements.
Supports different file formats ie Text, Parquet, Sequential , Avro etc..
Most of the times at back-end Map-reduce will run as soon as we run a query using Hive. It was created to reduced complexity of Map-reduce programs .
Easy for non-Java Analysts.
Hive is built on Write-Once and Read-Many concept. Whereas RDBMS is based on Write-Many and Read-Many.
HIVE ARCHITECTURE
BASIC COMMANDS
1 . Create Database & Table :
Db is just to store metadata.
Table creation is also creation of metadata for table.
Every table in database will be stored as sub-directory of schema .
Default table sub-directory is /user/hive/warehouse.
Commands :
hive> create database test1;
OK
Time taken: 1.297 seconds
## If not exists will not give error in case database already exists .
hive> create database if not exists test1;
OK
Time taken: 0.044 seconds
hive> create database test1;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database test1 already exists
hive> describe database test1;
OK
test1 location/in/test hadoop USER
Time taken: 0.144 seconds, Fetched: 1 row(s)
## As a good coding practise while creating a database we should add comments , give creator details ,creation time etc for db .
hive> create database if not exists test2 comment "This is a test database";
OK
Time taken: 0.496 seconds
hive> desc database test2;
OK
test2 This is a test database location/in/test hadoop USER
Time taken: 0.028 seconds, Fetched: 1 row(s)
hive> describe database extended test2;
OK
test2 This is a test database location/in/test hadoop USER
Time taken: 0.027 seconds, Fetched: 1 row(s)
hive> create database if not exists test3 with dbproperties ('creator'='test' ,'date' = '2020-01-02' );
OK
Time taken: 0.528 seconds
hive> describe database extended test3;
OK
test3 location/in/test hadoop USER {creator=test, date=2020-01-02}
Time taken: 0.028 seconds, Fetched: 1 row(s)
## command to display schemas which names staring with test .
hive> show databases like 'test*';
OK
test1
test2
test3
Time taken: 0.021 seconds, Fetched: 3 row(s)
Tables created in hive are of 2 types :
a : Internal Tables : By default all tables created in hive are internal tables until unless specified . For these tables hive is responsible for metadata and data . It means if we drop table then metadata and data both are lost .
b : External Tables : They are created with External keyword , Hive is only responsible for metadata of table ie if table is dropped then data remains in HDFS/S3 etc ..
## Before creating a table make sure to go to schema using use command .
hive> use test2;
OK
Time taken: 0.037 seconds
## Create table with string,array data type , int data type which will accept data stored in file which is comma separated for each column and rows by new line .
hive> create table if not exists table1 (col1 string , col2 array <string>, col3 string , col4 int) row format delimited fields terminated by ',' collection items terminated by ':' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.885 seconds
hive> describe table1;
OK
col1 string
col2 array<string>
col3 string
col4 int
Time taken: 0.104 seconds, Fetched: 4 row(s)
## Get details for a table using show create table command .
hive> show create table tablel1;
OK
CREATE TABLE `table1`(
`col1` string,
`col2` array<string>,
`col3` string,
`col4` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'colelction.delim'=':',
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://data/hive_warehouse/test2.db/table1'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1580491562')
Time taken: 0.146 seconds, Fetched: 25 row(s)
## We can create a table without explicitly giving details regarding storage and default properties are given ie columns are /u0001 ie control+A separated , new line by /n and by default created in schema sub-directory.
hive> create table if not exists table2 (col1 string , col2 array <string>, col3 string , col4 int) ;
OK
Time taken: 0.688 seconds
hive> show create table table2;
OK
CREATE TABLE `table2`(
`col1` string,
`col2` array<string>,
`col3` string,
`col4` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://data/hive_warehouse/test2.db/table2'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1580491653')
Time taken: 0.104 seconds, Fetched: 20 row(s)
## Create table with location on S3 /HDFS directory.
hive> create table if not exists table3 (col1 string , col2 array <string>, col3 string , col4 int) row format delimited fields terminated by ',' collection items terminated by ':' lines terminated by '\n' stored as textfile location 's3://data/hive_warehouse/table_3_details/';
OK
Time taken: 2.084 seconds
hive> desc table3;
OK
col1 string
col2 array<string>
col3 string
col4 int
Time taken: 0.103 seconds, Fetched: 4 row(s)
hive> show create table table3;
OK
CREATE TABLE `table3`(
`col1` string,
`col2` array<string>,
`col3` string,
`col4` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'colelction.delim'=':',
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://data/hive_warehouse/table_3_details'
TBLPROPERTIES (
'transient_lastDdlTime'='1580558814')
Time taken: 0.105 seconds, Fetched: 20 row(s)
## Loading data from a file on local system (OS) to table in hive . Make sure data is compatible with schema defined else Null /blank would be stored as hive is schema on read .
hive> load data local inpath '/home/hadoop/table1.txt' into table table1;
Loading data to table test2.table1
OK
Time taken: 1.648 seconds
hive> select * from table1;
OK
499 ["Poole","GBR"] England 141000
501 ["Blackburn","GBR"] England 140000
500 ["Bolton","GBR"] England 139020
502 ["Newport","GBR"] Wales 139000
503 ["PrestON","GBR"] England 135000
504 ["Stockport","GBR"] England 132813
Time taken: 0.125 seconds, Fetched: 6 row(s)
## Insert into will append data always.
hive> load data local inpath '/home/hadoop/table1.txt' into table table1;
Loading data to table test2.table1
OK
Time taken: 1.027 seconds
hive> select * from table1;
OK
499 ["Poole","GBR"] England 141000
501 ["Blackburn","GBR"] England 140000
500 ["Bolton","GBR"] England 139020
502 ["Newport","GBR"] Wales 139000
503 ["PrestON","GBR"] England 135000
504 ["Stockport","GBR"] England 132813
499 ["Poole","GBR"] England 141000
501 ["Blackburn","GBR"] England 140000
500 ["Bolton","GBR"] England 139020
502 ["Newport","GBR"] Wales 139000
503 ["PrestON","GBR"] England 135000
504 ["Stockport","GBR"] England 132813
Time taken: 0.12 seconds, Fetched: 12 row(s)
## Insert-overwrite will delete the data and then load the data into table always .
hive> load data local inpath '/home/hadoop/table1.txt' overwrite into table table1;
Loading data to table test2.table1
OK
Time taken: 1.167 seconds
hive> select * from table1;
OK
499 ["Poole","GBR"] England 141000
501 ["Blackburn","GBR"] England 140000
500 ["Bolton","GBR"] England 139020
502 ["Newport","GBR"] Wales 139000
503 ["PrestON","GBR"] England 135000
504 ["Stockport","GBR"] England 132813
Time taken: 0.124 seconds, Fetched: 6 row(s)
## Hive command to check if file exists on HDFS from hive CLI , just add ! sign in beginning
hive> !hadoop fs -ls s3://data/hive_warehouse/table1_ext_details/ ;
Found 1 items
-rw-rw-rw- 1 185 2020-02-01 12:12 s3://data/hive_warehouse/table1_ext_details/table1.txt
## Insert data into table from another table . We can add where condition as well .
hive> insert into table_test select * from table1;
Query ID = hadoop_20200201121721_28d407c1-1d45-4195-a628-162db09fc475
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Loading data to table test2.table_test
OK
Time taken: 15.622 seconds
hive> select * from table_test;
OK
499 ["Poole","GBR"] England 141000
501 ["Blackburn","GBR"] England 140000
500 ["Bolton","GBR"] England 139020
502 ["Newport","GBR"] Wales 139000
503 ["PrestON","GBR"] England 135000
504 ["Stockport","GBR"] England 132813
Time taken: 0.117 seconds, Fetched: 6 row(s)
## Multi-Insert from one table
hive> create table developer_tab (id int,name string,desg string) stored as textfile;
OK
Time taken: 0.726 seconds
hive> create table manager_tab (id int,name string,desg string) stored as textfile;
OK
Time taken: 0.633 seconds
hive> create table emp_tab(col1 int,col2 string,col3 string,col4 int,col5 int ,col6 int,col7 string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.679 seconds
hive> load data local inpath '/home/hadoop/emp.txt' into table emp_tab;
Loading data to table test2.emp_tab
OK
Time taken: 0.994 seconds
hive> select * from emp_tab;
OK
1281 Shawn Architect 7890 1481 10 IXZ
1381 Jacob Admin 4560 1481 20 POI
1481 flink Mgr 9580 1681 10 IXZ
1581 Richard Developer 1000 1681 40 LKJ
1681 Mira Mgr 5098 1481 10 IKZ
1781 John Developer 6500 1681 10 IXZ
Time taken: 0.116 seconds, Fetched: 6 row(s)
## inserting data into 2 tables from 1 table.
hive> from emp_tab insert into table developer_tab select col1,col2,col3 where col3 ='Developer' insert into table manager_tab select col1,col2,col3 where col3='Mgr';
Query ID = hadoop_20200201122251_0046f323-da32-43fd-9a94-14da6fd75f8d
Total jobs = 1
Launching Job 1 out of 1
Loading data to table test2.developer_tab
Loading data to table test2.manager_tab
OK
Time taken: 12.56 seconds
hive> select * from manager_tab;
OK
1481 flink Mgr
1681 Mira Mgr
Time taken: 0.118 seconds, Fetched: 2 row(s)
hive> select * from developer_tab;
OK
1581 Richard Developer
1781 John Developer
Time taken: 0.114 seconds, Fetched: 2 row(s)
hive>
ALTER TABLE COMMANDS :
It is possible to change the location of columns but only after adding it in the table using CHANGE COLUMN
hive> desc emp_tab;
OK
col1 int
col2 string
col3 string
col4 int
col5 int
col6 int
col7 string
Time taken: 0.1 seconds, Fetched: 7 row(s)
## By default a new column is added at end and will contain all null values .
hive> alter table emp_tab add columns (col8 int);
OK
Time taken: 0.422 seconds
hive> desc emp_tab;
OK
col1 int
col2 string
col3 string
col4 int
col5 int
col6 int
col7 string
col8 int
Time taken: 0.101 seconds, Fetched: 8 row(s)
hive> select * from emp_tab;
OK
1281 Shawn Architect 7890 1481 10 IXZ NULL
1381 Jacob Admin 4560 1481 20 POI NULL
1481 flink Mgr 9580 1681 10 IXZ NULL
1581 Richard Developer 1000 1681 40 LKJ NULL
1681 Mira Mgr 5098 1481 10 IKZ NULL
1781 John Developer 6500 1681 10 IXZ NULL
Time taken: 0.116 seconds, Fetched: 6 row(s)
## Command to add col 8 at beginning , but it caused datatype issue since due to addition of col8 at starting will cause all columns to move one position and this causes data type incompatibility issues . So be careful with these commands .
hive> alter table emp_tab change column col8 col8 string first;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
col1,col6
hive> alter table emp_tab change column col8 col8 int after col5;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
col6
hive> alter table emp_tab add columns (col9 int);
OK
Time taken: 0.413 seconds
hive> desc emp_tab;
OK
col1 int
col2 string
col3 string
col4 int
col5 int
col6 int
col7 string
col8 int
col9 int
Time taken: 0.113 seconds, Fetched: 9 row(s)
hive> alter table emp_tab change column col9 col9 int after col7;
OK
Time taken: 0.462 seconds
hive> desc emp_tab;
OK
col1 int
col2 string
col3 string
col4 int
col5 int
col6 int
col7 string
col9 int
col8 int
Time taken: 0.154 seconds, Fetched: 9 row(s)
hive> select * from emp_tab;
OK
1281 Shawn Architect 7890 1481 10 IXZ NULL NULL
1381 Jacob Admin 4560 1481 20 POI NULL NULL
1481 flink Mgr 9580 1681 10 IXZ NULL NULL
1581 Richard Developer 1000 1681 40 LKJ NULL NULL
1681 Mira Mgr 5098 1481 10 IKZ NULL NULL
1781 John Developer 6500 1681 10 IXZ NULL NULL
Time taken: 0.116 seconds, Fetched: 6 row(s)
## Renaming table name .
hive> alter table emp_tab rename to employee;
OK
Time taken: 0.73 seconds
hive> desc emp_tab;
FAILED: SemanticException [Error 10001]: Table not found emp_tab
hive> desc employee;
OK
col1 int
col2 string
col3 string
col4 int
col5 int
col6 int
col7 string
col9 int
col8 int
Time taken: 0.111 seconds, Fetched: 9 row(s)
## Removing old columns and keeping only new columns
hive> alter table employee replace columns(col9 int, name string);
OK
Time taken: 0.415 seconds
hive> desc employee;
OK
col9 int
name string
Time taken: 0.098 seconds, Fetched: 2 row(s)
hive> select * from employee;
OK
1281 Shawn
1381 Jacob
1481 flink
1581 Richard
1681 Mira
1781 John
Time taken: 0.115 seconds, Fetched: 6 row(s)
## Details regarding a table
hive> describe extended employee;
OK
col9 int
name string
Detailed Table Information Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560767, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560767, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.142 seconds, Fetched: 5 row(s)
## Setting table properties .
hive> alter table employee set tblproperties('auto.purge'='true');
OK
Time taken: 0.375 seconds
hive> describe extended employee;
OK
col9 int
name string
Detailed Table Information Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560876, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560876, auto.purge=true, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.119 seconds, Fetched: 5 row(s)
hive> alter table employee set fileformat avro;
OK
Time taken: 0.408 seconds
hive> describe extended employee;
OK
col9 int
name string
Detailed Table Information Table(tableName:employee, dbName:test2, owner:hadoop, createTime:1580559746, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col9, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:s3://data/hive_warehouse/test2.db/employee, inputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.avro.AvroSerDe, parameters:{serialization.format=,, line.delim=
, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{last_modified_time=1580560936, totalSize=212, numRows=0, rawDataSize=0, numFiles=1, transient_lastDdlTime=1580560936, auto.purge=true, last_modified_by=hadoop}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.116 seconds, Fetched: 5 row(s)
ORDER BY , GROUP BY , DISTRIBUTE BY .
Order by sorts complete data , but involves too much of shuffling since sorting is performed on 1 reducer .
hive> select * from s order by col2 ;
Query ID = hadoop_20200201124712_51a598d6-a6fd-495b-aecc-b0c3b74f93eb
Total jobs = 1
Launching Job 1 out of 1
OK
NULL
aa 1
bb 1
teh 1
sr 1
ef 1
dth 1
dth 1
yj 1
cv 1
jfy 2
vyj 2
rht 2
vhk 2
gj 2
ub 2
ef 2
h 3
bjk 4
l 4
dth 4
hyj 4
bj 4
j 5
kf 5
g 5
h 5
dfh 5
dd 5
fh 6
s 6
dth 12
Time taken: 6.333 seconds, Fetched: 32 row(s)
Sort by will sort with in a reducer , final result is not complete sort by union of each reducer sort .
hive> select * from s sort by col2 ;
Query ID = hadoop_20200201124737_1436f9a7-deb8-4eee-905b-6c9903e5585e
Total jobs = 1
Launching Job 1 out of 1
OK
NULL
aa 1
bb 1
teh 1
sr 1
ef 1
dth 1
dth 1
yj 1
cv 1
jfy 2
vyj 2
rht 2
vhk 2
gj 2
ub 2
ef 2
h 3
bjk 4
l 4
dth 4
hyj 4
bj 4
j 5
kf 5
g 5
h 5
dfh 5
dd 5
fh 6
s 6
dth 12
Time taken: 6.775 seconds, Fetched: 32 row(s)
Distribute by will sort non-overlapping values ie it decides which row goes to which reducer .
hive> select * from s distribute by col2 ;
Query ID = hadoop_20200201124814_0646f56e-62b8-4a96-a516-cee8c5e98a4c
Total jobs = 1
Launching Job 1 out of 1
OK
aa 1
bb 1
dd 5
ef 2
teh 1
dth 4
dfh 5
h 3
sr 1
ef 1
dth 1
dth 1
dth 12
rht 2
vhk 2
gj 2
ub 2
l 4
bj 4
bjk 4
hyj 4
s 6
fh 6
g 5
j 5
h 5
kf 5
yj 1
cv 1
vyj 2
jfy 2
NULL
Time taken: 6.272 seconds, Fetched: 32 row(s)
Clustery by is combination of distribute by sort by , just an alias .
hive> select * from s distribute by col2 sort by col2; or select *from s clustery by col2;
Query ID = hadoop_20200201124841_80bb494c-11e4-48fd-a748-eb3156b312d6
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1579671009944_0132)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.85 s
----------------------------------------------------------------------------------------------
OK
NULL
aa 1
bb 1
teh 1
sr 1
ef 1
dth 1
dth 1
yj 1
cv 1
jfy 2
vyj 2
rht 2
vhk 2
gj 2
ub 2
ef 2
h 3
bjk 4
l 4
dth 4
hyj 4
bj 4
j 5
kf 5
g 5
h 5
dfh 5
dd 5
fh 6
s 6
dth 12
Time taken: 6.35 seconds, Fetched: 32 row(s)
Comments