partitioning:
Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done on the datasets.
In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables.
Without partitioning Hive reads all the data in the directory and applies the query filters on it. This is slow and expensive since all data has to be read.
Very often users need to filter the data on specific column values. To apply the partitioning in hive , users need to understand the domain of the data on which they are doing analysis.
With this knowledge, identification of the frequently queried or accessesd columns becomes easy and then partitioning feature of Hive can be applied on the selected columns.
Owing the fact that Partitions are horizontal slices of data,larger sets of data can be separated into more manageable chunks.
The problem with hive is that when we apply where clause then even a simple query in Hive reads the entire dataset.
This decreases the efficiency and becomes a bottleneck when we are required to run the queries on large tables.
This issue can be overcome by implementing partitions in hive.
When to use Hive Partitioning:
When any user wants data contained within a table to be split across multiple sections in hive table, use of partition is suggested.
The entries for the various columns of dataset are segregated and stored in their respective partition. When we write the query to fetch the values from table , only the required partitions of the table are queried, which reduces the time taken by query to yield the result.
Let’s take a scenario:
Data is present in hdfs coming in from various ecommerce companies.
We need to run the HiveQl queries on user buying pattern.
We need to analyse data coming in from last 10 days.
In the above scenario instead of running the queries which involves scanning of entire table, an approach should be followed where query runs on only last 10 days of data.
hive> create table user_partition(fname varchar(50), lname varchar(50), id varchar(50)) partitioned by (country varchar(50), state varchar(50)) row format delimited fields terminated by ',' stored as textfile;
Classification of partitioning:
Static partitioning
Dynamic partitioning
When to use static partitioning:
Static partitioning needs to be applied when we know data(supposed to be inserted) belongs to which partition.
hive> load data local inpath <path:file1> into table user_partition partition (country='us', state='ca');
hive> load data local inpath <path:file2> into table user_partition partition (country='ind', state='ap');
Retrieving the user information:
If anyone wants to retrieve the user information belonging to partition with country us and state ca, query needs to be written as shown below.
hive> select * from user_partition where user_partition.country='us' and user_partition.state='ca';
hive> select * from user_partition where user_partition.country='ind' and user_partition.state='ap';
When to use dynamic partitioning:
In static partitioning every partitioning needs to be backed with individual hive statement which is not feasible for large number of partitions as it will require writing of lot of hive statements.
In that scenario dynamic partitioning is suggested as we can create as many number of partitions with single hive statement.
Dynamic partitioning:
We need to create the partitioned table user_partition
hive> create table user_partition(fname varchar(50), lname varchar(50), id varchar(50)) partitioned by (country varchar(50), state varchar(50)) row format delimited fields terminated by ',' stored as textfile;
In next phase user1 table needs to be created with all the columns including country and state.
hive> create table user_partition1(fname varchar(50), lname varchar(50), id varchar(50), country string, state string) row format delimited fields terminated by ',' stored as textfile;
venkat, reddy, 100, ind, ap
raju, reddy, 101, ind, ap
kiran, rao, 102, us, ca
siva, r, 103, us, ca
The above dataset needs to be copied into table user_partition1.
hive> load data local inpath <path:file1> into table user_partition1;
Setting of parameters for dynamic partitioning:
To use the dynamic partitioning in hive we need to set the below parameters in hive shell or in hive-site.xml file.
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Retrieving data from partitioned table:
We need to copy the file from user_partition1 to partitioned table user_partition and then retriving the data from it all together using insert and select statement in one hive statement.
hive> insert into table user_partition (country,state) select fname, lname, id, country, state from user_partition1;
Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done on the datasets.
In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables.
Without partitioning Hive reads all the data in the directory and applies the query filters on it. This is slow and expensive since all data has to be read.
Very often users need to filter the data on specific column values. To apply the partitioning in hive , users need to understand the domain of the data on which they are doing analysis.
With this knowledge, identification of the frequently queried or accessesd columns becomes easy and then partitioning feature of Hive can be applied on the selected columns.
Owing the fact that Partitions are horizontal slices of data,larger sets of data can be separated into more manageable chunks.
The problem with hive is that when we apply where clause then even a simple query in Hive reads the entire dataset.
This decreases the efficiency and becomes a bottleneck when we are required to run the queries on large tables.
This issue can be overcome by implementing partitions in hive.
When to use Hive Partitioning:
When any user wants data contained within a table to be split across multiple sections in hive table, use of partition is suggested.
The entries for the various columns of dataset are segregated and stored in their respective partition. When we write the query to fetch the values from table , only the required partitions of the table are queried, which reduces the time taken by query to yield the result.
Let’s take a scenario:
Data is present in hdfs coming in from various ecommerce companies.
We need to run the HiveQl queries on user buying pattern.
We need to analyse data coming in from last 10 days.
In the above scenario instead of running the queries which involves scanning of entire table, an approach should be followed where query runs on only last 10 days of data.
hive> create table user_partition(fname varchar(50), lname varchar(50), id varchar(50)) partitioned by (country varchar(50), state varchar(50)) row format delimited fields terminated by ',' stored as textfile;
Classification of partitioning:
Static partitioning
Dynamic partitioning
When to use static partitioning:
Static partitioning needs to be applied when we know data(supposed to be inserted) belongs to which partition.
hive> load data local inpath <path:file1> into table user_partition partition (country='us', state='ca');
hive> load data local inpath <path:file2> into table user_partition partition (country='ind', state='ap');
Retrieving the user information:
If anyone wants to retrieve the user information belonging to partition with country us and state ca, query needs to be written as shown below.
hive> select * from user_partition where user_partition.country='us' and user_partition.state='ca';
hive> select * from user_partition where user_partition.country='ind' and user_partition.state='ap';
When to use dynamic partitioning:
In static partitioning every partitioning needs to be backed with individual hive statement which is not feasible for large number of partitions as it will require writing of lot of hive statements.
In that scenario dynamic partitioning is suggested as we can create as many number of partitions with single hive statement.
Dynamic partitioning:
We need to create the partitioned table user_partition
hive> create table user_partition(fname varchar(50), lname varchar(50), id varchar(50)) partitioned by (country varchar(50), state varchar(50)) row format delimited fields terminated by ',' stored as textfile;
In next phase user1 table needs to be created with all the columns including country and state.
hive> create table user_partition1(fname varchar(50), lname varchar(50), id varchar(50), country string, state string) row format delimited fields terminated by ',' stored as textfile;
venkat, reddy, 100, ind, ap
raju, reddy, 101, ind, ap
kiran, rao, 102, us, ca
siva, r, 103, us, ca
The above dataset needs to be copied into table user_partition1.
hive> load data local inpath <path:file1> into table user_partition1;
Setting of parameters for dynamic partitioning:
To use the dynamic partitioning in hive we need to set the below parameters in hive shell or in hive-site.xml file.
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Retrieving data from partitioned table:
We need to copy the file from user_partition1 to partitioned table user_partition and then retriving the data from it all together using insert and select statement in one hive statement.
hive> insert into table user_partition (country,state) select fname, lname, id, country, state from user_partition1;
No comments:
Post a Comment