Wednesday, April 13, 2016

Hive Indexing

Index:

An Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So that it is easy to search for a record with minimum overhead. Indexes also speed up the searching of data.

Indexing in Hive:

Hive is a data warehousing tool present on the top of Hadoop, which provides the SQL kind of interface to perform queries on large data sets. Since Hive deals with Big Data, the size of files is naturally large and can span up to Terabytes and Petabytes. Now if we want to perform any operation or a query on this huge amount of data it will take large amount of time.

In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take large amount of time because queries will be executed on all the columns present in the table.

The major advantage of using indexing is; whenever we perform a query on a table that has an index, there is no need for the query to scan all the rows in the table. Further, it checks the index first and then goes to the particular column and performs the operation.

So if we maintain indexes, it will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.

Eventually, time is the only factor that everyone focuses on.

When to use Indexing:

Indexing can be use under the following circumstances:

If the dataset is very large.
If the query execution is more amount of time than you expected.
If a speedy query execution is required.
When building a data model.
Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned depending on the size of the data we have.

Types of Indexes in Hive:

Compact Indexing
Bitmap Indexing

Differences between Compact and Bitmap Indexing:

The main difference is the storing of the mapped values of the rows in the different blocks. When the data inside a Hive table is stored by default in the HDFS, they are distributed across the nodes in a cluster. There needs to be a proper identification of the data, like the data in block indexing. This data will be able to identity which row is present in which block, so that when a query is triggered it can go directly into that block. So, while performing a query, it will first check the index and then go directly into that block.

Compact indexing stores the pair of indexed column’s value and its blockid.

Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.

Bitmap Indexing:

A bitmap is is a type of memory organization or image file format used to store digital images so with this meaning of bitmap, we can redefine bitmap indexing as given below.

“Bitmap index stores the combination of value and list of rows as a digital image.”

The following are the different operations that can be performed on Hive indexes:

Creating index
Showing index
Alter index
Dropping index

Creating index:

CREATE INDEX index_name 
ON TABLE table_name (columns,....) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD;

The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table.
The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.

This syntax will create an index for our table, but to complete the creation, we need to complete the REBUILD statement. For this to happen, we need to add one more alter statement. A MapReduce job will be launched and the index creation is now completed.

ALTER INDEX index_nam on table_name REBUILD;

This ALTER statement will complete our REBUILDED index creation for the table.

hive> create table employee(ename STRING,eage INT,country STRING,year STRING,edept STRING) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath ‘path of your file’ into table employee;

Let’s perform an Average operation on this ‘employee’ data. Let’s calculate the average age of the employees using the following command:

hive> SELECT AVG(eage) from employee;

hive> CREATE INDEX employee_index ON TABLE employee (eage) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
hive> ALTER INDEX employee_index on employee REBUILD;
hive> show formatted index on employee;

hive> SELECT AVG(eage) from employee;

create a Bitmap index for the same table:

hive> CREATE INDEX employee_index_bitmap ON TABLE employee (eage) AS 'BITMAP' WITH DEFERRED REBUILD; 
hive> ALTER INDEX employee_index_bitmap on employee REBUILD;

hive> DROP INDEX IF EXISTS employee_index ON employee;

Indexes decrease the time for executing the query.
We can have any number of indexes on the same table.
We can use the type of index depending on the data we have.
In some cases, Bitmap indexes work faster than the Compact indexes and vice versa.

When not to use indexing?

It is essential to know when and where indexing shouldn’t be used. They should not be used in the following scenarios:

Indexes are advised to build on the columns on which you frequently perform operations.
Building more number of indexes also degrade the performance of your query.
Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create compact).This leads to increase in time for executing your query.





No comments:

Post a Comment