Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Importing a Table from RDBMS to HDFS:
Sqoop: sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --table employee -m1
Here we are connecting to MySQL through JDBC connectors and using the database TestDb. Here it is necessary to specify the MySQL ‘s username and password and the table name.
Here ‘-m’ specifies the number of map task that can be run simultaneously and ‘m1’ means that only one map task can run.
If we do not use -m1 at the end of the statement, for each record in the MySQL table we will get separate files in the HDFS.
Now the data in RDBMS has been successfully imported into HDFS. By default, the files will be stored here: /user/$user_name/table_name/part-m-00000 file.
Importing all the Tables in a Database to HDFS
Sqoop: sqoop import-all-tables --connect jdbc:mysql://localhost/TestDb --username root --password root -m1
Importing table data into a Specific Directory in HDFS
Sqoop: sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --table employee -m1 --target-dir /sqoop_data/employee/
Importing Table as a Sequence File into HDFS
Sqoop: sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --table employee -m1 --target-dir /sqoop_data/employee/squence/ --as-sequencefile
As the sequence file stores the contents in binary format, we will get the binary output.
Importing Table as a Avro File into HDFS
Sqoop: sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --table employee -m1 --target-dir /sqoop_data/employee/avro/ --as-avrodatafile
When storing a file using Avro file format, we will get the output file with .avro extension and the contents inside the file will be in binary format.
Export Data from HDFS to RDBMS
First create a table in MySQL named as employee1
Sqoop: sqoop export --connect jdbc:mysql://localhost/TestDb --table employee1 --export-dir /employee1.csv
Incremental Import in Sqoop To Load Data From Mysql To HDFS
Sqoop supports two types of incremental imports: append and lastmodified. You can use the –incremental argument to specify the type of incremental import to perform.
You should specify the append mode when importing a table, where new rows are continually added with increasing row id values. You must specify the column containing the row’s id with –check-column. Sqoop imports rows where the check column has a value greater than the one specified with –last-value.
An alternate table update strategy supported by Sqoop is called lastmodified mode. This should be used when rows of the source table is updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.
At the end of an incremental import, the value which should be specified as –last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify –last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import.
Sqoop: Sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --sales_table -m1 --tagret-dir /sqoopout
Currently 2 records imported successfully from MySQL to HDFS
1, table, 1000
2, sofa, 5000
Insert 3 more records through mysql
mysql> insert into sales_table values(3, 'window', 3000),(4, 'fan', 2000),(5, 'tv',8000);
now 5 records available in sales_table
The following syntax is used for the incremental option in Sqoop import command.
Sqoop: Sqoop import --connect jdbc:mysql://localhost/TestDb --username root --password root --sales_table -m1 --tagret-dir /sqoopout/ --incremental append --check-column ord_id --last-value 2
3 more records have been retrieved and the incremental import is now complete.