Friday, February 17, 2017

How to: Oozie + Sqoop + Hive


Importing data directly into Hive is a great way to quickly enable your Hadoop desires. Some times it’s useful to schedule such a job in an Oozie workflow. In this post, we’ll investigate using Sqoop to import data from MySQL into Hive executed by Oozie.

THE BASICS

Using Sqoop to import data into hive via Oozie can be accomplished in 3 easy steps:
  1. Copy your hive-site.xml into HDFS
  2. Copy your mysql JDBC jar into HDFS
  3. Create an Oozie workflow that has a Sqoop action, includes the hive-site.xml in a “file” element of the action, and includes the mysql JDBC jar in a “archive” element of the action
  4. Run your Oozie workflow

EXAMPLE

Here is an example workflow:
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
    <start to="sqoop2hive"/>
    <action name="sqoop2hive">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <command>import --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table test --hive-import --hive-table test</command>
            <archive>/tmp/mysql-connector-java-5.1.31-bin.jar#mysql-connector-java-5.1.31-bin.jar</archive>
            <file>/tmp/hive-site.xml#hive-site.xml</file>
        </sqoop>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed</message>
    </kill>
    <end name="end"/>
</workflow-app>
And its corresponding hive-site.xml:
<configuration>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://sqoop2.example.com:9083</value>
  </property>

</configuration>

KEY POINTS

  1. Hive needs to be configured to use a remote metastore. This is because Sqoop will be ran from any node in your MapReduce cluster. hive-site.xml should be included since Sqoop needs to be able to tell Hive which metastore to write to at minimum. To do this, we set hive.metastore.uris in the hive-site.xml. In the above example, hive.metastore.uris is set to “thrift://sqoop2.example.com:9083″.
  2. hive-site.xml should be uploaded to HDFS and included in the workflow.xml. In the above example, the hive-site.xml file has been uploaded to /tmp/hive-site.xml in HDFS.
  3. The MySQL JDBC jar should be uploaded to HDFS and included in the workflow.xml. In the above example, the MySQL JDBC jar file has been uploaded to /tmp/mysql-connector-java-5.1.31-bin.jar.
  4. Sqoop first imports all data into HDFS, then imports that data into Hive. To do so, Sqoop creates a hive script file and calls the the hive command line directly.
  5. Make sure to include the share lib by setting oozie.use.system.libpath to “true” in the oozie job configuration.

TROUBLE SHOOTING

Q: FILE ALREADY EXISTS EXCEPTION IN TASK LOGS:

ERROR org.apache.sqoop.tool.ImportTool  - Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <path> already exists
A: Sqoop imports into hive first transfers the data to ‘/user/<username>/<table>’ in HDFS, then imports the data into Hive. If this process failed before, then the import directory may already exist. The solution is to remove that directory.

Q: TABLE ALREADY EXISTS EXCEPTION IN TASK LOGS:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table test already exists)
A: The –create-hive-table option is being used and the table already exists in Hive. The table can either be dropped or the –create-hive-table option can be removed from the command.

Q: HIVE IMPORT IS STARTING, BUT SEEING AN ERROR BEFORE FINISHING.

Typically log messages like:
INFO  org.apache.sqoop.hive.HiveImport  - Loading uploaded data into Hive
before:
Intercepting System.exit(1)
are seen.
A: Normally, the hive-site.xml is missing, not in workflow.xml, or not correctly configured. See the example and “key points” section above.

Q: CLASS NOT FOUND EXCEPTION IN THE LAUNCHER JOB LOGS:

java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SqoopMain not found
A: The share lib hasn’t been included! It can be included by including oozie.use.system.libpath=true in the job configuration passed when submitting the job.

Q: MISSING DRIVER CLASS IN LAUNCHER JOB LOGS:

ERROR org.apache.sqoop.Sqoop  - Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
A: Driver can’t be found. Including the JDBC driver jar in “archives” section of your workflow should fix this.

SUMMARY

Topics covered include an overview of “key points”, basic troubleshooting, and a simple walk through to Sqoop data from MySQL to Hive. Hopefully this helps you schedule your sqoop to hive jobs.
The code examples are available at https://github.com/ingesttips/examples.

2 comments:

How To Fix Hive – Partition Table Query Failed When Stored As Parquet

This article is about the bug in Hive filtering option, when the partition table query stored as parquet. Big data developers will help y...