Tuesday, September 15, 2015

Configure Sqoop for Netezza and Import table to Hive Metastore

Our scenario is a below:

- DWH: Netezza (IBM Puredata)
- Hadoop : Cloudera (CDH5.4) and we are using Cloudera Manager. 
- Hadoop Cluster configuration consideration: All the worker nodes are behind a private network. Can not be accessed with our edge/management nodes.
- Objective: Import data from Netezza to a Hive table.

I followed below steps to meet the objective. 
As pr our current stats, we are now able to migrate 700 GB data in 1 hour (using 4 channels or mappers).

1. Install netezza connectors:

1.1 Install netezza connector for sqoop following steps in link: http://cloudera.com/content/cloudera/en/documentation/connectors/latest/Netezza/Cloudera-Connector-for-Netezza/ccnt_topic_3.html

1.2 (for Oozie)download "sqoop-nz-connector-1.2c5.tar" from below link and upload sqoop-nz-connector-1.2c5.jar to HDFS: /user/oozie/share/lib/lib_/sqoop/

      1.3 sudo -su oozie hadoop fs -put /tmp/sqoop-nz-connector-1.2c5.jar /user/oozie/share/lib/lib_/sqoop/

2. In addition to above, 
2.1 Netezza JDBC driver directory /var/lib/sqoop should have below permissions:
  This directory should have set permission as (chmod 755  /var/lib/sqoop)
  full path up to the driver file /var/lib/sqoop/nzjdbc3.jar should be accessible by others (chmod 444 /var/lib/sqoop/nzjdbc3.jar)
2.2 (for Oozie)upload netezza jdbc driver to   HDFS: /user/oozie/share/lib/lib_/sqoop/
sudo -su oozie hadoop fs -put /var/lib/sqoop/nzjdbc3.jar /user/oozie/share/lib/lib_/sqoop/

2.3 restart oozie

3. Configure proxy on edge node/s to connect to netezza for data nodes. (in case your datanodes are in private network).
[sqoop will launches several map jobs on data nodes and data nodes connect directly to netezza]

3.1 install haproxy on the edge node/s
  yum install haproxy
3.2 Configure the proxy for netezza.
Append below in '/etc/haproxy/haproxy.cfg'. Here the proxy will listen to 25501 on the edge node/s and bypass the incoming connection to netezza host:port.

# Proxy For Netezza
listen netezza : 25501
    mode tcp
    option tcplog
    balance leastconn

    server netezza :

3.3 service haproxy restart

4. To import use below
*** the user should have create external table privilege on the target database on netezza

- use username in upper case
- use direct ( Each map task of Netezza connector’s import job works on a subset of the Netezza partitions and transparently creates and uses an external table to transport data)
- use "--P" for password prompt
- we can pass mapreduce params like "-D mapreduce.task.timeout=0" just after import/export command
: netezza host/IP or host/IP of the edge node running the proxy for netezza.
: netezza port(default 5480) or port listening for netezza in the edge node running the proxy for netezza.

4.1     to import data on a HDFS directory use below:
sqoop import  \
--direct \
--connect jdbc:netezza://:/ \
--table D_ACCOUNT \
--fields-terminated-by '|' \
--target-dir /user/etl/temp/ \
--username \
--password '' \
-- --nz-logdir /tmp/

4.2     to import data on Hive table use below (it will prompt for password as we used “--P”):
sqoop import  \
--connect jdbc:netezza://:25501/  \
--table .  \
--where   \
--target-dir ""  \
--username   \
--P \
--hive-import  \
--hive-table ""  \
--hive-overwrite  \
--delete-target-dir  \
--num-mappers 4  \
--verbose \
--direct  \
-- --nz-logdir   --nz-uploaddir

4.3     Then use hive to copy data from temp table(text or serial format) to final table(parquet + snappy) (select ..insert). Sqoop has some limitation to load data directly to a parquet table.


  1. Hey, Great job here on NETEZZA. You know what! I read a lot of blog posts and I never come across such a helpful post. I like this post you made with great effort. Very ingenious.

  2. Needed to compose one simple word yet thanks for the suggestions that you are contributed here, please do keep updating us...
    DevOps Training in Chennai | DevOps Online Training in Chennai

  3. Good blog,thanks for sharing the valuable information...thank you.
    Best software Training institute in Bangalore

  4. It was really a nice post and i was really impressed by reading this Big Data Hadoop Online Course Bnagalore