Our scenario is a below:
- DWH: Netezza (IBM Puredata)
- Hadoop : Cloudera (CDH5.4) and we are using Cloudera Manager.
- Hadoop Cluster configuration consideration: All the worker nodes are behind a private network. Can not be accessed with our edge/management nodes.
- Objective: Import data from Netezza to a Hive table.
I followed below steps to meet the objective.
As pr our current stats, we are now able to migrate 700 GB data in 1 hour (using 4 channels or mappers).
1. Install netezza
connectors:
1.1
Install netezza connector for sqoop following steps in link:
http://cloudera.com/content/cloudera/en/documentation/connectors/latest/Netezza/Cloudera-Connector-for-Netezza/ccnt_topic_3.html
1.2
(for Oozie)download "sqoop-nz-connector-1.2c5.tar" from below link
and upload sqoop-nz-connector-1.2c5.jar to HDFS:
/user/oozie/share/lib/lib_/sqoop/
http://www.cloudera.com/content/cloudera/en/downloads/connectors/sqoop/netezza/1-2c5-for-cdh-5-x.html
1.3 sudo -su oozie
hadoop fs -put /tmp/sqoop-nz-connector-1.2c5.jar /user/oozie/share/lib/lib_/sqoop/
2. In addition to
above,
2.1
Netezza JDBC driver directory /var/lib/sqoop should have below permissions:
This directory should have set permission as
(chmod 755 /var/lib/sqoop)
full path up to the driver file
/var/lib/sqoop/nzjdbc3.jar should be accessible by others (chmod 444
/var/lib/sqoop/nzjdbc3.jar)
2.2
(for Oozie)upload netezza jdbc driver to
HDFS: /user/oozie/share/lib/lib_/sqoop/
sudo
-su oozie hadoop fs -put /var/lib/sqoop/nzjdbc3.jar
/user/oozie/share/lib/lib_/sqoop/
2.3
restart oozie
3. Configure proxy
on edge node/s to connect to netezza for data nodes. (in case your datanodes
are in private network).
[sqoop will launches
several map jobs on data nodes and data nodes connect directly to netezza]
3.1
install haproxy on the edge node/s
yum install haproxy
3.2
Configure the proxy for netezza.
Append
below in '/etc/haproxy/haproxy.cfg'. Here the proxy will listen to 25501 on the
edge node/s and bypass the incoming connection to netezza host:port.
#
Proxy For Netezza
listen
netezza : 25501
mode tcp
option tcplog
balance leastconn
server netezza :
3.3
service haproxy restart
4. To import use
below
*** the user should
have create external table privilege on the target database on netezza
- use username in
upper case
- use direct ( Each
map task of Netezza connector’s import job works on a subset of the Netezza
partitions and transparently creates and uses an external table to transport
data)
- use
"--P" for password prompt
- we can pass
mapreduce params like "-D mapreduce.task.timeout=0" just after
import/export command
Here:
4.1 to import data on a HDFS directory use below:
sqoop import \
--direct \
--connect
jdbc:netezza://:/ \
--table D_ACCOUNT \
--fields-terminated-by
'|' \
--target-dir
/user/etl/temp/ \
--username
\
--password
'' \
-- --nz-logdir /tmp/
4.2 to import data on Hive table use below (it
will prompt for password as we used “--P”):
sqoop import \
--connect
jdbc:netezza://:25501/ \
--table . \
--where \
--target-dir "" \
--username \
--P \
--hive-import \
--hive-table "" \
--hive-overwrite \
--delete-target-dir \
--num-mappers 4 \
--verbose \
--direct \
-- --nz-logdir
--nz-uploaddir
4.3 Then use hive to copy data from temp table(text
or serial format) to final table(parquet + snappy) (select ..insert). Sqoop has
some limitation to load data directly to a parquet table.