Tuesday, June 2, 2015

Migrate Large table from Hive/Impala to Netezza

We can follow below steps to migrate a table from Hive to Netezza/IBM Puredata (without sqoop).
It is very useful when the table size if large (more that 100 GB) as the full operation doesn't required local filesystem.


0. Create the target table in netezza.

1. create the target directory in HDFS
hadoop fs -mkdir /user/etl/temp/my_table

2. Login to hive/beeline export the target table/query to the hdfs lcation created above. Here, we can are using csv-serde to put customised quote string [https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar]


beeline -u jdbc:hive2://hiveSeverHost:10000/db_name -n etl

> add jar /var/lib/hive/jars/csv-serde-1.1.2-0.11.0-all.jar;

> create table exp_my_table 
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\,",
"quoteChar" = "\"",
"escapeChar" = "\\")
STORED AS textfile 
LOCATION '/user/etl/temp/my_table/' 
as select * from my_table;

3. Create a unix named pipe:
mkfifo /stage/AIU/ETL/temp/mypipe1

4. Issue below command in background of in a seperate session, loading data from the named pipe create above(this command will stuck until data inserts to the named pipe). Make sure -delim & -quotedValue matching with previous step. Put other nzload options as per requirements.

nzload -host nzhost_IP -u NZ_User -db DB_NAME -t my_table -delim "," -quotedValue DOUBLE -escapeChar '\' -df /stage/AIU/ETL/temp/mypipe1 -ignoreZero YES -ctrlChars -crInString &

To allow loading un-printable strings and ignore "null byte", specially in URLs:
-ignoreZero YES : "|<nul>AB<nul>CDEF|" will become "|ABCDEF|"
-ctrlChars : accept control chars in char/varchar fields
-crInString : allow unescaped carriage returns (cr), ine feed (LF) becomes the default end-of-row indicator

ref: https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.load.doc/c_load_handle_legal_characters.html

5. Now from a separate session(if step 4 is not in background), copy the hive table HDFS location(step 2) to the named pipe(created in step 3).

hadoop fs -cat /user/etl/temp/my_table/* > /stage/AIU/ETL/temp/mypipe1

6. It will start flowing data to the named pipe and naload was waiting to get data in the named pipe(step 4). So, at this point data starts flowing to netezza. When data loading done or any error encountered both nzload and hdfs copy commands will exit. Please check nzload log and on screen unix related error for further problem solving.


No comments:

Post a Comment