toddler DBA

Tuesday, September 15, 2015

Configure Sqoop for Netezza and Import table to Hive Metastore

Our scenario is a below:

- DWH: Netezza (IBM Puredata)

- Hadoop : Cloudera (CDH5.4) and we are using Cloudera Manager.

- Hadoop Cluster configuration consideration: All the worker nodes are behind a private network. Can not be accessed with our edge/management nodes.

- Objective: Import data from Netezza to a Hive table.

I followed below steps to meet the objective.

As pr our current stats, we are now able to migrate 700 GB data in 1 hour (using 4 channels or mappers).

1. Install netezza connectors:

1.1 Install netezza connector for sqoop following steps in link: http://cloudera.com/content/cloudera/en/documentation/connectors/latest/Netezza/Cloudera-Connector-for-Netezza/ccnt_topic_3.html

1.2 (for Oozie)download "sqoop-nz-connector-1.2c5.tar" from below link and upload sqoop-nz-connector-1.2c5.jar to HDFS: /user/oozie/share/lib/lib_/sqoop/

http://www.cloudera.com/content/cloudera/en/downloads/connectors/sqoop/netezza/1-2c5-for-cdh-5-x.html

1.3 sudo -su oozie hadoop fs -put /tmp/sqoop-nz-connector-1.2c5.jar /user/oozie/share/lib/lib_/sqoop/

2. In addition to above,

2.1 Netezza JDBC driver directory /var/lib/sqoop should have below permissions:

This directory should have set permission as (chmod 755 /var/lib/sqoop)

full path up to the driver file /var/lib/sqoop/nzjdbc3.jar should be accessible by others (chmod 444 /var/lib/sqoop/nzjdbc3.jar)

2.2 (for Oozie)upload netezza jdbc driver to HDFS: /user/oozie/share/lib/lib_/sqoop/

sudo -su oozie hadoop fs -put /var/lib/sqoop/nzjdbc3.jar /user/oozie/share/lib/lib_/sqoop/

2.3 restart oozie

3. Configure proxy on edge node/s to connect to netezza for data nodes. (in case your datanodes are in private network).

[sqoop will launches several map jobs on data nodes and data nodes connect directly to netezza]

3.1 install haproxy on the edge node/s

yum install haproxy

3.2 Configure the proxy for netezza.

Append below in '/etc/haproxy/haproxy.cfg'. Here the proxy will listen to 25501 on the edge node/s and bypass the incoming connection to netezza host:port.

# Proxy For Netezza

listen netezza : 25501

mode tcp

option tcplog

balance leastconn

server netezza :

3.3 service haproxy restart

4. To import use below

*** the user should have create external table privilege on the target database on netezza

- use username in upper case

- use direct ( Each map task of Netezza connector’s import job works on a subset of the Netezza partitions and transparently creates and uses an external table to transport data)

- use "--P" for password prompt

- we can pass mapreduce params like "-D mapreduce.task.timeout=0" just after import/export command

Here:

: netezza host/IP or host/IP of the edge node running the proxy for netezza.

: netezza port(default 5480) or port listening for netezza in the edge node running the proxy for netezza.

4.1 to import data on a HDFS directory use below:

sqoop import \

--direct \

--connect jdbc:netezza://:/ \

--table D_ACCOUNT \

--fields-terminated-by '|' \

--target-dir /user/etl/temp/ \

--username \

--password '' \

-- --nz-logdir /tmp/

4.2 to import data on Hive table use below (it will prompt for password as we used “--P”):

sqoop import \

--connect jdbc:netezza://:25501/ \

--table . \

--where \

--target-dir "" \

--username \

--P \

--hive-import \

--hive-table "" \

--hive-overwrite \

--delete-target-dir \

--num-mappers 4 \

--verbose \

--direct \

-- --nz-logdir --nz-uploaddir

4.3 Then use hive to copy data from temp table(text or serial format) to final table(parquet + snappy) (select ..insert). Sqoop has some limitation to load data directly to a parquet table.

Tuesday, June 2, 2015

Migrate Large table from Hive/Impala to Netezza

We can follow below steps to migrate a table from Hive to Netezza/IBM Puredata (without sqoop).
It is very useful when the table size if large (more that 100 GB) as the full operation doesn't required local filesystem.

0. Create the target table in netezza.

1. create the target directory in HDFS
hadoop fs -mkdir /user/etl/temp/my_table

2. Login to hive/beeline export the target table/query to the hdfs lcation created above. Here, we can are using csv-serde to put customised quote string [https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar]

beeline -u jdbc:hive2://hiveSeverHost:10000/db_name -n etl

> add jar /var/lib/hive/jars/csv-serde-1.1.2-0.11.0-all.jar;

> create table exp_my_table
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\,",
"quoteChar" = "\"",
"escapeChar" = "\\")
STORED AS textfile
LOCATION '/user/etl/temp/my_table/'
as select * from my_table;

3. Create a unix named pipe:
mkfifo /stage/AIU/ETL/temp/mypipe1

4. Issue below command in background of in a seperate session, loading data from the named pipe create above(this command will stuck until data inserts to the named pipe). Make sure -delim & -quotedValue matching with previous step. Put other nzload options as per requirements.

nzload -host nzhost_IP -u NZ_User -db DB_NAME -t my_table -delim "," -quotedValue DOUBLE -escapeChar '\' -df /stage/AIU/ETL/temp/mypipe1 -ignoreZero YES -ctrlChars -crInString &

To allow loading un-printable strings and ignore "null byte", specially in URLs:
-ignoreZero YES : "|<nul>AB<nul>CDEF|" will become "|ABCDEF|"
-ctrlChars : accept control chars in char/varchar fields
-crInString : allow unescaped carriage returns (cr), ine feed (LF) becomes the default end-of-row indicator

ref: https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.load.doc/c_load_handle_legal_characters.html

5. Now from a separate session(if step 4 is not in background), copy the hive table HDFS location(step 2) to the named pipe(created in step 3).

hadoop fs -cat /user/etl/temp/my_table/* > /stage/AIU/ETL/temp/mypipe1

6. It will start flowing data to the named pipe and naload was waiting to get data in the named pipe(step 4). So, at this point data starts flowing to netezza. When data loading done or any error encountered both nzload and hdfs copy commands will exit. Please check nzload log and on screen unix related error for further problem solving.

HDFS Directories Under Metastore(hive) Location (/user/hive/warehouse/) Without Corresponding Metastore Tables

Some times it happens that for some reason we drop tables from hive/impala but the corresponding HDFS directories are not removed.
The issue mainly as per my and also from below link, is Trash directory. Either the permission on the directory or there are no HDFS home directory exists for the user dropping a table.

https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/8J0IZf8E1yc

In our case we suspect that, it was a HUE user without HDFS home directory.

Whatever it is, we have to keep track of these orphan directories.
Manually identifying these HDFS location too tiring to a lazy guys like me. So, I prepared below simple python script for this. Hope it will help others too.

Please comment if you have better idea or there is something wrong with my script.
Please test properly before you proceed with any action based on this script.

#!/usr/bin/python

# Script to find HDFS directories under metastore location (/user/hive/warehouse/) without corresponding tables
# just call call the script with the target database name as mandatory parameter, if no parameter it will show error
# it the db name it wrong it will show "No such file or directory" & "Database does not exist"
# Prerequisites: hive CLI (NOT beeline), python 2.6, dircet access to the target hadoop cluster

import sys

# Convert a String having words separated by new line(\n) to list
# Example input : convertNewLineStringsToList('abc\n123\n')
# Example output: ['abc', '123']
def convertNewLineStringsToList(stringWithNewLine):
out = []
buff = []
for c in stringWithNewLine:
if c == '\n':
out.append(''.join(buff))
buff = []
else:
buff.append(c)
else:
if buff and buff[0]:
out.append(''.join(buff))
return out

# Get list of all directory names for corresponding table from HDFS for a metastore database
# Example input : getHDFSDirForDB("dbname"))
# Example output: ['table_a', 'table_b']
def getHDFSDirForDB(dbname):
import os
# get HDFS list by running hadoop fs command from OS
hdfs_list = os.popen("hadoop fs -ls /user/hive/warehouse/"+ dbname +".db/|awk '{print $8}'|cut -d'/' -f6")
hdfs_list = hdfs_list.read()
# collect HDFS dirs as list
out = convertNewLineStringsToList(hdfs_list)
# return HDFS dirs as list which are non-empty, keep tha case intact, as HDFS location should be exactly match
return [x for x in out if x]

# Get list of all table names for a metastore database
# Example input : getTablesForDB("dbname")
# Example output: ['table_a', 'table_b']
def getTablesForDB(dbname):
import os
# get HDFS list by running hive command from OS
# here we are using silent mode(-S) to get rid of unwanted texts
table_list = os.popen("hive --database "+dbname+" -S -e \"show tables\"|sed -n '1!p'")
table_list = table_list.read()
# collect table names as list
out = convertNewLineStringsToList(table_list)
# return table names as list which are non-empty and make in lower case return [x.lower() for x in out if x]

# main program called with the target DB name for which we want to find HDFS directories without corresponding tables
# Example input: main ("dbname")
# Example output:
# table_a
# table_b
def main(dbname):
HDFS_LIST = getHDFSDirForDB(dbname)
DB_LIST = getTablesForDB(dbname)

# directories that do not have corresponding tables
# It mean, components of HDFS_LIST which are not common in DB_LIST
# make HDFS location lowercase while comparing with DB table list(already lowercase) diff = [ x for x in HDFS_LIST if x.lower() not in DB_LIST ]

print '\n'.join(diff)

#________________main________________
if __name__ == "__main__":
main(sys.argv[1])

Thanks for the idea of set base minus operation http://stackoverflow.com/questions/13169725/how-to-convert-a-string-that-has-newline-characters-in-it-into-a-list-in-python

Monday, January 19, 2015

Configure Flume to Collect Twitter Streams Using Cloudera Manager

Below are the steps I followed to collect twitter streams to HDFS with Flume:

1. Before you start, please make sure you have below 4 items generated for twitter streaming API. For more info please check twitter streaming API documentation.
consumerKey
consumerSecret
accessToken
accessTokenSecret

2. From Cloudera Manager Stop current running Flume service.

3. Create a dedicated flume agent role group[named: "AgentTweets"] under "Flume" service:
- Cloudera manager-> flume-> configuration-> Role Groups > "Create new group..." > provide the "AgentTweets"

4. Assign as host for role group AgentTweets:
- Cloudera manager->flume->configuration-> Role Groups > "Create new group..." > select "Agent Default Group" > check "agent (sthdmgt1-pvt)" > "Action for Selected" > Move to Different Role Group .. > AgentTweets > Move

5. Check for Flume plugins directory:
Cloudera manager->flume->configuration->AgentTweets > "Plugin directories"

In our case plugin directories are :
/usr/lib/flume-ng/plugins.d
/var/lib/flume-ng/plugins.d

*** Note: As we are using host "sthdmgt1-pvt" for agent role AgentTweets, we are considering these and rest of the changes on host "sthdmgt1-pvt" only.

6. create the plugin directories as those do not exists:
mkdir -p /usr/lib/flume-ng/plugins.d
mkdir -p /var/lib/flume-ng/plugins.d

-- Also create the location for the twitter plugin:
mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

chown -R flume:flume /usr/lib/flume-ng/
chown -R flume:flume /var/lib/flume-ng/

7. download Download the custom Flume Source and copy it to both of the flume plugin locations
cd /tmp
wget "http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar"
cp flume-sources-1.0-SNAPSHOT.jar /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
cp flume-sources-1.0-SNAPSHOT.jar /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

8. If we use twitter4j-* jars with version 3 or above, we may end up to below error in flume agent log:

Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:139)

It is due to both twitter4j-stream-3.0.3.jar & flume-sources-1.0-SNAPSHOT.jar having same class "TwitterSource", To resolve above error do below steps:

8.1 Go to cloudera default location for flume library, Take a backup and remove all the twitter4j-*-3.0.3.jar

[root@sthdmgt1-pvt ~]# ll /opt/cloudera/parcels/CDH/lib/flume-ng/lib/|grep twitter4j
lrwxrwxrwx 1 root root 38 Nov 18 11:09 twitter4j-core-3.0.3.jar -> ../../../jars/twitter4j-core-3.0.3.jar
lrwxrwxrwx 1 root root 47 Nov 18 11:09 twitter4j-media-support-3.0.3.jar -> ../../../jars/twitter4j-media-support-3.0.3.jar
lrwxrwxrwx 1 root root 40 Nov 18 11:09 twitter4j-stream-3.0.3.jar -> ../../../jars/twitter4j-stream-3.0.3.jar
[root@sthdmgt1-pvt ~]#

*** Note: Please make sure to take a note of above links for future reference.
8.2 In this case, just need to remove the links as below:
# rm -rf /opt/cloudera/parcels/CDH/lib/flume-ng/lib/twitter4j-*-3.0.3.jar

8.3 download older version of twitter4j and copy to both of the flume plugin locations. We are choosing latest before 3.x version, that is 2.6:
# cd /tmp

# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-stream/2.2.6/twitter4j-stream-2.2.6.jar
# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-core/2.2.6/twitter4j-core-2.2.6.jar
# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-media-support/2.2.6/twitter4j-media-support-2.2.6.jar

# cp twitter4j-*.jar /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
# cp twitter4j-*.jar /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/

# chown -R flume:flume /usr/lib/flume-ng/
# chown -R flume:flume /var/lib/flume-ng/

Ref: http://stackoverflow.com/questions/19189979/cannot-run-flume-because-of-jar-conflict

9. *** Note: make sure system time zone and time is sync with your twitter settings to avoid 401 error.

10. Configure the flume agent role group:
10.1 Set the agent name:
Cloudera manager->flume->configuration->AgentTweets > Agent Name > set the agent name to "AgentTweets" [make sure agent name here and in config in the next steps are same]
10.2 Set flume configuration, make sure all the config prefoxed with the agent name set in setp 10.1
Cloudera manager->flume->configuration->AgentTweets > Agent Name > copy below config replace entire content in "Configuration File":
--------------------twitter flume conf start -------------
AgentTweets.sources = Twitter
AgentTweets.channels = MemChannel
AgentTweets.sinks = HDFS

AgentTweets.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
AgentTweets.sources.Twitter.channels = MemChannel
AgentTweets.sources.Twitter.consumerKey = consumerKey from step 1
AgentTweets.sources.Twitter.consumerSecret = consumerSecret from step 1
AgentTweets.sources.Twitter.accessToken = accessToken from step 1
AgentTweets.sources.Twitter.accessTokenSecret = accessTokenSecret from step 1
AgentTweets.sources.Twitter.keywords = malaysia, msia

AgentTweets.sinks.HDFS.channel = MemChannel
AgentTweets.sinks.HDFS.type = hdfs
AgentTweets.sinks.HDFS.hdfs.path = hdfs://namenodeHostnameOrIP:8020/user/flume/tweets/malaysia/%Y/%m/%d/%H/
AgentTweets.sinks.HDFS.hdfs.fileType = DataStream
AgentTweets.sinks.HDFS.hdfs.writeFormat = Text
AgentTweets.sinks.HDFS.hdfs.batchSize = 1000
AgentTweets.sinks.HDFS.hdfs.rollSize = 0
AgentTweets.sinks.HDFS.hdfs.rollCount = 10000

AgentTweets.channels.MemChannel.type = memory
AgentTweets.channels.MemChannel.capacity = 10000
AgentTweets.channels.MemChannel.transactionCapacity = 100
--------------------twitter flume conf end -------------

Note: in the above config please take a note on hdfs.path, it is set to hdfs://namenodeHostnameOrIP:8020/user/flume/tweets/malaysia/%Y/%m/%d/%H/
It is pointing to namenode and port 8020.

11. Take a note of the HDFS location under "AgentTweets.sinks.HDFS.hdfs.path", these location will be created by flume.
Create the location upto /user/flume/. Make sure that on target host sthdmgt1-pvt.aiu.axiata, OS user "flume" has read write to the HDFS directory /user/flume/. I
# su - hdfs
$ hadoop fs -mkdir /user/flume
$ hadoop fs -chown flume:flume /user/flume

12. (If you have proxy) Create twitter4j property file for that or handle it from OS. There is no proxy option for Flume.

13. Start/Restart the flume service from cloudera manager

14. monitor the log:
Cloudera manager->flume->instances->AgentTweets > click "Agent" for host "sthdmgt1-pvt.aiu.axiata"" > click "Log File"

15. If log seems of then check the HDFS location:
$ su - hdfs
$ hadoop fs -ls /user/flume/tweets/malaysia

Note: There will be data under /user/flume/tweets/malaysia////

Patching Cloudera Hadoop from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 2 of 2)

Patching Cloudera Hadoop from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 2 of 2):

Please complete part 1 (upgrade cloudera manager) before starting below steps.

Ref 1:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_to_cdh52_using_parcels.html

Ref 2: For rolling upgrade :
[Had some issues as in step 13. So, we are not following this, we will follow above link. Also nay how we need to stop some important services like impala & hive for rolling also. To me both are same an needs downtime(not that difference)]
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_rolling_upgrade.html

-----------------------------

Upgrading CDH 5 Using Parcels

-----------------------------

*** First we need to upgrade Cloudera Manager to 5.2.0 and then follow below steps to upgrade CDH.

*** Review Ref 1 again before start

-- ----------------

1. Before You Begin

-- ----------------

1.1 Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.

We can use web GUI for oozie to check it:

http://10.202.225.102:11000/oozie/

1.2 Run the "Host Inspector" and fix every issue.

1.2.1 On Cloudera manager Click the Hosts tab.

1.2.2 Click Host Inspector. Cloudera Manager begins several tasks to inspect the managed hosts.

1.2.3 After the inspection completes, click Download Result Data or Show Inspector Results to review the results.

1.2.4 Click "Show Inspector Results" to check the result

1.2.5 If there are any validation error, please consult with Cloudera Support before proceed further.

1.3 Run hdfs fsck / and hdfs dfsadmin -report and fix any issues

# su - hdfs

$ hdfs fsck /

$ hdfs dfsadmin -report

$ hbase hbck

1.4 enable maintenance mode on your cluster

1.4.1 Click (down arrow) to the right of the cluster name and select Enter Maintenance Mode.

1.4.2 Confirm that you want to do this.

-- -----------------------------------------

2. Back up the HDFS Metadata on the NameNode

-- -----------------------------------------

2.1 Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make a consistent backup.

2.2 CM > HDFS > Configuration

2.3 In the Search field, search for "NameNode Data Directories". This locates the NameNode Data Directories property.

2.4 From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories property.

For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:

# cd /mnt/hadoop/hdfs/name

# tar -cvf /root/nn_backup_data.tar .

Note: If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the CDH services.

-- ---------------------------------------------

3. Download, Distribute, and Activate the Parcel

-- ---------------------------------------------

Note: Before start below, enable internet access on the node where cloudera manager installed.

3.1 In the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar

3.2 Click Download for the version(s) you want to download.

3.3 When the download has completed, click Distribute for the version you downloaded.

3.4 When the parcel has been distributed and unpacked, the button will change to say Activate.

3.5 Click Activate. You are asked if you want to restart the cluster *** Do not restart the cluster at this time.

3.6 Click Close

- if some service failed to start, ignore for the time being

- Check follow below then check again

-- ---------------------

4. Upgrade HDFS Metadata

-- ---------------------

4.1 Start the ZooKeeper service.

4.2 Go to the HDFS service.

4.3 Select Actions > Upgrade HDFS Metadata.

-- -----------------------------------

5. Upgrade the Hive Metastore Database

-- -----------------------------------

5.1 Back up the Hive metastore database.

5.2 Go to the Hive service.

5.3 Select Actions > Upgrade Hive Metastore Database Schema and click Upgrade Hive Metastore Database Schema to confirm.

5.4 If you have multiple instances of Hive, perform the upgrade on each metastore database.

-- --------------------------

6. Upgrade the Oozie ShareLib

-- --------------------------

6.1 Go to the Oozie service.

6.2 Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.

-- -------------

7. Upgrade Sqoop

-- -------------

7.1 Go to the Sqoop service.

7.2 Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.

-- -----------------------

8. Upgrade Sentry Database

-- -----------------------

Required if you are updating from CDH 5.0 to 5.1 or later.

8.1 Back up the Sentry database.

8.2 Go to the Sentry service.

8.3 Select Actions > Upgrade Sentry Database Tables and click Upgrade Sentry Database Tables to confirm.

-- -------------

9. Upgrade Spark

-- -------------

Required if you are updating from CDH 5.0 to 5.1 or later.

9.1 Go to the Spark service.

9.2 Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.

9.3 Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.

--- -------------------

10. Restart the Cluster

--- -------------------

- CM > Cluster Name > Start

--- ---------------------------------

11. Deploy Client Configuration Files

--- ---------------------------------

- On the Home page, click to the right of the cluster name and select Deploy Client Configuration.

- Click the Deploy Client Configuration button in the confirmation pop-up that appears.

--- ----------------------------------

12. Finalize the HDFS Metadata Upgrade

--- ----------------------------------

After ensuring that the CDH 5 upgrade has succeeded and that everything is running smoothly, finalize the HDFS metadata upgrade. It is not unusual to wait days or even weeks before finalizing the upgrade.

- Go to the HDFS service.

- Click the Instances tab.

- Click the NameNode instance.

- Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.

--- -------------

13. Common issues

--- -------------

13.1 if HDFS namenode failed to start with below error on log:

"File system image contains an old layout version -55."

- CM > HDFS > Action > Stop

- CM > HDFS > Action > Upgrade HDFS Metadata

13.2 imapala showing "This Catalog Server is not connected to its StateStore"

- CM > Hue > Action > Stop

- CM > Imapala > Action > Stop

- CM > Hive > Action > Stop

> Action > Upgrade Metastore Schema

> Action > Upgrade Metastore Namenodes

- CM > Hive > Action > Start

- CM > Impala > Action > Start

- CM > Hue > Action > Start

13.3 Hue hot showing databases list:

- use internet explorer. It seems that it doesn't work on Chrome.

--- --------------------------------------

14. Test the upgraded CDH working properly

--- --------------------------------------

We can follow the steps we did on "Step 4 & 5" on install_CM_CDH_5.0.2.txt

Note: now the example jar file is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar

14.1 Check from Cloudera manager, there should not ant alarm or warnings

14.2 Runs host inspector and check for any alarm or warnings. If possible then fix them.

14.3 Check existing data on imapala are accessible

- Also check analytical query like below(for impala 2.0 and above):

SELECT qtr_hour_id,date_id,

count() OVER (PARTITION BY date_id,qtr_hour_id) AS how_many_qtr

FROM f_ntw_actvty_http ;

14.4 check import data from Netezza working properly

14.5 check export data to netezza working properly

14.6 run example YARN jobs and check they are successful

14.7 run terasort and check the output

14.8 Checl mahut working properly & can use data stored on HDFS

14.9 check R hadoop working properly & can use data stored on HDFS

14.10 Wait for 1 or 2 days and monitor daily jobs working fine

14.11 change all the end veriable pointing to latest CDH

--- ---------------------

15. Rollback to CDH 5.0.2

--- ---------------------

Below is the feedback from Clouder Support regarding rollback.

We can prepare a manula rollback step while doing sample upgrade on test environment.

There is no true "rollback" for CDH. While it is true that you can deactivate the new parcel and reactivate the old, or remove the new packages and re-install the old, an upgrade does not only constitute a change in the binaries and libraries for CDH. Some components store metadata in databases, and the upgrade process will usually modify the database schema in these databases -- for example, the Hive Metastore, the Hue database, the Sqoop2 metastore, and the Oozie database. If an upgrade of CM is involved, the CM database is also upgraded.

As such, there is no "rollback" option. What Cloudera recommends is that all databases be backed up prior to the upgrade taking place (you will note this warning in various places in the upgrade documentation). If necessary, a point-in-time restore can be performed, but there is no automated way to do this -- it is a highly manual process.

This is why we recommend thoroughly testing the upgrade process in an environment closely matching your production system. Then, during the actual production upgrade, take backups of metadata stores as noted in the upgrade documentation, and if an issue does occur during the upgrade, the backups can be used to roll-back and then retry the failed upgrade steps for that particular component.

Patching Cloudera Manager from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 1 of 2)

Lets say we have a Hadoop Cluster with below nodes running CDH 5.0.2. We want to upgrdate the cluster to CDH 5.2.0.

192.168.56.201 dvhdmgt1.example.com dvhdmgt1 # Management node hosting Cloudera Manager

192.168.56.202 dvhdnn1.example.com dvhdnn1 # Name node

192.168.56.203 dvhdjt1.example.com dvhdjt1 # Jobtracker/Resource Manager

192.168.56.101 dvhddn01.example.com dvhddn01 # Datanode1

192.168.56.102 dvhddn02.example.com dvhddn02 # Datanode2

192.168.56.103 dvhddn03.example.com dvhddn03 # Datanode3

This is a two step process:

Part 1: Upgrade Cloudera Manager

Part 2: Upgrade Cloudera Hadoop

Ref 1:

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_to_cdh52_using_parcels.html

Ref 2: For rolling upgrade :

[Had some issues as in step 13. So, we are not following this, we will follow above link. Also nay how we need to stop some important services like impala & hive for rolling also. To me both are same and needs downtime(not that difference)]

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_rolling_upgrade.html

-----------------------------

Upgrading CDH 5 Using Parcels

-----------------------------

*** First we need to upgrade Cloudera Manager to 5.2.0 and then follow below steps to upgrade CDH.

*** Review Ref 1 again before start

-- ----------------

1. Before You Begin

-- ----------------

We can use web GUI for oozie to check it:

http://192.168.56.201:11000/oozie/

1.2 Run the "Host Inspector" and fix every issue.

1.2.1 On Cloudera manager Click the Hosts tab.

1.2.2 Click Host Inspector. Cloudera Manager begins several tasks to inspect the managed hosts.

1.2.3 After the inspection completes, click Download Result Data or Show Inspector Results to review the results.

1.2.4 Click "Show Inspector Results" to check the result

1.2.5 If there are any validation error, please consult with Cloudera Support before proceed further.

1.3 Run hdfs fsck / and hdfs dfsadmin -report and fix any issues

# su - hdfs

$ hdfs fsck /

$ hdfs dfsadmin -report

$ hbase hbck

1.4 enable maintenance mode on your cluster

1.4.1 Click (down arrow) to the right of the cluster name and select Enter Maintenance Mode.

1.4.2 Confirm that you want to do this.

-- -----------------------------------------

2. Back up the HDFS Metadata on the NameNode

-- -----------------------------------------

2.1 Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make a consistent backup.

2.2 CM > HDFS > Configuration

2.3 In the Search field, search for "NameNode Data Directories". This locates the NameNode Data Directories property.

2.4 From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories property.

For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:

# cd /mnt/hadoop/hdfs/name

# tar -cvf /root/nn_backup_data.tar .

Note: If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the CDH services.

-- ---------------------------------------------

3. Download, Distribute, and Activate the Parcel

-- ---------------------------------------------

Note: Before start below, enable internet access on the node where cloudera manager installed.

3.1 In the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar

3.2 Click Download for the version(s) you want to download.

3.3 When the download has completed, click Distribute for the version you downloaded.

3.4 When the parcel has been distributed and unpacked, the button will change to say Activate.

3.5 Click Activate. You are asked if you want to restart the cluster *** Do not restart the cluster at this time.

3.6 Click Close

- if some service failed to start, ignore for the time being

- Check follow below then check again

-- ---------------------

4. Upgrade HDFS Metadata

-- ---------------------

4.1 Start the ZooKeeper service.

4.2 Go to the HDFS service.

4.3 Select Actions > Upgrade HDFS Metadata.

-- -----------------------------------

5. Upgrade the Hive Metastore Database

-- -----------------------------------

5.1 Back up the Hive metastore database.

5.2 Go to the Hive service.

5.3 Select Actions > Upgrade Hive Metastore Database Schema and click Upgrade Hive Metastore Database Schema to confirm.

5.4 If you have multiple instances of Hive, perform the upgrade on each metastore database.

-- --------------------------

6. Upgrade the Oozie ShareLib

-- --------------------------

6.1 Go to the Oozie service.

6.2 Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.

-- -------------

7. Upgrade Sqoop

-- -------------

7.1 Go to the Sqoop service.

7.2 Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.

-- -----------------------

8. Upgrade Sentry Database

-- -----------------------

Required if you are updating from CDH 5.0 to 5.1 or later.

8.1 Back up the Sentry database.

8.2 Go to the Sentry service.

8.3 Select Actions > Upgrade Sentry Database Tables and click Upgrade Sentry Database Tables to confirm.

-- -------------

9. Upgrade Spark

-- -------------

Required if you are updating from CDH 5.0 to 5.1 or later.

9.1 Go to the Spark service.

9.2 Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.

9.3 Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.

--- -------------------

10. Restart the Cluster

--- -------------------

- CM > Cluster Name > Start

--- ---------------------------------

11. Deploy Client Configuration Files

--- ---------------------------------

- On the Home page, click to the right of the cluster name and select Deploy Client Configuration.

- Click the Deploy Client Configuration button in the confirmation pop-up that appears.

--- ----------------------------------

12. Finalize the HDFS Metadata Upgrade

--- ----------------------------------

- Go to the HDFS service.

- Click the Instances tab.

- Click the NameNode instance.

- Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.

--- -------------

13. Common issues

--- -------------

13.1 if HDFS namenode failed to start with below error on log:

"File system image contains an old layout version -55."

- CM > HDFS > Action > Stop

- CM > HDFS > Action > Upgrade HDFS Metadata

13.2 imapala showing "This Catalog Server is not connected to its StateStore"

- CM > Hue > Action > Stop

- CM > Imapala > Action > Stop

- CM > Hive > Action > Stop

> Action > Upgrade Metastore Schema

> Action > Upgrade Metastore Namenodes

- CM > Hive > Action > Start

- CM > Impala > Action > Start

- CM > Hue > Action > Start

13.3 Hue hot showing databases list:

- use internet explorer. It seems that it doesn't work on Chrome.

--- --------------------------------------

14. Test the upgraded CDH working properly

--- --------------------------------------

We can follow the steps we did on "Step 4 & 5" on install_CM_CDH_5.0.2.txt

Note: now the example jar file is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar

14.1 Check from Cloudera manager, there should not ant alarm or warnings

14.2 Runs host inspector and check for any alarm or warnings. If possible then fix them.

14.3 Check existing data on imapala are accessible

- Also check analytical query like below(for impala 2.0 and above):

SELECT qtr_hour_id,date_id,

count() OVER (PARTITION BY date_id,qtr_hour_id) AS how_many_qtr

FROM f_ntw_actvty_http ;

14.4 check import data from Netezza working properly

14.5 check export data to netezza working properly

14.6 run example YARN jobs and check they are successful

14.7 run terasort and check the output

14.8 Checl mahut working properly & can use data stored on HDFS

14.9 check R hadoop working properly & can use data stored on HDFS

14.10 Wait for 1 or 2 days and monitor daily jobs working fine

14.11 change all the end veriable pointing to latest CDH

--- ---------------------

15. Rollback to CDH 5.0.2

--- ---------------------

Below is the feedback from Clouder Support regarding rollback.

We can prepare a manula rollback step while doing sample upgrade on test environment.