Monday, January 19, 2015

Configure Flume to Collect Twitter Streams Using Cloudera Manager


Below are the steps I followed to collect twitter streams to HDFS with Flume:


1. Before you start, please make sure you have below 4 items generated for twitter streaming API. For more info please check twitter streaming API documentation. 
consumerKey
consumerSecret
accessToken
accessTokenSecret

2. From Cloudera Manager Stop current running Flume service.

3. Create a dedicated flume agent role group[named: "AgentTweets"] under "Flume" service:
- Cloudera manager-> flume-> configuration-> Role Groups > "Create new group..." > provide the "AgentTweets"

4. Assign as host for role group AgentTweets:
- Cloudera manager->flume->configuration-> Role Groups > "Create new group..." > select "Agent Default Group" > check "agent (sthdmgt1-pvt)"  > "Action for Selected" > Move to Different Role Group .. > AgentTweets > Move

5. Check for Flume plugins directory:
Cloudera manager->flume->configuration->AgentTweets > "Plugin directories"

In our case plugin directories are :
/usr/lib/flume-ng/plugins.d
/var/lib/flume-ng/plugins.d

*** Note: As we are using host "sthdmgt1-pvt" for agent role AgentTweets, we are considering these and rest of the changes on host "sthdmgt1-pvt" only.

6. create the plugin directories as those do not exists:
mkdir -p /usr/lib/flume-ng/plugins.d
mkdir -p /var/lib/flume-ng/plugins.d

-- Also create the location for the twitter plugin:
mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

chown -R flume:flume /usr/lib/flume-ng/
chown -R flume:flume /var/lib/flume-ng/

7. download Download the custom Flume Source and copy it to both of the flume plugin locations
cd /tmp 
wget "http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar"
cp flume-sources-1.0-SNAPSHOT.jar /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
cp flume-sources-1.0-SNAPSHOT.jar /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

8. If we use twitter4j-* jars with version 3 or above, we may end up to below error in flume agent log:

Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:139)

It is due to both twitter4j-stream-3.0.3.jar & flume-sources-1.0-SNAPSHOT.jar having same class "TwitterSource", To resolve above error do below steps:

8.1 Go to cloudera default location for flume library, Take a backup and remove all the twitter4j-*-3.0.3.jar

[root@sthdmgt1-pvt ~]# ll /opt/cloudera/parcels/CDH/lib/flume-ng/lib/|grep twitter4j
lrwxrwxrwx 1 root root 38 Nov 18 11:09 twitter4j-core-3.0.3.jar -> ../../../jars/twitter4j-core-3.0.3.jar
lrwxrwxrwx 1 root root 47 Nov 18 11:09 twitter4j-media-support-3.0.3.jar -> ../../../jars/twitter4j-media-support-3.0.3.jar
lrwxrwxrwx 1 root root 40 Nov 18 11:09 twitter4j-stream-3.0.3.jar -> ../../../jars/twitter4j-stream-3.0.3.jar
[root@sthdmgt1-pvt ~]#

*** Note: Please make sure to take a note of above links for future reference.
8.2 In this case, just need to remove the links as below:
# rm -rf /opt/cloudera/parcels/CDH/lib/flume-ng/lib/twitter4j-*-3.0.3.jar

8.3 download older version of twitter4j and copy to both of the flume plugin locations. We are choosing latest before 3.x version, that is 2.6:
# cd /tmp

# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-stream/2.2.6/twitter4j-stream-2.2.6.jar
# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-core/2.2.6/twitter4j-core-2.2.6.jar
# wget http://twitter4j.org/maven2/org/twitter4j/twitter4j-media-support/2.2.6/twitter4j-media-support-2.2.6.jar

# cp twitter4j-*.jar /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
# cp twitter4j-*.jar /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/

# chown -R flume:flume /usr/lib/flume-ng/
# chown -R flume:flume /var/lib/flume-ng/

Ref: http://stackoverflow.com/questions/19189979/cannot-run-flume-because-of-jar-conflict

9. *** Note: make sure system time zone and time is sync with your twitter settings to avoid 401 error.

10. Configure the flume agent role group:
10.1 Set the agent name:
Cloudera manager->flume->configuration->AgentTweets > Agent Name > set the agent name to "AgentTweets" [make sure agent name here and in config in the next steps are same]
10.2 Set flume configuration, make sure all the config prefoxed with the agent name set in setp 10.1
Cloudera manager->flume->configuration->AgentTweets > Agent Name > copy  below config replace entire content in "Configuration File":
--------------------twitter flume conf start -------------
AgentTweets.sources = Twitter
AgentTweets.channels = MemChannel
AgentTweets.sinks = HDFS

AgentTweets.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
AgentTweets.sources.Twitter.channels = MemChannel
AgentTweets.sources.Twitter.consumerKey = consumerKey from step 1
AgentTweets.sources.Twitter.consumerSecret = consumerSecret from step 1
AgentTweets.sources.Twitter.accessToken = accessToken from step 1
AgentTweets.sources.Twitter.accessTokenSecret = accessTokenSecret from step 1
AgentTweets.sources.Twitter.keywords = malaysia, msia

AgentTweets.sinks.HDFS.channel = MemChannel
AgentTweets.sinks.HDFS.type = hdfs
AgentTweets.sinks.HDFS.hdfs.path = hdfs://namenodeHostnameOrIP:8020/user/flume/tweets/malaysia/%Y/%m/%d/%H/
AgentTweets.sinks.HDFS.hdfs.fileType = DataStream
AgentTweets.sinks.HDFS.hdfs.writeFormat = Text
AgentTweets.sinks.HDFS.hdfs.batchSize = 1000
AgentTweets.sinks.HDFS.hdfs.rollSize = 0
AgentTweets.sinks.HDFS.hdfs.rollCount = 10000

AgentTweets.channels.MemChannel.type = memory
AgentTweets.channels.MemChannel.capacity = 10000
AgentTweets.channels.MemChannel.transactionCapacity = 100
--------------------twitter flume conf end -------------

Note: in the above config please take a note on hdfs.path, it is set to hdfs://namenodeHostnameOrIP:8020/user/flume/tweets/malaysia/%Y/%m/%d/%H/
It is pointing to namenode and port 8020.

11. Take a note of the HDFS location under "AgentTweets.sinks.HDFS.hdfs.path", these location will be created by flume. 
Create the location upto /user/flume/. Make sure that on target host sthdmgt1-pvt.aiu.axiata, OS user "flume" has read write to the HDFS directory  /user/flume/. I
# su - hdfs
$ hadoop fs -mkdir /user/flume
$ hadoop fs -chown flume:flume /user/flume

12. (If you have proxy) Create twitter4j property file for that or handle it from OS. There is no proxy option for Flume.

13. Start/Restart the flume service from cloudera manager

14. monitor the log: 
Cloudera manager->flume->instances->AgentTweets > click "Agent" for host "sthdmgt1-pvt.aiu.axiata"" > click "Log File"

15. If log seems of then check the HDFS location:
$ su - hdfs
$ hadoop fs -ls /user/flume/tweets/malaysia

Note: There will be data under /user/flume/tweets/malaysia////


Patching Cloudera Hadoop from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 2 of 2)

Patching Cloudera Hadoop from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 2 of 2):

Please complete part 1 (upgrade cloudera manager) before starting below steps.

Ref 1:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_to_cdh52_using_parcels.html


Ref 2: For rolling upgrade :
[Had some issues as in step 13. So, we are not following this, we will follow above link. Also nay how we need to stop some important services like impala & hive for rolling also. To me both are same an needs downtime(not that difference)]
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_rolling_upgrade.html


-----------------------------
Upgrading CDH 5 Using Parcels
-----------------------------

*** First we need to upgrade Cloudera Manager to 5.2.0 and then follow below steps to upgrade CDH.
*** Review Ref 1 again before start
-- ----------------
1. Before You Begin
-- ----------------
1.1 Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
We can use web GUI for oozie to check it:
http://10.202.225.102:11000/oozie/
1.2 Run the "Host Inspector" and fix every issue.
1.2.1 On Cloudera manager Click the Hosts tab.
1.2.2 Click Host Inspector. Cloudera Manager begins several tasks to inspect the managed hosts.
1.2.3 After the inspection completes, click Download Result Data or Show Inspector Results to review the results.
1.2.4 Click "Show Inspector Results" to check the result
1.2.5 If there are any validation error, please consult with Cloudera Support before proceed further.
1.3 Run hdfs fsck / and hdfs dfsadmin -report and fix any issues
# su - hdfs
$ hdfs fsck /
$ hdfs dfsadmin -report
$ hbase hbck
1.4 enable maintenance mode on your cluster
1.4.1 Click (down arrow) to the right of the cluster name and select Enter Maintenance Mode.
1.4.2 Confirm that you want to do this.

-- -----------------------------------------
2. Back up the HDFS Metadata on the NameNode
-- -----------------------------------------
   2.1 Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make a consistent backup.
   2.2 CM > HDFS > Configuration 
   2.3 In the Search field, search for "NameNode Data Directories". This locates the NameNode Data Directories property.
   2.4 From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories property. 
For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
Note:  If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the CDH services.
-- ---------------------------------------------
3. Download, Distribute, and Activate the Parcel
-- ---------------------------------------------
Note: Before start below, enable internet access on the node where cloudera manager installed.
3.1 In the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar
3.2 Click Download for the version(s) you want to download.
3.3 When the download has completed, click Distribute for the version you downloaded.
3.4 When the parcel has been distributed and unpacked, the button will change to say Activate.
3.5 Click Activate. You are asked if you want to restart the cluster *** Do not restart the cluster at this time.
    3.6 Click Close
    - if some service failed to start, ignore for the time being
- Check follow below then check again

-- ---------------------
4. Upgrade HDFS Metadata
-- ---------------------
4.1 Start the ZooKeeper service.
4.2 Go to the HDFS service.
4.3 Select Actions > Upgrade HDFS Metadata.

-- -----------------------------------
5. Upgrade the Hive Metastore Database
-- -----------------------------------
   5.1 Back up the Hive metastore database.
   5.2 Go to the Hive service.
   5.3 Select Actions > Upgrade Hive Metastore Database Schema and click Upgrade Hive Metastore Database Schema to confirm.
   5.4 If you have multiple instances of Hive, perform the upgrade on each metastore database.

-- --------------------------
6. Upgrade the Oozie ShareLib
-- --------------------------
   6.1 Go to the Oozie service.
   6.2 Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.
-- -------------
7. Upgrade Sqoop
-- -------------
   7.1 Go to the Sqoop service.
   7.2 Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
 
-- -----------------------
8. Upgrade Sentry Database
-- -----------------------
Required if you are updating from CDH 5.0 to 5.1 or later.

   8.1 Back up the Sentry database.
   8.2 Go to the Sentry service.
   8.3 Select Actions > Upgrade Sentry Database Tables and click Upgrade Sentry Database Tables to confirm.

-- -------------
9. Upgrade Spark
-- -------------
Required if you are updating from CDH 5.0 to 5.1 or later.

9.1 Go to the Spark service.
9.2 Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
9.3 Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.

--- -------------------
10. Restart the Cluster
--- -------------------
  - CM > Cluster Name > Start

--- ---------------------------------
11. Deploy Client Configuration Files
--- ---------------------------------
   - On the Home page, click  to the right of the cluster name and select Deploy Client Configuration.
   - Click the Deploy Client Configuration button in the confirmation pop-up that appears.
  
--- ----------------------------------
12. Finalize the HDFS Metadata Upgrade
--- ----------------------------------
After ensuring that the CDH 5 upgrade has succeeded and that everything is running smoothly, finalize the HDFS metadata upgrade. It is not unusual to wait days or even weeks before finalizing the upgrade.
   - Go to the HDFS service.
   - Click the Instances tab.
   - Click the NameNode instance.
   - Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.

--- -------------
13. Common issues
--- -------------
13.1 if HDFS namenode failed to start with below error on log:
"File system image contains an old layout version -55."
   - CM > HDFS > Action > Stop
- CM > HDFS > Action > Upgrade HDFS Metadata
13.2 imapala showing "This Catalog Server is not connected to its StateStore"
- CM > Hue > Action > Stop
- CM > Imapala > Action > Stop
- CM > Hive > Action > Stop 
> Action > Upgrade Metastore Schema
> Action > Upgrade Metastore Namenodes
- CM > Hive > Action > Start
- CM > Impala > Action > Start
- CM > Hue > Action > Start
13.3 Hue hot showing databases list:
- use internet explorer. It seems that it doesn't work on Chrome.

--- --------------------------------------
14. Test the upgraded CDH working properly
--- --------------------------------------
    We can follow the steps we  did on "Step 4 & 5" on install_CM_CDH_5.0.2.txt
Note: now the example jar file is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar
14.1 Check from Cloudera manager, there should not ant alarm or warnings
14.2 Runs host inspector and check for any alarm or warnings. If possible then fix them.
14.3 Check existing data on imapala are accessible
  - Also check analytical query like below(for impala 2.0 and above):
SELECT qtr_hour_id,date_id,
 count() OVER (PARTITION BY date_id,qtr_hour_id) AS how_many_qtr
FROM f_ntw_actvty_http ;
14.4 check import data from Netezza working properly
14.5 check export data to netezza working properly
14.6 run example YARN jobs and check they are successful
14.7 run terasort and check the output
14.8 Checl mahut working properly & can use data stored on HDFS
14.9 check R hadoop working properly & can use data stored on HDFS
14.10 Wait for 1 or 2 days and monitor daily jobs working fine
14.11 change all the end veriable pointing to latest CDH


--- ---------------------
15. Rollback to CDH 5.0.2
--- ---------------------
Below is the feedback from Clouder Support regarding rollback.
We can prepare a manula rollback step while doing sample upgrade on test environment.

There is no true "rollback" for CDH.  While it is true that you can deactivate the new parcel and reactivate the old, or remove the new packages and re-install the old, an upgrade does not only constitute a change in the binaries and libraries for CDH.  Some components store metadata in databases, and the upgrade process will usually modify the database schema in these databases -- for example, the Hive Metastore, the Hue database, the Sqoop2 metastore, and the Oozie database.  If an upgrade of CM is involved, the CM database is also upgraded.

As such, there is no "rollback" option.  What Cloudera recommends is that all databases be backed up prior to the upgrade taking place (you will note this warning in various places in the upgrade documentation).  If necessary, a point-in-time restore can be performed, but there is no automated way to do this -- it is a highly manual process.

This is why we recommend thoroughly testing the upgrade process in an environment closely matching your production system.  Then, during the actual production upgrade, take backups of metadata stores as noted in the upgrade documentation, and if an issue does occur during the upgrade, the backups can be used to roll-back and then retry the failed upgrade steps for that particular component.


Patching Cloudera Manager from 5.0.2 to 5.2.0 (Cloudera Hadoop Patching/Version upgrade part 1 of 2)

Lets say we have a Hadoop Cluster with below nodes running CDH 5.0.2. We want to upgrdate the cluster to CDH 5.2.0.

192.168.56.201 dvhdmgt1.example.com dvhdmgt1  # Management node hosting Cloudera Manager
192.168.56.202 dvhdnn1.example.com  dvhdnn1 # Name node
192.168.56.203 dvhdjt1.example.com  dvhdjt1 # Jobtracker/Resource Manager
192.168.56.101 dvhddn01.example.com dvhddn01 # Datanode1
192.168.56.102 dvhddn02.example.com dvhddn02 # Datanode2
192.168.56.103 dvhddn03.example.com dvhddn03 # Datanode3

This is a two step process:
Part 1: Upgrade Cloudera Manager


Ref 1:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_to_cdh52_using_parcels.html

Ref 2: For rolling upgrade :
[Had some issues as in step 13. So, we are not following this, we will follow above link. Also nay how we need to stop some important services like impala & hive for rolling also. To me both are same and needs downtime(not that difference)]
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_rolling_upgrade.html


-----------------------------
Upgrading CDH 5 Using Parcels
-----------------------------

*** First we need to upgrade Cloudera Manager to 5.2.0 and then follow below steps to upgrade CDH.
*** Review Ref 1 again before start
-- ----------------
1. Before You Begin
-- ----------------
1.1 Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
We can use web GUI for oozie to check it:
http://192.168.56.201:11000/oozie/
1.2 Run the "Host Inspector" and fix every issue.
1.2.1 On Cloudera manager Click the Hosts tab.
1.2.2 Click Host Inspector. Cloudera Manager begins several tasks to inspect the managed hosts.
1.2.3 After the inspection completes, click Download Result Data or Show Inspector Results to review the results.
1.2.4 Click "Show Inspector Results" to check the result
1.2.5 If there are any validation error, please consult with Cloudera Support before proceed further.
1.3 Run hdfs fsck / and hdfs dfsadmin -report and fix any issues
# su - hdfs
$ hdfs fsck /
$ hdfs dfsadmin -report
$ hbase hbck
1.4 enable maintenance mode on your cluster
1.4.1 Click (down arrow) to the right of the cluster name and select Enter Maintenance Mode.
1.4.2 Confirm that you want to do this.

-- -----------------------------------------
2. Back up the HDFS Metadata on the NameNode
-- -----------------------------------------
   2.1 Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make a consistent backup.
   2.2 CM > HDFS > Configuration 
   2.3 In the Search field, search for "NameNode Data Directories". This locates the NameNode Data Directories property.
   2.4 From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories property. 
For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
Note:  If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the CDH services.
-- ---------------------------------------------
3. Download, Distribute, and Activate the Parcel
-- ---------------------------------------------
Note: Before start below, enable internet access on the node where cloudera manager installed.
3.1 In the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar
3.2 Click Download for the version(s) you want to download.
3.3 When the download has completed, click Distribute for the version you downloaded.
3.4 When the parcel has been distributed and unpacked, the button will change to say Activate.
3.5 Click Activate. You are asked if you want to restart the cluster *** Do not restart the cluster at this time.
    3.6 Click Close
    - if some service failed to start, ignore for the time being
- Check follow below then check again

-- ---------------------
4. Upgrade HDFS Metadata
-- ---------------------
4.1 Start the ZooKeeper service.
4.2 Go to the HDFS service.
4.3 Select Actions > Upgrade HDFS Metadata.

-- -----------------------------------
5. Upgrade the Hive Metastore Database
-- -----------------------------------
   5.1 Back up the Hive metastore database.
   5.2 Go to the Hive service.
   5.3 Select Actions > Upgrade Hive Metastore Database Schema and click Upgrade Hive Metastore Database Schema to confirm.
   5.4 If you have multiple instances of Hive, perform the upgrade on each metastore database.

-- --------------------------
6. Upgrade the Oozie ShareLib
-- --------------------------
   6.1 Go to the Oozie service.
   6.2 Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.
-- -------------
7. Upgrade Sqoop
-- -------------
   7.1 Go to the Sqoop service.
   7.2 Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
-- -----------------------
8. Upgrade Sentry Database
-- -----------------------
Required if you are updating from CDH 5.0 to 5.1 or later.

   8.1 Back up the Sentry database.
   8.2 Go to the Sentry service.
   8.3 Select Actions > Upgrade Sentry Database Tables and click Upgrade Sentry Database Tables to confirm.

-- -------------
9. Upgrade Spark
-- -------------
Required if you are updating from CDH 5.0 to 5.1 or later.

9.1 Go to the Spark service.
9.2 Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
9.3 Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.

--- -------------------
10. Restart the Cluster
--- -------------------
  - CM > Cluster Name > Start

--- ---------------------------------
11. Deploy Client Configuration Files
--- ---------------------------------
   - On the Home page, click  to the right of the cluster name and select Deploy Client Configuration.
   - Click the Deploy Client Configuration button in the confirmation pop-up that appears.
  
--- ----------------------------------
12. Finalize the HDFS Metadata Upgrade
--- ----------------------------------
After ensuring that the CDH 5 upgrade has succeeded and that everything is running smoothly, finalize the HDFS metadata upgrade. It is not unusual to wait days or even weeks before finalizing the upgrade.
   - Go to the HDFS service.
   - Click the Instances tab.
   - Click the NameNode instance.
   - Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.

--- -------------
13. Common issues
--- -------------
13.1 if HDFS namenode failed to start with below error on log:
"File system image contains an old layout version -55."
   - CM > HDFS > Action > Stop
- CM > HDFS > Action > Upgrade HDFS Metadata
13.2 imapala showing "This Catalog Server is not connected to its StateStore"
- CM > Hue > Action > Stop
- CM > Imapala > Action > Stop
- CM > Hive > Action > Stop 
> Action > Upgrade Metastore Schema
> Action > Upgrade Metastore Namenodes
- CM > Hive > Action > Start
- CM > Impala > Action > Start
- CM > Hue > Action > Start
13.3 Hue hot showing databases list:
- use internet explorer. It seems that it doesn't work on Chrome.

--- --------------------------------------
14. Test the upgraded CDH working properly
--- --------------------------------------
    We can follow the steps we  did on "Step 4 & 5" on install_CM_CDH_5.0.2.txt
Note: now the example jar file is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar
14.1 Check from Cloudera manager, there should not ant alarm or warnings
14.2 Runs host inspector and check for any alarm or warnings. If possible then fix them.
14.3 Check existing data on imapala are accessible
  - Also check analytical query like below(for impala 2.0 and above):
SELECT qtr_hour_id,date_id,
 count() OVER (PARTITION BY date_id,qtr_hour_id) AS how_many_qtr
FROM f_ntw_actvty_http ;
14.4 check import data from Netezza working properly
14.5 check export data to netezza working properly
14.6 run example YARN jobs and check they are successful
14.7 run terasort and check the output
14.8 Checl mahut working properly & can use data stored on HDFS
14.9 check R hadoop working properly & can use data stored on HDFS
14.10 Wait for 1 or 2 days and monitor daily jobs working fine
14.11 change all the end veriable pointing to latest CDH


--- ---------------------
15. Rollback to CDH 5.0.2
--- ---------------------
Below is the feedback from Clouder Support regarding rollback.
We can prepare a manula rollback step while doing sample upgrade on test environment.

There is no true "rollback" for CDH.  While it is true that you can deactivate the new parcel and reactivate the old, or remove the new packages and re-install the old, an upgrade does not only constitute a change in the binaries and libraries for CDH.  Some components store metadata in databases, and the upgrade process will usually modify the database schema in these databases -- for example, the Hive Metastore, the Hue database, the Sqoop2 metastore, and the Oozie database.  If an upgrade of CM is involved, the CM database is also upgraded.

As such, there is no "rollback" option.  What Cloudera recommends is that all databases be backed up prior to the upgrade taking place (you will note this warning in various places in the upgrade documentation).  If necessary, a point-in-time restore can be performed, but there is no automated way to do this -- it is a highly manual process.

This is why we recommend thoroughly testing the upgrade process in an environment closely matching your production system.  Then, during the actual production upgrade, take backups of metadata stores as noted in the upgrade documentation, and if an issue does occur during the upgrade, the backups can be used to roll-back and then retry the failed upgrade steps for that particular component.


Installing Cloudera Manager & Cloudera Hadoop (Install Cloudera Hadoop part 2 of 2)

Here I am sharing my activity cookbook I followed to install a development environment for Cloudera Hadoop cluster using Virtual Box. Using almost same method you can easily install the Cloudera Hadoop on production.

Here is the part 2 of 2, Cloudera Manager & Cloudera hadoop installation and test.
This also include HA configuration, Gateway configuration, R hadoop installation.

For hosts/nodes configuration steps please check part 1


         
-- ------------------------
-- 1. Pre Requisites Checks
-- ------------------------
1.1 OS : RHEL 6.4 or CentOS 6.5
1.2 MySql : 5.x or later
1.3 Python : 2.4
1.4 RAM : 2 GB
1.5 Disk : - 5 GB on the partition hosting /var.
 - 500 MB on the partition hosting /usr
1.6 Network : - ssh access to all the nodes/hosts
         - Name resolve either by /etc/hosts or by DNS
     - The /etc/hosts file must not have duplicate IP addresses
1.7 Security: - root access as CM agent runs as root
 - No blocking by Security-Enhanced Linux (SELinux)
     - Disable Ipv6 on all hosts
     - make sure required ports are open (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_config_ports.html#concept_k5z_vwy_4j)
 - for RHEL /etc/sysconfig/network should contains hostname for the corresponding systems
     - requires root/sudo access

1.8 On RHEL and CentOS 5, Install Python 2.6 or 2.7:
1.9.l In order to install packages from the EPEL repository, first download the appropriate repository rpm packages to your machine and then install Python using yum. 
# su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
#  yum install python26

-- --------------------------------
-- 2. Install Cloudera Manager (CM)
-- --------------------------------

2.1 Establish Your Cloudera Manager Repository Strategy.
- We will have internet connection on CM node
- Other node will use CM node as proxy
- We have already configired this on prepare_hadoop_hosts steps.
 
2.2 Install the Oracle JDK on CM node:
2.2.1 The JDK is included in the Cloudera Manager 5 repositories. Once you have the repo or list file in the correct place, you can install the JDK as follows:
# yum install oracle-j2sdk1.7

2.2.2 better also install "rpm -Uvh jdk-7u51-linux-x64.rpm", some times it creates problem on 3rd party apps

2.3 Install the Cloudera Manager Server Packages(**make sure to install 5.2, we have done a change at step 2.1.3 for this)
# yum install cloudera-manager-daemons cloudera-manager-server
 
2.4 Prepare external database(we will use MySql):
ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_mysql.html#cmig_topic_5_5_1_unique_1

2.4 .1 install MySql 5.5.37. We are installing below rpms:

- First install rpm -Uvh mysql-community-release-el6-5.noarch.rpm:
# wget http://dev.mysql.com/get/mysql-community-release-el6-5.noarch.rpm
# rpm -Uvh mysql-community-release-el6-5.noarch.rpm

- Then use yum to install Mysql:
# yum install mysql-community-server


- So, at the end of installation, below mysql packages will ne there:
mysql-community-release-el6-5.noarch
mysql-community-client-5.5.37-4.el6.x86_64
mysql-community-common-5.5.37-4.el6.x86_64
mysql-community-libs-5.5.37-4.el6.x86_64   
mysql-community-server-5.5.37-4.el6.x86_64

2.4 .2 Configuring and Starting the MySQL Server:
 a. Stop the MySQL server if it is running
  $service mysqld stop
 b. Move old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 out of /var/lib/mysql/ to a backup location
 c. Determine the location of the option file, my.cnf (normally /etc/my.cnf)
 d. Update my.cnf so that it conforms to the following requirements:
- To prevent deadlocks, Cloudera Manager requires the isolation level to be set to read committed.
- Configure the InnoDB engine. Cloudera Manager will not start if its tables are configured with the MyISAM engine. It can be checked usimg below:
mysql> show table status;
- Cloudera recommends that you set the innodb_flush_method property to O_DIRECT
- Set the max_connections property according to the size of your cluster. Custer Clusters with fewer than 50 hosts can be considered small clusters.
- Ours one is small cluster, we will all databases on the same host where CM installed
- Allow 100 maximum connections for each database and then add 50 extra connections. So, for 2 DB it will be 2X100+50=250. 
- or our case it is very small installation with 6/7 DBs, wew are putting it to 550. Should be good enough.
- So, typecially, our MySql config(my.cnf) will be as below:

Note: Need to create the bin log location and change ownership to mysql user as below:
mkdir -p /opt/mysql/binlog/
chown -R mysql:mysql /opt/mysql/binlog/

------------start my.cnf---------------
[mysqld]
transaction-isolation=READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links=0

key_buffer              = 8M
key_buffer_size         = 16M
max_allowed_packet      = 16M
thread_stack            = 64K
thread_cache_size       = 32
query_cache_limit       = 8M
query_cache_size        = 16M
query_cache_type        = 1

max_connections         = 550

# log_bin should be on a disk with enough free space
# NOTE: replace '/x/home/mysql/logs/binary' below with
#       an appropriate path for your system.
log_bin=/opt/mysql/binlog/mysql_binary_log

# For MySQL version 5.1.8 or later. Comment out binlog_format for older versions.
binlog_format           = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 8M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size          = 16M
innodb_buffer_pool_size         = 120M
innodb_thread_concurrency       = 8
innodb_flush_method             = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
------------end my.cnf---------------


2.4.3 Ensure the MySQL server starts at boot
# /sbin/chkconfig mysqld on
# /sbin/chkconfig --list mysqld
mysqld          0:off   1:off   2:on    3:on    4:on    5:on    6:off

2.4.4 Start the MySQL server:
# service mysqld start

2.4.5 Set the MySQL root password:
# /usr/bin/mysql_secure_installation

2.4 .6 Installing the MySQL JDBC Connector:
Note: Do not use the yum install command to install the MySQL connector package, because it installs the openJDK, and then uses Linux alternatives command to set the system JDK to be the openJDK.

- Install the JDBC connector on the Cloudera Manager Server host, as well as hosts to which you assign the Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server roles. I our case all on the same host.

- Better to use this process to avoid error like "MySQLSyntaxErrorException" on imapala/hive. If you installed using yum, do this step again.

- download it from http://dev.mysql.com/downloads/connector/j/
# wget "http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.34.tar.gz"
- Extract the JDBC driver JAR file from the downloaded file; for example:
# tar -xvf mysql-connector-java-5.1.34.tar.gz
- Add the JDBC driver, renamed, to the relevant server; for example:
# mkdir /usr/share/java/
# cp mysql-connector-java-5.1.34/mysql-connector-java-5.1.34-bin.jar /usr/share/java/
# ln -s /usr/share/java/mysql-connector-java-5.1.34-bin.jar /usr/share/java/mysql-connector-java.jar
# /usr/share/java/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java-5.1.34-bin.jar

2.4 .7 Create databases & users for Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server. The database must be configured to support UTF-8 character set encoding.

For Activity Monitor:
mysql> create database amon DEFAULT CHARACTER SET utf8;
mysql> grant all on amon.* TO 'amon'@'localhost' IDENTIFIED BY 'password'; 
mysql> grant all on amon.* TO 'amon'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Reports Manager:
mysql> create database rman DEFAULT CHARACTER SET utf8;
mysql> grant all on rman.* TO 'rman'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on rman.* TO 'rman'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';

     -- here using local host, as all the services running in the same server

For Hive Metastore Server:
mysql> create database metastore DEFAULT CHARACTER SET utf8;
mysql> grant all on metastore.* TO 'hive'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on metastore.* TO 'hive'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Sentry Server:
mysql> create database sentry DEFAULT CHARACTER SET utf8;
mysql> grant all on sentry.* TO 'sentry'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on sentry.* TO 'sentry'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Cloudera Navigator Audit Server:
mysql> create database nav DEFAULT CHARACTER SET utf8;
mysql> grant all on nav.* TO 'nav'@'localhost' IDENTIFIED BY 'password'; 
mysql> grant all on nav.* TO 'nav'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server  
2.4 .8 Backup all the DBs:
# mysqldump -u root -p --all-databases > alldb_backup.sql

2.4 .9 Run the scm_prepare_database.sh script for Installer or package install:
on the host where the Cloudera Manager Server package is installed. The script prepares the database by:
- Creating the Cloudera Manager Server database configuration file.
    - Creating a database for the Cloudera Manager Server to use. This is optional and is only completed if options are specified.
- Setting up a user account for the Cloudera Manager Server. This is optional and is only completed if options are specified.
mysql > grant all on *.* to 'temp'@'%' identified by 'temp' with grant option;
# /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -utemp -ptemp --scm-host localhost scm scm scm
 -- The log4j errors appear but don't seem to be harmful.

mysql> drop user 'temp'@'%';

2.4 .10 Remove the embedded PostgreSQL properties file. For Installer or package install do below:
# rm /etc/cloudera-scm-server/db.mgmt.properties

2.4 .11 **  We must create the databases before you run the Cloudera Manager installation wizard if we chose external dayabase option.


2.4 .12 ** External Databases for Hue, and Oozie. 
      - Hue and Oozie are automatically configured with databases, but you can configure these services to use external databases after Cloudera Manager is installed.
  http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_hue_service.html#cmig_topic_15_unique_1
  http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_oozie_service.html#cmig_topic_14_unique_1

2.5 Start Cloudera Manager search for target hosts:
2.5.1 Run this command on the Cloudera Manager Server host to start cloudera manager:
# service cloudera-scm-server start

2.5.2 Wait several minutes for the Cloudera Manager Server to complete its startup and monitor log as below:
# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log #wait until you see "Started Jetty server."

2.5.3 In a web browser, enter http://Server host:7180

2.5.4 Log into Cloudera Manager Admin Console. The default credentials are: Username: admin Password: admin


2.6 Choose Cloudera Manager Edition and Hosts
2.6.1 When you start the Cloudera Manager Admin Console, the install wizard starts up. Click Continue to get started
2.6.2 Choose which edition to install.
- For our case we will installl "Cloudera Enterprise Data Hub Edition Trial, which does not require a license, but expires after 60 days and cannot be renewed"
- "Continue"
2.6.3 Cluster configuration page appear. 
- (optional) Click on the "Cloudera Manager" logo to skip default installation
- go to "Administration> setting > parcels"
- Add desired "Remote Parcel Repository URLs". For us we are going to install chd 5.0.2. We will add below:
http://archive.cloudera.com/cdh5/parcels/5.0.2/
- "Save Changes"


2.6.4 Search for and choose hosts as below:
- Cloudera Manager Home > hosts > Add new hosts to cluster. Add hosts option will appear.
- To enable Cloudera Manager to automatically discover hosts on which to install CDH and managed services, enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges:
a. IP range "10.1.1.[1-4]" or histname "host[1-3].company.com" for our case 192.168.56.[101-103],192.168.56.[201-203]
b. The scan results will include all addresses scanned, but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default. 
c. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for services.
d. Verify that the number of hosts shown matches the number of hosts where you want to install services. 
e. Click Continue. The Select Repository page displays.
2.6.5 to avoid stuck in "Acquiring installation lock..." while installation do below on all the nodes:
# rm /tmp/.scm_prepare_node.lock

3. Install Cloudera Manager Agent, CDH, and Managed Service Software

mysql bug: http://bugs.mysql.com/bug.php?id=63085

3.1 Select how CDH and managed service software is installed: packages or parcels. We will use parcels
3.2 Choose the parcels to install. The choices you see depend on the repositories you have chosen – a repository may contain multiple parcels. Only the parcels for the latest supported service versions are configured by default.
3.3 Chose "CDH-5.0.2-1.cdh5.0.2.p0.13" and keep rest as default

3.5 install Cloudera Manager Agent
3.5.1 select the release of Cloudera Manager Agent to install. 
3.5.2 Click Continue.
3.5.3 Leave Install Oracle Java SE Development Kit (JDK) checked to allow Cloudera Manager to install the JDK on each cluster host or uncheck if you plan to install it yourself. Click Continue.
3.5.4 Provide SSH login credentials.
3.5.5 Click Continue. If you did not install packages manually, Cloudera Manager installs the Oracle JDK, Cloudera Manager Agent,packages and CDH and managed service packages or parcels.
3.5.6 When the Continue button appears at the bottom of the screen, the installation process is completed. Click Continue.
3.5.7 The Host Inspector runs to validate the installation, and provides a summary of what it finds, including all the versions of the installed components. 
     If the validation is successful, click Finish. The Cluster Setup page displays.

3.6 Add Services
3.6.1 In the first page of the Add Services wizard you choose the combination of services to install and whether to install Cloudera Navigator. Click the radio button next to the combination of services to install.
Some services depend on other services; for example, HBase requires HDFS and ZooKeeper. Cloudera Manager tracks dependencies and installs the correct combination of services.
3.6.2 The Flume service can be added only after your cluster has been set up.
3.6.3 If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally check the Include Cloudera Navigator checkbox to enable Cloudera Navigator.
3.6.4 Click Continue. The Customize Role Assignments page displays.
3.6.5 Customize the assignment of role instances to hosts.  (datanodes, namenodes, resource manager etc). Hosts can be chosen semilar like step 2.6.3.1.a
3.6.6 When you are satisfied with the assignments, click Continue. The Database Setup page displays.
3.6.7 Enter the database host, database type, database name, username, and password for the database that you created when you set up the database.
3.6.8 Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information you have supplied. If the test succeeds in all cases, click Continue; 
3.6.9 Review the configuration changes to be applied. 
   - Confirm the settings entered for file system paths for HDFS and others. 
- Make sure to add 3 nodes for Zookeeper.
- Donot make namenode as hbase master
The file paths required vary based on the services to be installed. 
Click Continue. The wizard starts the services.
3.6.10 When all of the services are started, click Continue. You will see a success message indicating that your cluster has been successfully started.
3.6.11 There will be some configuration alarms as we have installed with low resources. Fix them as much as possible.
some thing like below:
a. - if needed,delete services in below order
Oozie 
impala
Hive 
HBase
SPARK
sqoop2 
YARN 
HDFS
zookeeper
 - The add service is reverse order 
b. While reinstalling HDFS make sure name directories of NameNode are empty(default /dfs/nn and on secondery namemode /dfs/snn, datanodes /data/0[1-3..]/ on)
for our case:
   ssh dvhdnn1  "rm -rf /dfs/nn/*"
ssh dvhdjt1  "rm -rf /dfs/snn/*"
ssh dvhddn01  "rm -rf /data/01/*"
ssh dvhddn01  "rm -rf /data/02/*"
ssh dvhddn01  "rm -rf /data/03/*"
ssh dvhddn02  "rm -rf /data/01/*"
ssh dvhddn02  "rm -rf /data/02/*"
ssh dvhddn02  "rm -rf /data/03/*"
ssh dvhddn03  "rm -rf /data/01/*"
ssh dvhddn03  "rm -rf /data/02/*"
ssh dvhddn03  "rm -rf /data/03/*"

c. While re-installing hbase (to avoid "TableExistsException: hbase:namespace")
- Stop existing hbase service
- do below from one of the servers running an HBase service (for our case we can use CM node,dvhdmgt1)
# hbase zkcli
[zk: localhost:2181(CONNECTED) 0] rmr /hbase   # << this command to remove existing znode
- delete the existing hbase service
- try adding hbase again
d. Eliminate "Failed to access Hive warehouse: /user/hive/warehouse" on hue or beeswax:
- # su - hdfs
- # hadoop fs -mkdir /user/hive
- # hadoop fs -mkdir /user/hive/warehouse
- # hadoop fs -chown -R hive:hive /user/hive
- # hadoop fs -chmod -R 1775 /user/hive/
- restart hue service
-- -----------------------
-- 4 Test the Installation
-- -----------------------
4.1 login to CM web console
4.2 All the services should be running with Good Health on CM console.
4.3 Click the Hosts tab where you can see a list of all the Hosts along with the value of their Last Heartbeat. By default, every Agent must heartbeat successfully every 15 seconds. 
4.4 Running a MapReduce Job
4.4.1 Log into a host in the cluster.
4.4.2  run marrequce jobs  as below, it should run successfully
a. Run pi example:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
b. Wrodcount example:
# su - hdfs
$ echo 'Hello World, Bye World!' > /tmp/file01
$ echo 'Hello Hadoop, Goodbye to hadoop.' > /tmp/file02
$hadoop fs -mkdir /tmp/input/
$ hadoop fs -put /tmp/file01 /tmp/input/file01
$ hadoop fs -put /tmp/file02 /tmp/input/file02
$ hadoop fs -cat /tmp/input/file01
Hello World, Bye World!

$ hadoop fs -cat /tmp/input/file02
Hello Hadoop, Goodbye to hadoop.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar wordcount /tmp/input/ /tmp/output/
 
$ hadoop fs -ls /tmp/output/
Found 2 items
-rw-r--r--   3 hdfs supergroup          0 2014-11-08 06:31 /tmp/output/_SUCCESS
-rw-r--r--   3 hdfs supergroup         32 2014-11-08 06:31 /tmp/output/part-r-00000

$ hadoop fs -cat /tmp/output/part-r-00000
Bye     1
Goodbye 1
Hadoop, 1
Hello   2
World!  1
World,  1
hadoop. 1
to      1
4.4.3 Monitor above mapreduce job as "Clusters > ClusterName > yarn Applications"
4.4.4 Testing imapla
- create the datafile locally
$ cat /tmp/tab1.csv
1,true,123.123,2012-10-24 08:55:00 
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33

- copy the file to hdfs
# su - hdfs
$ hadoop fs -mkdir /tmp/tab1/
$ hadoop fs -put /tmp/tab1.csv /tmp/tab1/tab1.csv

- login to imapala shell
# impala-shell -i dvhddn01

- Create a text based table
CREATE EXTERNAL TABLE TMP_TAB1
(
  id INT,
  col_1 BOOLEAN,
  col_2 DOUBLE,
  col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1';

select * from TMP_TAB1;

- create a PARQUET table
create table TAB1 (
  id INT,
  col_1 BOOLEAN,
  col_2 DOUBLE,
  col_3 TIMESTAMP
) STORED AS PARQUET;

insert into TAB1 select * from TMP_TAB1;

select * from TAB1;

4.4.5 test with hue
- log into hue webcon sole http://dvmgt1:8888
- access tables create in step 4.4.4 using hive
- access tables create in step 4.4.4 using impala
-- ------------------
-- 5 Install R Hadoop
-- ------------------
http://ashokharnal.wordpress.com/2014/01/16/installing-r-rhadoop-and-rstudio-over-cloudera-hadoop-ecosystem-revised/
https://github.com/RevolutionAnalytics/RHadoop/wiki
*** should have internet access
5.1 on the same node where CM installed, Install R & R-devel
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -Uvh epel-release-6-8.noarch.rpm
# yum clean all
# yum install R R-devel
5.2 usin R shell to install packages as below:
# R
> install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2","caTools"))

5.3 Download rhdfs and rmr2 packages to your local Download folder from 'https://github.com/RevolutionAnalytics/RHadoop/wiki"
cd /tmp
wget "https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz" \
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz

wget "https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz"
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz

5.4 Make sure below to env are set in .bash_profile of root ot the sudo user:

export PATH

export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar
export JAVA_HOME=/usr/java/jdk1.7.0_51

export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server



5.5 Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine.
# R
> install.packages("/tmp/rhdfs_1.0.8.tar.gz", repos = NULL, type="source")
> install.packages("/tmp/rmr2_3.2.0.tar.gz", repos = NULL, type="source")
5.6 Now RHadoop is installed

5.7 test R hadoop (if a new user than non root/non sudo make sure to set env related to hadoop & rhadoop)
5.8 test installation as below:

Note: if you find error like "Unable to find JAAS classes", make sure you installed native jdk(step 2.2.2) and set env(step 5.4)

# su - hdfs
$R
> library(rmr2)
> library(rJava)
> library(rhdfs)
> hdfs.init() 
> hdfs.ls("/") 
> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)


[You will get a long series of output something as below]

$key
NULL

$val
v
 [1,]   1   2
 [2,]   2   4
 [3,]   3   6
 [4,]   4   8
 [5,]   5  10
 [6,]   6  12
.............
.............
 [98,]  98 196
 [99,]  99 198
[100,] 100 200


-- --------------
 6. Install Mahut
-- --------------
   6.1 Install mahut on CM node, here we are using yum
# yum install mahout
   
   6.2 Mahut will be accessible using below:
# /usr/bin/mahout
  
   6.3 Above command should not give any error 


-- ----------------------------
7. Configure High Availability
-- ----------------------------

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-High-Availability-Guide/CDH5-High-Availability-Guide.html

6.1 Configuring HDFS High Availability (using wizard)
6.1.1 CM Home > HDFS > instances > Click "Enable High Availability" button
6.1.2 select 3 JournalNodes(dvhdmgt1, dvhdnn1 & dvhdjt1) and a standby node(dvhdjt1). Continue
6.1.3 select your name service name. I kept it default "nameservice1". Continue
6.1.4 Review Changes and provide values for dfs.journalnode.edits.dir (i provided /dfs/jn)
 Keep rest default( most of them are cleaning & re-initializing existing services). Continue
6.1.5 It will fail formaing "Name directories of the current NameNod", it should be failed. Just ignore it..
6.1.6 The following manual steps must be performed after completing this wizard:
- CM Home > Hive > Action > "Stop"
- (optionally) Backup Hive metastore
- CM Home > Hive > Action > "Update Hive Metastore NameNodes"
- CM Home > Hive > Action > "Restart"
- CM Home > impala > Action > "Restart"
- CM Home > Hue > Action > "Restart"
- Test on previous MR, hive & impala data

6.2 Configuring High Availability for ResourceManager (MRv2/YARN)
    6.2.1 Stop all YARN daemons
- CM > YARN > Action > Stop

6.2.2 Update the configuration on yarn-site.xml (use CM)
-- dvhdjt1, we will name it resource manager 1(rm1)
-- dvhdnn1, we will name it resource manager 2(rm2)
-- Append below in /etc/hadoop/conf/yarn-site.xml on dvhdjt1 and copy it to all nodes (except CM node)



 
    yarn.resourcemanager.connect.retry-interval.ms
    2000
 
 
    yarn.resourcemanager.ha.enabled
    true
 
 
    yarn.resourcemanager.ha.automatic-failover.enabled
    true
 
 
    yarn.resourcemanager.ha.automatic-failover.embedded
    true
 
 
    yarn.resourcemanager.cluster-id
    yarnRM
 
 
    yarn.resourcemanager.ha.rm-ids
    rm1,rm2
 
 
    yarn.resourcemanager.ha.id
    rm1
 
 
    yarn.resourcemanager.scheduler.class
    org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
 
 
    yarn.resourcemanager.recovery.enabled
    true
 
 
    yarn.resourcemanager.store.class
    org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
 
 
    yarn.resourcemanager.zk.state-store.address
    localhost:2181
 
 
    yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
    5000
 


 
    yarn.resourcemanager.address.rm1
    dvhdjt1:23140
 
 
    yarn.resourcemanager.scheduler.address.rm1
    dvhdjt1:23130
 
 
    yarn.resourcemanager.webapp.https.address.rm1
    dvhdjt1:23189
 
 
    yarn.resourcemanager.webapp.address.rm1
    dvhdjt1:23188
 
 
    yarn.resourcemanager.resource-tracker.address.rm1
    dvhdjt1:23125
 
 
    yarn.resourcemanager.admin.address.rm1
    dvhdjt1:23141
 
 

 
    yarn.resourcemanager.address.rm2
    dvhdnn1:23140
 
 
    yarn.resourcemanager.scheduler.address.rm2
    dvhdnn1:23130
 
 
    yarn.resourcemanager.webapp.https.address.rm2
    dvhdnn1:23189
 
 
    yarn.resourcemanager.webapp.address.rm2
    dvhdnn1:23188
 
 
    yarn.resourcemanager.resource-tracker.address.rm2
    dvhdnn1:23125
 
 
    yarn.resourcemanager.admin.address.rm2
    dvhdnn1:23141
 
 

 
    Address where the localizer IPC is.
    yarn.nodemanager.localizer.address
    0.0.0.0:23344
 
 
    NM Webapp address.
    yarn.nodemanager.webapp.address
    0.0.0.0:23999
 
 
    yarn.nodemanager.aux-services
    mapreduce_shuffle
 
 
    yarn.nodemanager.local-dirs
    /tmp/pseudo-dist/yarn/local
 
 
    yarn.nodemanager.log-dirs
    /tmp/pseudo-dist/yarn/log
 
 
    mapreduce.shuffle.port
    23080
 


6.2.3 Re-start the YARN daemons
- CM > YARN > Instances > Add > for Resourcemanager select "dvhdnn1" > Continue
- CM > YARN > Instances > Select all > Action for Selected "Restrat" 

6.2.4 Using yarn rmadmin to Administer ResourceManager HA
- yarn rmadmin has the following options related to RM HA:
[-transitionToActive ]
[-transitionToStandby ]
[-getServiceState ]
[-checkHealth ]
[-help ]
where is the rm-id (for our case in rm1 & rm2)


-- ----------------
7. Install gateways
-- ----------------

    7.1. Impala proxy: 

Ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_proxy.html

7.1.1 Install haproxy
# yum install haproxy

7.1.2 Set up the configuration file: /etc/haproxy/haproxy.cfg as below
--------------------------Start COnfig File-------------------------------
global
    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local0
    log         127.0.0.1 local1 notice
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    maxconn                 3000
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms



#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
    balance
    mode http
    stats enable
    stats auth username:password

# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25004
    mode tcp
    option tcplog
    balance leastconn

    server impald1 dvhddn01.example.com:21000
    server impald2 dvhddn02.example.com:21000
    server impald3 dvhddn03.example.com:21000


# For impala JDBC or ODBC
listen impala :25003
    mode tcp
    option tcplog
    balance leastconn

    server impald1 dvhddn01.example.com:21050
    server impald2 dvhddn02.example.com:21050
    server impald3 dvhddn03.example.com:21050


--------------------------End Config File-------------------------------
     ** we have configured 25003 for JDBC/ODBC impala connection
** we have configured 25004 impalad/impala-shell connection
 
7.1.3 Run the load balancer (on a single host, preferably one not running impalad, our case dvhdmgt1, dvhdnn1 & dvhdjt1): 

# service haproxy start

- ignore below warning:
Starting haproxy: [WARNING] 322/090925 (15196) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
or 
Starting haproxy: [WARNING] 329/052137 (32754) : Parsing [/etc/haproxy/haproxy.cfg:73]: proxy 'impala' has same name as another proxy (declared at /etc/haproxy/haproxy.cfg:62).
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.


7.1.4 Connect to impala from any of the haproxy nodes as below:
# impala-shell -i dvhdmgt1:25004
> use axdb;
> select count(*) from f_ntw_actvty_http;

7.1.5 make is on for restart:
# chkconfig haproxy on


    7.2 HttpFS gateway: 
   HttpFS gateway normally installed with cloudera hadoop parcel installation (Step 3.6 Add Services). 
7.2.1 Just check CM > HFDS > Instances > chack whether any httpfs node are there or not. 
7.2.2 If not then CM> HDFS > Instances > Add Role Instances > HttpFS > Add your hosts > Follow next instrictions
7.2.3 Afer installation completed check with below (from each of the IP of the nodes where HttpFS installed):
curl "http://192.168.56.201:14000/webhdfs/v1?op=gethomedirectory&user.name=hdfs"
curl 'http://192.168.56.202:14000/webhdfs/v1/?user.name=hdfs&op=open'
curl 'http://192.168.56.203:14000/webhdfs/v1/tmp/tab1/tab1.csv?user.name=hdfs&op=open'



8. Install RImpala
   ref: http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
   - On a node conneceted to cluster and preferebaly not running impala daemon(in our case the CM/mgt node):
8.1 mkdir -p /usr/lib/impala/lib
8.2 cd /usr/lib/impala/lib
8.3 wegt "https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip" 
8.4 unzip impala-jdbc-0.5-2.zip
it will extract to unzip ./impala-jdbc-0.5-2, take a note of full path "/usr/lib/impala/lib/impala-jdbc-0.5-2"
8.5 # R
> install.packages("RImpala")
 - slect the mirror
 - Successfull installation will show below logs:
 * DONE (RImpala)
Making 'packages.html' ... done

The downloaded source packages are in
‘/tmp/RtmpIayO6J/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
8.6 Initialize impala for R
> library(RImpala)
Loading required package: rJava
> rimpala.init(libs="/usr/lib/impala/lib/impala-jdbc-0.5-2/")  # Path from 8.3
[1] "Classpath added successfully"
> rimpala.connect("dvhdmgt1", "25003") # here we are using impala gateway IP & port for impalaJDBC (setp 7)
[1] TRUE
> rimpala.invalidate()
[1] TRUE
> rimpala.showdatabases()
 name
1 _impala_builtins
2          default
> rimpala.usedatabase(db="default")
> rimpala.showtables()
  name
1 sample_07
2      tab1
3  tmp_tab1
> rimpala.describe("tab1")
   name      type comment
1    id       int
2 col_1   boolean
3 col_2    double
4 col_3 timestamp
> data = rimpala.query("Select * from tab1")
> data
id col_1      col_2                     col_3
1  1  true    123.123     2012-10-24 08:55:00.0
2  2 false   1243.500     2012-10-25 13:40:00.0
3  3 false  24453.325   2008-08-22 09:33:21.123
4  4 false 243423.325 2007-05-12 22:32:21.33454
> rimpala.close()
[1] TRUE
>



Ref: 
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-0-0/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html