toddler DBA

Monday, January 19, 2015

Installing Cloudera Manager & Cloudera Hadoop (Install Cloudera Hadoop part 2 of 2)

Here I am sharing my activity cookbook I followed to install a development environment for Cloudera Hadoop cluster using Virtual Box. Using almost same method you can easily install the Cloudera Hadoop on production.

Here is the part 2 of 2, Cloudera Manager & Cloudera hadoop installation and test.
This also include HA configuration, Gateway configuration, R hadoop installation.

For hosts/nodes configuration steps please check part 1.

-- ------------------------
-- 1. Pre Requisites Checks
-- ------------------------
1.1 OS : RHEL 6.4 or CentOS 6.5
1.2 MySql : 5.x or later
1.3 Python : 2.4
1.4 RAM : 2 GB
1.5 Disk : - 5 GB on the partition hosting /var.
- 500 MB on the partition hosting /usr
1.6 Network : - ssh access to all the nodes/hosts
- Name resolve either by /etc/hosts or by DNS
- The /etc/hosts file must not have duplicate IP addresses
1.7 Security: - root access as CM agent runs as root
- No blocking by Security-Enhanced Linux (SELinux)
- Disable Ipv6 on all hosts
- make sure required ports are open (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_config_ports.html#concept_k5z_vwy_4j)
- for RHEL /etc/sysconfig/network should contains hostname for the corresponding systems
- requires root/sudo access

1.8 On RHEL and CentOS 5, Install Python 2.6 or 2.7:
1.9.l In order to install packages from the EPEL repository, first download the appropriate repository rpm packages to your machine and then install Python using yum.
# su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
# yum install python26

-- --------------------------------
-- 2. Install Cloudera Manager (CM)
-- --------------------------------

2.1 Establish Your Cloudera Manager Repository Strategy.
- We will have internet connection on CM node
- Other node will use CM node as proxy
- We have already configired this on prepare_hadoop_hosts steps.

2.2 Install the Oracle JDK on CM node:
2.2.1 The JDK is included in the Cloudera Manager 5 repositories. Once you have the repo or list file in the correct place, you can install the JDK as follows:
# yum install oracle-j2sdk1.7

2.2.2 better also install "rpm -Uvh jdk-7u51-linux-x64.rpm", some times it creates problem on 3rd party apps

2.3 Install the Cloudera Manager Server Packages(**make sure to install 5.2, we have done a change at step 2.1.3 for this)
# yum install cloudera-manager-daemons cloudera-manager-server

2.4 Prepare external database(we will use MySql):
ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_mysql.html#cmig_topic_5_5_1_unique_1

2.4 .1 install MySql 5.5.37. We are installing below rpms:

- First install rpm -Uvh mysql-community-release-el6-5.noarch.rpm:
# wget http://dev.mysql.com/get/mysql-community-release-el6-5.noarch.rpm
# rpm -Uvh mysql-community-release-el6-5.noarch.rpm

- Then use yum to install Mysql:
# yum install mysql-community-server

- So, at the end of installation, below mysql packages will ne there:
mysql-community-release-el6-5.noarch
mysql-community-client-5.5.37-4.el6.x86_64
mysql-community-common-5.5.37-4.el6.x86_64
mysql-community-libs-5.5.37-4.el6.x86_64
mysql-community-server-5.5.37-4.el6.x86_64

2.4 .2 Configuring and Starting the MySQL Server:
a. Stop the MySQL server if it is running
$service mysqld stop
b. Move old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 out of /var/lib/mysql/ to a backup location
c. Determine the location of the option file, my.cnf (normally /etc/my.cnf)
d. Update my.cnf so that it conforms to the following requirements:
- To prevent deadlocks, Cloudera Manager requires the isolation level to be set to read committed.
- Configure the InnoDB engine. Cloudera Manager will not start if its tables are configured with the MyISAM engine. It can be checked usimg below:
mysql> show table status;
- Cloudera recommends that you set the innodb_flush_method property to O_DIRECT
- Set the max_connections property according to the size of your cluster. Custer Clusters with fewer than 50 hosts can be considered small clusters.
- Ours one is small cluster, we will all databases on the same host where CM installed
- Allow 100 maximum connections for each database and then add 50 extra connections. So, for 2 DB it will be 2X100+50=250.
- or our case it is very small installation with 6/7 DBs, wew are putting it to 550. Should be good enough.
- So, typecially, our MySql config(my.cnf) will be as below:

Note: Need to create the bin log location and change ownership to mysql user as below:
mkdir -p /opt/mysql/binlog/
chown -R mysql:mysql /opt/mysql/binlog/

------------start my.cnf---------------
[mysqld]
transaction-isolation=READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links=0

key_buffer = 8M
key_buffer_size = 16M
max_allowed_packet = 16M
thread_stack = 64K
thread_cache_size = 32
query_cache_limit = 8M
query_cache_size = 16M
query_cache_type = 1

max_connections = 550

# log_bin should be on a disk with enough free space
# NOTE: replace '/x/home/mysql/logs/binary' below with
# an appropriate path for your system.
log_bin=/opt/mysql/binlog/mysql_binary_log

# For MySQL version 5.1.8 or later. Comment out binlog_format for older versions.
binlog_format = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 8M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 16M
innodb_buffer_pool_size = 120M
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
------------end my.cnf---------------

2.4.3 Ensure the MySQL server starts at boot
# /sbin/chkconfig mysqld on
# /sbin/chkconfig --list mysqld
mysqld 0:off 1:off 2:on 3:on 4:on 5:on 6:off

2.4.4 Start the MySQL server:
# service mysqld start

2.4.5 Set the MySQL root password:
# /usr/bin/mysql_secure_installation

2.4 .6 Installing the MySQL JDBC Connector:
Note: Do not use the yum install command to install the MySQL connector package, because it installs the openJDK, and then uses Linux alternatives command to set the system JDK to be the openJDK.

- Install the JDBC connector on the Cloudera Manager Server host, as well as hosts to which you assign the Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server roles. I our case all on the same host.

- Better to use this process to avoid error like "MySQLSyntaxErrorException" on imapala/hive. If you installed using yum, do this step again.

- download it from http://dev.mysql.com/downloads/connector/j/
# wget "http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.34.tar.gz"
- Extract the JDBC driver JAR file from the downloaded file; for example:
# tar -xvf mysql-connector-java-5.1.34.tar.gz
- Add the JDBC driver, renamed, to the relevant server; for example:
# mkdir /usr/share/java/
# cp mysql-connector-java-5.1.34/mysql-connector-java-5.1.34-bin.jar /usr/share/java/
# ln -s /usr/share/java/mysql-connector-java-5.1.34-bin.jar /usr/share/java/mysql-connector-java.jar
# /usr/share/java/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java-5.1.34-bin.jar

2.4 .7 Create databases & users for Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server. The database must be configured to support UTF-8 character set encoding.

For Activity Monitor:
mysql> create database amon DEFAULT CHARACTER SET utf8;
mysql> grant all on amon.* TO 'amon'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on amon.* TO 'amon'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
-- here using local host, as all the services running in the same server

For Reports Manager:
mysql> create database rman DEFAULT CHARACTER SET utf8;
mysql> grant all on rman.* TO 'rman'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on rman.* TO 'rman'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';

-- here using local host, as all the services running in the same server

For Hive Metastore Server:
mysql> create database metastore DEFAULT CHARACTER SET utf8;
mysql> grant all on metastore.* TO 'hive'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on metastore.* TO 'hive'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
-- here using local host, as all the services running in the same server

For Sentry Server:
mysql> create database sentry DEFAULT CHARACTER SET utf8;
mysql> grant all on sentry.* TO 'sentry'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on sentry.* TO 'sentry'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
-- here using local host, as all the services running in the same server

For Cloudera Navigator Audit Server:
mysql> create database nav DEFAULT CHARACTER SET utf8;
mysql> grant all on nav.* TO 'nav'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on nav.* TO 'nav'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
-- here using local host, as all the services running in the same server
2.4 .8 Backup all the DBs:
# mysqldump -u root -p --all-databases > alldb_backup.sql

2.4 .9 Run the scm_prepare_database.sh script for Installer or package install:
on the host where the Cloudera Manager Server package is installed. The script prepares the database by:
- Creating the Cloudera Manager Server database configuration file.
- Creating a database for the Cloudera Manager Server to use. This is optional and is only completed if options are specified.
- Setting up a user account for the Cloudera Manager Server. This is optional and is only completed if options are specified.
mysql > grant all on *.* to 'temp'@'%' identified by 'temp' with grant option;
# /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -utemp -ptemp --scm-host localhost scm scm scm
-- The log4j errors appear but don't seem to be harmful.

mysql> drop user 'temp'@'%';

2.4 .10 Remove the embedded PostgreSQL properties file. For Installer or package install do below:
# rm /etc/cloudera-scm-server/db.mgmt.properties

2.4 .11 ** We must create the databases before you run the Cloudera Manager installation wizard if we chose external dayabase option.

2.4 .12 ** External Databases for Hue, and Oozie.
- Hue and Oozie are automatically configured with databases, but you can configure these services to use external databases after Cloudera Manager is installed.
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_hue_service.html#cmig_topic_15_unique_1
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_oozie_service.html#cmig_topic_14_unique_1

2.5 Start Cloudera Manager search for target hosts:
2.5.1 Run this command on the Cloudera Manager Server host to start cloudera manager:
# service cloudera-scm-server start

2.5.2 Wait several minutes for the Cloudera Manager Server to complete its startup and monitor log as below:
# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log #wait until you see "Started Jetty server."

2.5.3 In a web browser, enter http://Server host:7180

2.5.4 Log into Cloudera Manager Admin Console. The default credentials are: Username: admin Password: admin

2.6 Choose Cloudera Manager Edition and Hosts
2.6.1 When you start the Cloudera Manager Admin Console, the install wizard starts up. Click Continue to get started
2.6.2 Choose which edition to install.
- For our case we will installl "Cloudera Enterprise Data Hub Edition Trial, which does not require a license, but expires after 60 days and cannot be renewed"
- "Continue"
2.6.3 Cluster configuration page appear.
- (optional) Click on the "Cloudera Manager" logo to skip default installation
- go to "Administration> setting > parcels"
- Add desired "Remote Parcel Repository URLs". For us we are going to install chd 5.0.2. We will add below:
http://archive.cloudera.com/cdh5/parcels/5.0.2/
- "Save Changes"

2.6.4 Search for and choose hosts as below:
- Cloudera Manager Home > hosts > Add new hosts to cluster. Add hosts option will appear.
- To enable Cloudera Manager to automatically discover hosts on which to install CDH and managed services, enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges:
a. IP range "10.1.1.[1-4]" or histname "host[1-3].company.com" for our case 192.168.56.[101-103],192.168.56.[201-203]
b. The scan results will include all addresses scanned, but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default.
c. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for services.
d. Verify that the number of hosts shown matches the number of hosts where you want to install services.
e. Click Continue. The Select Repository page displays.
2.6.5 to avoid stuck in "Acquiring installation lock..." while installation do below on all the nodes:
# rm /tmp/.scm_prepare_node.lock

3. Install Cloudera Manager Agent, CDH, and Managed Service Software

mysql bug: http://bugs.mysql.com/bug.php?id=63085

3.1 Select how CDH and managed service software is installed: packages or parcels. We will use parcels
3.2 Choose the parcels to install. The choices you see depend on the repositories you have chosen – a repository may contain multiple parcels. Only the parcels for the latest supported service versions are configured by default.
3.3 Chose "CDH-5.0.2-1.cdh5.0.2.p0.13" and keep rest as default

3.5 install Cloudera Manager Agent
3.5.1 select the release of Cloudera Manager Agent to install.
3.5.2 Click Continue.
3.5.3 Leave Install Oracle Java SE Development Kit (JDK) checked to allow Cloudera Manager to install the JDK on each cluster host or uncheck if you plan to install it yourself. Click Continue.
3.5.4 Provide SSH login credentials.
3.5.5 Click Continue. If you did not install packages manually, Cloudera Manager installs the Oracle JDK, Cloudera Manager Agent,packages and CDH and managed service packages or parcels.
3.5.6 When the Continue button appears at the bottom of the screen, the installation process is completed. Click Continue.
3.5.7 The Host Inspector runs to validate the installation, and provides a summary of what it finds, including all the versions of the installed components.
If the validation is successful, click Finish. The Cluster Setup page displays.

3.6 Add Services
3.6.1 In the first page of the Add Services wizard you choose the combination of services to install and whether to install Cloudera Navigator. Click the radio button next to the combination of services to install.
Some services depend on other services; for example, HBase requires HDFS and ZooKeeper. Cloudera Manager tracks dependencies and installs the correct combination of services.
3.6.2 The Flume service can be added only after your cluster has been set up.
3.6.3 If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally check the Include Cloudera Navigator checkbox to enable Cloudera Navigator.
3.6.4 Click Continue. The Customize Role Assignments page displays.
3.6.5 Customize the assignment of role instances to hosts. (datanodes, namenodes, resource manager etc). Hosts can be chosen semilar like step 2.6.3.1.a
3.6.6 When you are satisfied with the assignments, click Continue. The Database Setup page displays.
3.6.7 Enter the database host, database type, database name, username, and password for the database that you created when you set up the database.
3.6.8 Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information you have supplied. If the test succeeds in all cases, click Continue;
3.6.9 Review the configuration changes to be applied.
- Confirm the settings entered for file system paths for HDFS and others.
- Make sure to add 3 nodes for Zookeeper.
- Donot make namenode as hbase master
The file paths required vary based on the services to be installed.
Click Continue. The wizard starts the services.
3.6.10 When all of the services are started, click Continue. You will see a success message indicating that your cluster has been successfully started.
3.6.11 There will be some configuration alarms as we have installed with low resources. Fix them as much as possible.
some thing like below:
a. - if needed,delete services in below order
Oozie
impala
Hive
HBase
SPARK
sqoop2
YARN
HDFS
zookeeper
- The add service is reverse order
b. While reinstalling HDFS make sure name directories of NameNode are empty(default /dfs/nn and on secondery namemode /dfs/snn, datanodes /data/0[1-3..]/ on)
for our case:
ssh dvhdnn1 "rm -rf /dfs/nn/*"
ssh dvhdjt1 "rm -rf /dfs/snn/*"
ssh dvhddn01 "rm -rf /data/01/*"
ssh dvhddn01 "rm -rf /data/02/*"
ssh dvhddn01 "rm -rf /data/03/*"
ssh dvhddn02 "rm -rf /data/01/*"
ssh dvhddn02 "rm -rf /data/02/*"
ssh dvhddn02 "rm -rf /data/03/*"
ssh dvhddn03 "rm -rf /data/01/*"
ssh dvhddn03 "rm -rf /data/02/*"
ssh dvhddn03 "rm -rf /data/03/*"
-
c. While re-installing hbase (to avoid "TableExistsException: hbase:namespace")
- Stop existing hbase service
- do below from one of the servers running an HBase service (for our case we can use CM node,dvhdmgt1)
# hbase zkcli
[zk: localhost:2181(CONNECTED) 0] rmr /hbase # << this command to remove existing znode
- delete the existing hbase service
- try adding hbase again
d. Eliminate "Failed to access Hive warehouse: /user/hive/warehouse" on hue or beeswax:
- # su - hdfs
- # hadoop fs -mkdir /user/hive
- # hadoop fs -mkdir /user/hive/warehouse
- # hadoop fs -chown -R hive:hive /user/hive
- # hadoop fs -chmod -R 1775 /user/hive/
- restart hue service
-- -----------------------
-- 4 Test the Installation
-- -----------------------
4.1 login to CM web console
4.2 All the services should be running with Good Health on CM console.
4.3 Click the Hosts tab where you can see a list of all the Hosts along with the value of their Last Heartbeat. By default, every Agent must heartbeat successfully every 15 seconds.
4.4 Running a MapReduce Job
4.4.1 Log into a host in the cluster.
4.4.2 run marrequce jobs as below, it should run successfully
a. Run pi example:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
b. Wrodcount example:
# su - hdfs
$ echo 'Hello World, Bye World!' > /tmp/file01
$ echo 'Hello Hadoop, Goodbye to hadoop.' > /tmp/file02
$hadoop fs -mkdir /tmp/input/
$ hadoop fs -put /tmp/file01 /tmp/input/file01
$ hadoop fs -put /tmp/file02 /tmp/input/file02
$ hadoop fs -cat /tmp/input/file01
Hello World, Bye World!

$ hadoop fs -cat /tmp/input/file02
Hello Hadoop, Goodbye to hadoop.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar wordcount /tmp/input/ /tmp/output/

$ hadoop fs -ls /tmp/output/
Found 2 items
-rw-r--r-- 3 hdfs supergroup 0 2014-11-08 06:31 /tmp/output/_SUCCESS
-rw-r--r-- 3 hdfs supergroup 32 2014-11-08 06:31 /tmp/output/part-r-00000

$ hadoop fs -cat /tmp/output/part-r-00000
Bye 1
Goodbye 1
Hadoop, 1
Hello 2
World! 1
World, 1
hadoop. 1
to 1
4.4.3 Monitor above mapreduce job as "Clusters > ClusterName > yarn Applications"
4.4.4 Testing imapla
- create the datafile locally
$ cat /tmp/tab1.csv
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33

- copy the file to hdfs
# su - hdfs
$ hadoop fs -mkdir /tmp/tab1/
$ hadoop fs -put /tmp/tab1.csv /tmp/tab1/tab1.csv

- login to imapala shell
# impala-shell -i dvhddn01

- Create a text based table
CREATE EXTERNAL TABLE TMP_TAB1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1';

select * from TMP_TAB1;

- create a PARQUET table
create table TAB1 (
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
) STORED AS PARQUET;

insert into TAB1 select * from TMP_TAB1;

select * from TAB1;

4.4.5 test with hue
- log into hue webcon sole http://dvmgt1:8888
- access tables create in step 4.4.4 using hive
- access tables create in step 4.4.4 using impala
-- ------------------
-- 5 Install R Hadoop
-- ------------------
http://ashokharnal.wordpress.com/2014/01/16/installing-r-rhadoop-and-rstudio-over-cloudera-hadoop-ecosystem-revised/
https://github.com/RevolutionAnalytics/RHadoop/wiki
*** should have internet access
5.1 on the same node where CM installed, Install R & R-devel
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -Uvh epel-release-6-8.noarch.rpm
# yum clean all
# yum install R R-devel
5.2 usin R shell to install packages as below:
# R
> install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2","caTools"))

5.3 Download rhdfs and rmr2 packages to your local Download folder from 'https://github.com/RevolutionAnalytics/RHadoop/wiki"
cd /tmp
wget "https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz" \
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz

wget "https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz"
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz

5.4 Make sure below to env are set in .bash_profile of root ot the sudo user:

export PATH

export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar
export JAVA_HOME=/usr/java/jdk1.7.0_51

export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server

5.5 Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine.
# R
> install.packages("/tmp/rhdfs_1.0.8.tar.gz", repos = NULL, type="source")
> install.packages("/tmp/rmr2_3.2.0.tar.gz", repos = NULL, type="source")
5.6 Now RHadoop is installed

5.7 test R hadoop (if a new user than non root/non sudo make sure to set env related to hadoop & rhadoop)
5.8 test installation as below:

Note: if you find error like "Unable to find JAAS classes", make sure you installed native jdk(step 2.2.2) and set env(step 5.4)

# su - hdfs
$R
> library(rmr2)
> library(rJava)
> library(rhdfs)
> hdfs.init()
> hdfs.ls("/")
> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)

[You will get a long series of output something as below]

$key
NULL

$val
v
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
.............
.............
[98,] 98 196
[99,] 99 198
[100,] 100 200

-- --------------
6. Install Mahut
-- --------------
6.1 Install mahut on CM node, here we are using yum
# yum install mahout

6.2 Mahut will be accessible using below:
# /usr/bin/mahout

6.3 Above command should not give any error

-- ----------------------------
7. Configure High Availability
-- ----------------------------

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-High-Availability-Guide/CDH5-High-Availability-Guide.html

6.1 Configuring HDFS High Availability (using wizard)
6.1.1 CM Home > HDFS > instances > Click "Enable High Availability" button
6.1.2 select 3 JournalNodes(dvhdmgt1, dvhdnn1 & dvhdjt1) and a standby node(dvhdjt1). Continue
6.1.3 select your name service name. I kept it default "nameservice1". Continue
6.1.4 Review Changes and provide values for dfs.journalnode.edits.dir (i provided /dfs/jn)
Keep rest default( most of them are cleaning & re-initializing existing services). Continue
6.1.5 It will fail formaing "Name directories of the current NameNod", it should be failed. Just ignore it..
6.1.6 The following manual steps must be performed after completing this wizard:
- CM Home > Hive > Action > "Stop"
- (optionally) Backup Hive metastore
- CM Home > Hive > Action > "Update Hive Metastore NameNodes"
- CM Home > Hive > Action > "Restart"
- CM Home > impala > Action > "Restart"
- CM Home > Hue > Action > "Restart"
- Test on previous MR, hive & impala data

6.2 Configuring High Availability for ResourceManager (MRv2/YARN)
6.2.1 Stop all YARN daemons
- CM > YARN > Action > Stop

6.2.2 Update the configuration on yarn-site.xml (use CM)
-- dvhdjt1, we will name it resource manager 1(rm1)
-- dvhdnn1, we will name it resource manager 2(rm2)
-- Append below in /etc/hadoop/conf/yarn-site.xml on dvhdjt1 and copy it to all nodes (except CM node)

yarn.resourcemanager.connect.retry-interval.ms
2000

yarn.resourcemanager.ha.enabled
true

yarn.resourcemanager.ha.automatic-failover.enabled
true

yarn.resourcemanager.ha.automatic-failover.embedded
true

yarn.resourcemanager.cluster-id
yarnRM

yarn.resourcemanager.ha.rm-ids
rm1,rm2

yarn.resourcemanager.ha.id
rm1

yarn.resourcemanager.scheduler.class
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

yarn.resourcemanager.recovery.enabled
true

yarn.resourcemanager.store.class
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore

yarn.resourcemanager.zk.state-store.address
localhost:2181

yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
5000

yarn.resourcemanager.address.rm1
dvhdjt1:23140

yarn.resourcemanager.scheduler.address.rm1
dvhdjt1:23130

yarn.resourcemanager.webapp.https.address.rm1
dvhdjt1:23189

yarn.resourcemanager.webapp.address.rm1
dvhdjt1:23188

yarn.resourcemanager.resource-tracker.address.rm1
dvhdjt1:23125

yarn.resourcemanager.admin.address.rm1
dvhdjt1:23141

yarn.resourcemanager.address.rm2
dvhdnn1:23140

yarn.resourcemanager.scheduler.address.rm2
dvhdnn1:23130

yarn.resourcemanager.webapp.https.address.rm2
dvhdnn1:23189

yarn.resourcemanager.webapp.address.rm2
dvhdnn1:23188

yarn.resourcemanager.resource-tracker.address.rm2
dvhdnn1:23125

yarn.resourcemanager.admin.address.rm2
dvhdnn1:23141

Address where the localizer IPC is.
yarn.nodemanager.localizer.address
0.0.0.0:23344

NM Webapp address.
yarn.nodemanager.webapp.address
0.0.0.0:23999

yarn.nodemanager.aux-services
mapreduce_shuffle

yarn.nodemanager.local-dirs
/tmp/pseudo-dist/yarn/local

yarn.nodemanager.log-dirs
/tmp/pseudo-dist/yarn/log

mapreduce.shuffle.port
23080

6.2.3 Re-start the YARN daemons
- CM > YARN > Instances > Add > for Resourcemanager select "dvhdnn1" > Continue
- CM > YARN > Instances > Select all > Action for Selected "Restrat"

6.2.4 Using yarn rmadmin to Administer ResourceManager HA
- yarn rmadmin has the following options related to RM HA:
[-transitionToActive ]
[-transitionToStandby ]
[-getServiceState ]
[-checkHealth ]
[-help ]
where is the rm-id (for our case in rm1 & rm2)

-- ----------------
7. Install gateways
-- ----------------

7.1. Impala proxy:

Ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_proxy.html

7.1.1 Install haproxy
# yum install haproxy

7.1.2 Set up the configuration file: /etc/haproxy/haproxy.cfg as below
--------------------------Start COnfig File-------------------------------
global
# To have these messages end up in /var/log/haproxy.log you will
# need to:
#
# 1) configure syslog to accept network log events. This is done
# by adding the '-r' option to the SYSLOGD_OPTIONS in
# /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
# file. A line like the following can be added to
# /etc/sysconfig/syslog
#
# local2.* /var/log/haproxy.log
#
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4000
user haproxy
group haproxy
daemon

# turn on stats unix socket
#stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
maxconn 3000
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms

#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
balance
mode http
stats enable
stats auth username:password

# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25004
mode tcp
option tcplog
balance leastconn

server impald1 dvhddn01.example.com:21000
server impald2 dvhddn02.example.com:21000
server impald3 dvhddn03.example.com:21000

# For impala JDBC or ODBC
listen impala :25003
mode tcp
option tcplog
balance leastconn

server impald1 dvhddn01.example.com:21050
server impald2 dvhddn02.example.com:21050
server impald3 dvhddn03.example.com:21050

--------------------------End Config File-------------------------------
** we have configured 25003 for JDBC/ODBC impala connection
** we have configured 25004 impalad/impala-shell connection

7.1.3 Run the load balancer (on a single host, preferably one not running impalad, our case dvhdmgt1, dvhdnn1 & dvhdjt1):

# service haproxy start

- ignore below warning:
Starting haproxy: [WARNING] 322/090925 (15196) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
or
Starting haproxy: [WARNING] 329/052137 (32754) : Parsing [/etc/haproxy/haproxy.cfg:73]: proxy 'impala' has same name as another proxy (declared at /etc/haproxy/haproxy.cfg:62).
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.

7.1.4 Connect to impala from any of the haproxy nodes as below:
# impala-shell -i dvhdmgt1:25004
> use axdb;
> select count(*) from f_ntw_actvty_http;

7.1.5 make is on for restart:
# chkconfig haproxy on

7.2 HttpFS gateway:
HttpFS gateway normally installed with cloudera hadoop parcel installation (Step 3.6 Add Services).
7.2.1 Just check CM > HFDS > Instances > chack whether any httpfs node are there or not.
7.2.2 If not then CM> HDFS > Instances > Add Role Instances > HttpFS > Add your hosts > Follow next instrictions
7.2.3 Afer installation completed check with below (from each of the IP of the nodes where HttpFS installed):
curl "http://192.168.56.201:14000/webhdfs/v1?op=gethomedirectory&user.name=hdfs"
curl 'http://192.168.56.202:14000/webhdfs/v1/?user.name=hdfs&op=open'
curl 'http://192.168.56.203:14000/webhdfs/v1/tmp/tab1/tab1.csv?user.name=hdfs&op=open'

8. Install RImpala
ref: http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
- On a node conneceted to cluster and preferebaly not running impala daemon(in our case the CM/mgt node):
8.1 mkdir -p /usr/lib/impala/lib
8.2 cd /usr/lib/impala/lib
8.3 wegt "https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip"
8.4 unzip impala-jdbc-0.5-2.zip
it will extract to unzip ./impala-jdbc-0.5-2, take a note of full path "/usr/lib/impala/lib/impala-jdbc-0.5-2"
8.5 # R
> install.packages("RImpala")
- slect the mirror
- Successfull installation will show below logs:
* DONE (RImpala)
Making 'packages.html' ... done

The downloaded source packages are in
‘/tmp/RtmpIayO6J/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
8.6 Initialize impala for R
> library(RImpala)
Loading required package: rJava
> rimpala.init(libs="/usr/lib/impala/lib/impala-jdbc-0.5-2/") # Path from 8.3
[1] "Classpath added successfully"
> rimpala.connect("dvhdmgt1", "25003") # here we are using impala gateway IP & port for impalaJDBC (setp 7)
[1] TRUE
> rimpala.invalidate()
[1] TRUE
> rimpala.showdatabases()
name
1 _impala_builtins
2 default
> rimpala.usedatabase(db="default")
> rimpala.showtables()
name
1 sample_07
2 tab1
3 tmp_tab1
> rimpala.describe("tab1")
name type comment
1 id int
2 col_1 boolean
3 col_2 double
4 col_3 timestamp
> data = rimpala.query("Select * from tab1")
> data
id col_1 col_2 col_3
1 1 true 123.123 2012-10-24 08:55:00.0
2 2 false 1243.500 2012-10-25 13:40:00.0
3 3 false 24453.325 2008-08-22 09:33:21.123
4 4 false 243423.325 2007-05-12 22:32:21.33454
> rimpala.close()
[1] TRUE
>

Ref:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-0-0/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html

Preparing Hosts(VMs) to Install Cloudera Hadoop (Install Cloudera Hadoop part 1 of 2):

Thursday, November 6, 2014

OLTP vs MPP vs Hadoop

Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.
- MPP takes the processing as close possible to the data, so less data movement

2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
- MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.

Hadoop(without impala or EMC HAWQ) vs MPP:

1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.
- There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
2. - With conventional MPP, it support only relational models(row-column)
- hadoop support virtually any kind of data.

* However the main objective of MPP and hadoop is same, process data parallely near storage.

Cloudera impala(or pivotal HAWQ) vs MPP:

1. - MPP supports advanced indatabase analytics
- Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
2. - MPP databases have industry standard security features and well defined user schema structure.
- Impala has very immatured security system and virtually no user schema.
3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
- impala supports most open file formats (text, parquate)

* However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.

How to choice what (in general and my personal opinion):

1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
- Transaction based application
- Smaller DWH
* However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.

2. MPP (Netezza, Teradata, Vertica)
- Bigger Data warehouse (may be having tables with size more than 4-5 TB)
- Needs no or little pre-processing
- Needs faster batch processing speed
- In database analytics

3. Only Hadoop:
- All data as heavily unstructured (documents, audio, video etc)
- need to process in batch

4. Hadoop and using mainly Impala (or EMC HAWQ)
- Need to have a DWH with low cost
- No need to have advance analytics features
- Can utilize open source tools
- Not concern about security or limited number of users

5. Hadoop(with impala or HAWQ) + MPP:
- Some Data need heavy pre-processing before ready to advance analytics.
- Need cheaper query able archive or backup for older data

Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc