Monday, January 19, 2015

Installing Cloudera Manager & Cloudera Hadoop (Install Cloudera Hadoop part 2 of 2)

Here I am sharing my activity cookbook I followed to install a development environment for Cloudera Hadoop cluster using Virtual Box. Using almost same method you can easily install the Cloudera Hadoop on production.

Here is the part 2 of 2, Cloudera Manager & Cloudera hadoop installation and test.
This also include HA configuration, Gateway configuration, R hadoop installation.

For hosts/nodes configuration steps please check part 1


         
-- ------------------------
-- 1. Pre Requisites Checks
-- ------------------------
1.1 OS : RHEL 6.4 or CentOS 6.5
1.2 MySql : 5.x or later
1.3 Python : 2.4
1.4 RAM : 2 GB
1.5 Disk : - 5 GB on the partition hosting /var.
 - 500 MB on the partition hosting /usr
1.6 Network : - ssh access to all the nodes/hosts
         - Name resolve either by /etc/hosts or by DNS
     - The /etc/hosts file must not have duplicate IP addresses
1.7 Security: - root access as CM agent runs as root
 - No blocking by Security-Enhanced Linux (SELinux)
     - Disable Ipv6 on all hosts
     - make sure required ports are open (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_config_ports.html#concept_k5z_vwy_4j)
 - for RHEL /etc/sysconfig/network should contains hostname for the corresponding systems
     - requires root/sudo access

1.8 On RHEL and CentOS 5, Install Python 2.6 or 2.7:
1.9.l In order to install packages from the EPEL repository, first download the appropriate repository rpm packages to your machine and then install Python using yum. 
# su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
#  yum install python26

-- --------------------------------
-- 2. Install Cloudera Manager (CM)
-- --------------------------------

2.1 Establish Your Cloudera Manager Repository Strategy.
- We will have internet connection on CM node
- Other node will use CM node as proxy
- We have already configired this on prepare_hadoop_hosts steps.
 
2.2 Install the Oracle JDK on CM node:
2.2.1 The JDK is included in the Cloudera Manager 5 repositories. Once you have the repo or list file in the correct place, you can install the JDK as follows:
# yum install oracle-j2sdk1.7

2.2.2 better also install "rpm -Uvh jdk-7u51-linux-x64.rpm", some times it creates problem on 3rd party apps

2.3 Install the Cloudera Manager Server Packages(**make sure to install 5.2, we have done a change at step 2.1.3 for this)
# yum install cloudera-manager-daemons cloudera-manager-server
 
2.4 Prepare external database(we will use MySql):
ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_mysql.html#cmig_topic_5_5_1_unique_1

2.4 .1 install MySql 5.5.37. We are installing below rpms:

- First install rpm -Uvh mysql-community-release-el6-5.noarch.rpm:
# wget http://dev.mysql.com/get/mysql-community-release-el6-5.noarch.rpm
# rpm -Uvh mysql-community-release-el6-5.noarch.rpm

- Then use yum to install Mysql:
# yum install mysql-community-server


- So, at the end of installation, below mysql packages will ne there:
mysql-community-release-el6-5.noarch
mysql-community-client-5.5.37-4.el6.x86_64
mysql-community-common-5.5.37-4.el6.x86_64
mysql-community-libs-5.5.37-4.el6.x86_64   
mysql-community-server-5.5.37-4.el6.x86_64

2.4 .2 Configuring and Starting the MySQL Server:
 a. Stop the MySQL server if it is running
  $service mysqld stop
 b. Move old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 out of /var/lib/mysql/ to a backup location
 c. Determine the location of the option file, my.cnf (normally /etc/my.cnf)
 d. Update my.cnf so that it conforms to the following requirements:
- To prevent deadlocks, Cloudera Manager requires the isolation level to be set to read committed.
- Configure the InnoDB engine. Cloudera Manager will not start if its tables are configured with the MyISAM engine. It can be checked usimg below:
mysql> show table status;
- Cloudera recommends that you set the innodb_flush_method property to O_DIRECT
- Set the max_connections property according to the size of your cluster. Custer Clusters with fewer than 50 hosts can be considered small clusters.
- Ours one is small cluster, we will all databases on the same host where CM installed
- Allow 100 maximum connections for each database and then add 50 extra connections. So, for 2 DB it will be 2X100+50=250. 
- or our case it is very small installation with 6/7 DBs, wew are putting it to 550. Should be good enough.
- So, typecially, our MySql config(my.cnf) will be as below:

Note: Need to create the bin log location and change ownership to mysql user as below:
mkdir -p /opt/mysql/binlog/
chown -R mysql:mysql /opt/mysql/binlog/

------------start my.cnf---------------
[mysqld]
transaction-isolation=READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links=0

key_buffer              = 8M
key_buffer_size         = 16M
max_allowed_packet      = 16M
thread_stack            = 64K
thread_cache_size       = 32
query_cache_limit       = 8M
query_cache_size        = 16M
query_cache_type        = 1

max_connections         = 550

# log_bin should be on a disk with enough free space
# NOTE: replace '/x/home/mysql/logs/binary' below with
#       an appropriate path for your system.
log_bin=/opt/mysql/binlog/mysql_binary_log

# For MySQL version 5.1.8 or later. Comment out binlog_format for older versions.
binlog_format           = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 8M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size          = 16M
innodb_buffer_pool_size         = 120M
innodb_thread_concurrency       = 8
innodb_flush_method             = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
------------end my.cnf---------------


2.4.3 Ensure the MySQL server starts at boot
# /sbin/chkconfig mysqld on
# /sbin/chkconfig --list mysqld
mysqld          0:off   1:off   2:on    3:on    4:on    5:on    6:off

2.4.4 Start the MySQL server:
# service mysqld start

2.4.5 Set the MySQL root password:
# /usr/bin/mysql_secure_installation

2.4 .6 Installing the MySQL JDBC Connector:
Note: Do not use the yum install command to install the MySQL connector package, because it installs the openJDK, and then uses Linux alternatives command to set the system JDK to be the openJDK.

- Install the JDBC connector on the Cloudera Manager Server host, as well as hosts to which you assign the Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server roles. I our case all on the same host.

- Better to use this process to avoid error like "MySQLSyntaxErrorException" on imapala/hive. If you installed using yum, do this step again.

- download it from http://dev.mysql.com/downloads/connector/j/
# wget "http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.34.tar.gz"
- Extract the JDBC driver JAR file from the downloaded file; for example:
# tar -xvf mysql-connector-java-5.1.34.tar.gz
- Add the JDBC driver, renamed, to the relevant server; for example:
# mkdir /usr/share/java/
# cp mysql-connector-java-5.1.34/mysql-connector-java-5.1.34-bin.jar /usr/share/java/
# ln -s /usr/share/java/mysql-connector-java-5.1.34-bin.jar /usr/share/java/mysql-connector-java.jar
# /usr/share/java/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java-5.1.34-bin.jar

2.4 .7 Create databases & users for Activity Monitor, Reports Manager, Hive Metastore, Sentry Server, and Cloudera Navigator Audit Server. The database must be configured to support UTF-8 character set encoding.

For Activity Monitor:
mysql> create database amon DEFAULT CHARACTER SET utf8;
mysql> grant all on amon.* TO 'amon'@'localhost' IDENTIFIED BY 'password'; 
mysql> grant all on amon.* TO 'amon'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Reports Manager:
mysql> create database rman DEFAULT CHARACTER SET utf8;
mysql> grant all on rman.* TO 'rman'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on rman.* TO 'rman'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';

     -- here using local host, as all the services running in the same server

For Hive Metastore Server:
mysql> create database metastore DEFAULT CHARACTER SET utf8;
mysql> grant all on metastore.* TO 'hive'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on metastore.* TO 'hive'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Sentry Server:
mysql> create database sentry DEFAULT CHARACTER SET utf8;
mysql> grant all on sentry.* TO 'sentry'@'localhost' IDENTIFIED BY 'password';
mysql> grant all on sentry.* TO 'sentry'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server

For Cloudera Navigator Audit Server:
mysql> create database nav DEFAULT CHARACTER SET utf8;
mysql> grant all on nav.* TO 'nav'@'localhost' IDENTIFIED BY 'password'; 
mysql> grant all on nav.* TO 'nav'@'dvhdmgt1.example.com' IDENTIFIED BY 'password';
     -- here using local host, as all the services running in the same server  
2.4 .8 Backup all the DBs:
# mysqldump -u root -p --all-databases > alldb_backup.sql

2.4 .9 Run the scm_prepare_database.sh script for Installer or package install:
on the host where the Cloudera Manager Server package is installed. The script prepares the database by:
- Creating the Cloudera Manager Server database configuration file.
    - Creating a database for the Cloudera Manager Server to use. This is optional and is only completed if options are specified.
- Setting up a user account for the Cloudera Manager Server. This is optional and is only completed if options are specified.
mysql > grant all on *.* to 'temp'@'%' identified by 'temp' with grant option;
# /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -utemp -ptemp --scm-host localhost scm scm scm
 -- The log4j errors appear but don't seem to be harmful.

mysql> drop user 'temp'@'%';

2.4 .10 Remove the embedded PostgreSQL properties file. For Installer or package install do below:
# rm /etc/cloudera-scm-server/db.mgmt.properties

2.4 .11 **  We must create the databases before you run the Cloudera Manager installation wizard if we chose external dayabase option.


2.4 .12 ** External Databases for Hue, and Oozie. 
      - Hue and Oozie are automatically configured with databases, but you can configure these services to use external databases after Cloudera Manager is installed.
  http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_hue_service.html#cmig_topic_15_unique_1
  http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Managing-Clusters/cm5mc_oozie_service.html#cmig_topic_14_unique_1

2.5 Start Cloudera Manager search for target hosts:
2.5.1 Run this command on the Cloudera Manager Server host to start cloudera manager:
# service cloudera-scm-server start

2.5.2 Wait several minutes for the Cloudera Manager Server to complete its startup and monitor log as below:
# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log #wait until you see "Started Jetty server."

2.5.3 In a web browser, enter http://Server host:7180

2.5.4 Log into Cloudera Manager Admin Console. The default credentials are: Username: admin Password: admin


2.6 Choose Cloudera Manager Edition and Hosts
2.6.1 When you start the Cloudera Manager Admin Console, the install wizard starts up. Click Continue to get started
2.6.2 Choose which edition to install.
- For our case we will installl "Cloudera Enterprise Data Hub Edition Trial, which does not require a license, but expires after 60 days and cannot be renewed"
- "Continue"
2.6.3 Cluster configuration page appear. 
- (optional) Click on the "Cloudera Manager" logo to skip default installation
- go to "Administration> setting > parcels"
- Add desired "Remote Parcel Repository URLs". For us we are going to install chd 5.0.2. We will add below:
http://archive.cloudera.com/cdh5/parcels/5.0.2/
- "Save Changes"


2.6.4 Search for and choose hosts as below:
- Cloudera Manager Home > hosts > Add new hosts to cluster. Add hosts option will appear.
- To enable Cloudera Manager to automatically discover hosts on which to install CDH and managed services, enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges:
a. IP range "10.1.1.[1-4]" or histname "host[1-3].company.com" for our case 192.168.56.[101-103],192.168.56.[201-203]
b. The scan results will include all addresses scanned, but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default. 
c. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for services.
d. Verify that the number of hosts shown matches the number of hosts where you want to install services. 
e. Click Continue. The Select Repository page displays.
2.6.5 to avoid stuck in "Acquiring installation lock..." while installation do below on all the nodes:
# rm /tmp/.scm_prepare_node.lock

3. Install Cloudera Manager Agent, CDH, and Managed Service Software

mysql bug: http://bugs.mysql.com/bug.php?id=63085

3.1 Select how CDH and managed service software is installed: packages or parcels. We will use parcels
3.2 Choose the parcels to install. The choices you see depend on the repositories you have chosen – a repository may contain multiple parcels. Only the parcels for the latest supported service versions are configured by default.
3.3 Chose "CDH-5.0.2-1.cdh5.0.2.p0.13" and keep rest as default

3.5 install Cloudera Manager Agent
3.5.1 select the release of Cloudera Manager Agent to install. 
3.5.2 Click Continue.
3.5.3 Leave Install Oracle Java SE Development Kit (JDK) checked to allow Cloudera Manager to install the JDK on each cluster host or uncheck if you plan to install it yourself. Click Continue.
3.5.4 Provide SSH login credentials.
3.5.5 Click Continue. If you did not install packages manually, Cloudera Manager installs the Oracle JDK, Cloudera Manager Agent,packages and CDH and managed service packages or parcels.
3.5.6 When the Continue button appears at the bottom of the screen, the installation process is completed. Click Continue.
3.5.7 The Host Inspector runs to validate the installation, and provides a summary of what it finds, including all the versions of the installed components. 
     If the validation is successful, click Finish. The Cluster Setup page displays.

3.6 Add Services
3.6.1 In the first page of the Add Services wizard you choose the combination of services to install and whether to install Cloudera Navigator. Click the radio button next to the combination of services to install.
Some services depend on other services; for example, HBase requires HDFS and ZooKeeper. Cloudera Manager tracks dependencies and installs the correct combination of services.
3.6.2 The Flume service can be added only after your cluster has been set up.
3.6.3 If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally check the Include Cloudera Navigator checkbox to enable Cloudera Navigator.
3.6.4 Click Continue. The Customize Role Assignments page displays.
3.6.5 Customize the assignment of role instances to hosts.  (datanodes, namenodes, resource manager etc). Hosts can be chosen semilar like step 2.6.3.1.a
3.6.6 When you are satisfied with the assignments, click Continue. The Database Setup page displays.
3.6.7 Enter the database host, database type, database name, username, and password for the database that you created when you set up the database.
3.6.8 Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information you have supplied. If the test succeeds in all cases, click Continue; 
3.6.9 Review the configuration changes to be applied. 
   - Confirm the settings entered for file system paths for HDFS and others. 
- Make sure to add 3 nodes for Zookeeper.
- Donot make namenode as hbase master
The file paths required vary based on the services to be installed. 
Click Continue. The wizard starts the services.
3.6.10 When all of the services are started, click Continue. You will see a success message indicating that your cluster has been successfully started.
3.6.11 There will be some configuration alarms as we have installed with low resources. Fix them as much as possible.
some thing like below:
a. - if needed,delete services in below order
Oozie 
impala
Hive 
HBase
SPARK
sqoop2 
YARN 
HDFS
zookeeper
 - The add service is reverse order 
b. While reinstalling HDFS make sure name directories of NameNode are empty(default /dfs/nn and on secondery namemode /dfs/snn, datanodes /data/0[1-3..]/ on)
for our case:
   ssh dvhdnn1  "rm -rf /dfs/nn/*"
ssh dvhdjt1  "rm -rf /dfs/snn/*"
ssh dvhddn01  "rm -rf /data/01/*"
ssh dvhddn01  "rm -rf /data/02/*"
ssh dvhddn01  "rm -rf /data/03/*"
ssh dvhddn02  "rm -rf /data/01/*"
ssh dvhddn02  "rm -rf /data/02/*"
ssh dvhddn02  "rm -rf /data/03/*"
ssh dvhddn03  "rm -rf /data/01/*"
ssh dvhddn03  "rm -rf /data/02/*"
ssh dvhddn03  "rm -rf /data/03/*"

c. While re-installing hbase (to avoid "TableExistsException: hbase:namespace")
- Stop existing hbase service
- do below from one of the servers running an HBase service (for our case we can use CM node,dvhdmgt1)
# hbase zkcli
[zk: localhost:2181(CONNECTED) 0] rmr /hbase   # << this command to remove existing znode
- delete the existing hbase service
- try adding hbase again
d. Eliminate "Failed to access Hive warehouse: /user/hive/warehouse" on hue or beeswax:
- # su - hdfs
- # hadoop fs -mkdir /user/hive
- # hadoop fs -mkdir /user/hive/warehouse
- # hadoop fs -chown -R hive:hive /user/hive
- # hadoop fs -chmod -R 1775 /user/hive/
- restart hue service
-- -----------------------
-- 4 Test the Installation
-- -----------------------
4.1 login to CM web console
4.2 All the services should be running with Good Health on CM console.
4.3 Click the Hosts tab where you can see a list of all the Hosts along with the value of their Last Heartbeat. By default, every Agent must heartbeat successfully every 15 seconds. 
4.4 Running a MapReduce Job
4.4.1 Log into a host in the cluster.
4.4.2  run marrequce jobs  as below, it should run successfully
a. Run pi example:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
b. Wrodcount example:
# su - hdfs
$ echo 'Hello World, Bye World!' > /tmp/file01
$ echo 'Hello Hadoop, Goodbye to hadoop.' > /tmp/file02
$hadoop fs -mkdir /tmp/input/
$ hadoop fs -put /tmp/file01 /tmp/input/file01
$ hadoop fs -put /tmp/file02 /tmp/input/file02
$ hadoop fs -cat /tmp/input/file01
Hello World, Bye World!

$ hadoop fs -cat /tmp/input/file02
Hello Hadoop, Goodbye to hadoop.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar wordcount /tmp/input/ /tmp/output/
 
$ hadoop fs -ls /tmp/output/
Found 2 items
-rw-r--r--   3 hdfs supergroup          0 2014-11-08 06:31 /tmp/output/_SUCCESS
-rw-r--r--   3 hdfs supergroup         32 2014-11-08 06:31 /tmp/output/part-r-00000

$ hadoop fs -cat /tmp/output/part-r-00000
Bye     1
Goodbye 1
Hadoop, 1
Hello   2
World!  1
World,  1
hadoop. 1
to      1
4.4.3 Monitor above mapreduce job as "Clusters > ClusterName > yarn Applications"
4.4.4 Testing imapla
- create the datafile locally
$ cat /tmp/tab1.csv
1,true,123.123,2012-10-24 08:55:00 
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33

- copy the file to hdfs
# su - hdfs
$ hadoop fs -mkdir /tmp/tab1/
$ hadoop fs -put /tmp/tab1.csv /tmp/tab1/tab1.csv

- login to imapala shell
# impala-shell -i dvhddn01

- Create a text based table
CREATE EXTERNAL TABLE TMP_TAB1
(
  id INT,
  col_1 BOOLEAN,
  col_2 DOUBLE,
  col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1';

select * from TMP_TAB1;

- create a PARQUET table
create table TAB1 (
  id INT,
  col_1 BOOLEAN,
  col_2 DOUBLE,
  col_3 TIMESTAMP
) STORED AS PARQUET;

insert into TAB1 select * from TMP_TAB1;

select * from TAB1;

4.4.5 test with hue
- log into hue webcon sole http://dvmgt1:8888
- access tables create in step 4.4.4 using hive
- access tables create in step 4.4.4 using impala
-- ------------------
-- 5 Install R Hadoop
-- ------------------
http://ashokharnal.wordpress.com/2014/01/16/installing-r-rhadoop-and-rstudio-over-cloudera-hadoop-ecosystem-revised/
https://github.com/RevolutionAnalytics/RHadoop/wiki
*** should have internet access
5.1 on the same node where CM installed, Install R & R-devel
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -Uvh epel-release-6-8.noarch.rpm
# yum clean all
# yum install R R-devel
5.2 usin R shell to install packages as below:
# R
> install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2","caTools"))

5.3 Download rhdfs and rmr2 packages to your local Download folder from 'https://github.com/RevolutionAnalytics/RHadoop/wiki"
cd /tmp
wget "https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz" \
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz

wget "https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz"
or curl -O https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.2.0.tar.gz

5.4 Make sure below to env are set in .bash_profile of root ot the sudo user:

export PATH

export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar
export JAVA_HOME=/usr/java/jdk1.7.0_51

export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server



5.5 Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine.
# R
> install.packages("/tmp/rhdfs_1.0.8.tar.gz", repos = NULL, type="source")
> install.packages("/tmp/rmr2_3.2.0.tar.gz", repos = NULL, type="source")
5.6 Now RHadoop is installed

5.7 test R hadoop (if a new user than non root/non sudo make sure to set env related to hadoop & rhadoop)
5.8 test installation as below:

Note: if you find error like "Unable to find JAAS classes", make sure you installed native jdk(step 2.2.2) and set env(step 5.4)

# su - hdfs
$R
> library(rmr2)
> library(rJava)
> library(rhdfs)
> hdfs.init() 
> hdfs.ls("/") 
> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)


[You will get a long series of output something as below]

$key
NULL

$val
v
 [1,]   1   2
 [2,]   2   4
 [3,]   3   6
 [4,]   4   8
 [5,]   5  10
 [6,]   6  12
.............
.............
 [98,]  98 196
 [99,]  99 198
[100,] 100 200


-- --------------
 6. Install Mahut
-- --------------
   6.1 Install mahut on CM node, here we are using yum
# yum install mahout
   
   6.2 Mahut will be accessible using below:
# /usr/bin/mahout
  
   6.3 Above command should not give any error 


-- ----------------------------
7. Configure High Availability
-- ----------------------------

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-High-Availability-Guide/CDH5-High-Availability-Guide.html

6.1 Configuring HDFS High Availability (using wizard)
6.1.1 CM Home > HDFS > instances > Click "Enable High Availability" button
6.1.2 select 3 JournalNodes(dvhdmgt1, dvhdnn1 & dvhdjt1) and a standby node(dvhdjt1). Continue
6.1.3 select your name service name. I kept it default "nameservice1". Continue
6.1.4 Review Changes and provide values for dfs.journalnode.edits.dir (i provided /dfs/jn)
 Keep rest default( most of them are cleaning & re-initializing existing services). Continue
6.1.5 It will fail formaing "Name directories of the current NameNod", it should be failed. Just ignore it..
6.1.6 The following manual steps must be performed after completing this wizard:
- CM Home > Hive > Action > "Stop"
- (optionally) Backup Hive metastore
- CM Home > Hive > Action > "Update Hive Metastore NameNodes"
- CM Home > Hive > Action > "Restart"
- CM Home > impala > Action > "Restart"
- CM Home > Hue > Action > "Restart"
- Test on previous MR, hive & impala data

6.2 Configuring High Availability for ResourceManager (MRv2/YARN)
    6.2.1 Stop all YARN daemons
- CM > YARN > Action > Stop

6.2.2 Update the configuration on yarn-site.xml (use CM)
-- dvhdjt1, we will name it resource manager 1(rm1)
-- dvhdnn1, we will name it resource manager 2(rm2)
-- Append below in /etc/hadoop/conf/yarn-site.xml on dvhdjt1 and copy it to all nodes (except CM node)



 
    yarn.resourcemanager.connect.retry-interval.ms
    2000
 
 
    yarn.resourcemanager.ha.enabled
    true
 
 
    yarn.resourcemanager.ha.automatic-failover.enabled
    true
 
 
    yarn.resourcemanager.ha.automatic-failover.embedded
    true
 
 
    yarn.resourcemanager.cluster-id
    yarnRM
 
 
    yarn.resourcemanager.ha.rm-ids
    rm1,rm2
 
 
    yarn.resourcemanager.ha.id
    rm1
 
 
    yarn.resourcemanager.scheduler.class
    org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
 
 
    yarn.resourcemanager.recovery.enabled
    true
 
 
    yarn.resourcemanager.store.class
    org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
 
 
    yarn.resourcemanager.zk.state-store.address
    localhost:2181
 
 
    yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
    5000
 


 
    yarn.resourcemanager.address.rm1
    dvhdjt1:23140
 
 
    yarn.resourcemanager.scheduler.address.rm1
    dvhdjt1:23130
 
 
    yarn.resourcemanager.webapp.https.address.rm1
    dvhdjt1:23189
 
 
    yarn.resourcemanager.webapp.address.rm1
    dvhdjt1:23188
 
 
    yarn.resourcemanager.resource-tracker.address.rm1
    dvhdjt1:23125
 
 
    yarn.resourcemanager.admin.address.rm1
    dvhdjt1:23141
 
 

 
    yarn.resourcemanager.address.rm2
    dvhdnn1:23140
 
 
    yarn.resourcemanager.scheduler.address.rm2
    dvhdnn1:23130
 
 
    yarn.resourcemanager.webapp.https.address.rm2
    dvhdnn1:23189
 
 
    yarn.resourcemanager.webapp.address.rm2
    dvhdnn1:23188
 
 
    yarn.resourcemanager.resource-tracker.address.rm2
    dvhdnn1:23125
 
 
    yarn.resourcemanager.admin.address.rm2
    dvhdnn1:23141
 
 

 
    Address where the localizer IPC is.
    yarn.nodemanager.localizer.address
    0.0.0.0:23344
 
 
    NM Webapp address.
    yarn.nodemanager.webapp.address
    0.0.0.0:23999
 
 
    yarn.nodemanager.aux-services
    mapreduce_shuffle
 
 
    yarn.nodemanager.local-dirs
    /tmp/pseudo-dist/yarn/local
 
 
    yarn.nodemanager.log-dirs
    /tmp/pseudo-dist/yarn/log
 
 
    mapreduce.shuffle.port
    23080
 


6.2.3 Re-start the YARN daemons
- CM > YARN > Instances > Add > for Resourcemanager select "dvhdnn1" > Continue
- CM > YARN > Instances > Select all > Action for Selected "Restrat" 

6.2.4 Using yarn rmadmin to Administer ResourceManager HA
- yarn rmadmin has the following options related to RM HA:
[-transitionToActive ]
[-transitionToStandby ]
[-getServiceState ]
[-checkHealth ]
[-help ]
where is the rm-id (for our case in rm1 & rm2)


-- ----------------
7. Install gateways
-- ----------------

    7.1. Impala proxy: 

Ref: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_proxy.html

7.1.1 Install haproxy
# yum install haproxy

7.1.2 Set up the configuration file: /etc/haproxy/haproxy.cfg as below
--------------------------Start COnfig File-------------------------------
global
    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local0
    log         127.0.0.1 local1 notice
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    maxconn                 3000
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms



#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
    balance
    mode http
    stats enable
    stats auth username:password

# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25004
    mode tcp
    option tcplog
    balance leastconn

    server impald1 dvhddn01.example.com:21000
    server impald2 dvhddn02.example.com:21000
    server impald3 dvhddn03.example.com:21000


# For impala JDBC or ODBC
listen impala :25003
    mode tcp
    option tcplog
    balance leastconn

    server impald1 dvhddn01.example.com:21050
    server impald2 dvhddn02.example.com:21050
    server impald3 dvhddn03.example.com:21050


--------------------------End Config File-------------------------------
     ** we have configured 25003 for JDBC/ODBC impala connection
** we have configured 25004 impalad/impala-shell connection
 
7.1.3 Run the load balancer (on a single host, preferably one not running impalad, our case dvhdmgt1, dvhdnn1 & dvhdjt1): 

# service haproxy start

- ignore below warning:
Starting haproxy: [WARNING] 322/090925 (15196) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
or 
Starting haproxy: [WARNING] 329/052137 (32754) : Parsing [/etc/haproxy/haproxy.cfg:73]: proxy 'impala' has same name as another proxy (declared at /etc/haproxy/haproxy.cfg:62).
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.
[WARNING] 329/052137 (32754) : config : 'option forwardfor' ignored for proxy 'impala' as it requires HTTP mode.


7.1.4 Connect to impala from any of the haproxy nodes as below:
# impala-shell -i dvhdmgt1:25004
> use axdb;
> select count(*) from f_ntw_actvty_http;

7.1.5 make is on for restart:
# chkconfig haproxy on


    7.2 HttpFS gateway: 
   HttpFS gateway normally installed with cloudera hadoop parcel installation (Step 3.6 Add Services). 
7.2.1 Just check CM > HFDS > Instances > chack whether any httpfs node are there or not. 
7.2.2 If not then CM> HDFS > Instances > Add Role Instances > HttpFS > Add your hosts > Follow next instrictions
7.2.3 Afer installation completed check with below (from each of the IP of the nodes where HttpFS installed):
curl "http://192.168.56.201:14000/webhdfs/v1?op=gethomedirectory&user.name=hdfs"
curl 'http://192.168.56.202:14000/webhdfs/v1/?user.name=hdfs&op=open'
curl 'http://192.168.56.203:14000/webhdfs/v1/tmp/tab1/tab1.csv?user.name=hdfs&op=open'



8. Install RImpala
   ref: http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
   - On a node conneceted to cluster and preferebaly not running impala daemon(in our case the CM/mgt node):
8.1 mkdir -p /usr/lib/impala/lib
8.2 cd /usr/lib/impala/lib
8.3 wegt "https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip" 
8.4 unzip impala-jdbc-0.5-2.zip
it will extract to unzip ./impala-jdbc-0.5-2, take a note of full path "/usr/lib/impala/lib/impala-jdbc-0.5-2"
8.5 # R
> install.packages("RImpala")
 - slect the mirror
 - Successfull installation will show below logs:
 * DONE (RImpala)
Making 'packages.html' ... done

The downloaded source packages are in
‘/tmp/RtmpIayO6J/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
8.6 Initialize impala for R
> library(RImpala)
Loading required package: rJava
> rimpala.init(libs="/usr/lib/impala/lib/impala-jdbc-0.5-2/")  # Path from 8.3
[1] "Classpath added successfully"
> rimpala.connect("dvhdmgt1", "25003") # here we are using impala gateway IP & port for impalaJDBC (setp 7)
[1] TRUE
> rimpala.invalidate()
[1] TRUE
> rimpala.showdatabases()
 name
1 _impala_builtins
2          default
> rimpala.usedatabase(db="default")
> rimpala.showtables()
  name
1 sample_07
2      tab1
3  tmp_tab1
> rimpala.describe("tab1")
   name      type comment
1    id       int
2 col_1   boolean
3 col_2    double
4 col_3 timestamp
> data = rimpala.query("Select * from tab1")
> data
id col_1      col_2                     col_3
1  1  true    123.123     2012-10-24 08:55:00.0
2  2 false   1243.500     2012-10-25 13:40:00.0
3  3 false  24453.325   2008-08-22 09:33:21.123
4  4 false 243423.325 2007-05-12 22:32:21.33454
> rimpala.close()
[1] TRUE
>



Ref: 
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-0-0/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html

Preparing Hosts(VMs) to Install Cloudera Hadoop (Install Cloudera Hadoop part 1 of 2):

Here I am sharing my activity cookbook I followed to install a development environment for Cloudera Hadoop cluster using Virtual Box. Using almost same method you can easily install the Cloudera Hadoop on production.

Here is the part 1 of 2, hosts/nodes preparation. For Cloudera Manager & Cloudera hadoop installation please check part 2.


1. Target:

Hadoop:
Hadoop Cluster with below:
- 1 management node: acting as GW to the cluster and hasting Cloudera manager with 4G ram, 2 vcore, 25G local disk
- 1 name node: acting as primary namenode and standby resource manager with 2G ram, 1 vcore, 25G local disk 
- 1 resource manager node: acting as promary resource manager and standby namenode with 2G ram, 1 vcore, 25G local disk
- 3 datanodes: datanodes having all hadoop process and having 3 disk volumes each. with 2G ram, 1 vcore, 25G local disk
- Onlt management node will have internet connection, other 5 not. It is to emulate production data center environment.

VMs for hadoop:
Configure 1 virtual guest1 with:
- can ssh from host
- can connect to internet
- can interact with other guest on the same host
- having 25GB local storage
- having 1 GB ram
- 1 virtual CPU

2. Target Virtual machine:
We used virtual box as our virtualization SW.
Make sure virtualization support is activated for the host. If not enabled plese enable it from BIOS.

3. Create a virtual hostwith below network config:
Adapter 1: hostonly (on eth0)
Adapter 2: NAT (on eth1)

4. Install CenOS/RHEL with required partitions. For our case we are using only below 3 partitions. It will help use all available space the space:
- /
- /boot
- /home

5. Configure network as below:

a. On the host Hostonly Network(eth0):
- On Virtual box guest configuration: The default IP for hostonly virtual interface on the host machine is 192.168.56.1 and no gateway, keep it untouched.

- On eth0 configuration file:
- Keep DEVICE, HWADDR & UUID untouch. 
- Chnage ONBOOT=yes
- Change BOOTPROTO to PROTO put value "static"
- add IPADDR=
- no gateway no other things. 
- If any just remove them
- this interface will be used to communicate with the host and other guests

- For exaple, ifcfg-eth0 file looks like below, for other machine only  IPADDR will be changed:
[root@base ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
HWADDR=08:00:27:C6:2A:31
TYPE=Ethernet
UUID=e8e7aafe-2033-4cc5-ae93-bea49b5b3528
ONBOOT=yes
PROTO=static
IPADDR=192.168.56.201
[root@base ~]#

b. On NAT (eth1):
- change ONBOOT=yes 
- add BOOTPROTO=dhcp (if not there)
- keep rest untouched
- this will be used to access internet
- For example ifcfg-eth1 will look like below:
[root@base ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
HWADDR=08:00:27:AC:01:6D
TYPE=Ethernet
UUID=5ba5cf93-7526-4acb-88d9-9ce14439df80
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=dhcp

c. Change your host name change as below:
[root@base ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=base.example.com
[root@base ~]#


d. NO need to do any change on DNS (resolv.conf)

e. (Optional) Set the swappiness to 0 to avoid swaping, as per the recommendation of Cloudera manager:
# vi /etc/sysctl.conf
vm.swappiness = 0
# sysctl -p
# cat /proc/sys/vm/swappiness
0

Ref: https://blogs.oracle.com/fatbloke/entry/networking_in_virtualbox1

6. Disable selinux
vi /etc/selinux/config
SELINUX=disabled

7. Disable IPV6 by issuing below commands as root:
# vi /etc/sysctl.conf and add below two lines:
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.all.disable_ipv6 = 1

then run "sysctl -p" as root
# sysctl -p

9. diable firewall
chkconfig iptables off

10. vi /etc/yum/pluginconf.d/fastestmirror.conf
enabled=0

11. reboot and check all changes are persistant


12. Setup SSH

To also simplify the access between hosts, install and setup SSH keys and defined them as already authorized
- do below on the base node. While we will cloning this base to create orger nodes it will be there. No need to copy again.

$ yum -y install perl openssh-clients
$ ssh-keygen (type enter, enter, enter)
$ cd ~/.ssh
$ cp id_rsa.pub authorized_keys

- Modify the ssh configuration file. Uncomment the following line and change the value to no; this will prevent the question when connecting with SSH to the host.

# vi /etc/ssh/ssh_config
StrictHostKeyChecking no

13. edit /etc/hosts file as per tour need: 
vi /etc/hosts
192.168.56.201 dvhdmgt1.example.com dvhdmgt1  # Management node hosting Cloudera Manager
192.168.56.202 dvhdnn1.example.com dvhdnn1 # Name node
192.168.56.203 dvhdjt1.example.com dvhdjt1 # Jobtracker/Resource Manager
192.168.56.101 dvhddn01.example.com dvhddn01 # Datanode1
192.168.56.102 dvhddn02.example.com dvhddn02 # Datanode2
192.168.56.103 dvhddn03.example.com dvhddn03 # Datanode3

14. Clone the base with considering below for each of the nodes:

a. if NIC not up, then select both of adaptar and for both from advance options, refresh the MAC update HWADDD with corresponding MAC in eth0 & eth1 and reboot

b. When the system is up and both NICs (eth0 & eth1) are up do as below:
- open "/etc/udev/rules.d/*-persistent-net.rules" and check which MAC (ATTR{address}==) matched with which eth0/eth1 and rename to corresponding eth0/eth1 then remove/commentout other line if having eth0/eth1 not maching MAC
[ref: http://xmodulo.com/2013/04/how-to-clone-or-copy-virtual-machine-on-virtualbox.html]
c. Assign IP & MAC for corresponding nodes on eth0 only, Change only MAC for eth1 on file /etc/sysconfig/network-scripts/ifcfg-eth0 & 1
d. change the hostname on /etc/sysconfig/network as per point 13.
e. Reboot to make it effective and check
f. Following above steps create 6 VMs and configure CPU & RAM as listed in step 1.  
g. Except management node (dvhdmgt1), shutdown the NAT network(ifdown eth1) and set "ONBOOT=no" in file /etc/sysconfig/network-scripts/ifcfg-eth1. 

  
15. Mount jbod (only for data disks)
a. From virtual box assign 3 disks to each of the 3 datanodes.
b. Mount your data disks with noatime (e.g. /dev/sdc1 /mnt/disk3 ext4 defaults,noatime 1 2 which btw. implies nodiratime)
c. (Optionally) By default 5% of a HDD are reserved in ext filesystems for  critical processes can still write some data when the disk is full. (check by running tune2fs -l /dev/sdc1 and look at the Reserved block count). Down it to 1% by running: tune2fs -m 1 on all your data disks (i.e. tune2fs -m 1 /dev/sdc1)
http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/

For eample do as below for each data nodes:
- # shutdown
- add 3 virtual storage HDD on data nodes
- start the machine
- use "fdisk -l" to check unformatted disks (should not be included in partition tables)
- partition each of whole disks as a single primary partitions with partition number "1" for all using fdisk:
# fdisk /dev/sdb then
then follow all the steps to make desired partitions(typically n>p>1>enter>enter>w) 
After that all the device will have partition with additiinal 1 in the corresponding device name(like, /dev/sdb will have /dev/sdb1)
- Format disks with ext4 
mkfs.ext4 /dev/sdb1
mkfs.ext4 /dev/sdc1
mkfs.ext4 /dev/sdd1
- (optionally) tune as per point c above:
tune2fs -m 1 /dev/sdb1
tune2fs -m 1 /dev/sdc1
tune2fs -m 1 /dev/sdd1

tune2fs -l /dev/sdb1 |grep "Reserved block count:"
tune2fs -l /dev/sdc1 |grep "Reserved block count:"
tune2fs -l /dev/sdd1 |grep "Reserved block count:"
- Mount data partitions
# mkdir -p /data/01
# mkdir -p /data/02
# mkdir -p /data/03

   # vi /etc/fstab
/dev/sdb1 /data/01 ext4 defaults,noatime 1 2
/dev/sdc1 /data/02 ext4 defaults,noatime 1 2
/dev/sdd1 /data/03 ext4 defaults,noatime 1 2

# mount -a


16. Prepare Cloudera Manager (CM) server node(on dvhdmgt1) as a proxy for yum(as only CM will have internet connection):

16.1 install squid and enable local caching:
# yum install squid
16.2 specify the caching directory (here we are caching about 7000 MB):
# grep cache_dir /etc/squid/squid.conf
#cache_dir ufs /var/spool/squid 100 16 256
cache_dir ufs /var/spool/squid 7000 16 256

16.3 You will also have to allow connections to port 3128 or stop the firewall. For our case it is not running.
16.4 start squid server (on CM node):
#service squid start
init_cache_dir /var/spool/squid... Starting squid: .       [  OK  ]
 
16.5 Add squid to chkconfig:
# chkconfig squid on
# chkconfig --list squid
squid           0:off   1:off   2:on    3:on    4:on    5:on    6:off

17. Create repo file for cloudera manager with proper version (on CM node):

17.1 Cloudera recommends installing products using package management tools such as yum for Red Hat compatible systems. We will follow this also for Redhat systems.
17.2  (on CM node) download repo file "http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/cloudera-manager.repo" and copy to the "/etc/yum.repos.d/" directory.
17.3  (on CM node) Edit the file to change the baseurl to point to the specific version of Cloudera Manager you want to download. For us, we want to install Cloudera Manager version 5.0.2. So our final "/etc/yum.repos.d/cloudera-manager.repo" file will be as below.

[cloudera-manager]
# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64            
name=Cloudera Manager
baseurl=http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5.0.2/
gpgkey = http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/RPM-GPG-KEY-cloudera    
gpgcheck = 1

17.4 do above on CM node(dvhdmgt1) then distribute the repo file to all nodes
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhdnn1:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhdjt1:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn01:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn02:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn03:/etc/yum.repos.d


18. Point yum proxy to the CM node(dvhdmgt1) (on all node except dvhdmgt1):
18.1 On all the servers that need to use the cache, set the proxy configuration in their /etc/yum.conf file to be the cache server on port 3128.
18.2 for our case we will use CM server IP:
# grep proxy /etc/yum.conf
 proxy=http://192.168.56.201:3128
18.3 test with "yum info jdk", it should be successfull to load info From repo "cloudera-manager" as per set(step 17 above) in CM node:
# yum info jdk
base                                                                                                                                             | 3.7 kB     00:00
base/primary_db                                                                                                                                  | 4.6 MB     00:01
extras                                                                                                                                           | 3.3 kB     00:00
extras/primary_db                                                                                                                                |  19 kB     00:00
updates                                                                                                                                          | 3.4 kB     00:00
updates/primary_db                                                                                                                               | 171 kB     00:00
Installed Packages
Name        : jdk
Arch        : x86_64
Epoch       : 2000
Version     : 1.6.0_31
Release     : fcs
Size        : 143 M
Repo        : installed
From repo   : cloudera-manager
Summary     : Java(TM) Platform Standard Edition Development Kit
URL         : http://java.sun.com/
License     : Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved. Also under other license(s) as shown at the Description field.
Description : The Java Platform Standard Edition Development Kit (JDK) includes both
: the runtime environment (Java virtual machine, the Java platform classes
: and supporting files) and development tools (compilers, debuggers,
: tool libraries and other tools).
:
: The JDK is a development environment for building applications, applets
: and components that can be deployed with the Java Platform Standard
: Edition Runtime Environment.


# yum list available|grep -i cloudera-manager
cloudera-manager-server.x86_64            5.0.2-1.cm502.p0.297.el6       cloudera-manager
cloudera-manager-server-db-2.x86_64       5.0.2-1.cm502.p0.297.el6       cloudera-manager
enterprise-debuginfo.x86_64               5.0.2-1.cm502.p0.297.el6       cloudera-manager


19. At this point all the VM hosts are ready to install Cloudera Hadoop.

20. Please follow next post on install Cloudera Manager

Thursday, November 6, 2014

OLTP vs MPP vs Hadoop


Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

                1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.
                   - MPP takes the processing as close possible to the data, so less data movement
             
                2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
                   - MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.


Hadoop(without impala or EMC HAWQ) vs MPP:

                1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.
                   - There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
                2. - With conventional MPP, it support only relational models(row-column)
                   - hadoop support virtually any kind of data.
             
                * However the main objective of MPP and hadoop is same, process data parallely near storage.
             

Cloudera impala(or pivotal HAWQ) vs MPP:

                1. - MPP supports advanced indatabase analytics
                   - Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
                2. - MPP databases have industry standard security features and well defined user schema structure.
                   - Impala has very immatured security system and virtually no user schema.
                3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
                   - impala supports most open file formats (text, parquate)
             
                * However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.


How to choice what (in general and my personal opinion):


1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
                - Transaction based application
                - Smaller DWH
                * However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.

2. MPP (Netezza, Teradata, Vertica)
                - Bigger Data warehouse (may be having tables with size more than 4-5 TB)  
                - Needs no or little pre-processing
                - Needs faster batch processing speed
                - In database analytics

3. Only Hadoop:
                - All data as heavily unstructured (documents, audio, video etc)
                - need to process in batch

4. Hadoop and using mainly Impala (or EMC HAWQ)
                - Need to have a DWH with low cost
                - No need to have advance analytics features
                - Can utilize open source tools
                - Not concern about security or limited number of users
             
5. Hadoop(with impala or HAWQ) + MPP:
                - Some Data need heavy pre-processing before ready to advance analytics.
                - Need cheaper query able archive or backup for older data


Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc