Monday, January 19, 2015

Preparing Hosts(VMs) to Install Cloudera Hadoop (Install Cloudera Hadoop part 1 of 2):

Here I am sharing my activity cookbook I followed to install a development environment for Cloudera Hadoop cluster using Virtual Box. Using almost same method you can easily install the Cloudera Hadoop on production.

Here is the part 1 of 2, hosts/nodes preparation. For Cloudera Manager & Cloudera hadoop installation please check part 2.

1. Target:

Hadoop Cluster with below:
- 1 management node: acting as GW to the cluster and hasting Cloudera manager with 4G ram, 2 vcore, 25G local disk
- 1 name node: acting as primary namenode and standby resource manager with 2G ram, 1 vcore, 25G local disk 
- 1 resource manager node: acting as promary resource manager and standby namenode with 2G ram, 1 vcore, 25G local disk
- 3 datanodes: datanodes having all hadoop process and having 3 disk volumes each. with 2G ram, 1 vcore, 25G local disk
- Onlt management node will have internet connection, other 5 not. It is to emulate production data center environment.

VMs for hadoop:
Configure 1 virtual guest1 with:
- can ssh from host
- can connect to internet
- can interact with other guest on the same host
- having 25GB local storage
- having 1 GB ram
- 1 virtual CPU

2. Target Virtual machine:
We used virtual box as our virtualization SW.
Make sure virtualization support is activated for the host. If not enabled plese enable it from BIOS.

3. Create a virtual hostwith below network config:
Adapter 1: hostonly (on eth0)
Adapter 2: NAT (on eth1)

4. Install CenOS/RHEL with required partitions. For our case we are using only below 3 partitions. It will help use all available space the space:
- /
- /boot
- /home

5. Configure network as below:

a. On the host Hostonly Network(eth0):
- On Virtual box guest configuration: The default IP for hostonly virtual interface on the host machine is and no gateway, keep it untouched.

- On eth0 configuration file:
- Keep DEVICE, HWADDR & UUID untouch. 
- Chnage ONBOOT=yes
- Change BOOTPROTO to PROTO put value "static"
- add IPADDR=
- no gateway no other things. 
- If any just remove them
- this interface will be used to communicate with the host and other guests

- For exaple, ifcfg-eth0 file looks like below, for other machine only  IPADDR will be changed:
[root@base ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
[root@base ~]#

b. On NAT (eth1):
- change ONBOOT=yes 
- add BOOTPROTO=dhcp (if not there)
- keep rest untouched
- this will be used to access internet
- For example ifcfg-eth1 will look like below:
[root@base ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1

c. Change your host name change as below:
[root@base ~]# cat /etc/sysconfig/network
[root@base ~]#

d. NO need to do any change on DNS (resolv.conf)

e. (Optional) Set the swappiness to 0 to avoid swaping, as per the recommendation of Cloudera manager:
# vi /etc/sysctl.conf
vm.swappiness = 0
# sysctl -p
# cat /proc/sys/vm/swappiness


6. Disable selinux
vi /etc/selinux/config

7. Disable IPV6 by issuing below commands as root:
# vi /etc/sysctl.conf and add below two lines:
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.all.disable_ipv6 = 1

then run "sysctl -p" as root
# sysctl -p

9. diable firewall
chkconfig iptables off

10. vi /etc/yum/pluginconf.d/fastestmirror.conf

11. reboot and check all changes are persistant

12. Setup SSH

To also simplify the access between hosts, install and setup SSH keys and defined them as already authorized
- do below on the base node. While we will cloning this base to create orger nodes it will be there. No need to copy again.

$ yum -y install perl openssh-clients
$ ssh-keygen (type enter, enter, enter)
$ cd ~/.ssh
$ cp authorized_keys

- Modify the ssh configuration file. Uncomment the following line and change the value to no; this will prevent the question when connecting with SSH to the host.

# vi /etc/ssh/ssh_config
StrictHostKeyChecking no

13. edit /etc/hosts file as per tour need: 
vi /etc/hosts dvhdmgt1  # Management node hosting Cloudera Manager dvhdnn1 # Name node dvhdjt1 # Jobtracker/Resource Manager dvhddn01 # Datanode1 dvhddn02 # Datanode2 dvhddn03 # Datanode3

14. Clone the base with considering below for each of the nodes:

a. if NIC not up, then select both of adaptar and for both from advance options, refresh the MAC update HWADDD with corresponding MAC in eth0 & eth1 and reboot

b. When the system is up and both NICs (eth0 & eth1) are up do as below:
- open "/etc/udev/rules.d/*-persistent-net.rules" and check which MAC (ATTR{address}==) matched with which eth0/eth1 and rename to corresponding eth0/eth1 then remove/commentout other line if having eth0/eth1 not maching MAC
c. Assign IP & MAC for corresponding nodes on eth0 only, Change only MAC for eth1 on file /etc/sysconfig/network-scripts/ifcfg-eth0 & 1
d. change the hostname on /etc/sysconfig/network as per point 13.
e. Reboot to make it effective and check
f. Following above steps create 6 VMs and configure CPU & RAM as listed in step 1.  
g. Except management node (dvhdmgt1), shutdown the NAT network(ifdown eth1) and set "ONBOOT=no" in file /etc/sysconfig/network-scripts/ifcfg-eth1. 

15. Mount jbod (only for data disks)
a. From virtual box assign 3 disks to each of the 3 datanodes.
b. Mount your data disks with noatime (e.g. /dev/sdc1 /mnt/disk3 ext4 defaults,noatime 1 2 which btw. implies nodiratime)
c. (Optionally) By default 5% of a HDD are reserved in ext filesystems for  critical processes can still write some data when the disk is full. (check by running tune2fs -l /dev/sdc1 and look at the Reserved block count). Down it to 1% by running: tune2fs -m 1 on all your data disks (i.e. tune2fs -m 1 /dev/sdc1)

For eample do as below for each data nodes:
- # shutdown
- add 3 virtual storage HDD on data nodes
- start the machine
- use "fdisk -l" to check unformatted disks (should not be included in partition tables)
- partition each of whole disks as a single primary partitions with partition number "1" for all using fdisk:
# fdisk /dev/sdb then
then follow all the steps to make desired partitions(typically n>p>1>enter>enter>w) 
After that all the device will have partition with additiinal 1 in the corresponding device name(like, /dev/sdb will have /dev/sdb1)
- Format disks with ext4 
mkfs.ext4 /dev/sdb1
mkfs.ext4 /dev/sdc1
mkfs.ext4 /dev/sdd1
- (optionally) tune as per point c above:
tune2fs -m 1 /dev/sdb1
tune2fs -m 1 /dev/sdc1
tune2fs -m 1 /dev/sdd1

tune2fs -l /dev/sdb1 |grep "Reserved block count:"
tune2fs -l /dev/sdc1 |grep "Reserved block count:"
tune2fs -l /dev/sdd1 |grep "Reserved block count:"
- Mount data partitions
# mkdir -p /data/01
# mkdir -p /data/02
# mkdir -p /data/03

   # vi /etc/fstab
/dev/sdb1 /data/01 ext4 defaults,noatime 1 2
/dev/sdc1 /data/02 ext4 defaults,noatime 1 2
/dev/sdd1 /data/03 ext4 defaults,noatime 1 2

# mount -a

16. Prepare Cloudera Manager (CM) server node(on dvhdmgt1) as a proxy for yum(as only CM will have internet connection):

16.1 install squid and enable local caching:
# yum install squid
16.2 specify the caching directory (here we are caching about 7000 MB):
# grep cache_dir /etc/squid/squid.conf
#cache_dir ufs /var/spool/squid 100 16 256
cache_dir ufs /var/spool/squid 7000 16 256

16.3 You will also have to allow connections to port 3128 or stop the firewall. For our case it is not running.
16.4 start squid server (on CM node):
#service squid start
init_cache_dir /var/spool/squid... Starting squid: .       [  OK  ]
16.5 Add squid to chkconfig:
# chkconfig squid on
# chkconfig --list squid
squid           0:off   1:off   2:on    3:on    4:on    5:on    6:off

17. Create repo file for cloudera manager with proper version (on CM node):

17.1 Cloudera recommends installing products using package management tools such as yum for Red Hat compatible systems. We will follow this also for Redhat systems.
17.2  (on CM node) download repo file "" and copy to the "/etc/yum.repos.d/" directory.
17.3  (on CM node) Edit the file to change the baseurl to point to the specific version of Cloudera Manager you want to download. For us, we want to install Cloudera Manager version 5.0.2. So our final "/etc/yum.repos.d/cloudera-manager.repo" file will be as below.

# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64            
name=Cloudera Manager
gpgkey =    
gpgcheck = 1

17.4 do above on CM node(dvhdmgt1) then distribute the repo file to all nodes
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhdnn1:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhdjt1:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn01:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn02:/etc/yum.repos.d
[root@dvhdmgt1 yum.repos.d]# scp cloudera-manager.repo dvhddn03:/etc/yum.repos.d

18. Point yum proxy to the CM node(dvhdmgt1) (on all node except dvhdmgt1):
18.1 On all the servers that need to use the cache, set the proxy configuration in their /etc/yum.conf file to be the cache server on port 3128.
18.2 for our case we will use CM server IP:
# grep proxy /etc/yum.conf
18.3 test with "yum info jdk", it should be successfull to load info From repo "cloudera-manager" as per set(step 17 above) in CM node:
# yum info jdk
base                                                                                                                                             | 3.7 kB     00:00
base/primary_db                                                                                                                                  | 4.6 MB     00:01
extras                                                                                                                                           | 3.3 kB     00:00
extras/primary_db                                                                                                                                |  19 kB     00:00
updates                                                                                                                                          | 3.4 kB     00:00
updates/primary_db                                                                                                                               | 171 kB     00:00
Installed Packages
Name        : jdk
Arch        : x86_64
Epoch       : 2000
Version     : 1.6.0_31
Release     : fcs
Size        : 143 M
Repo        : installed
From repo   : cloudera-manager
Summary     : Java(TM) Platform Standard Edition Development Kit
URL         :
License     : Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved. Also under other license(s) as shown at the Description field.
Description : The Java Platform Standard Edition Development Kit (JDK) includes both
: the runtime environment (Java virtual machine, the Java platform classes
: and supporting files) and development tools (compilers, debuggers,
: tool libraries and other tools).
: The JDK is a development environment for building applications, applets
: and components that can be deployed with the Java Platform Standard
: Edition Runtime Environment.

# yum list available|grep -i cloudera-manager
cloudera-manager-server.x86_64            5.0.2-1.cm502.p0.297.el6       cloudera-manager
cloudera-manager-server-db-2.x86_64       5.0.2-1.cm502.p0.297.el6       cloudera-manager
enterprise-debuginfo.x86_64               5.0.2-1.cm502.p0.297.el6       cloudera-manager

19. At this point all the VM hosts are ready to install Cloudera Hadoop.

20. Please follow next post on install Cloudera Manager

1 comment: