Wednesday, February 4, 2015

HDFS (Hadoop 2.4.1) POC Lab on Ubuntu 14.04 via VirtualBox - Note

I was always thinking the set up to HDFS to understand more logic then just read the papers. I remember my friend always told me, the best way to learn the new skill is "Try It". 

Regarding Hadoop, there has three major papers conduct whole concept and they are all actually published by Google guys. 
  1. The Google File System (2003)
  2. MapReduce: Simpli ed Data Processing on Large Clusters (2004)
  3. Bigtable: A Distributed Storage System for Structured Data (2006)
PS: These papers are very easy to google and download in pdf format.

Splitting the file into 64MB blocks and save them into distributed nodes is similar with my research concept (Data Deduplication), but go to the different purpose. Before we jump into deep, let me give a rough idea regarding Distributed File System concept.


Distributed File System : eg GFS/HDFS

Features

  • Global File Namespace
  • Redundancy
  • Availability

Typical Usage Pattern


  • Huge File (100s of GB to TB)
    • ( keep on add but never remove or replace whole file content )
      • Data is rarely updated in place
      • Reads and update in appends are common
    • Data kepts in "Chunks" spread across machines
  • Each chunk replicated on different machines - multiple copies ( at least twice ) - ensure persistence and availability

PS: replicate never in the same chunk server


Architecture:

Chunk Server (datanode)

  • File is split into contiguous chunks ( 16 ~ 64 MB )
  • Each chunk replicated ( 2x ~ 3x ) in different Rack

Master Node (aka namenode in HDFS)

  • Store "Metadata" about where files are stored (might be replicated as well)

Client has "Library" for file access

  • Talks to master to find chunk servers
  • "Connects directly" to chunk server to access data




Interesting here, I notice distribution file system is sort of like dropbox like sync solution (client file sync and share) since  basic the concept is similar. 

More, I would like to give my personal thought about how's it different from deduplication since it was my research topic before. 

Here are the bullet items for features differential as my understanding.

  • First of all, dropbox-like sync client (Files Sync and Share) doesn't care the replica, since it's backend repository features such as Amazon S3 usually will take care of replica for HA but distributed file system do care since it is major feature for HA.
  • Second in dropbox-like sync client, the data pull from all kinds of endpoint devices but push all to the single endpoint (eg: Amazon S3 or Openstack Swift) but distributed file system pull from multiple nodes, racks or clusters which are connected by GB switch.
  • Third, the metadata in Master(name) node keep address ( location / directories ) for each chunk, but in dropbox-like sync client since it's single node, it's usually just use Hash as KEY reference.
  • Last, lets talk about MapReduce and Dedup here. 
  • The MapReduce concept similar with Dedup, they are all base on the key remove the redundant(value) in data stream. However MapReduce is using because distributed file system HA feature and when data collection must be exclusive into unique, but Dedup is using because keep single copy of data chunk to reduce disk space. To sum, MapReduce collect data from servers, reduce duplicated info and land on client. Dedup is half opposite, it reduce the input to the disk from client ( nodes ) and end by unique chunks in repository ( server ).

I would like to play around to see what's else I can learn from this hot topic. Here I note the steps I set them up and share in here. I outline the steps into couple sections which might be easier for people to follow up with.
  1. Preparation
    1. Install Ubuntu 14.04 VirtualBox VM
    2. Install Java
    3. Add UserGroup and User
    4. SSH setup and Key ( Single Node on VM doesn't matter )
  2. Installation
    1. Download tar
    2. unzip
    3. configuration
    4. format hdfs
    5. Star all or Stop all
    6. Double Check Features
  3. Testing
    1. Create Directory
    2. Copy File (copyToLocal)
    3. Put File
    4. List Directory/List
    5. Display File Content
    6. Delete

1 Preparation

Ubuntu 14.04 VirtualBox VM



The easier way I did is preparing a Ubuntu flesh Installation VM, export ova for vary kinds of testing purpose. I import the ova and change the hostname for different installation / testing Lab.

PS: remember update your apt-get
#sudo apt-get update


Install Java


SSH setup and Key ( Single Node on VM doesn't matter )

#sudo apt-get install default-jdk


Add UserGroup and User for Hadoop

#sudo addgroup hadoop
#sudo adduser --ingroup hadoop hduser

SSH setup and Key ( Single Node on VM doesn't matter )
#sudo apt-get install ssh

Generate Key

#su hduser
#ssh-keygen -t rsa -P ""
#cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

2 Installation

Download tar

#wget http://sourceforge.net/projects/hadoop.mirror/files/Hadoop%202.4.1/hadoop-2.4.1.tar.gz

unzip

#tar xvzf hadoop-2.4.1.tar.gz



Move to /usr/local/hadoop
#sudo mv hadoop-2.4.1 /usr/local/hadoop 

Authorized user and usergroup for folder hadoop
#sudo chown -R hduser:hadoop hadoop

Five Major Configurations

1. ~/.bashrc

#sudo vim ~/.bashrc



add

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END








2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

#sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh



remark existing and add new one as below

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64


3. /usr/local/hadoop/etc/hadoop/core-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml

add 
<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>

 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>



4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
#sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

add
<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
 </property>
</configuration>



5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

add
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>



***Format HDFS***

#hadoop namenode -format

Start / Stop Hadoop

go to sbin folder#cd /usr/local/hadoop/sbin

Start
#start-all.sh

Stop
#stop-all.sh


Double Check Features

localhost:50070 - Web UI for NameNode Daemo


localhost:50090 - Second Name Node


localhost:50075 - Data Node


List all the port
#netstat -plten | grep java



3 Testing

Try to use example:
#hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5

you can use dd or echo > to create a file
#echo "hdfs test" > hdfsTest.txt

Create Directory

#hadoop fs -mkdir -/ user/hduser


Copy File (copyFromLocal)

#hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt


Put File

#hadoop fs -put hdfsTest.txt


List Directory/List

#hadoop fs -ls

Display File Content

#hadoop fs -cat /usr/hduser/hdfsTest.txt


Delete

#hadoop fs -rm hdfsTest.txt

PS: You can always review the result via Web UI simply double check the transactions log.


Movie Quote: 

When you're lost, follow your dreams. They know the way.
- (Begin Again) 2014

1 comment:

  1. Thanks the comment, it's encouraging. I will keep posting and feel free to share more if you like.
    Thanks !

    ReplyDelete