Johnny ( Chianing ) Wang : HDFS (Hadoop 2.4.1) POC Lab on Ubuntu 14.04 via VirtualBox

I was always thinking the set up to HDFS to understand more logic then just read the papers. I remember my friend always told me, the best way to learn the new skill is "Try It".

Regarding Hadoop, there has three major papers conduct whole concept and they are all actually published by Google guys.

The Google File System (2003)
MapReduce: Simpli ed Data Processing on Large Clusters (2004)
Bigtable: A Distributed Storage System for Structured Data (2006)

PS: These papers are very easy to google and download in pdf format.

Splitting the file into 64MB blocks and save them into distributed nodes is similar with my research concept (Data Deduplication), but go to the different purpose. Before we jump into deep, let me give a rough idea regarding Distributed File System concept.

Distributed File System : eg GFS/HDFS

Features

Global File Namespace

Redundancy

Availability

Typical Usage Pattern

Huge File (100s of GB to TB)

( keep on add but never remove or replace whole file content )

Data is rarely updated in place
Reads and update in appends are common

Data kepts in "Chunks" spread across machines

Each chunk replicated on different machines - multiple copies ( at least twice ) - ensure persistence and availability

PS: replicate never in the same chunk server

Architecture:

Chunk Server (datanode)

File is split into contiguous chunks ( 16 ~ 64 MB )
Each chunk replicated ( 2x ~ 3x ) in different Rack

Master Node (aka namenode in HDFS)

Store "Metadata" about where files are stored (might be replicated as well)

Client has "Library" for file access

Talks to master to find chunk servers
"Connects directly" to chunk server to access data

Interesting here, I notice distribution file system is sort of like dropbox like sync solution (client file sync and share) since basic the concept is similar.

More, I would like to give my personal thought about how's it different from deduplication since it was my research topic before.

Here are the bullet items for features differential as my understanding.

First of all, dropbox-like sync client (Files Sync and Share) doesn't care the replica, since it's backend repository features such as Amazon S3 usually will take care of replica for HA but distributed file system do care since it is major feature for HA.
Second in dropbox-like sync client, the data pull from all kinds of endpoint devices but push all to the single endpoint (eg: Amazon S3 or Openstack Swift) but distributed file system pull from multiple nodes, racks or clusters which are connected by GB switch.
Third, the metadata in Master(name) node keep address ( location / directories ) for each chunk, but in dropbox-like sync client since it's single node, it's usually just use Hash as KEY reference.
Last, lets talk about MapReduce and Dedup here.

The MapReduce concept similar with Dedup, they are all base on the key remove the redundant(value) in data stream. However MapReduce is using because distributed file system HA feature and when data collection must be exclusive into unique, but Dedup is using because keep single copy of data chunk to reduce disk space. To sum, MapReduce collect data from servers, reduce duplicated info and land on client. Dedup is half opposite, it reduce the input to the disk from client ( nodes ) and end by unique chunks in repository ( server ).

I would like to play around to see what's else I can learn from this hot topic. Here I note the steps I set them up and share in here. I outline the steps into couple sections which might be easier for people to follow up with.

Preparation

Install Ubuntu 14.04 VirtualBox VM
Install Java
Add UserGroup and User
SSH setup and Key ( Single Node on VM doesn't matter )

Installation

Download tar
unzip
configuration
format hdfs
Star all or Stop all
Double Check Features

Testing

Create Directory
Copy File (copyToLocal)
Put File
List Directory/List
Display File Content
Delete

1 Preparation

Ubuntu 14.04 VirtualBox VM

The easier way I did is preparing a Ubuntu flesh Installation VM, export ova for vary kinds of testing purpose. I import the ova and change the hostname for different installation / testing Lab.

PS: remember update your apt-get

#sudo apt-get update

Install Java

SSH setup and Key ( Single Node on VM doesn't matter )

#sudo apt-get install default-jdk

Add UserGroup and User for Hadoop

#sudo addgroup hadoop

#sudo adduser --ingroup hadoop hduser

SSH setup and Key ( Single Node on VM doesn't matter )

#sudo apt-get install ssh

Generate Key

#su hduser

#ssh-keygen -t rsa -P ""

#cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

2 Installation

Download tar

#wget http://sourceforge.net/projects/hadoop.mirror/files/Hadoop%202.4.1/hadoop-2.4.1.tar.gz

unzip

#tar xvzf hadoop-2.4.1.tar.gz

Move to /usr/local/hadoop

#sudo mv hadoop-2.4.1 /usr/local/hadoop

Authorized user and usergroup for folder hadoop

#sudo chown -R hduser:hadoop hadoop

Five Major Configurations

1. ~/.bashrc

#sudo vim ~/.bashrc

add

#HADOOP VARIABLES START

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_INSTALL=/usr/local/hadoop

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

#HADOOP VARIABLES END

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

#sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

remark existing and add new one as below

#export JAVA_HOME=${JAVA_HOME}

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

3. /usr/local/hadoop/etc/hadoop/core-site.xml

#sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml

add

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

</configuration>

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

#sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

add

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

</configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

#sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

add

<name>dfs.replication</name>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

</configuration>

Format HDFS

#hadoop namenode -format

Start / Stop Hadoop

go to sbin folder#cd /usr/local/hadoop/sbin

Start
#start-all.sh

Stop
#stop-all.sh

Double Check Features

localhost:50070 - Web UI for NameNode Daemo

localhost:50090 - Second Name Node

localhost:50075 - Data Node

List all the port
#netstat -plten | grep java

3 Testing

Try to use example:

#hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5

you can use dd or echo > to create a file

#echo "hdfs test" > hdfsTest.txt

Create Directory

#hadoop fs -mkdir -/ user/hduser

Copy File (copyFromLocal)

#hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt

Put File

#hadoop fs -put hdfsTest.txt

List Directory/List

#hadoop fs -ls

Display File Content

#hadoop fs -cat /usr/hduser/hdfsTest.txt

Delete

#hadoop fs -rm hdfsTest.txt

PS: You can always review the result via Web UI simply double check the transactions log.

Movie Quote:

When you're lost, follow your dreams. They know the way.

- (Begin Again) 2014

Johnny ( Chianing ) Wang

Wednesday, February 4, 2015

HDFS (Hadoop 2.4.1) POC Lab on Ubuntu 14.04 via VirtualBox - Note

Distributed File System : eg GFS/HDFS

Features

Global File Namespace

Redundancy

Availability

Typical Usage Pattern

Architecture:

Chunk Server (datanode)

Master Node (aka namenode in HDFS)

Client has "Library" for file access

1 Preparation

Ubuntu 14.04 VirtualBox VM

Install Java

Add UserGroup and User for Hadoop

Generate Key

2 Installation

Download tar

unzip

Five Major Configurations

Format HDFS

Start / Stop Hadoop

Double Check Features

3 Testing

Create Directory

Copy File (copyFromLocal)

Put File

List Directory/List

Display File Content

Delete

Movie Quote:

1 comment:

Wednesday, February 4, 2015

HDFS (Hadoop 2.4.1) POC Lab on Ubuntu 14.04 via VirtualBox - Note

Distributed File System : eg GFS/HDFS

Features

Global File Namespace Redundancy Availability

Typical Usage Pattern

Architecture:

Chunk Server (datanode)

Master Node (aka namenode in HDFS)

Client has "Library" for file access

1 Preparation

Ubuntu 14.04 VirtualBox VM

Install Java

Add UserGroup and User for Hadoop

Generate Key

2 Installation

Download tar

unzip

Five Major Configurations

***Format HDFS***

Start / Stop Hadoop

Double Check Features

3 Testing

Create Directory

Copy File (copyFromLocal)

Put File

List Directory/List

Display File Content

Delete

Movie Quote:

1 comment:

Global File Namespace

Redundancy

Availability

Format HDFS