Regarding Hadoop, there has three major papers conduct whole concept and they are all actually published by Google guys.
- The Google File System (2003)
- MapReduce: Simpli ed Data Processing on Large Clusters (2004)
- Bigtable: A Distributed Storage System for Structured Data (2006)
PS: These papers are very easy to google and download in pdf format.
Splitting the file into 64MB blocks and save them into distributed nodes is similar with my research concept (Data Deduplication), but go to the different purpose. Before we jump into deep, let me give a rough idea regarding Distributed File System concept.
PS: replicate never in the same chunk server
Interesting here, I notice distribution file system is sort of like dropbox like sync solution (client file sync and share) since basic the concept is similar.
More, I would like to give my personal thought about how's it different from deduplication since it was my research topic before.
Here are the bullet items for features differential as my understanding.
I would like to play around to see what's else I can learn from this hot topic. Here I note the steps I set them up and share in here. I outline the steps into couple sections which might be easier for people to follow up with.
Distributed File System : eg GFS/HDFS
Features
- Global File Namespace
- Redundancy
- Availability
Typical Usage Pattern
- Huge File (100s of GB to TB)
- ( keep on add but never remove or replace whole file content )
- Data is rarely updated in place
- Reads and update in appends are common
- Data kepts in "Chunks" spread across machines
- Each chunk replicated on different machines - multiple copies ( at least twice ) - ensure persistence and availability
PS: replicate never in the same chunk server
Architecture:
Chunk Server (datanode)
- File is split into contiguous chunks ( 16 ~ 64 MB )
- Each chunk replicated ( 2x ~ 3x ) in different Rack
Master Node (aka namenode in HDFS)
- Store "Metadata" about where files are stored (might be replicated as well)
Client has "Library" for file access
- Talks to master to find chunk servers
- "Connects directly" to chunk server to access data
Interesting here, I notice distribution file system is sort of like dropbox like sync solution (client file sync and share) since basic the concept is similar.
More, I would like to give my personal thought about how's it different from deduplication since it was my research topic before.
Here are the bullet items for features differential as my understanding.
- First of all, dropbox-like sync client (Files Sync and Share) doesn't care the replica, since it's backend repository features such as Amazon S3 usually will take care of replica for HA but distributed file system do care since it is major feature for HA.
- Second in dropbox-like sync client, the data pull from all kinds of endpoint devices but push all to the single endpoint (eg: Amazon S3 or Openstack Swift) but distributed file system pull from multiple nodes, racks or clusters which are connected by GB switch.
- Third, the metadata in Master(name) node keep address ( location / directories ) for each chunk, but in dropbox-like sync client since it's single node, it's usually just use Hash as KEY reference.
- Last, lets talk about MapReduce and Dedup here.
- The MapReduce concept similar with Dedup, they are all base on the key remove the redundant(value) in data stream. However MapReduce is using because distributed file system HA feature and when data collection must be exclusive into unique, but Dedup is using because keep single copy of data chunk to reduce disk space. To sum, MapReduce collect data from servers, reduce duplicated info and land on client. Dedup is half opposite, it reduce the input to the disk from client ( nodes ) and end by unique chunks in repository ( server ).
I would like to play around to see what's else I can learn from this hot topic. Here I note the steps I set them up and share in here. I outline the steps into couple sections which might be easier for people to follow up with.
- Preparation
- Install Ubuntu 14.04 VirtualBox VM
- Install Java
- Add UserGroup and User
- SSH setup and Key ( Single Node on VM doesn't matter )
- Installation
- Download tar
- unzip
- configuration
- format hdfs
- Star all or Stop all
- Double Check Features
- Testing
- Create Directory
- Copy File (copyToLocal)
- Put File
- List Directory/List
- Display File Content
- Delete
1 Preparation
Ubuntu 14.04 VirtualBox VM
The easier way I did is preparing a Ubuntu flesh Installation VM, export ova for vary kinds of testing purpose. I import the ova and change the hostname for different installation / testing Lab.
PS: remember update your apt-get
#sudo apt-get update
Install Java
SSH setup and Key ( Single Node on VM doesn't matter )
#sudo apt-get install default-jdk
Add UserGroup and User for Hadoop
#sudo addgroup hadoop
#sudo adduser --ingroup hadoop hduser
SSH setup and Key ( Single Node on VM doesn't matter )
#sudo apt-get install ssh
Generate Key
#su hduser
#ssh-keygen -t rsa -P ""
#cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
2 Installation
Download tar
#wget http://sourceforge.net/projects/hadoop.mirror/files/Hadoop%202.4.1/hadoop-2.4.1.tar.gz
unzip
#tar xvzf hadoop-2.4.1.tar.gz
Move to /usr/local/hadoop
#sudo mv hadoop-2.4.1 /usr/local/hadoop
Authorized user and usergroup for folder hadoop
#sudo chown -R hduser:hadoop hadoop
Five Major Configurations
1. ~/.bashrc
#sudo vim ~/.bashrc
add
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
#sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
remark existing and add new one as below
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
3. /usr/local/hadoop/etc/hadoop/core-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml
add
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
#sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
add
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
add
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
***Format HDFS***
#hadoop namenode -format
Start / Stop Hadoop
Start
#start-all.sh
Stop
#stop-all.sh
Double Check Features
localhost:50070 - Web UI for NameNode Daemolocalhost:50090 - Second Name Node
localhost:50075 - Data Node
List all the port
#netstat -plten | grep java
3 Testing
Try to use example:
#hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5
you can use dd or echo > to create a file
#echo "hdfs test" > hdfsTest.txt
Create Directory
#hadoop fs -mkdir -/ user/hduser
Copy File (copyFromLocal)
#hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt
Put File
#hadoop fs -put hdfsTest.txt
List Directory/List
#hadoop fs -ls
Display File Content
#hadoop fs -cat /usr/hduser/hdfsTest.txt
Delete
#hadoop fs -rm hdfsTest.txt
PS: You can always review the result via Web UI simply double check the transactions log.
Movie Quote:
When you're lost, follow your dreams. They know the way.
- (Begin Again) 2014
Thanks the comment, it's encouraging. I will keep posting and feel free to share more if you like.
ReplyDeleteThanks !