Sunday, February 8, 2015

Docker POC Lab Ubuntu 14.04 via VirtualBox

I heard about Docker when I was in Puppet Conf 2014. There has a session to talk about Docker and Puppet corporation. I know there has application visualization when I was doing VM image deduplication but was surprise it came out to the market early and turns out very popular. I would like to share the Docker Proof of Concept Lab I built base Ubuntu VM via VirtualBox. It actually shows how easy to make DevOps adopt CI/CI concept. Here, I list the major sections for my blog and hope you would enjoy it.

Outline

  • 1. What's Docker.
  • 2. Installation
  • 3. Permission
  • 4. Docker Commands
  • 5. More Run Commands
  • 6. Build Docker Image

1. What's Docker

Docker is an open-source that makes creating and managing Linux containers. Here is standard statement from Docker Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud. I reference the docker explanation for the people has familiar VM already. I think it's easier to get you get the point !.


What's linux container ?

Containers are like extremely lightweight VM(Virtual Machines)s - they allow code to run in isolation from other containers but safely share the machine resources, all without the overhead of a hypervisor

Linux Containers (LXC)

LXC combines "cgroups" and "namespace" support to provide an isolated environment for applications. Docker can also use LXC as one of its execution drivers, enabling image management and providing deployment services.

Docker Registry

DOCKER REGISTRY - REPOSITORIES OF DOCKER IMAGES - We need to have a disk image to make the virtualization work. The disk image represents the system we're running on. Docker registry is a registry of already existing images that we can use to run and create containerized applications.

Why setup via Ubuntu first ?

Ubuntu Trusty comes with a 3.13.0 Linux kernel, and a docker.io package which installs Docker 0.9.1 and all its prerequisites from Ubuntu's repository.

Ubuntu (and Debian) contain a much older KDE3/GNOME2 package called docker, so the package and the executable are called docker.io.

2. Installation

Generally, the way I try is follow the instruction (https://docs.docker.com/installation/ubuntulinux/), it downloads the package lists from the repositories and "updates" them to get information on the newest versions of packages and their dependencies by synchronizing the package index files fetched from /etc/apt/sources.list.

preparation

#sudo apt-get update
#sudo apt-get install docker.io

To make the shell easier to use, we need to create a soft link since /usr/local/bin is for normal user programs not managed by the distribution package manager. The following command overwrites the link (/usr/local/bin/docker):

#sudo ln -sf /usr/bin/docker.io /usr/local/bin/docker

To enable tab-completion of Docker commands in BASH, either restart BASH or:
#source /etc/bash_completion.d/docker.io

check docker process and docker version
#ps aux | grep docker

#docker -v
#sudo docker.io version


install the latest version

To install the latest version, we need to add the Docker repository key to our local keychain:
#sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 36A1D7869245C8950F966E92D8576A8BA88D21E9

Add the Docker repository to our apt sources list:
#sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"

check
#cat /etc/apt/sources.list.d/docker.list

Then, update and install the lxc-docker package:
#sudo apt-get update

install
#sudo apt-get install lxc-docker

check docker version and info again, now you see it is 1.4.1
#sudo docker -v
#sudo docker info


To verify that everything has worked as expected, we can check if Docker downloads the ubuntu image, and then start bash in a container:
#sudo docker run -i -t ubuntu /bin/bash
or
#sudo docker run -it ubuntu /bin/bash

you will find you login docker latest image via root
log out
#exit

We were able to start bash in a container. And we can check if the image for ubuntu is there:
#sudo docker images

3. Permission

permission or you actually can't find docker with error message.

check docker group first
#cat /etc/group

add user to docker group
#sudo gpasswd -a devdocker docker

Restart the Docker daemon:
#sudo service docker restart

log out and log in
#exit

***PS docker need to run root w/o required sudo cli***

4. Docker Commands

If you are starter, you can just type docker and it will list all the commands for you, or you can go docker site to study those commands. (https://docs.docker.com/reference/commandline/cli/)
#docker


docker search

#docker search --help
#docker search -s 10 jenkins
We got too many outputs, so we need to filter it out items with more than 10 stars:
#docker search -s 10 tomcat

docker pull

The pull command will go up to the web site and grab the image and download it to our local machine.
#docker pull --help
#docker pull ubuntu

docker images

Check what Docker images are available on our machine, we use docker images:
#docker images
Try download centOS
#docker pull centos:latest
check again
#docker images

docker rmi ( remove images )

remove image from our local machine via image id
#docker rmi 4986bf8c1536
remove all images
#docker rmi $(docker images -q)
PS: force remove images

[root@xxx ~]# docker images
REPOSITORY                    TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
hello-world                   latest              0a6ba66e537a        5 months ago        960 B
localhost:5000/myfirstimage   latest              0a6ba66e537a        5 months ago        960 B

[root@xxx ~]# docker rmi -f 0a6ba66e537a
Untagged: hello-world:latest
Untagged: localhost:5000/myfirstimage:latest
Deleted: 0a6ba66e537a53a5ea94f7c6a99c534c6adb12e3ed09326d4bf3b38f7c3ba4e7
Deleted: b901d36b6f2fd759c362819c595301ca981622654e7ea8a1aac893fbd2740b4c

docker run

#docker run --help
check current local machine os
#cat /etc/issue

docker run centos - This will create container based upon the image and execute the bin/bash command. Then it will take us into a shell on that machine that can continue to do things:
#docker run -it centos:latest /bin/bash
now you are in the centos bash, log out from your centos
#exit

try busybox - "stripped-down Unix tools in a single executable file" - very tiny linux distribution, it runs in a variety of POSIX environments such as Linux, Android, FreeBSD, etc. 
#docker run -it busybox sh

***can't do changing the command that's been executed***
for example echo Johnny Wang via docker run -it
#docker run -it echo 'Johnny Wang'

docker just run echo shell and exit, you see the status via docker ps -a
even you restart it and try to attach it, it will fail since it terminate after run echo command.

***ps if you run -it with the image local doesn't have, it will download image directly***

docker list container - ps

#docker ps
show running container
#docker ps -a

docker restart

we can restart the container via container id
#docker restart 6edc94951379 (container id)

docker attach

we can attach to a running container
#docker attach 6edc94951379 (container id)

docker remove

#docker rm 6edc94951379 (container id)
remove all container
#docker rm $(docker ps -aq)

remove container after run every time run -it --rm ( one time execution )
#docker run -it --rm busybox sh

#docker run -it --rm busybox echo "I will not staying around forever"
PS:don't have any container. By passing in --rm, the container is automatically deleted once the container exited.

docker kill

when you try to run an unstoppable container via docker run -t
#docker run -t busybox sh
you can attach but nothing there, now you would like to stop it.
#docker kill fd18db6a9223 (container id)



5. More Run Command

run -v for share file with local under root

binding a folder in our local machine (/home/k/myDocker) with the folder in Docker container (k) so that they can share files: ***-v /home/k/myDocker:/k busybox***

#docker run -it --rm -v /home/k/myDocker:/k busybox sh

run -p for attach file to run under port

Mapping host with port, using port 80 for nginx via auto assign.
#docker run -d -p 80 nginx
You can check the auto-mapping port via docker with ps -a command.
Then check the web app via browser.

run under specific port by request
#docker run -p 8099:80 -d nginx
Check web app via specific ports

run -e for run with environment variant

run -e DOCK_VAR=devdocker
#docker run -it --rm -e DOCK_VAR=devdocker busybox sh


5. Build Docker Images - mangoDB demo

There has two basic way to build your own Docker image. One is download image from docker forge and run image as container, then you can do the customization regarding your container then either you upload back to docker forge or you store it in your repository. The demo above has show the download and run container.

Another one is using dockerfile as scripting to customize your docker image automatically.


docker run build

Each Dockerfile is a script, composed of various commands (instructions) and arguments listed successively to automatically perform actions on a base image in order to create (or form) a new one. Dockerfiles begin with defining an image FROM which the build process starts. Here is the basic command. 

#vim Dockerfile

#cat Dockerfile
FROM ubuntu

#docker build -t (Dockerfile) .

***watch out "." at the end.***

For me, Dockerfile is more like scripting for build an image, it is similar concept like some DSL. There are various other methods, commands and arguments (or conditions), in return, provide a new image which is to be used for creating docker containers. The scripting detail you can find in this link. (https://docs.docker.com/reference/builder/)

Here I would like to show a demo to use dockerfile to build a mangoDB automatically. I actually follow the steps on this site to simulate the process. (https://www.digitalocean.com/community/tutorials/docker-explained-using-dockerfiles-to-automate-building-of-images)

First of all, you need to prepare a Dockerfile for mangoDB. The copy you can find from above link.

#sudo vim Dockerfile
#docker build -t my_mongodb .

#docker run -name my_first_mdb_instance -i -t my_mongodb

if you would like to re-use the image, you can restart the image again and assign another container name for it.

Check image id
#docker ps -a

Restart image
#docker restart 46fd9933236d

Run another container
#docker run -name my_second_mdb_instance -i -t my_mongodb

PS: Even Docker fully leverage linux container, however Docker can still be running on Windows as long as leverage VM. Here has link from standard docker doc site (https://docs.docker.com/installation/windows/)

Last, I must share this funny pic with you. This is the best picture to describe what's Docker can do. Look the last one, Leo's face is so confusing !

Why use docker ( container ) 


Application-centric management: 
Raises the level of abstraction from running an OS on virtual hardware to running an application on a virtual OS 
The simplicity of PaaS with the flexibility of IaaS 

Dev and Ops separation of concerns: 
Provides separation of build and deployment 
Decoupling applications from infrastructure 

Agile application creation and deployment: 
Increased ease and efficiency of container image creation compared to VM image use 

Continuous development, integration, and deployment: 
Facilitates reliable/frequent container image build/deployment with quick and easy rollbacks 
Image immutability 

Loosely coupled, distributed, elastic, liberated micro-services: 
Applications are broken into smaller, independent pieces 
Can be deployed and managed dynamically 
  Not a fat monolithic stack running on one big single-purpose machine 

Environmental consistency across development, testing, and production: Runs the same on a laptop as it does in the cloud 

Cloud and OS distribution portability: 
Distros: Ubuntu, RHEL, SUSE, … 
Clouds: AWS, Google Container Engine, Azure, Rackspace, … 

Resource isolation: 
Predictable application performance 

Efficient resource utilization: 
High efficiency and density 

Movie Quote: 

No body know where will we go, just follow the your instinct.

Cobb: You're waiting for a train. A train that'll take you far away. You know where you hope this train will take you. But you can't know for sure. Yet it doesn't matter. Now, tell me why?
Mal: Because you'll be together! - (Inception) 2010

You dream always awaits you. It's up to you to decide when you'll go for it. - (Chef), 2014

Wednesday, February 4, 2015

HDFS (Hadoop 2.4.1) POC Lab on Ubuntu 14.04 via VirtualBox - Note

I was always thinking the set up to HDFS to understand more logic then just read the papers. I remember my friend always told me, the best way to learn the new skill is "Try It". 

Regarding Hadoop, there has three major papers conduct whole concept and they are all actually published by Google guys. 
  1. The Google File System (2003)
  2. MapReduce: Simpli ed Data Processing on Large Clusters (2004)
  3. Bigtable: A Distributed Storage System for Structured Data (2006)
PS: These papers are very easy to google and download in pdf format.

Splitting the file into 64MB blocks and save them into distributed nodes is similar with my research concept (Data Deduplication), but go to the different purpose. Before we jump into deep, let me give a rough idea regarding Distributed File System concept.


Distributed File System : eg GFS/HDFS

Features

  • Global File Namespace
  • Redundancy
  • Availability

Typical Usage Pattern


  • Huge File (100s of GB to TB)
    • ( keep on add but never remove or replace whole file content )
      • Data is rarely updated in place
      • Reads and update in appends are common
    • Data kepts in "Chunks" spread across machines
  • Each chunk replicated on different machines - multiple copies ( at least twice ) - ensure persistence and availability

PS: replicate never in the same chunk server


Architecture:

Chunk Server (datanode)

  • File is split into contiguous chunks ( 16 ~ 64 MB )
  • Each chunk replicated ( 2x ~ 3x ) in different Rack

Master Node (aka namenode in HDFS)

  • Store "Metadata" about where files are stored (might be replicated as well)

Client has "Library" for file access

  • Talks to master to find chunk servers
  • "Connects directly" to chunk server to access data




Interesting here, I notice distribution file system is sort of like dropbox like sync solution (client file sync and share) since  basic the concept is similar. 

More, I would like to give my personal thought about how's it different from deduplication since it was my research topic before. 

Here are the bullet items for features differential as my understanding.

  • First of all, dropbox-like sync client (Files Sync and Share) doesn't care the replica, since it's backend repository features such as Amazon S3 usually will take care of replica for HA but distributed file system do care since it is major feature for HA.
  • Second in dropbox-like sync client, the data pull from all kinds of endpoint devices but push all to the single endpoint (eg: Amazon S3 or Openstack Swift) but distributed file system pull from multiple nodes, racks or clusters which are connected by GB switch.
  • Third, the metadata in Master(name) node keep address ( location / directories ) for each chunk, but in dropbox-like sync client since it's single node, it's usually just use Hash as KEY reference.
  • Last, lets talk about MapReduce and Dedup here. 
  • The MapReduce concept similar with Dedup, they are all base on the key remove the redundant(value) in data stream. However MapReduce is using because distributed file system HA feature and when data collection must be exclusive into unique, but Dedup is using because keep single copy of data chunk to reduce disk space. To sum, MapReduce collect data from servers, reduce duplicated info and land on client. Dedup is half opposite, it reduce the input to the disk from client ( nodes ) and end by unique chunks in repository ( server ).

I would like to play around to see what's else I can learn from this hot topic. Here I note the steps I set them up and share in here. I outline the steps into couple sections which might be easier for people to follow up with.
  1. Preparation
    1. Install Ubuntu 14.04 VirtualBox VM
    2. Install Java
    3. Add UserGroup and User
    4. SSH setup and Key ( Single Node on VM doesn't matter )
  2. Installation
    1. Download tar
    2. unzip
    3. configuration
    4. format hdfs
    5. Star all or Stop all
    6. Double Check Features
  3. Testing
    1. Create Directory
    2. Copy File (copyToLocal)
    3. Put File
    4. List Directory/List
    5. Display File Content
    6. Delete

1 Preparation

Ubuntu 14.04 VirtualBox VM



The easier way I did is preparing a Ubuntu flesh Installation VM, export ova for vary kinds of testing purpose. I import the ova and change the hostname for different installation / testing Lab.

PS: remember update your apt-get
#sudo apt-get update


Install Java


SSH setup and Key ( Single Node on VM doesn't matter )

#sudo apt-get install default-jdk


Add UserGroup and User for Hadoop

#sudo addgroup hadoop
#sudo adduser --ingroup hadoop hduser

SSH setup and Key ( Single Node on VM doesn't matter )
#sudo apt-get install ssh

Generate Key

#su hduser
#ssh-keygen -t rsa -P ""
#cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

2 Installation

Download tar

#wget http://sourceforge.net/projects/hadoop.mirror/files/Hadoop%202.4.1/hadoop-2.4.1.tar.gz

unzip

#tar xvzf hadoop-2.4.1.tar.gz



Move to /usr/local/hadoop
#sudo mv hadoop-2.4.1 /usr/local/hadoop 

Authorized user and usergroup for folder hadoop
#sudo chown -R hduser:hadoop hadoop

Five Major Configurations

1. ~/.bashrc

#sudo vim ~/.bashrc



add

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END








2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

#sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh



remark existing and add new one as below

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64


3. /usr/local/hadoop/etc/hadoop/core-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml

add 
<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>

 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>



4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
#sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

add
<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
 </property>
</configuration>



5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
#sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

add
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>



***Format HDFS***

#hadoop namenode -format

Start / Stop Hadoop

go to sbin folder#cd /usr/local/hadoop/sbin

Start
#start-all.sh

Stop
#stop-all.sh


Double Check Features

localhost:50070 - Web UI for NameNode Daemo


localhost:50090 - Second Name Node


localhost:50075 - Data Node


List all the port
#netstat -plten | grep java



3 Testing

Try to use example:
#hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5

you can use dd or echo > to create a file
#echo "hdfs test" > hdfsTest.txt

Create Directory

#hadoop fs -mkdir -/ user/hduser


Copy File (copyFromLocal)

#hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt


Put File

#hadoop fs -put hdfsTest.txt


List Directory/List

#hadoop fs -ls

Display File Content

#hadoop fs -cat /usr/hduser/hdfsTest.txt


Delete

#hadoop fs -rm hdfsTest.txt

PS: You can always review the result via Web UI simply double check the transactions log.


Movie Quote: 

When you're lost, follow your dreams. They know the way.
- (Begin Again) 2014