Johnny ( Chianing ) Wang : June 2015

Monday, June 22, 2015

Openstack Swift - Ring

In my previous post (openstack -swfit dev box - saio), I show a little bit about Openstack explain the concept of the swift and talk about SAIO ( Swift All In One ). This post I would like to deep further to take a look at the Ring concept in swfit.

Storage Location ( Physical )

But before we start to talk about ring in swift, I would like to talk about storage location first since swift is a software build on top of JBOD. So it is physical disk inside an account, container or object server. Where account databases, container databases, and object are stored.

Zone

For isolated set of storage location. A "zone" could represent a single disk, server, cabinet switch, or even a physical part of the datacenter. If I could use the similar concept, I would say SAN's zoning is pretty much doing the same thing. But in swift, zone present a server "Rack" usually.

PS: Swift system design REGION on top of zone in most of the case, which means each region can has multiple zones.

PS: It might be easier to picture,

zone = rack of servers/drives

region = "server room or data center"

However, this is just example and may be not the true always.

Region > Zone
Server Room or DC (Region) > Rack of Servers(Zone)

Sum

Storage Location: Region (DC) > Zone (Rack)

The Ring ( Virtual )

Last, we would like to discuss how does swift place data ? It's all rely on "Ring".

The Ring determines the storage locations of any given "account database", "container database", or "object". It's a "static" data structure, and locations are deterministically derived.

Where should this object be stored --> The Ring --> Store this object in these location --> disk 1 ... disk m ... disk n

Ring Partition

Concept: Ring > (1:N) > Ring Partition > (1:N) > "A set of Storage Locations" ~ "The Number of Desired Replicas"

A ring is segmented into a number of ring partitions.
Each ring partition maps to "a set of storage locations".
The number of storage locations per ring partition is determined by "the number of desired replicas".

PS: group of directories on disk.

Partition Power

Partition Power determines the total number of ring partitions in the form of 2^n. Generally speaking, larger clusters will require a larger partition power. The optimal partition power is derived from the max number of storage locations in the cluster.

Logic : eg: 50 HDD

50 ( Max Number of Storage Locations = Disk ) * 100 ( Desired Partitions per Storage Location ) / 3 ( Replica Count ) = 1667 ( Target Partitions )

Log2 ( 1667 ) = 10.7 ( Binary Logarithm of Target PArtitions ) ~ 11 ( get integer : Partition Power )

The equation can be look like this:

PS: the binary logarithm of x is the power to which the number 2, must be raised to obtain the value x.
eg: log2 (1024) = 10 since 2 ^ 10 = 1024
eg: 1og2 (2048) = 11 since 2 ^ 11 = 2048

Here is an example to show you how's ring builder commands you can try when your max number of drives setup is 50 and min is 10. Except partition power you can see in commands is 11, there has max per drive is 614 which means if your min drive setup is 10, the max per drive might be 614, vise versa for min per drive is 123 when total drive number is 50.

Calculating Partition Power

Number of partitions / drive:

(2^[partition power] x [Replica Count]) / [Disks] ~ 100 partition/disk

#e.g - 1:

Partition Power = 16
Replicas = 3
Disks = 2000
Number of Partition/Drive : ( 2^16 * 3 ) / 2000 = 98.3

#e.g. - 2

Partition Power = 16
Replicas = 3
Disks = 100 ( each disk has 100 partitions )
Number of Partition/Drive : ( 2^16 * 3 ) / 100 = 1966

if we assume 3TB drives, the following table lists some partition power values and how they would translate to disk count, raw storage, and usable storage once the cluster cluster grown as large as it can ( ~ 100 partition per drive ).

Partition Power	Target Disk Count	Cluster Raw (3 TB/disk)	Cluster Usable (Raw / 3 replicas)
16	1966	5.90 PB	1.97 PB
17	3932	11.80 PB	3.93 PB
18	7864	23.59 PB	7.86 PB
19	15728	47.18 PB	15.73 PB
20	31457	94.37 PB	31.46 PB
21	62914	188.74 PB	62.91 PB

Object Hash

Object Hash is unique identifier for every object stored in the cluster. It is used to map objects to ring partitions.

We can use RESTful API - http put to store an object in swift. When we manipulate the object, the "PATH" is the key since the object path is extracted from the request, and the prefix and suffix salts are added. eg: PREFIX_SALT + '/Account/Container/Object' + SUFFIX_SALT. When the process completed, the MD5 is computed, and the object hash is returned. The hash is calculated from the object's "PATH", not the object's contents. The prefix and suffix salts are configured at the cluster level.

eg: MD5 hash of "johnny", and we would like to store it to object.
d0f59baadadd3349e4a9b2674bcceae8

Like we mentioned, object hash is a mapping between object and ring partitions. Here is an example is using partition power and physical MD5 of string to get your ring partition to map your object.

1. swift only uses the first 4 bytes of the hash.
d0 f5 9b aa da dd 33 49e 4a 9b 26 74 bc ce ae8

2. the first 4 bytes (32 bits) of the hash converted to binary. With a partition power of 11, the top 11 bits are significant.

a. convert each Hex to Binary

eg: d0 = 1101 0000

1101 0000 1111 0101 1001 1101 1010 1010

b. Since the previous setting we assume Max =50 drives and partition power = 11, we collect top 11 bits.

1101 0000 1111 0101 1001 1101 1010 1010

3. The value is bit shifted to the right 21 bits since total bits = 32, then 32 bits - 11 bits ( partition power ) = 21.
0000 0000 0000 0000 0000 0 1101 0000 111

4. Converting the binary value to decimal, we get ring partition 1575.
0000 0000 0000 0000 0000 0110 1000 0111
= 2^0*1 + 2^1*1 + 2^2*1 + 2^7*1 + 2^9*1 + 2^10*1
= 1575

Replicas

For HA purpose, the replicas is required to copy of account databases, container databases, and objects to guaranteed it(replicas) to reside in a different zone.

PS: Commit the write should get 2 quorum acks always, it doesn't need to wait 3rd ack always.

Handoffs - how swift handle failure

Handoff is standby storage locations, ready in the event a primary storage location is full or not available.
It's a locations for any given partition will be primary storage location for other partition.
Handoff is always waiting for failure primary down, then copy to handoff. And when primary back and copy from handoff again and clean handoff.

PS: ring includes 3 primary replica and plus 1 handoff

Swift Replicators

It ensures data is present on all primary storage locations. If a primary storage location fails, data is replicated to handoff storage locations.
Handoff storage locations participate in replication in a failure scenario.
Replication is pushed based.

Swift Auditors

It identifies corrupted data, and moves it into quarantine. Uses MD5 hashes of the data, which are stored in the XFS xattrs of each files. Replicators restore a good copy from other primary nodes. Replication is pushed base.

PS: XFS vs EXT4, XFS is the best for SWIFT now.

Sum

Cluster
- Region ( Data Center )
- Zone ( Rack )
- Node ( Server )
- Disk ( Drive )

Reference:

https://rackerlabs.github.io/swift-ppc/
https://github.com/openstack/swift

Tuesday, June 9, 2015

How to install OpenStack kilo on VirtualBox via DevStack

In my previous post, I share How to Install OpenStack kilo on VirtualBox via DevStack. I aware the openstack release kilo at early of 2015. Here, I would like to share how to set up the newest openstack release - kilo. All the steps I took is exactly same with juno version as my previous post.

I skip virtualbox setup and jump into the kilo installation first.

# sudo apt-get update
# sudo apt-get upgrade
# sudo apt-get dist-upgrade
# sudo reboot
# sudo apt-get install git
# git clone -b stable/kilo https://github.com/openstack-dev/devstack.git

I was trying to customize the installation via local.conf but it's not working properly in kilo. just fyi the cli as below. You can skip if you have concern.

But, if you remove local.conf you just have default installation which includes Nova, Glance, Cinder, Horizon and Keystone.

For other modules, such as Swift, ~~Neutron~~, Heat, Celometer, Tempest ... etc won't be there.
-------------------
# cd devstack
# wget https://dl.dropboxusercontent.com/u/44260569/local.conf
# wget https://dl.dropboxusercontent.com/u/44260569/localrc_kilo
~~# mv localrc_kio localrc~~

-------------------

PS: I comment neutron in local.conf first since it fail on creating private IPv4 subnet. I'm not sure what's happen. I'm working on it right now. If you know how to solve it, please drop any comment.
Ref: https://bugs.launchpad.net/devstack/+bug/1417410

# ./stack.sh

The installation is done and display the login credential in output.

# source openrc

try display nova

# nova list

try horizon

First awareness from my observation is menu bar is different from juno which is all located at left hand side.

try to display system information

w/o local.conf

with local.conf( w/ neutron, celometer, heat and swift )

try create a vm.

display vm result

Last, here is the youtube video fyi. Again, this is only general installation w/o additional components such as swift, neutron ...etc.

Reference:

https://github.com/openstack-dev/devstack/tree/stable/kilo

https://www.openstack.org/software/kilo/

https://www.openstack.org/software/juno/

https://ask.openstack.org/en/question/44505/devstack-install-gateway-is-not-valid-on-subnet/

Monday, June 8, 2015

Storage Performance Evaluation using Tools

I mentioned about the storage performance evaluation concept in previous blog post, and introduce 3 tools I familiar with to the people. Today, I would like to share more detail regarding these three tools. Hopefully this post can help people who would like to star the storage performance evaluation.

===dd===

dd can be used for simplified copying of data at the low level, and giving the brief. In doing this, device files are often access directly. Since additional queries do not arrive while the device is being accessed, erroneous usage of dd can quickly lead to data loss. I recommend performing the steps described below on test systems.

Modern OS do not normally write files immediately to RAID systems or HDD (JBOD). Temporary memory will be used to cache writes and read. I/O performance measurements will not be effected by these caches, the oflag parameter can be used. Here are couple oflag examples.

direct - use direct I/O for data

dsync - use synchronized I/O for data

sync - likewise, but also for metadata (i-node)

Throughput (streaming I/O)

ceph@ceph-VirtualBox:~/dd$ dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 32.474 s, 33.1 MB/s

Clean the cache

ceph@ceph-VirtualBox:~/dd$ dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=sync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 123.37 s, 8.7 MB/s

Latency

ceph@ceph-VirtualBox:~/dd$ dd if=/dev/zero of=/root/testfile bs=512 count=1000 oflag=direct
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.36084 s, 1.4 MB/s

Clean the cache

ceph@ceph-VirtualBox:~/dd$ dd if=/dev/zero of=/root/testfile bs=512 count=1000 oflag=sync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 11.1865 s, 45.8 kB/s

===fio===

fio is an I/O tool meant to be used both for benchmark and stress/hardware verification. It has support for 19 different types of I/O engines ( sync, mmap, libaio, posixaio, SG v3, splice, null, network, syslet, guasi, solarisaio, and more ), I/O priorities (for newer Linux kernels), rate I/O, forked or threaded jobs, and much more.

It can work on block devices as well as files, fio accepts job descriptions in a simple-to-understand text format. Several example job files are included. fio displays all sorts of I/O performance information, including complete IO latencies and percentiles. fio is in wide use in many places, for both benchmarking, QA, and verification purpose. It support Linux, FreeBSD, NetBSD, OpenBSD, OS X, OpenSolaris, AIX, HP-UX, Android, and Windows.

As current most the ubuntu version, fio is default package pre-installed. Even though, I still listed the required packages for setting up the fio environment.

Other than ubuntu, I realized RHEL might be different. such as the requirements as below might be needed for setup the test environment.

---RHEL Installation---

In a Virtual Machine, install and configure the fio.

# yum install libaio
# yum install blktrace
# yum install fio
or
# (rpm -ivh fio-2.1.11-1.el7.x86_64-2.rpm) - option

ps: If you need the btrace for debugging performance, install the blktrace package to get the btrace utility.

---Ubuntu Installation---

apt_upgrade: true

1. packages: ksh, fio ,sysstat
#sudo apt-get install ksh
#sudo apt-get install fio
#sudo apt-get install sysstat

or you can

2. download fio

--- How fio works---

Two ways to trigger fio run, 1. command line and 2, job file

1. command line, eg:

#fio --name=global --rw=randread --size=129M --runtime=120
or
#fio --name=random-writers --ioengine=libaio --iodepth=4 --rw=randwrite --bs=32k --direct=0 --size=64m --numjobs=4

2. Job file

The first step in getting fio to simulate a desired IO workload, is writing a job file describing that specific setup. A job file may contain any number of threads and/or files - the typical contents of the job file is a global section defining shared parameters, and one or more job sections descripbing the jobs involved. When run, fio parses this file and sets everything up as described. If we break down a job from top to bottom, it contains the following basic parameters as below:
command line directory
[job1]
[job2]

Same command with previous cli exampe
fio --name=random-writers --ioengine=libaio --iodepth=4 --rw=randwrite --bs=32k --direct=0 --size=64m --numjobs=4

equal

[random-writers]
ioengine=libaio
iodepth=4
rw=randwrite
bs=32k
direct=0
size=64m
numjobs=4

3. command line combine with Job file

Here is a example for general outline for fio configuration file, but like what's we explain, the cli can be combined with Job file and cli can overwrite Job file configuration if there has any need.

eg: sample
#fio --name=global --rw=randread --size=128m --name=job1 --name=job2
; -- start job file --
[global]
rw=randread
size=128m
[job1]
[job2]

eg: real example
#fio --name=global --rw=randread --size=128m --name=job1 --name=job2
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
job2: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 2 processes
job1: Laying out IO file(s) (1 file(s) / 128MB)
job2: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 2 (f=2): [rr] [11.9% done] [427KB/0KB/0KB /s] [106/0/0 iops] [eta 07m:33s]

===vdbench===

Vdbench is a command line utility specifically created to help engineers and customers generate disk I/O workloads to be used for validating storage performance and storage data integrity. Vdbench execution parameters may also specified via an input text file.

It is open source and is provided by Oracle. You can download the code from here. http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

Here are the general steps you can make the vdbench working on your box.
It includes all the required binary for both windows and linux.

download vdbench*.zip to your box
unzip vdbench*.zip
setup environment

a: windows needs java
b: linux need csh, java

prepare parmfile
vdbench test - #./vdbench -t or #./vdbench -f
run vdbench with parameter file

---run with parmfile---

Run command
-f : with parmfile location
-o : with output log directory
#./vdbench -f parmfile -o ./output

PS: You can do a very quick simple test without even having to create a parameter file:
# ./vdbench –t (for a raw I/O workload)
# ./vdbench -tf (for a file system workload)

---run with dymanic lun w/ cli---

Variable substitution allows you to code variables like $lun in your parameter file which then can be overridden from the command line. For example: sd=sd1,lun=$lun, $lun must be overridden from the command line: ./vdbench -f parmfile lun=/dev/x.

In case your parameter file is embedded in a shell script, you may also specify a '!' to prevent accidental substitution by the scripting language, e.g. sd=sd1,lun=!lun

---remote to slave ---

vdbench usually run controller against with multiple slaves, each slaves can attached single or multiple volumes depends on design.

---Sum---

How 's vdbench controller to communicate slaves ? It's using shell and shell can be rsh, ssh or vdbench ( the slave have to run daemon first via command line ./vdbench rsh )
What's kinds of workload that vdbench can provided ? It can provide raw Disk IO and File System workload.
Not only the general workload, vdbench can provide deduplication and compression workload which provide you more info regarding your disk performance in specific circumstance. eg: dedupratio=2,dedupunit=4k,dedupsets=33 in parmfile

raw disk IO

---hd: host definition---

This is how you design your master/slaves architecture.

eg: general in default and each row
hd=default,user=ubuntu,shell=vdbench,vdbench=/home/ubuntu/vdbench
hd=fio-driver-1,system=192.168.2.74
...
eg: each row (slaves) has all unique user name for ssh connection.
hd=drive-1,system=192.168.2.74,user=ubuntu,shell=ssh
hd=drive-2,system=192.168.2.75,user=ubuntu,shell=ssh
...
eg: general in default and specific slaves in each row
host=default,user=ubuntu,shell=ssh
hd=one,system=192.168.1.30
host=(localhost,one)

PS: Remote Raw Disk IO Working - add root Privilege w/o password for vdbench
#sudo visudo
add ceph ALL=NOPASSWD: /home/ceph/vdbench/shutdown -f parmfile at the last line
ctrl-X + enter for exit

---sd: storage definition---

sd : storage definition (use any: sd1, sd2 ...sdtest...) lun=/dev/vdb : I use RAW device (that mounted from storage, create LUN or Volume on Storage system and mount it to testing server. There are many kind of storage if you want to stress, disk, raw device, file system etc.)

threads: maximum number of concurrent outstanding I/O that we want to flush.

PS: 'seekpct=nn': Percentage of Random Seeks

---wd: workload definition (use any)---

xfersize: data transfer size (1M,70, 10M, 30): Generate xfersize as a random value between 1 Megabyte and 10 Megabyte with weight for random value is 70%.
rdpct: read percentage (70% is read and 30% is write).

---rd: run definition (use name any)---

iorate=max: Run an uncontrolled workload. (iorate=100 : Run a workload of 100 I/Os per second)
elapsed: time to run this test (second)
interval: report interval to your screen in second.

Sum as eg:
sd=sd1,lun=/dev/vdb,openflags=o_direct,threads=200
wd=wd1,sd=sd1,xfersize=(1M,70,10M,30),rdpct=70
rd=run1,wd=wd1,iorate=max,elapsed=600,interval=1

File System IO

---fsd: file system definition---

fsd=fsd1,anchor=/home/ubuntu/vdbench/TEST,shared=yes,depth=1,width=8,files=4,size=8k

or

fsd=fsd1,anchor=/media/20GB/TEST,depth=1,width=8,files=4,size=8k

---fwd: file workload definition---

fwd=fwd1,fsd=fsd1,operation=read,xfersize=4k,fileio=sequential,fileselect=random,threads=4
rd=rd1,fwd=fwd1,fwdrate=100,format=yes,elapsed=10,interval=1

*anchor: starting point for generate the files
*depth: directory levels
*width: folders number in the directory
*files: files under the folder
*size: each file size
*operation: mkdir, rmdir, create, delete, open, close, read, write, getattr and setattr
*xfersize: data transfer size
*fileio: sequential, random
*fileselect: how to select file name, directory name
*threads: how many concurrent threads

*fwdrate: how many file system operations per second, eg: 100: run a workload of 100 operations per second.
*format: during run, if needed, crate the complete file structure no

PS: Vdbench will first delete the current file structure and then will create the file structure again. It will then execute the run you requested in the current RD.

*interval: per second

PS: there has a pdf manual to explain how to run the vdbench. It's a good resource to guide you through the execution.

---compression and deduplication---

compression

Compratio=n
Ratio between the original data and the actually written data,
e.g.compratio=2 for a 2:1 ratio. Default: compratio=1

deduplication

Data Deduplication is built into Vdbench with the understanding that the dedup logic included in the target storage device looks at each n-byte data block to see if a block with identical content already exists. When there is a match the block no longer needs to be written to storage and a
pointer to the already existing block is stored instead. Since it is possible for dedup and data compression algorithms to be used at the same time, dedup by default generates data patterns that do not compress.

Ratio between the original data and the actually written data, eg: dedupratio=2 for a 2:1 ratio. Default: no dedup, or dedupratio=1

The size of a data block that dedup tries to match with already existing data. Default dedupunit=128k

How many different sets or groups of duplicate blocks to have. See below. Default: dedupsets=5% (You can also just code a numeric value, e.g. dedupsets=100)

For a File System Definition (FSD) dedup is controlled on an FSD level, so not on a file level.

These blocks are unique and will always be unique, even when they are rewritten. In other words, a unique block will be rewritten with a different content than all its previous versions.

All blocks within a set are duplicates of each other. How many sets are there? I have set the default to dedupsets=5%, or 5% of the estimated total amount of dedupunit=nn blocks.

A 128m SD file, for 1024 128k blocks. dedupratio = 2 and dedupset = 5%

dedupratio = 2 = 50% = 1024 * 0.5 = 512
unique blocks = 512 - 51 = 461

Dedupratio=2 ultimately will result in wanting 512 data blocks to be written to disk and 512 blocks that are duplicates of other blocks.

all read and write operations must be a multiple of that dedupunit= size. dedupunit = 8k, all data transfer will be 8k, 16k, 25k ... etc

Of course, if 48 hours is not enough elapsed time Vdbench will terminate BEFORE the last block has been written.

PS: the deduplication work load is fix chunk size but it provides two different patterns, Unique blocks and Duplicate blocks.

There are ‘nn’ sets or groups of duplicate blocks.
Example:

dedupset = 1024 * 0.05 = 51 sets
duplicate blocks = 512 + 51 = 563
total = duplicate + unique = 563 + 461 = 1024
dedupset=5%; There will be (5% of 1024) 51 sets of duplicate blocks.

===Sum of Performance Evaluation===

For measuring write performance, the data to be written should be read from /dev/zero and ideally written it to an empty RAID arrayy, hard disk or partition (such as using of=/dev/sda for the first HDD or of=/dev/sda2 for the 2nd partition on the first HDD.) If this is not possible, a normal file in the file system (such as using of=/root/testfile) can be written. PS: You should only use empty RAID arrays, HDD or partitions.

When using if=/dev/zero and bs=1G, OS will need 1GB of free space in RAM. In order to get results closer to real-life, performin the tests described several times(3 ~ 10 times)

Last, below figure shows an example conceptually of the type of graph the performance evaluation should end up with if you plot IOPS on X- and Latency on Y. The saturation point is the point where latency increases exponentially with additional load. The inflection point in the right hand portion of figure is the performance saturation point.

Here is a general phase of a performance test plan should have. This is just an example but give us a overview of key tasks with time should includes.

Reference:

https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd
https://github.com/axboe/fio/blob/master/HOWTO
https://community.oracle.com/community/server_%26_storage_systems/storage/vdbench
http://info.purestorage.com/rs/purestorage/images/IDC_Report_AFA_Performance_Testing_Framework.pdf
http://www.brendangregg.com/sysperfbook.html