Johnny ( Chianing ) Wang : March 2017

My First Deep Learning Practice via Adopting Linear Regression Model

Environment Preparation. 

Prepare Three Dependencies

# Pandas allow to read data set

$ pip install pandas

# scikit learn is Machine Learning Library for Regression

$ pip install -U scikit-learn

# matplotlib allows visualize model and data

$ brew install freetype

$ brew install libpng

$ git clone git://github.com/matplotlib/matplotlib.git

$ cd matplotlib/

$ python setup.py install

Here is a quicker way to get your preparation done

$ cat requirements.txt

matplotlib==1.5.1

numpy==1.11.0

pandas==0.18.0

scikit-learn==0.17.1

Then

$ pip install -r requirements.txt

Prepare a python script to read the dedup dataset, learning with linear regression model and display in matplot visualization UI.

$ cat demo.py

#Data is labeled and We use Supervisor Approach

#Type of machine learning task will perform is called regression ( Linear Regression )

import pandas as pd

from sklearn import linear_model

import matplotlib.pyplot as plt

#read data

dataframe = pd.read_fwf('dedup_seq_var_1_17.txt')

x_values = dataframe[['dedup']]

y_values = dataframe[['time']]

#train model on data

body_reg = linear_model.LinearRegression()

body_reg.fit(x_values, y_values)

#visualize results

plt.scatter(x_values, y_values)

plt.plot(x_values, body_reg.predict(x_values))

#plt.interactive(True)

plt.show()

Prepare your dataset, e.g different dedup files (size) to compare with dedup ratio and dedup process time.

file_name	Dedup	Time
seq_377K.0.0	96.25409276	2512
seq_610K.0.0	96.76181481	1742
seq_987K.0.0	97.22268395	18379
seq_1597K.0.0	97.28971803	4174
seq_2584K.0.0	97.51585024	3796
seq_4181K.0.0	97.53909053	8360
seq_6765K.0.0	97.57731661	9649
seq_10946K.0.0	97.70243732	12128
seq_17711K.0.0	97.73371995	26217
seq_28657K.0.0	97.75899063	36044
seq_46368K.0.0	97.77537807	56288
seq_75025K.0.0	97.77286868	89911
seq_121393K.0.0	97.7728039	130598
seq_196418K.0.0	97.76114378	231925
seq_317811K.0.0	97.76875575	365581
seq_514229K.0.0	97.77152402	587564
seq_832040K.0.0	97.77271286	946887

Final Output Example 

Linear Regression ( y = mx + b ) diagram when m = dedup ratio and b = time

As the result
is linear when we use variable chunking with modulo 128K but when file size
larger then 196418K ( 196.418 MB ), then process time got exponential grow in
this case.

Double check with Linear Regression Equation via NumPy

$ cat ./demo.py

from numpy import *

# y = mx + b

# m is slope, b is y-intercept 

# this is error equation for collecting the error during different iteration

def compute_error_for_line_given_points(b, m, points):

    #compute the distance from point to line and find the min which is best fit line for all the points

    #y - y^2(mx+b) but we take sqare for y and summarize it and take average n

    #initialize it at 0

    totalError = 0

    #for every points

    for i in range(0, len(points)):

        #get x value

        x = points[i, 0]

        #get y value

        y = points[i, 1]

        #get the difference, square it, add it to the total

        totalError += (y - (m * x + b)) ** 2

    #get the average

    return totalError / float(len(points))

#gradient follows this partial derivative equation

def step_gradient(b_current, m_current, points, learningRate):

    #initial, starting points for gradients

    b_gradient = 0

    m_gradient = 0

    N = float(len(points))

    for i in range(0, len(points)):

        x = points[i, 0]

        y = points[i, 1]

        #direction with respect to b and m

        #computing partial derivatives of our error function ( partial derivative equations )

        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))

        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))

    #update b and m values using partial derivatives

    new_b = b_current - (learningRate * b_gradient)

    new_m = m_current - (learningRate * m_gradient)

    return [new_b, new_m]

def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):

    #starting b and m

    b = starting_b

    m = starting_m

    #gradient descent

    for i in range(num_iterations):

        #update b and m with the new more accurate b and m by performing

        #this gradient step

        b, m = step_gradient(b, m, array(points), learning_rate)

    #once for loop finish, then return the ultimated b and m result

    return [b, m]

def run():

    #step 1 - collect data, two colums , colum 1: hours study (x), colum 2: test score (y)

    points = genfromtxt("data.csv", delimiter=",")

    #step 2 - define hyperparameters ( it's a balance )

    #hyper-parameter 1: learning rate = how fast should our model converage

    #converagence ( converage ) = means when you get the optimal results(model), the line of best fit. ( best answer )

    #learning rate too low , get slow convergence, if too big, the error function might not decrease.

    learning_rate = 0.0001

    #hyper-parameter 2: equation ( slope formula ): y - mx + b

    initial_b = 0 # initial y-intercept guess

    initial_m = 0 # initial slope guess

    #hyper=parameter 3: iteration

    num_iterations = 10000

    #Step 3 - train ML model

    print "Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points))

    print "Running..."

    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

    print "After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points))

if __name__ == '__main__':

    run()

$ cat data.csv

96.25409276,2512

96.76181481,1742

97.22268395,18379

97.28971803,4174

97.51585024,3796

97.53909053,8360

97.57731661,9649

97.70243732,12128

97.73371995,26217

97.75899063,36044

97.77537807,56288

97.77286868,89911

97.7728039,130598

97.76114378,231925

97.76875575,365581

97.77152402,587564

97.77271286,946887

$ python demo.py

Starting gradient descent at b = 0, m = 0, error = 85896989939.5

Running...

After 1000 iterations b = -58.9851802482, m = 1531.65706095, error = 63605989287.1

$ cat data.csv

96.25409276,2512

96.76181481,1742

97.22268395,18379

97.28971803,4174

97.51585024,3796

97.53909053,8360

97.57731661,9649

97.70243732,12128

97.73371995,26217

97.75899063,36044

97.77537807,56288

97.77286868,89911

97.7728039,130598

$ python demo.py

Starting gradient descent at b = 0, m = 0, error = 2383362378.46

Running...

After 1000 iterations b = -13.93366184, m = 316.652164833, error = 1432258731.3

The interesting thing is if we remove file larger than seq_10946K in data set, we got lowest error as result.

$ cat data.csv

96.25409276,2512

96.76181481,1742

97.22268395,18379

97.28971803,4174

97.51585024,3796

97.53909053,8360

97.57731661,9649

$ python demo.py

Starting gradient descent at b = 0, m = 0, error = 77422434.5714

Running...

After 1000 iterations b = -1.28801270757, m = 71.5887947793, error = 29053565.2771

In sum, when we use variable chunk and modulo 128K ( 0.85 * 128K ( min ) ~ 2 * 128K ( max ) ) as chunking boundary are kinds of follow linear pattern when file is smaller than 196.418MB. However, when file size larger than 196.418MB, the process time is not follow the linear patter compare with dedup ratio and might be obviously become explanation growing patten. Moreover, when we use variable chunking with mod 128K, against with different file size, when file size under 10946K, then we might keep lowest error rate in this data set.

Extra Bonus:

Here is the practice I was doing for 3 parameters in linear regression. Regarding the juypter note, you can easy to track down the code and generate those linear graphics.

https://github.com/chianingwang/DeepLearningPractice/blob/master/linear_regression/linear-regression_sklearn/linear_regression.ipynb

Here are leverage 2D and predict 2D

Here are the 3D plots diagram

Here are the prediction base on the existing data.

不要害怕犯錯。只有當你停止犯錯的時候，才需要警惕。因為這代表著你已經不再學習，或停止進步。

Don't be afraid to make mistakes. The only time you should be worried is when you stopped making mistakes; it only means that you've stopped learning or making progress.

進擊的鼓手 (Whiplash), 2014

Reference:
https://github.com/chianingwang/DeepLearningPractice/tree/master/linear_regression
https://github.com/llSourcell/linear_regression_demo
https://github.com/llSourcell/linear_regression_live

https://github.com/ludobouan/linear-regression-sklearn

Johnny ( Chianing ) Wang

Saturday, March 4, 2017

Basic Machine Learning ( Linear Regression Model ) for (e.g Reduplication DataSet )

Here are leverage 2D and predict 2D

Here are the 3D plots diagram

Here are the prediction base on the existing data.