Wednesday, April 12, 2017

How to write and debug grok pattern in ELK.

elk, it's elastic search, log stash and kibana, three open source key components from elastic search. logstash process the logs, and elastic search provide log indexing, storing ... etc features and kibana displays the result in web UI.

I introduce how to setup elk5 server is my previous post, thus I won't explain the installation and setup but welcome to check previous post before you start it.

If you just start the grok pattern, I recommend you check out the elastic search web site to get some grok pattern basic concept

This post, I will show you how to write/debug a grok pattern with different tools and give a proper tags for your log which allow to search , indexing or event prepare for further utilization in the further such as Machine Learning.


0. outline


  1. Preparation
  2. Design
  3. Implementation
    1. Regular Expression - debugging
    2. Grok Pattern
      1. Method 1: online debug tools
        1. First Steps
        2. Second Steps
        3. Third Steps
        4. Last review result
      2. Method 2: Check Logs


1. Preparation


After you finish the installation you should be able to see the folders/files under /etc/logstash/ as below.

$ ls -larth
total 44K
-rw-rw-r--   1 root       root       1.7K Nov 11 14:25 startup.options
-rw-rw-r--   1 root       root       1.7K Nov 11 14:25 jvm.options
-rw-rw-r--   1 root       root       1.4K Feb 24 09:41 log4j2.properties
-rw-rw-r--   1 root       root       2.8K Feb 24 23:56 logstash.yml.dpkg-old
lrwxrwxrwx   1 root       root         86 Feb 25 00:04 patterns -> /usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-4.0.2/patterns
-rw-rw-r--   1 root       root       4.4K Mar 13 16:49 logstash.yml
drwxrwxr-x   4 root       root       4.0K Mar 13 17:01 .
drwxr-xr-x 110 root       root       4.0K Apr  5 10:16 ..
drwxrwxr-x   2 swiftstack swiftstack 4.0K Apr 12 17:39 conf.d
drwxr-xr-x   2 swiftstack swiftstack 4.0K Apr 12 18:05 extra_patterns

There are three major folders you might need to be aware.

a. patterns:
As you can see it's a soft link to your logstash installation directory since logstash will bring the native patterns for your to leverage it.

b. extra_patterns:
You can prepare your customized pattern which is regular expressions mapping to leverage in your grok pattern.

c. conf.d:
You will have a grok pattern mapping under this folder. e.g:
$ ls ./conf.d/ -larth
total 48K
-rw-r--r-- 1 swiftstack swiftstack  271 Feb 27 16:18 30-elk5-output.conf
-rw-r--r-- 1 swiftstack swiftstack  409 Feb 27 20:41 10-rsyslog-input.conf
drwxrwxr-x 4 root       root       4.0K Mar 13 17:01 ..
-rw-rw-r-- 1 swiftstack swiftstack  29K Apr 12 17:39 20-rsyslog-filter.conf
drwxrwxr-x 2 swiftstack swiftstack 4.0K Apr 12 17:39 .


2. Design

The design is all depends on you, in my design I will separate input, filter and output into three diff *.conf but you can either put them together or separate each grok patterns into multiple *.conf thus you might have.
e.g input, filter_1, filter_2 ... etc and output.


In my case, I leverage rsyslog to forward the log to logstash, the detail you should be able to find in here.

a. input:
Input.conf is the taking the logs from syslog and identify the format which is json line by lone.

e.g: $ cat 10-rsyslog-input.conf
# This input block will listen on port 10514 for logs to come in.
# host should be an IP on the Logstash server.
# codec => "json" indicates that we expect the lines we're receiving to be in JSON format
# type => "rsyslog" is an optional identifier to help identify messaging streams in the pipeline.

input {
  udp {
    port => 10514
    codec => json_lines
    type => "rsyslog"
  }
}

b. filter
Filter.conf is the key we put the grok patterns. I will emphasis it in later section.

e.g: $ cat 20-rsyslog-filter.conf
# This is an empty filter block.  You can later add other filters here to further process
# your log lines

filter {
################################
# Object Expirer patterns
################################
#
# EXAMPLE LOG ENTRIES
################################
# OBJ_EXP_START Oct 15 13:13:00 localhost.localdomain object-expirer: Pass beginning; 0 possible containers; 0 possible objects
# OBJ_EXP_FINISH Oct 15 13:13:00 localhost.localdomain object-expirer: Pass completed in 0s; 0 objects expired
################################
  if [programname] == "object-expirer" {
    if "beginning" in [message] {
      grok {
        patterns_dir => ["/etc/logstash/patterns", "/etc/logstash/extra_patterns"]
        match => { "message" => " Pass beginning; %{BASE10NUM:PossibleContainers} possible containers; %{BASE10NUM:PossibleObject} possible objects" }
        add_tag => ["swift", "object", "expirer", "OBJ_EXP_START"]
        remove_tag => ["_grokparsefailure"]
        remove_field => ["PossibleContainers", "PossibleObject"]
      }
    }
    else {
      grok {
        match => { "message" => " Pass completed in %{NUMBER:CompletedSeconds}s; %{BASE10NUM:ObjectsExpired} objects expired" }
        remove_tag => ["_grokparsefailure"]
        add_tag => ["swift", "object", "expirer", "OBJ_EXP_FINISH"]
        remove_field => ["CompletedSeconds", "ObjectsExpired"]
      }
    }
  }

c. output
Output.conf is the logs parse in logstash and forward to elasticsearch thus you can see the output is elasticsearch server ( localhost ) and port.

e.g: $ cat 30-elk5-output.conf
# This output block will send all events of type "rsyslog" to elasticsearch at the configured
# host and port into daily indices of the pattern, "rsyslog-YYYY.MM.DD"

output {
  if [type] == "rsyslog" {
    elasticsearch {
      hosts => [ "localhost:9200" ]
    }
  }
}


3. Implement

This section I will show you how to write a grok patter and how to debug it. I was stuggling the grok patter implementation for a while and found these steps / tools do help me a lot.

Let's give an example from your un-parser logs, as you can see most of the data was matched by default logstash but some of them in message wasn't and turns out can't find any pattern and error out via a tag: _grokparsefailure.
{
  "_index": "logstash-2017.04.12",
  "_type": "rsyslog",
  "_id": "AVtkdWQiSkb40H9LtWNZ",
  "_score": null,
  "_source": {
    "severity": "info",
    "@timestamp": "2017-04-12T23:17:46.000Z",
    "@version": "1",
    "programname": "proxy-server",
    "procid": "-",
    "host": "127.0.0.1",
    "sysloghost": "cloud-pac01",
    "message": " Copying object from /v1/AUTH_sme/stepan/Folders_pack/03 with special characters ~`!@#$%^&()_+-={}[];',._over_PHP.zzz to /v1/AUTH_sme/stepan/.trash/org/Folders_pack/03%20with%20special%20characters%20%7E%60%21%40%23%24%25%5E%26%28%29_%2B-%3D%7B%7D%5B%5D%3B%27%2C._over_PHP.zzz (txn: txae6773cb5efd47ae86c1b-0058eeb59a)",
    "type": "rsyslog",
    "facility": "local0",
    "tags": [
      "_grokparsefailure"
    ]
  },
  "fields": {
    "@timestamp": [
      1492039066000
    ]
  },
  "highlight": {
    "message": [
      " @kibana-highlighted-field@Copying@/kibana-highlighted-field@ @kibana-highlighted-field@object@/kibana-highlighted-field@ @kibana-highlighted-field@from@/kibana-highlighted-field@ /v1/AUTH_sme/stepan/Folders_pack/03 with special characters ~`!@#$%^&()_+-={}[];',._over_PHP.zzz to /v1/AUTH_sme/stepan/.trash/org/Folders_pack/03%20with%20special%20characters%20%7E%60%21%40%23%24%25%5E%26%28%29_%2B-%3D%7B%7D%5B%5D%3B%27%2C._over_PHP.zzz (txn: txae6773cb5efd47ae86c1b-0058eeb59a)"
    ]
  },
  "sort": [
    1492039066000
  ]
}





In above message "  Copying object from /v1/AUTH_sme/stepan/Folders_pack/03 with special characters ~`!@#$%^&()_+-={}[];',._over_PHP.zzz to /v1/AUTH_sme/stepan/.trash/org/Folders_pack/03%20with%20special%20characters%20%7E%60%21%40%23%24%25%5E%26%28%29_%2B-%3D%7B%7D%5B%5D%3B%27%2C._over_PHP.zzz (txn: txae6773cb5efd47ae86c1b-0058eeb59a)" is basically a sting to tell us "Copying object from "A" to "B" (txn: TXN)" that it. However, we notice some of the features in below bullet items.

a. there has special characters in messages e.g : "(" and ")", we need to skip it.
b. there has special characters in messages e.g : "...$%^..."
c. there might has foreign letter which need unicode to be includes in our patterns.


3.1 Regular Expression

First of all we need to design a regular expression pattern to map what we want and allow us to re-use it all the time in the later mapping.

Before we start I have to introduce you a easy on line tool to parse your regex ( regular expression ) - https://regex101.com

a. leverage existing pattern - I call it use your own dog food.
# TXN in transaction number
# In OpenStack swift we use it a lot thus I design a extra-pattern to leverage it all the time
# This example is very straight forward just two words concats together.
TXN %{WORD}[-]%{WORD}


# Above a + b + c bullet items can be summarized as below which 
# Copy from unicode object path which allow unicode
#\w matches any word character (equal to [a-zA-Z0-9_])
#\s matches any whitespace character (equal to [\r\n\t\f\v ])
#\/ matches the character / literally (case sensitive)
#\. matches the character . literally (case sensitive)
#\p{L} matches any kind of letter from any language
#\p{N} matches any kind of numeric character in any script
#\p{S} matches any math symbols, currency signs, dingbats, box-drawing characters, etc
#\p{P} matches any kind of punctuation character
COPYFROMUNICODE [\w\s\/\.\p{L}\p{N}\p{S}\p{P}]+

Via online regex tools we can practice and debug our regex which an match the requirement or not.

This tools has detail explination about how regex map to your test string.
/
[\w\s\/\.\p{L}\p{N}\p{S}\p{P}]+
/
Match a single character present in the list below 
[\w\s\/\.\p{L}\p{N}\p{S}\p{P}]+
+ Quantifier — Matches between one and unlimitedtimes, as many times as possible, giving back as needed (greedy)
\w matches any word character (equal to [a-zA-Z0-9_])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\/ matches the character / literally (case sensitive)
\. matches the character . literally (case sensitive)
\p{L} matches any kind of letter from any language
\p{N} matches any kind of numeric character in any script
\p{S} matches any math symbols, currency signs, dingbats, box-drawing characters, etc
\p{P} matches any kind of punctuation character

This is how you debug the special string. Try this link.

After we sort the pre-define mapping pattern we put them under /etc/logstash/extra_patterns/...

e.g: $ cat ../extra_patterns/swift
# Swift logs documentation: http://docs.openstack.org/developer/swift/logs.html

# TXN in transaction numbers
TXN %{WORD}[-]%{WORD}

# Copy from unicode object path which allow unicode
#\w matches any word character (equal to [a-zA-Z0-9_])
#\s matches any whitespace character (equal to [\r\n\t\f\v ])
#\/ matches the character / literally (case sensitive)
#\. matches the character . literally (case sensitive)
#\p{L} matches any kind of letter from any language
#\p{N} matches any kind of numeric character in any script
#\p{S} matches any math symbols, currency signs, dingbats, box-drawing characters, etc
#\p{P} matches any kind of punctuation character
COPYFROMUNICODE [\w\s\/\.\p{L}\p{N}\p{S}\p{P}]+


3.2 Grok Patterns

After we prepare extra_patterns we can add grok patterns in 2. filter.conf.

Before you start you might now what's the general structure for your filter, thus below is an example for you to reference it. I highlight them into 4 major colors which is easier for you to catch it. In my case if the pattern match, I will give "swift", "proxy" and "PROXY_SER_SIDE_COPY"

e.g: $ cat 20-rsyslog-filter.conf
filter {
################################
# proxy server patterns info
################################
  if [programname] == "proxy-server" and [severity] == "info" {
    if "Copying object from" in [message] and "to" in [message] {
      grok {
        patterns_dir => ["/etc/logstash/patterns", "/etc/logstash/extra_patterns"]
        match => { "message" => " Copying object from %{COPYFROMUNICODE:copy_from_obj_path} to %{NOTSPACE:copy_to_obj_path} \(txn: %{NOTSPACE:txn}\)" }
        remove_tag => ["_grokparsefailure"]
        add_tag => ["swift", "proxy", "PROXY_SER_SIDE_COPY"]
      }
    }
  }
}

However, how do we know the grok pattern is correct, there has two major way to test it. There has some grok pattern on line debugger, here is the example.

Method 1: you can leverage online debug tools
First column is your source which is message content in my example log

 Copying object from /v1/AUTH_sme/stepan/Folders_pack/03 with special characters ~`!@#$%^&()_+-={}[];',._over_PHP.zzz to /v1/AUTH_sme/stepan/.trash/org/Folders_pack/03%20with%20special%20characters%20%7E%60%21%40%23%24%25%5E%26%28%29_%2B-%3D%7B%7D%5B%5D%3B%27%2C._over_PHP.zzz (txn: txae6773cb5efd47ae86c1b-0058eeb59a)

Second column is the grok pattern I wrote.

 Copying object from %{COPYFROMUNICODE:copy_from_obj_path} to %{NOTSPACE:copy_to_obj_path} \(txn: %{NOTSPACE:txn}\)

In my case I leverage extra pattern thus I check the "Add custom patterns"
And put my extra pattern in the textbook.

COPYFROMUNICODE [\w\/\.\p{L}\p{N}\p{S}\p{P}\s]+
TXN %{WORD}[-]%{WORD}

Third is adding extra pattern then you can click go

Final, then you should be able to see the result as below.


{
  "copy_from_obj_path": [
    [
      "/v1/AUTH_sme/stepan/Folders_pack/03 with special characters ~`!@#$%^&()_+-={}[];',._over_PHP.zzz"
    ]
  ],
  "copy_to_obj_path": [
    [
      "/v1/AUTH_sme/stepan/.trash/org/Folders_pack/03%20with%20special%20characters%20%7E%60%21%40%23%24%25%5E%26%28%29_%2B-%3D%7B%7D%5B%5D%3B%27%2C._over_PHP.zzz"
    ]
  ],
  "txn": [
    [
      "txae6773cb5efd47ae86c1b-0058eeb59a"
    ]
  ]
}

The whole debug page is as blow.


Method 2: check logstash log to trace the grok pattern bug. 
$ sudo service logstash restart 
then tail the logstash log see whether there has clue for you.
$ sudo tail -f /var/log/logstash/logstash-plain.log

If you are not seeing the error and restart log show success as below. Then you are done !!!
[2017-04-12T18:03:19,124][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>250}
[2017-04-12T18:03:19,133][INFO ][logstash.pipeline        ] Pipeline main started
[2017-04-12T18:03:19,133][INFO ][logstash.inputs.udp      ] Starting UDP listener {:address=>"0.0.0.0:10514"}
[2017-04-12T18:03:19,150][INFO ][logstash.inputs.udp      ] UDP listener started {:address=>"0.0.0.0:10514", :receive_buffer_bytes=>"106496", :queue_size=>"2000"}
[2017-04-12T18:03:19,188][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

That it ! Then you should be able to have correct mapping logs with tags which in my case is "swift", "proxy", "PROXY_SER_SIDE_COPU"




Reference:
https://regex101.com
https://grokdebug.herokuapp.com/
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html

Saturday, March 4, 2017

Basic Machine Learning ( Linear Regression Model ) for (e.g Reduplication DataSet )

My First Deep Learning Practice via Adopting Linear Regression Model

Environment Preparation.

Prepare Three Dependencies

# Pandas allow to read data set
$ pip install pandas

# scikit learn is Machine Learning Library for Regression
$ pip install -U scikit-learn

# matplotlib allows visualize model and data
$ brew install freetype
$ brew install libpng
$ git clone git://github.com/matplotlib/matplotlib.git
$ cd matplotlib/
$ python setup.py install

Here is a quicker way to get your preparation done
$ cat requirements.txt
matplotlib==1.5.1
numpy==1.11.0
pandas==0.18.0
scikit-learn==0.17.1

Then

$ pip install -r requirements.txt

Prepare a python script to read the dedup dataset, learning with linear regression model and display in matplot visualization UI.
$ cat demo.py
#Data is labeled and We use Supervisor Approach
#Type of machine learning task will perform is called regression ( Linear Regression )

import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

#read data
dataframe = pd.read_fwf('dedup_seq_var_1_17.txt')
x_values = dataframe[['dedup']]
y_values = dataframe[['time']]

#train model on data
body_reg = linear_model.LinearRegression()
body_reg.fit(x_values, y_values)

#visualize results
plt.scatter(x_values, y_values)
plt.plot(x_values, body_reg.predict(x_values))
#plt.interactive(True)
plt.show()

Prepare your dataset, e.g different dedup files (size) to compare with dedup ratio and dedup process time.
file_name
Dedup
Time
seq_377K.0.0
96.25409276
2512
seq_610K.0.0
96.76181481
1742
seq_987K.0.0
97.22268395
18379
seq_1597K.0.0
97.28971803
4174
seq_2584K.0.0
97.51585024
3796
seq_4181K.0.0
97.53909053
8360
seq_6765K.0.0
97.57731661
9649
seq_10946K.0.0
97.70243732
12128
seq_17711K.0.0
97.73371995
26217
seq_28657K.0.0
97.75899063
36044
seq_46368K.0.0
97.77537807
56288
seq_75025K.0.0
97.77286868
89911
seq_121393K.0.0
97.7728039
130598
seq_196418K.0.0
97.76114378
231925
seq_317811K.0.0
97.76875575
365581
seq_514229K.0.0
97.77152402
587564
seq_832040K.0.0
97.77271286
946887

Final Output Example 

Linear Regression ( y = mx + b ) diagram when m = dedup ratio and b = time































As the result is linear when we use variable chunking with modulo 128K but when file size larger then 196418K ( 196.418 MB ), then process time got exponential grow in this case.


Double check with Linear Regression Equation via NumPy


$ cat ./demo.py
from numpy import *
# y = mx + b
# m is slope, b is y-intercept 
# this is error equation for collecting the error during different iteration
def compute_error_for_line_given_points(b, m, points):
    #compute the distance from point to line and find the min which is best fit line for all the points
    #y - y^2(mx+b) but we take sqare for y and summarize it and take average n
    #initialize it at 0
    totalError = 0
    #for every points
    for i in range(0, len(points)):
        #get x value
        x = points[i, 0]
        #get y value
        y = points[i, 1]
        #get the difference, square it, add it to the total
        totalError += (y - (m * x + b)) ** 2
    #get the average
    return totalError / float(len(points))

#gradient follows this partial derivative equation
def step_gradient(b_current, m_current, points, learningRate):
    #initial, starting points for gradients
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        #direction with respect to b and m
        #computing partial derivatives of our error function ( partial derivative equations )
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    #update b and m values using partial derivatives
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    #starting b and m
    b = starting_b
    m = starting_m
    #gradient descent
    for i in range(num_iterations):
        #update b and m with the new more accurate b and m by performing
        #this gradient step
        b, m = step_gradient(b, m, array(points), learning_rate)
    #once for loop finish, then return the ultimated b and m result
    return [b, m]

def run():
    #step 1 - collect data, two colums , colum 1: hours study (x), colum 2: test score (y)
    points = genfromtxt("data.csv", delimiter=",")
    #step 2 - define hyperparameters ( it's a balance )
    #hyper-parameter 1: learning rate = how fast should our model converage
    #converagence ( converage ) = means when you get the optimal results(model), the line of best fit. ( best answer )
    #learning rate too low , get slow convergence, if too big, the error function might not decrease.
    learning_rate = 0.0001
    #hyper-parameter 2: equation ( slope formula ): y - mx + b
    initial_b = 0 # initial y-intercept guess
    initial_m = 0 # initial slope guess
    #hyper=parameter 3: iteration
    num_iterations = 10000

    #Step 3 - train ML model
    print "Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points))
    print "Running..."
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
    print "After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points))

if __name__ == '__main__':

    run()


$ cat data.csv
96.25409276,2512
96.76181481,1742
97.22268395,18379
97.28971803,4174
97.51585024,3796
97.53909053,8360
97.57731661,9649
97.70243732,12128
97.73371995,26217
97.75899063,36044
97.77537807,56288
97.77286868,89911
97.7728039,130598
97.76114378,231925
97.76875575,365581
97.77152402,587564
97.77271286,946887

$ python demo.py
Starting gradient descent at b = 0, m = 0, error = 85896989939.5
Running...
After 1000 iterations b = -58.9851802482, m = 1531.65706095, error = 63605989287.1

$ cat data.csv
96.25409276,2512
96.76181481,1742
97.22268395,18379
97.28971803,4174
97.51585024,3796
97.53909053,8360
97.57731661,9649
97.70243732,12128
97.73371995,26217
97.75899063,36044
97.77537807,56288
97.77286868,89911
97.7728039,130598

$ python demo.py
Starting gradient descent at b = 0, m = 0, error = 2383362378.46
Running...
After 1000 iterations b = -13.93366184, m = 316.652164833, error = 1432258731.3

The interesting thing is if we remove file larger than seq_10946K in data set, we got lowest error as result.
$ cat data.csv
96.25409276,2512
96.76181481,1742
97.22268395,18379
97.28971803,4174
97.51585024,3796
97.53909053,8360
97.57731661,9649

$ python demo.py
Starting gradient descent at b = 0, m = 0, error = 77422434.5714
Running...

After 1000 iterations b = -1.28801270757, m = 71.5887947793, error = 29053565.2771

In sum, when we use variable chunk and modulo 128K ( 0.85 * 128K ( min ) ~ 2 * 128K ( max ) ) as chunking boundary are kinds of follow linear pattern when file is smaller than 196.418MB. However, when file size larger than 196.418MB, the process time is not follow the linear patter compare with dedup ratio and might be obviously become explanation growing patten. Moreover, when we use variable chunking with mod 128K, against with different file size, when file size under 10946K, then we might keep lowest error rate in this data set.

Extra Bonus:

Here is the practice I was doing for 3 parameters in linear regression. Regarding the juypter note, you can easy to track down the code and generate those linear graphics.


https://github.com/chianingwang/DeepLearningPractice/blob/master/linear_regression/linear-regression_sklearn/linear_regression.ipynb


Here are leverage 2D and predict 2D




Here are the 3D plots diagram





Here are the prediction base on the existing data. 





不要害怕犯錯。只有當你停止犯錯的時候,才需要警惕。因為這代表著你已經不再學習,或停止進步。
Don't be afraid to make mistakes. The only time you should be worried is when you stopped making mistakes; it only means that you've stopped learning or making progress.

進擊的鼓手 (Whiplash), 2014

Reference:

https://github.com/chianingwang/DeepLearningPractice/tree/master/linear_regression
https://github.com/llSourcell/linear_regression_demo
https://github.com/llSourcell/linear_regression_live