Constructing a supervised learning model

With machine learning the dataset that is used to train the learning program is critical. Feeding a dataset of correct observations is what defines supervised learning. What it means in this project’s context, is that we have a set of log data that is already labeled correctly.

I have found wikipedia to be a good resource for the basic theory, and from the supervised learning article, we’ll go through the general steps needed to create such a model with some thoughts and a small experiment of data processing. I also found a book containing a collection of scientific studies regarding machine learning and cyber security:

https://www.crcpress.com/Machine-Learning-for-Computer-and-Cyber-Security-Principle-Algorithms/Gupta-Sheng/p/book/9781138587304

1. Define the training set

In the very first prototype, the training set is going to be Linux log messages and their respective logging levels. Thus, the data set is defined as LOGLEVEL:MESSAGE. With supervised learning, the wikipedia article gives a set of issues to consider, of which bias-variance dilemma looks quite interesting. Questions of how to evaluate and define a good dataset have to be answered.

2. Get the training set

We have lots of logs. I wanted to test how to work on the data itself, and created an experimental script to work with systemd-journald generated log data:

$ cat datatest.sh
#!/bin/bash
sudo journalctl -n 5 -o json --output-fields=PRIORITY,MESSAGE |
sed -E 's/"__CURSOR":"[^"]*"//;
	s/"_BOOT_ID":"[^"]*"//;
	s/"__MONOTONIC_TIMESTAMP":"[^"]*"//;
	s/"__REALTIME_TIMESTAMP":"[^"]*"//' > testdataset.json

The script parses extra metadata that is included by default journalctl’s json output. The script above runs the parser only for 5 journal lines. I  removed the ‘-n 5’ selector, and took a full dump of my test server’s logs:

$ time ./datatest.sh
real	0m17.953s
user	0m17.192s
sys	0m0.607s
$ stat testdataset.json 
  File: testdataset.json
  Size: 78178485  	Blocks: 152696     IO Block: 4096   regular file

~80M of parsed logs in 18secs on a simple cloud machine. Efficient, huh? TBH I don’t know. Some further smoothing anyway to be done, I heard CSV is the format… Here is a screenshot how it looks like now:

Screenshot_2020-02-10_15-34-34

3. Determinte the input features

The ‘features’ in a machine learning context are essentially different variables on a given input. In step 2 there were only two such variables: priority and the message. This can be of course expanded, but it can also be too much. Key here is to identify relevant features that make most sense in terms of desired output. Looking at the script, the extra data can be easily thrown in with the ‘–output–fields=’ -selector.

4. The structure of the learning function and the algorithms

Choosing the right algorithmic models that can do the job itself. There are many of them, and finding the right one to classify the data is being looked into. Neural networks, decision trees, support vector machines, Bayesian networks… this is where the MATH happens.

5. Train the model

Once the model is complete, the data is fed into the machine, and it learns from it. Data format and volumes have to be considered, as well as the data quality.

6. Evaluate the accuracy

The working piece of machine learning software has to be tested. There are 2 more datasets to make: validation dataset and testing dataset.

The way forward

The initial program is going to perform what is called ‘multiclass classification’. A closer look into the suitable algorithmic models is a priority. There are many options to choose from, but the scope of the first prototype has to be tight so we don’t get overwhelmed with the options. For example the unsupervised learning methods and models can find patterns out of the data, and from this project’s point of view can definetely be looked into. But not just yet…

Project presentation ahead!

We are finalizing the project plan and making it presentable. The project presentation will be held on Wednesday 12th February in Haaga-Helia University of Applied Sciences, 12:00 onwards in classroom 5009. There will be other interesting projects too 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s