Define the problem: Getting started with machine learning

Being new to machine learning, finding the right tutorial to start with can be difficult as there is a multitude to choose from. The University of Helsinki offers one of the best online courses: The Elements of AI, which has been immensely popular. The course is due to have its sequel this spring: Building AI.

I spent some time in the weekend studying the basics, and again saw that the Wikipedia knows it all. Then I found a recent bachelor’s thesis by Killian Duay from our school where our peer had also started AI/ML learning from scratch. His concise background study gave some ideas where to head next.

The basic question: what is the problem?

Defining and categorizing the problem is the first thing we have to do. Once the problem is known, proper algorithm(s) can be chosen for the machine learning model.

Without going too deep into details and theory, the machine learning techniques are divided into 2 main categories: supervised and unsupervised learning. Briefly and broadly explained, supervised learning uses historical data and mathematical models to make predictions, while unsupervised learning finds structures in the data. It seems to me that if you can combine different machine learning techniques in an efficient way, you are likely to get good results.

Classification is the problem!

Classification is one of the central ideas in machine learning, and many of the programming tutorials start with it. Feed the AI enough cat pictures, it learns to recognize one. In our meeting last week with Joni we looked at a dataset which contained movie reviews classified as positive or negative. This dataset is then used to train and test the ML program to recognize from any review text whether the review is postive or negative.

The basic idea of our project is to make a machine learning based program which is capable of recognizing interesting, meaningful or anomalous Linux log entries. The classification problem looks like a good starting point that can be applied to log processing.

Classifying logs

With Linux kernel logs, there are 8 different severity levels defined: from emergency (system is unusable) to debug. Can we make a program that can tell this level just out of the log message itself?

We already have the data and means to process it. The severity-level classified log messages would act as training data, and after the ML program is trained we can take another set of log messages and see if the program can classify it.

Already my mind races ahead: if some random application writes to the system log instead of the kernel, there might not be any classification data or the classification could be ambiguous. The log message itself is fundamentally defined by humans. A good example: take a look at your Linux machine’s system log:

$ dmesg | grep Yama
[    0.260756] Yama: becoming mindful.

…is this a meaningful log message? Is this Yama-dude some new-age guru meditating inside my laptop?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s