Milestone: a working log classifier ML model

We had an eye-opening meeting earlier this week, and I started to get confident we can make our first defined milestone ready by next week. But it turned out to be closer! What we have, is a rough machine learning model that can determine log severity level from the message body only.

Parser Panda – training data

I had 3 large log files, and wanted to create a small example dataset for testing purposes. Initially, the data was not too well distributed among the levels, so I wrote some code to even it out.

Initial situation - total distribution by log level:
1: 9
2: 574
3: 602
4: 829751
5: 148710
6: 590698
7: 6317

Parser-Panda reads all the 3 log files, and takes maximum of 1000 random entries from each log level and file. It does everything as panda dataframes. As pandas do. Right? Well, it gives it back as a json file. Here is the output from the program:

Generated dataframe value counts:
6.0    3000
5.0    3000
4.0    3000
7.0    2344
3.0     602
2.0     572
1.0       9

This is inside a 1.8M json file, with only the essentials needed: loglevel and the message. Note that everytime PP does it’s thing, the file ends up being different regarding the parts that are overpopulated.

Natural language processing – word tokenizer

The messages have to be converted into numeric format for the ML algorithms to understand them. I found an article which covers the NLP basics well. A word is encoded to a number, but this is not cryptography…

The words-and-sentences-into-numbers -part affects greatly to the outcome, and there are many possibilities here. On the prototype, Keras -library tokenizer is used. The tokenizer gives a sequence of numbers as it’s output, and this is essentially the feature vector for the neural network. In other words, the input layer of the neural network consists of a number-coded sequence of the log message. What I’m trying to explain is here on the left side of the picture.

tokenization_pic

The model

The inputs ready in right format, the model has to be defined. Jeff Heaton’s course on neural networks helped tons, this is the part I used to start with. There are many parameters to tune with the models. Essentially the picture I drew above is somewhat accurate. There are 2 hidden layers, but many more nodes: 10 input nodes in the input layer to start with.

Another article from towardsdatascience.com gave insights on how to choose neural network parameters, but there is a lot room for experiment, and I can’t wait to get my hands dirty 🙂

Test run

If you want to try this at home, you have to provide with journactl generated json logs, process them with the mighty Parser Panda, and run the log classifier model code. Training data had 12529 samples with 50 epochs. Took about a minute to run.

Accuracy score was calculated against the training data itself, and ~80% is really not that good, but this is the first working version, and a milestone in our project!

Screenshot_2020-03-13_19-06-08

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s