Even if our first log classifier was not the most accurate one, we managed to make a working command line version. Due to special circumstances, we published a video in finnish as a demonstration aimed for our supervisor, Tero Karvinen, to evaluate and guide our progress. The demo version itself is available on the project GitHub with instructions how to run it.
There would be numerous ways to improve the first demo model, but we decided to focus on our primary objective: anomaly detection from logs.
Next steps: we need a dataset!
By now we know what is needed at the fundamental level: data. But what data? To keep things simple, we’ve decided to start on apache access logs, and try to detect anomalies there. Depending on who you ask, Apache is either the most popular web server, or #2 competing with Nginx of the top spot.
Let the neural network analyze the data?
We are going to teach the model how things look when they are fine. The logs are going to be monitored live, and if something unexpected happens, it notices it. So we need logs, and decided on stuff that happens in /var/log/apache2/access.log
This is actually a tomcat8 log, what I had ad-hoc in hand, but Apache access log looks almost the same. Also note the IPv6 localhost address!
The model we are going to construct is going to use unsupervised learning. Autoencoding neural network is specific to this task, as it is geared towards finding patterns and regularities from the data, and restructure the data to a set of new variables. Anomalies are then detected if the regularities and patterns detected differ from the baseline data.
Once we can detect anomalies in Apache access logs, we try to broaden the view. But lets not get ahead of ourselves. Some solutions seem to exist already…
Sources and approaches
A commercial log detector is available from Zebrium inc., and their blogs revealed some of the techniques they are using. In one of their blogs, they describe a ‘log dictionary’, which could possibly be obtained for Apache web server directly from it’s source code. Jeff Heaton’s deep learning course has time and again proven a very good and instructive resource, so we are definetely going through autoencoder neural networks and LSTM (Long Short-Term Memory) basics from there in the near future.
We might have to put a human in the loop as well, and how to teach the model to take user input into account is yet another subject to look into at some point.