Machine Learning or Adapting Adaptive Thresholds?
Posted by: Eric Olsen
Machine Learning seems to be everywhere. Your phone can recognize faces and Gmail has figured out the nuances of your writing. So why is security data so different? To start, the above-mentioned data is tangible, accessible by the cloud, trained alongside enormous pools of similar data, and is deployed in a manner that can receive continual updates and training. On the other hand, security incidents are ambiguous, private, unique, and largely rely on a high degree of unique context.
Consequently, one of the largest challenges of deploying machine learning algorithms in a security solution is the locality of the security data. Machine learning models typically assume that they are observing a complete set of data, and their entire context relies on this concept. Most centralized logging solutions aim to reduce waste and efficiently store security data. For example, filtering out unneeded windows event logs is good. Collecting a VPN session’s start and stop records while dropping their intermediary records may be ideal. Oftentimes, optimizing the performance of security solutions requires that we not log everything. However, this means that machine learning algorithms no longer have a complete data set for training and alerting.
For example, network scanners are often filtered out of alerts for “hosts establishing a high number of connections with other hosts.” This is useful for detecting rogue devices conducting network scans, but can make overall network traffic calculations misleading. Additionally, it is common to filter out failed authentication attempts from a user’s phone who recently reset their domain account password. Whilethiss is useful for reducing noise, it can also skew the overall calculations for a normal number of failed authentications.
With this concept of incomplete data in mind, I’d like to take a look at numerical outlier detection using the Splunk Machine Learning Toolkit (MLTK). Out of the box, Splunk Enterprise Security will utilize the MLTK to detect these outliers. In this example, we’ll focus on the authentication datamodel, a broad collection of authentication events from various data sources, in order to create alerts when failed authentications rise above a calculated threshold. The nuance is Splunk has little context for the completeness of data. For example, a host hitting the top 99% of failed authentications is significant if all vulnerability scanners have been filtered out. However, this same host may slip into the noise of its surrounding data if vulnerability scanners are included in calculations.
The picture below is a search generated from a streamstats query; streamstats operates by looking backward for a given time interval to calculate a moving average over this timeframe. The average shown was multiplied by the standard deviation of all values and a multiplier in order to create the upper threshold. This specific data is from the authentication datamodel with regards to web authentication.
This data is very complete. There are no gaps, and we can see an obvious rhythm to the highs and lows of these events. If you are fortunate enough to have authentication data that looks like this, you may be able to deploy an effective numerical outlier model with minimal training. However, it is also worth noting that this data did not change on Saturday or Sunday, and it never bottoms out. Perhaps no one takes days off in this organization? This data may not tell the entire picture for authentication events.
Alternatively, what if your authentication data looks more like this? This may be more accurate in an environment where noisy hosts are filtered out. Like above, this is also web authentication data. The majority of events are at zero, and they shift to high values with little transition.
At this point, we may feel tempted to make a few corrections to improve this data. We could add some static values to all the zero fields and then utilize the Splunk command “makecontinuous” in order to connect these values into a smoother line. We could rely on the MLTK’s density function, which would calculate a complex polynomial function and draw a threshold line that conforms to these data spikes. The problem is we’re conforming the data into our model rather than building a model which represents the data.
To address this problem, I propose that we put our thoughts on Machine Learning algorithms aside for a moment and just focus on the basics. My hypothesis is that we care about small spikes over the weekend, and we care about abnormally high spikes during workdays. We’ll build a search sensitive enough to detect both but avoid plugging the dataset into a built-in MLTK algorithm. All we need to do this is the old-fashioned Splunk eventstats, streamstats, and some mathematical elbow grease.
First, a word of warning about streamstat: as mentioned earlier, we’re looking backward in order to predict the upper threshold. If we make this window too small, we make some ambitious assumptions about how high the threshold should go. In this sample output, I used streamstats with an eight-hour window. It looks like our adaptive threshold is over-adapting.
If we make the streamstats window too long, we end up copying the entire data set. This window is set for an entire week and there is too much data fluctuation for this to fit a clear pattern.
I would like for this model to alternate between an adaptive and static threshold. Static thresholds will suit this data well when monitoring events near zero, and we’ll use eventstats to calculate an overall average. We’ll use streamstats with a window of 24 hours to generate our adaptive threshold. This will overcorrect during upticks but is also a very intuitive window. I will temporarily skip over the mathematical work so we can see the visual output:
The two remaining challenges are calculating the upper threshold and building a way of transitioning models into this calculation. Let’s start by briefly visiting the concept of exponential distribution.
Before running off in terror at the memories of solving exponential equations, the main point is that most of our data’s values are near zero. They do not fit into the more familiar bell curve associated with normal distributions and standard deviation. In this case, we’ll need to use a natural log when drawing our upper threshold.
The final challenge is to calculate this threshold with a multiplier that favors eventstats when close to zero and streamstats when above zero. Consider that when values are near zero, streamstats will follow it. Therefore, comparing whether our fixed average is greater than our moving average should be sufficient to achieve this.
With all this in mind, this is the final search:
The generic search to produce this is:
…large search query
| streamstats window=24 current=false avg(Auth_Events) as avg_mov
| eventstats avg(Auth_Events) as avg_fixed
| eval threshold=.0001
| eval lambda = if(avg_mov>avg_fixed,1/avg_mov,1/avg_fixed)
| eval expon_dist = (lambda*exact(2.71828))/(lambda* Auth_Events)
| makecontinuous expon_dist
| eval lowerBound=(0), upperBound=-1*(ln(threshold)/lambda)
| eval isOutlier=if(‘Auth_Events’ > upperBound, 1, 0)
| fields _time, Auth_Events, lowerBound, upperBound, isOutlier, *
This approach would not generalize well at a massive scale, and I would certainly not recommend generating every search with complicated math in the Splunk search query. However, it is a useful example for displaying the insights machine learning can provide without truly using machine learning.
All too often, we may feel that we simply do not have enough clean and well-processed data to use machine learning in our enterprise effectively. Rather than get wrapped up in whether it can be done, I’d suggest we start with a hypothesis and a model. It may turn out that we understand the problem well enough not to need machine learning.
Eric Olsen
,
GuidePoint Security
Eric Olsen is a security professional with 10 years experience in the field of information security. He began his career in the US Army as a Linguist and Intelligence Analyst, and after serving for 12 years, worked as a Security Analyst within the DoD and as a Security Engineer for various commercial and government entities at GuidePoint Security. He holds several certifications from SANS (GSEC, GCIH, GCIA), is a certified Splunk Core Consultant and CISSP holder from (ISC)2.