Accedian is now part of Cisco  |

Avatar photo
By Michael Rezek

The proper use of machine learning: the cybersecurity edition

Let the machines take over …or perhaps not always?

The rise of the machine has been apparent this past decade. I remember taking many statistics classes as a minor in my Master’s program at Georgia Tech and all of the grinding, mind-bendingly terrible work that had to be performed by hand. Now, it’s just so easy, let the machines take over …or perhaps, not always? 

There are many companies that flex their machine learning (ML) muscles in marketing how many algorithms they have for performing threat detection, but is that really the standard-bearing measure of the “best” cybersecurity solution? 

When is ML necessary, and when are statistical methods satisfactory for cybersecurity threat detection? 

Can too much ML actually be a liability? 

I think it’s important to first understand the difference of when statistical methods are effective and when machine methods are needed. Statistics and ML/AI are all about relationships and the correlation of variables of between data sets. 

So, let’s take a fun example. If we wanted to look at the relationship between children and ice cream flavors, the first question is, “what are the variables related to children?” and “what are the variables related to ice cream flavors?” 

These variables are generally referred to as the “dimensions” of data sets. If I only look at the gender of the children, that is one dimension, but if I look at the gender and age, then that’s 2 dimensions. If I consider gender, age, and economic class, then that’s 3 dimensions … you get the picture. 

The same goes with ice cream flavors. The color (and supposed flavor) is one dimension, and perhaps we add the number of scoops as a second dimension. 

Essentially, any variable that could impact the relationship between children and ice cream flavor selection can be analyzed and correlated.

So if we want to know the relationship between child gender and ice cream flavor, would I have need to train an ML model to create a relational algorithm to make that determination or would statistical methods be enough?

In this case, because we are only looking at 2 dimensions of data, and we know the relationship between the variables (male/ female and say 2 colors of ice cream), it’s pretty obvious that statistical models are satisfactory. We can guess that perhaps 75% of the girls choose pink ice cream and 75% of the boys choose the green ice cream.

So now we have a baseline, and if we wanted to “alert on a deviation” from this norm, we could. No ML is really required. ML can be computationally costly and interpreting ML results is not always easy, so ML should only be used when necessary.

But, if I had no idea if there is a correlation between multiple variables of the children – gender, age, socioeconomic class and also color and number of scoops of ice cream – statistical methods might be inadequate because we have introduced a higher number of dimensionality and we aren’t sure what the relationship is, if there is any. 

The benefit of machine learning is that it requires no prior assumptions about the underlying relationships between the variables and can analyze a vast amount of data in models. Just load the data you have, and the models learn the patterns. These can then be used to make predictions in the future.

Machine learning algorithms are said to be like a black box. ML is generally applied to data sets with a higher number of dimensions, and your prediction accuracy is proportional to the amount of data you have. But, you have to be able to interpret the results because there is always a risk of false positives especially when the dimensionality of the data set is low or the data is not rich and adequate (clean) enough. 

Statistical modeling is adequate when you have a lower number of dimensions for your data and you know what relationships you want to understand.

Dimensionality and how it impacts cybersecurity

In cybersecurity, we often know the relationships between data sets because we know what the behavioral characteristics are that we are looking for, and we can easily apply statistical standard deviation techniques to detect an anomaly. 

However, if we have a high number of variables and do not know what, if any, relationship exists in the variables we have available to analyze, we need machine learning models to be trained to identify anomalous behaviors. 

ML and statistical models are both excellent tools, but like any tool, knowing which one to use for a given job is as important as the way they are used.

Behavior-based intrusion detection, like effective machine learning algorithms, allows security analytics teams to sift through the noise and act on the data and alerts that can mean the difference between a costly data breach and effective security defense. See how Accedian Skylight powered Security provides full visibility for all network connections – North-South and East-West – for complete threat detection, including, signatures, anomalies, and behaviors in the cloud.