4.1 Unsupervised Learning
In collaboration with https://www.secrepo.com
Introduction¶
PCA/Clustering
This marks the first unsupervised learning lab. There are several aspects to unsupervised learning:
Data has no labels
The goal is to find structure
The most "popular" aspect is clustering
It also includes dimensionality reduction and feature extraction
This lab will focus on dimensionality reduction via PCA (Principal Component Analysis). As well as an introduction to K-means clustering.
Data
Lab
In this lab you, as analyst, have a list of domains and the related blacklists they appear on. In addition some of these domains were responsible for sending a file to the client. These files have been run through VirusTotal and the AV results are also available with the domains. The goal is to explore the data, find some structure and attempt to find a way to gain confidence in what domains are more malicious as a means of prioritization. As with any type of data exploration, it's not a silver bullet but perhaps you'll gain an understanding of the data
File Input - Blacklist Data
The data for the lab is contained in host_detections.csv and has columns: host, detections, and detection_count.
Cleanup - Blacklist Data
Drop the duplicates on the df dataframe, for column hostI
This next section cleans up the detections column. It removes the text formatting and puts the information into a Python list, and places the Python list back into the dataframe in place of the text. It also creates a multi-dimensional list that represents the the various blacklists and if there was a hit for the domain 1 or not 0.
Join the resulting multi-dimensional list to the "side" of the existing dataframe.
You can see the host 02b123c.netsolhost.com has 0 detections, and has 0s in place for all of the blacklist values. Where 0lilioo0l0o00lilil.info has 7 detections and a 1 in place of it's detections (e.g. hpHosts).
File Input - VirusTotal
The data is in a file named mal_domains.csv and has columns: host, count, and detections. This data has been pre-processed to save some pain on parsing and assembling massive amounts of JSON data.:
Cleanup - VirusTotal
Similar to the above we clean-up the detections column.
A little massaging is necessary here because there are blacklists and AV engines that have the same name. This renames the columns and places an av_ prefix to the name ensuring there are no duplicates, and has the extra advantage of allow easy distinction in analysis.
Also, join the AV dataframe to the blacklist one created above.
This is where the expansion, and then filling in of values, 1 for detection and 0 for no detection, happens.
For consistency's sake, set all of the columns but host, detections, and av_detections to type int
Take a look at the resulting dataframe, you'll see a similar structure to the one above.
The cell below shows how to print the dimensions of the dataframe, in this case it has 346 rows and 97 columns (e.g. dimensions). This is due to the selection clause, it looks for domains that have zero AV results against it, and more than one blacklist hit.
Try reversing the query av_count > 0 and detection_count == 0.In [ ]:
In your exploration you might have run across an IP address or 2, let's split these up into two different dataframes. This will allow and apples-to-apples comparison.In [ ]:
How many elements (rows) are in each dataframe (domains, ips)?In [ ]:
Analysis
The cell below pulls out the list of features what we want to use. In this case it's all of the columns that don't (or appear not to) add any value to the analysis. The hostname is what is being analyzed, the detections and av_detections are sparse text that can't be use in this lab, and the counts should be summed-up/accounted for by the presence or lack of a qualifying detection event (AV or blacklist).In [ ]:
K-Means Clustering
K-Means works on a fairly simple idea. You provide the algorithm with K, the number of clusters you think are in the dataset. The algorithm will attempt to find points that have the minimum distance to the other points, the centroids dictate the center of the cluster.
Below, the K for K-means was set to two. There are many ways to determine an optimal K, but for this exercise we're only interested in two labels, good and bad. By doing this we can guide the algorithm into picking two centers and giving us a "good" group and a "bad" group of domains.
The data is clustered two times. One time with both the blacklist and AV features, and another time with just the blacklist features. The labels for the clusters are stored in bl_vt_labels and bl_labels respectively. This allows an easy way to reference the labels without re-clustering the data later on.
You should add a third cluster section that stores the labels in vt_labels, and is only a cluster of columns from the AV set. Remember the AV results are prefixed with av_ making the columns easy to pick out.
Check your work! Make sure to print out at least a few elements of vt_labels.
Remember, the algorithm doesn't know what's malicious or not, so don't place any inherent value in a label of 1 or 0. It's only a label of what group the algorithm thinks the data belongs in. Although, you as an analyst, might be able to infer if it's in the malicious or benign cluster.
Below is a way to spot check domains, explore a couple more on your own. You can see what blacklists and AV engines, if any, are associated with the domain.
PCA
PCA is used for dimensionality reduction, one of the major advantages of this is being able to visualize data. Our current dataset has 92 features/dimensions, which unless you have super powers is pretty hard to visualize. One awesome use of PCA is to reduce these dimensions down into something that we as mortals can see.
The first exercise is reducing all 92 dimensions down to three for easy and pretty graphing. The colors in the graph are set by the labels from the K-Means clustering above.
Do the same as the cell below but one set of graphs for the blacklist only data and one set of graphs for the VirusTotal only data. What kinds of patterns emerge?
Hint don't forget to use the right labels for the right columns.
2D
Now that you're a wiz at reducing various dimensions to three, it's possible to reduce down to two and graph that. Perhaps some more or different structure will pop out at you.
Once again the blacklist and VirusTotal scenario is done for you, do the same as above and examine the blacklist only and VirusTotal cases in 2D.
1D
Our last stop on this journey is 1D. The insights gained by visualizing the data in both three and two dimensions can be pretty helpful. As the beginning of the lab stated our goal is to create some kind of ranking or prioritization of the domains which is just a one-dimensional task. We'll cheat a little bit since looking at a list of numbers isn't that pretty. We'll cheat a bit for the graphing and plot our points along the X-axis with a Y value of 0 for each point.
The case of all the features has been provided for you, repeat the process for blacklist only and AV only.
Scaled Data
One of the final things we can do with this information is scale the feature returned by PCA in this instance. This shifts the data so all values are between zero and one. Giving a really nice scale.
The case of both AV and blacklist is once again provided, perform the same operation/graph for AV only and blacklist only.
Putting It All Together
After doing all that work to attempt to order and group data, it's time to make use of the results. Remember, that the labels 0 and 1 are arbitrary so it will take assigning the values back and you interpreting the data to understand what's going on.
Here's one of the ways to assign and look at domains. This is just for the AV and blacklist results, so you should do the same with the other labels/values.
When does this seem to work, when does it seem to fail? How valuable do you think this kind of technique is?
Last updated