There are many types of machine learning, as you explored in class and lab exercises. The general context of this project is “unsupervised learning,” with a specific focus on K-means clustering. This technique enables the use of unlabeled data, without defined categories or groups. The K-means algorithm is creating the missing categories by building clusters based on feature similarity; this algorithm has many uses across industries and domains. In this assignment, you will explore applications of K-means clustering in cybersecurity. You will analyze a publicly available packet capture (PCAP) file on a widely use repository, NETRESEC. This assignment is open-ended in the sense that you have the freedom (and responsibility) to choose a dataset and identify the type of information thought. The only requirements are that the dataset is cybersecurity related (specifically a PCAP file), from NETRESEC, and that you use the K-means clustering algorithm as detailed below. Read all steps below before attempting to perform any work on this assignment.
K-means clustering is a method that finds k clusters from a collection of M objects with n attributes. For a given cluster of m points (\(m \le M\)), the point that corresponds to the cluster’s mean is called a centroid. In mathematics, a centroid refers to a point that corresponds to the center of mass for an object.
Choose the value of k and the k initial guess for the centroids.
Compute the distance from each data point to each centroid. Assign each point to the closest centroid. This association defines the first k clusters.
Compute the centroid, the center of mass, of each newly defined cluster from Step 2.
Repeat steps 2 and 3 until the algorithm converges to an answer.
Assign each point to the closest centroid computed in Step 3.
Compute the centroid of newly defined clusters.
Repeat until the algorithm reaches the final answer.
The data that is going to be used in this project is the ruspini data set. “The Ruspini data set, consisting of 75 points in four groups that is popular for illustrating clustering techniques.” -https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/ruspini.html
The raw ruspini data can be seen below.
set.seed(1)
ruspini = read.csv("ruspini.csv")
ruspini = ruspini[,2:3]
plot(ruspini)
The K-means algorithm is used on the data below. k is set to four because it seems like there are four clusters when looking at the scatter plot in the previous section.
ruspiniKM = kmeans(ruspini, 4)
The data is plotted and the clusters are colored below.
colors = c("red", "green", "orange", "blue")
plot(ruspini[,1:2], pch = 19, col = colors[ruspiniKM$cluster])
The information within the K-means model is displayed below.
ruspiniKM
## K-means clustering with 4 clusters of sizes 20, 23, 17, 15
##
## Cluster means:
## x y
## 1 20.15000 64.9500
## 2 43.91304 146.0435
## 3 98.17647 114.8824
## 4 68.93333 19.4000
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
## [71] 4 4 4 4 4
##
## Within cluster sum of squares by cluster:
## [1] 3689.500 3176.783 4558.235 1456.533
## (between_SS / total_SS = 94.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
The Cluster means are the four points that represent the mean of their respective clusters. These would all be located in the center of the colored clusters seen on the plot in the previous section.
These clusters hold no specific meaning, as this data set was created to illustrate clustering techniques, but does not actually represent anything.
(n.d.). Retrieved December 05, 2017, from https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/ruspini.html
Dietrich, D., Heller, B., & Yang, B. (2015). Data science & big data analytics: discovering, analyzing, visualizing and presenting data. Indianapolis, IN: Wiley.