Machine Learning in Network Traffic

Intro
Setup
Data collection
Data Preparation
Data modeling
- cross validation
- create classification model
Model Evaluation
Conclution

Intro

Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning can be divided into 2 types based on the direction of learning, namely supervised learning and unsupervised learning. Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels.

Supervised learning can be divided into 2 type base on the type of target, namely classification and regression. When the prediction value is numerical (think oil prices, rainfall, quarterly sales, blood pressure etc), it is generally referred to as a “regression” problem. This is in contrast with “classification” problems, a general term for when the value we’re trying to predict is categorical (loan defaults, email spam collection, handwriting recognition etc).

machine learning can be implemented in any sector include network traffic. in network traffic we can classify websites base on the traffic. This post will discuss the application of classification in network traffic. the case taken is the youtube or spotify classification on network traffic

Setup

library(dplyr)
library(rsample)
library(caret)
library(ggplot2)
library(ROCR)

Data collection

The data used is web browsing data that represents communication between the computer and the website. One standard way to capture network traffic is to use Wireshark. Wireshark is a free and open source packets analyst. This tool captures several values in each packets such as time, source, destination, protocol and info. Data from wireshark will be stored in CSV format which will later be processed using the R programming language.

wireshark

Data Preparation

For each item (packet) in the dataset, we calculate the quantity of packets received by the laptop in the past 10 milliseconds after this packet’s time of arrival. Some things that are counted are the mean, sd, sum and total packets every 10 ms when first entering the website. after calculating packages per 10 milliseconds the data is converted into sequences data. Explanation of the sequence can be seen more easily in the plot below. In the plot below, a black point (spotify / youtube) will be predicted based on 10 * 10 ms traffic afterwards (red line)

the data that has been aggregated every 10 ms and made into a sequence has been stored in a file called traffic.csv, the data can be directly loaded using the read.csv () function.

traffic <- read.csv("traffic.csv")
head(traffic)

the traffic dataset has 21 variables (columns) and 184 observations (rows), the following is a description of each variable:
website: the name of the website being accessed (target class, Spotify/ Youtube)
mean1 : average size of data in the first 10 ms
mean2 : average size of data in the second 10 ms
mean3 : average size of data in the third 10 ms
mean4 : average size of data in the fourth 10 ms
mean5 : average size of data in the fifth 10 ms

packets1: the number of packets in the first 10 ms
packets2: the number of packets in the second 10 ms
packets3: the number of packets in the third 10 ms
packets4: the number of packets in the fourth 10 ms
packets5: the number of packets in the fifth 10 ms

sd1 : standard deviation size of data in the first 10 ms
sd2 : standard deviation size of data in the second 10 ms
sd3 : standard deviation size of data in the third 10 ms
sd4 : standard deviation size of data in the fourth 10 ms
sd5 : standard deviation size of data in the fifth 10 ms

sum1 : total size of data in the first 10 ms
sum2 : total size of data in the second 10 ms
sum3 : total size of data in the third 10 ms
sum4 : total size of data in the fourth 10 ms
sum5 : total size of data in the fifth 10 ms

From the plot below it can be seen that the proportion of YouTube and Spotify data is 0.56 and 0.46, from here it can be seen that the data labeled Youtube is slightly more than Spotify.

traffic %>% 
  pull(website) %>% 
  table() %>% 
  prop.table() %>% 
  round(2) %>% 
  as.data.frame() %>% 
  rename(website = ".") %>% 
  ggplot(aes(x = website, y = Freq)) +
  geom_col(fill = c("#5dd860", "#d93724")) +
  geom_label(aes(label = Freq)) +
  labs(title = "proportion of Youtube and Spotify Data",
       x = "Website", 
       y = "Proportion of Data") +
  scale_y_continuous(limits = c(0,1)) +
  theme_bw()

Data modeling

cross validation

before doing machine learning modeling, the data should be split into train and test data. A related idea is known as cross-validation. The goals of cross validation is to obtaining an unbiased estimate of our model’s out-of-sample performance is an important one as it is often the case that the in-sample error (the error you obtain from running your algorithm on the dataset it was trained on) is optimistic and tuned / adapted in a particular way to minimize the error in the training sample. Therefore - the in-sample error is not a good representation or indication of how our model will perform when it is applied on unseen data. 80% of the data is used as a train dataset and the remainder is a test dataset.

set.seed(100)
idx <- initial_split(traffic, prop = 0.7, strata = "website")
traffic_training <- training(idx)
traffic_testing <- testing(idx)

create classification model

we will create a classification model using random forest algoritm. Random forest is an ensemble-based state-of-the-art algorithm built on the decision tree method and is also known for its versatility and performance. I’m going to go ahead now and create our Random Forest model, using a 5-Fold Cross Validation, with 3 repeats.

set.seed(100)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
model_rf <- train(website ~ ., data=traffic_training, method="rf", trControl = ctrl)

Model Evaluation

from the model we can see that the model with mtry equal to 2 produces the highest accuracy which is 0.90. mtry is number of variables available for splitting at each tree node.

model_rf$results

From the plot below, it can be seen that the more mtry are used, the smaller accuracy is obtained. this shows that using mtry equals 2 is the best.

plot(model_rf)

variabel Importance

by using the varImp() function we can get the most influential variables when creating a random forest model. for our model the most influential variable is mean1. We can use varImp() function to get a list of the most important variables used in our random forest. This give us a total decrease in node impurities from splitting on the variable averaged across the 500 trees. for our model the most importance variable is mean1.

varImp(model_rf)

## rf variable importance
## 
##           Overall
## mean1    100.0000
## sum1      65.4989
## packets1  33.7943
## mean5     29.7005
## sd1       28.8722
## sum5      22.6691
## sd5       13.5076
## mean4     12.4429
## sum3       8.2232
## sum4       8.2171
## mean3      7.5905
## mean2      6.7661
## packets5   6.3980
## packets3   6.3171
## sum2       6.2264
## packets4   4.1679
## sd4        3.5040
## packets2   2.1028
## sd3        0.4484
## sd2        0.0000

after knowing that mean1 is the most importance variable, we can visualize the distribution of the mean1 to the target variable target (website). it can be seen that the mean1 of youtube is more inclined to gather on the right while spotify on the left.

mean_plot <- traffic_training %>% 
  ggplot(aes(x =mean1, fill = website, alpha = website)) +
  geom_histogram(bins = 10) +
  scale_fill_manual(values = c("#5dd860", "#d93724")) +
  scale_alpha_manual(values = c("Spotify"=0.8,"Youtube"=0.9)) +
  labs(x = "total data in the first 10 ms", 
       title = "mean1") +
  theme_bw()
  
  
mean_plot

random forest model produces a value called out-of-bag estimates (OOB) that can be used as a reliable estimate of its true accuracy on unseen examples. from this plot it can be seen that the spotify error value (green line) is relatively higher than youtube (red line).

plot(model_rf$finalModel,col =  c("black", "#5dd860", "#d93724"), main = "OOB Random Forest Model") 
legend("topright", colnames(model_rf$finalModel$err.rate), fill = c("black", "#5dd860", "#d93724"))

Confusion Matrix

to evaluate the model that has been made we can use a confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. Before we get the confusion matrix we need to predict the testing data.

# get test accuracy
pred <- predict(model_rf, traffic_testing)
table(prediction = pred, actual = traffic_testing$website)

##           actual
## prediction Spotify Youtube
##    Spotify      23       1
##    Youtube       2      29

This table above is also known as the confusion matrix.

Observe from the confusion matrix that:
- Out of the 30 actual Youtube we classified 29 of them correctly
- Out of the 25 actual Spotify we classified 23 of them correctly - Out of the 55 cases of loans in our test set, we classified 52 of them correctly

from the confusion matrix, we can calculate the accuracy by dividing the amount of data that is classified correctly with the total data.

52/55

## [1] 0.9454545

The accuracy of the model that has been made is 0.94, which means the classification model made correctly guesses at 94%. to ensure whether the model can distinguish between spotify and youtube properly we can see from the AUC (Area Under Curve) value

AUC

AUC values are obtained from the area under the ROC curve. ROC is a curve that shows the classification model performance in all thresholds. ROC has 2 axes
x : False Positive Rate (recall)
y : True Positive Rate (1 - specificitty)

pred_prob <- predict(model_rf, traffic_testing, type = "prob")
pred <- prediction(pred_prob[,2], traffic_testing$website)
perf <- performance(pred, "tpr", "fpr")
plot(perf)

AUC value of 0.97 indicates that the model was successfully determined between the Spotify class and YouTube good

auc <- performance(pred,"auc")
auc@y.values[[1]]

## [1] 0.976

just additions we can see the grouping between spotify data and youtube using unsupervised learing techniques. in this case we can see using a biplot. when viewed youtube data tends to spread, in contrast to spotify data which is more collected in one place

library(FactoMineR)

df_scale <- traffic %>% 
  mutate_if(is.numeric, scale)
  
df_pca <- PCA(df_scale, quali.sup = 1, graph = F)
plot.PCA(df_pca, cex = 0.5, habillage = 1)

Conclution

classification model created to predict a spotify or youtube website based on traffic is good, this can be seen from several matrices like OOB, Accuracy and AUC. this indicates that machine learning can be implemented in the network sector