The goal of this project is to construct a model to predict whether a packet of network traffic is anomalous or not(normal).
For this project I will be creating a logistic model using the UNSW-NB15 data set. The UNSW-NB15 data set is a famous data set created by IXIA PerfectStorm. A partition of the original data set was configured as a training set and a testing set. It is from these two data sets that I will use throughout in my project.
I start off by fetching the data set files and passing them to their respective variables: TESTING ; TRAINING ; FEATURES
TESTING = read.csv('testing.csv')
TRAINING = read.csv('training.csv')
FEATURES = read.csv('features.csv')
FEATURES
## No. Name Type
## 1 1 srcip nominal
## 2 2 sport integer
## 3 3 dstip nominal
## 4 4 dsport integer
## 5 5 proto nominal
## 6 6 state nominal
## 7 7 dur Float
## 8 8 sbytes Integer
## 9 9 dbytes Integer
## 10 10 sttl Integer
## 11 11 dttl Integer
## 12 12 sloss Integer
## 13 13 dloss Integer
## 14 14 service nominal
## 15 15 Sload Float
## 16 16 Dload Float
## 17 17 Spkts integer
## 18 18 Dpkts integer
## 19 19 swin integer
## 20 20 dwin integer
## 21 21 stcpb integer
## 22 22 dtcpb integer
## 23 23 smeansz integer
## 24 24 dmeansz integer
## 25 25 trans_depth integer
## 26 26 res_bdy_len integer
## 27 27 Sjit Float
## 28 28 Djit Float
## 29 29 Stime Timestamp
## 30 30 Ltime Timestamp
## 31 31 Sintpkt Float
## 32 32 Dintpkt Float
## 33 33 tcprtt Float
## 34 34 synack Float
## 35 35 ackdat Float
## 36 36 is_sm_ips_ports Binary
## 37 37 ct_state_ttl Integer
## 38 38 ct_flw_http_mthd Integer
## 39 39 is_ftp_login Binary
## 40 40 ct_ftp_cmd integer
## 41 41 ct_srv_src integer
## 42 42 ct_srv_dst integer
## 43 43 ct_dst_ltm integer
## 44 44 ct_src_ ltm integer
## 45 45 ct_src_dport_ltm integer
## 46 46 ct_dst_sport_ltm integer
## 47 47 ct_dst_src_ltm integer
## 48 48 attack_cat nominal
## 49 49 Label binary
## Description
## 1 Source IP address
## 2 Source port number
## 3 Destination IP address
## 4 Destination port number
## 5 Transaction protocol
## 6 Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR, FIN, INT, MAS, PAR, REQ, RST, TST, TXD, URH, URN, and (-) (if not used state)
## 7 Record total duration
## 8 Source to destination transaction bytes
## 9 Destination to source transaction bytes
## 10 Source to destination time to live value
## 11 Destination to source time to live value
## 12 Source packets retransmitted or dropped
## 13 Destination packets retransmitted or dropped
## 14 http, ftp, smtp, ssh, dns, ftp-data ,irc and (-) if not much used service
## 15 Source bits per second
## 16 Destination bits per second
## 17 Source to destination packet count
## 18 Destination to source packet count
## 19 Source TCP window advertisement value
## 20 Destination TCP window advertisement value
## 21 Source TCP base sequence number
## 22 Destination TCP base sequence number
## 23 Mean of the ?ow packet size transmitted by the src
## 24 Mean of the ?ow packet size transmitted by the dst
## 25 Represents the pipelined depth into the connection of http request/response transaction
## 26 Actual uncompressed content size of the data transferred from the server\x92s http service.
## 27 Source jitter (mSec)
## 28 Destination jitter (mSec)
## 29 record start time
## 30 record last time
## 31 Source interpacket arrival time (mSec)
## 32 Destination interpacket arrival time (mSec)
## 33 TCP connection setup round-trip time, the sum of \x92synack\x92 and \x92ackdat\x92.
## 34 TCP connection setup time, the time between the SYN and the SYN_ACK packets.
## 35 TCP connection setup time, the time between the SYN_ACK and the ACK packets.
## 36 If source (1) and destination (3)IP addresses equal and port numbers (2)(4) equal then, this variable takes value 1 else 0
## 37 No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
## 38 No. of flows that has methods such as Get and Post in http service.
## 39 If the ftp session is accessed by user and password then 1 else 0.
## 40 No of flows that has a command in ftp session.
## 41 No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
## 42 No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
## 43 No. of connections of the same destination address (3) in 100 connections according to the last time (26).
## 44 No. of connections of the same source address (1) in 100 connections according to the last time (26).
## 45 No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
## 46 No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
## 47 No of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).
## 48 The name of each attack category. In this data set , nine categories e.g. Fuzzers, Analysis, Backdoors, DoS Exploits, Generic, Reconnaissance, Shellcode and Worms
## 49 0 for normal and 1 for attack records
First we assess the “FEATURES” file that reveals the variables that we can expect within the data sets.
str(FEATURES)
## 'data.frame': 49 obs. of 4 variables:
## $ No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "srcip" "sport" "dstip" "dsport" ...
## $ Type : chr "nominal" "integer" "nominal" "integer" ...
## $ Description: chr "Source IP address" "Source port number" "Destination IP address" "Destination port number" ...
head(FEATURES)
## No. Name Type
## 1 1 srcip nominal
## 2 2 sport integer
## 3 3 dstip nominal
## 4 4 dsport integer
## 5 5 proto nominal
## 6 6 state nominal
## Description
## 1 Source IP address
## 2 Source port number
## 3 Destination IP address
## 4 Destination port number
## 5 Transaction protocol
## 6 Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR, FIN, INT, MAS, PAR, REQ, RST, TST, TXD, URH, URN, and (-) (if not used state)
dim(FEATURES)
## [1] 49 4
dim(TESTING)
## [1] 82332 45
dim(TRAINING)
## [1] 175341 45
The Features file reveals to have 4 variables within it; 3 of the data types are ‘char’. These 3 char variables hold the name, data type, and description of each variable in the original data set.
Yet I do notice that the features includes 49 rows which don’t match with the 45 variables of the “TESTING” and “TRAINING” data sets. This is likely due to the testing sets to be missing some of the original variables. We will have to compensate for this later…
I will now use the “table()” function and the “label” feature to display a table with a tally of the number of 0s and 1s in the training set.
label_table = table(TRAINING$label)
label_table
##
## 0 1
## 56000 119341
## With a little math we can see the proportions of the 0s and 1s.
proportions = label_table / nrow(TRAINING)
proportions
##
## 0 1
## 0.3193777 0.6806223
Lets also do this for the “TESTING” data set.
label_table = table(TESTING$label)
label_table
##
## 0 1
## 37000 45332
## With a little math we can see the proportions of the 0s and 1s.
proportions = label_table / nrow(TESTING)
proportions
##
## 0 1
## 0.4494 0.5506
The proportions seem somewhat similar and glady they are not the same.
I create a model using the “glm()” function. I specify the following features:
#desired_values <- c("Dpkts", "sbytes", "dbytes", "synack", "dur")
#FEATURES[FEATURES$Name %in% desired_values, ]
Below I created the model.
glm.fit = glm(TRAINING$label ~ dpkts + sbytes + dbytes + synack + sqrt(dur), data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
##
## Call:
## glm(formula = TRAINING$label ~ dpkts + sbytes + dbytes + synack +
## sqrt(dur), family = "binomial", data = TRAINING)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.297e+00 7.609e-03 170.42 <2e-16 ***
## dpkts -1.197e-01 8.513e-04 -140.57 <2e-16 ***
## sbytes 2.633e-05 5.967e-07 44.12 <2e-16 ***
## dbytes 8.696e-05 6.474e-07 134.32 <2e-16 ***
## synack 5.225e+00 1.764e-01 29.61 <2e-16 ***
## sqrt(dur) 2.112e-01 9.326e-03 22.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 219667 on 175340 degrees of freedom
## Residual deviance: 182669 on 175335 degrees of freedom
## AIC: 182681
##
## Number of Fisher Scoring iterations: 9
I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.
predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)
I convert probabilities to binary predictions (0 or 1)
binary_predictions = ifelse(predictions > 0.5, 1, 0)
Below I create the confusion matrix.
confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
## Predicted
## Actual 0 1
## 0 9295 27705
## 1 2160 43172
I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.6372613
I create a model using the “glm()” function. I specify the following features:
#desired_values <- c("sttl", "ct_dst_src_ltm", "spkts", "dload", "sloss", "dloss", "ct_src_ltm". "ct_srv_dst")
#FEATURES[FEATURES$Name %in% desired_values, ]
Below I created the model.
glm.fit = glm(TRAINING$label ~ sttl + ct_dst_src_ltm + spkts + dload + sloss + dloss + ct_src_ltm + ct_srv_dst, data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
##
## Call:
## glm(formula = TRAINING$label ~ sttl + ct_dst_src_ltm + spkts +
## dload + sloss + dloss + ct_src_ltm + ct_srv_dst, family = "binomial",
## data = TRAINING)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.263e-01 1.737e-02 -53.32 <2e-16 ***
## sttl 1.241e-02 8.804e-05 140.98 <2e-16 ***
## ct_dst_src_ltm 2.301e-01 5.086e-03 45.23 <2e-16 ***
## spkts -2.126e-02 5.569e-04 -38.18 <2e-16 ***
## dload -2.810e-06 4.808e-08 -58.45 <2e-16 ***
## sloss 5.004e-02 1.314e-03 38.08 <2e-16 ***
## dloss 2.632e-02 5.710e-04 46.08 <2e-16 ***
## ct_src_ltm 8.622e-02 1.978e-03 43.58 <2e-16 ***
## ct_srv_dst -2.676e-01 5.059e-03 -52.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 219667 on 175340 degrees of freedom
## Residual deviance: 112114 on 175332 degrees of freedom
## AIC: 112132
##
## Number of Fisher Scoring iterations: 9
I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.
predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)
I convert probabilities to binary predictions (0 or 1)
binary_predictions = ifelse(predictions > 0.5, 1, 0)
Below I create the confusion matrix.
confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
## Predicted
## Actual 0 1
## 0 21223 15777
## 1 6484 38848
I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.
accuracy = sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.7296191
I create a model using the “glm()” function. I specify the following features:
#desired_values <- c("service", "sbytes", "sttl", "smean", "ct_dst_sport_ltm")
#FEATURES[FEATURES$Name %in% desired_values, ]
I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.
glm.fit = glm(TRAINING$label ~ service + sbytes + sttl + smean + smean + ct_dst_sport_ltm, data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
##
## Call:
## glm(formula = TRAINING$label ~ service + sbytes + sttl + smean +
## smean + ct_dst_sport_ltm, family = "binomial", data = TRAINING)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.080e+00 3.216e-02 -126.897 < 2e-16 ***
## servicedhcp 1.612e+01 7.054e+01 0.228 0.81927
## servicedns 6.904e-01 4.340e-02 15.905 < 2e-16 ***
## serviceftp 2.707e+00 4.895e-02 55.299 < 2e-16 ***
## serviceftp-data 2.567e+00 4.367e-02 58.771 < 2e-16 ***
## servicehttp 3.056e+00 3.165e-02 96.569 < 2e-16 ***
## serviceirc 1.561e+01 1.363e+02 0.115 0.90884
## servicepop3 8.143e+00 5.015e-01 16.237 < 2e-16 ***
## serviceradius -1.424e-03 7.768e-01 -0.002 0.99854
## servicesmtp 4.036e+00 4.750e-02 84.965 < 2e-16 ***
## servicesnmp 2.594e+00 1.007e+00 2.577 0.00996 **
## servicessh -2.156e+00 3.354e-01 -6.429 1.28e-10 ***
## servicessl 1.540e+01 9.401e+01 0.164 0.86990
## sbytes 1.150e-05 5.485e-07 20.972 < 2e-16 ***
## sttl 2.284e-02 1.386e-04 164.820 < 2e-16 ***
## smean -1.798e-03 4.305e-05 -41.777 < 2e-16 ***
## ct_dst_sport_ltm 2.034e-01 5.646e-03 36.027 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 219667 on 175340 degrees of freedom
## Residual deviance: 101523 on 175324 degrees of freedom
## AIC: 101557
##
## Number of Fisher Scoring iterations: 13
I create
predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)
I convert probabilities to binary predictions (0 or 1)
binary_predictions = ifelse(predictions > 0.5, 1, 0)
Below I create the confusion matrix.
confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
## Predicted
## Actual 0 1
## 0 19064 17936
## 1 1897 43435
I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.7591095
I created model 1 from the structure that was left behind from the assignment information. I simply had to remove a few of the features to reach the point I could use it. I would recieve errors with using the full example so I played around with it and removed “state”, “rate”, and “service”.
The collection of features from MODEL 3 proved to be most useful and lead to a 63.72613% accuracy.
I created model 2 and 3 after reading an online research paper. Link here - https://shura.shu.ac.uk/15662/1/Feature%20Selection%20TJ-SZ-ISIE%202017-Camera%20Ready1.pdf. The reading mentioned the most popular used data features which made up my model 2.
The collection of features from MODEL 3 proved to be most useful and lead to a 72.96191% accuracy.
I created model 3 from the same research paper I mentioned above. The research paper summarized that these features were determined with the aid of machine learning. The machine learning algorithms included Random Forest algorithm.
The features are as follows: * service = Service type * sbytes = Source to destination bytes * sttl = Source to destination time to live * smean = Mean of packet size transmitted by the srcip * ct_dst_sport_ltm = No. of rows of the same dstip and the sport in 100 rows
The collection of features from MODEL 3 proved to be most useful and lead to a 75.91095% accuracy.
Resources: JANARTHANAN, Tharmini and ZARGARI, Shahrzad (2017). Feature Selection in UNSW-NB15 and KDDCUP’99 datasets. In: 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE),. IEEE.