The Goal

The goal of this project is to construct a model to predict whether a packet of network traffic is anomalous or not(normal).

For this project I will be creating a logistic model using the UNSW-NB15 data set. The UNSW-NB15 data set is a famous data set created by IXIA PerfectStorm. A partition of the original data set was configured as a training set and a testing set. It is from these two data sets that I will use throughout in my project.

Step 1-3:

I start off by fetching the data set files and passing them to their respective variables: TESTING ; TRAINING ; FEATURES

TESTING = read.csv('testing.csv')
TRAINING = read.csv('training.csv')
FEATURES = read.csv('features.csv')
FEATURES
##    No.             Name      Type
## 1    1            srcip   nominal
## 2    2            sport   integer
## 3    3            dstip   nominal
## 4    4           dsport   integer
## 5    5            proto   nominal
## 6    6            state   nominal
## 7    7              dur     Float
## 8    8           sbytes   Integer
## 9    9           dbytes   Integer
## 10  10             sttl   Integer
## 11  11             dttl   Integer
## 12  12            sloss   Integer
## 13  13            dloss   Integer
## 14  14          service   nominal
## 15  15            Sload     Float
## 16  16            Dload     Float
## 17  17            Spkts   integer
## 18  18            Dpkts   integer
## 19  19             swin   integer
## 20  20             dwin   integer
## 21  21            stcpb   integer
## 22  22            dtcpb   integer
## 23  23          smeansz   integer
## 24  24          dmeansz   integer
## 25  25      trans_depth   integer
## 26  26      res_bdy_len   integer
## 27  27             Sjit     Float
## 28  28             Djit     Float
## 29  29            Stime Timestamp
## 30  30            Ltime Timestamp
## 31  31          Sintpkt     Float
## 32  32          Dintpkt     Float
## 33  33           tcprtt     Float
## 34  34           synack     Float
## 35  35           ackdat     Float
## 36  36  is_sm_ips_ports    Binary
## 37  37     ct_state_ttl   Integer
## 38  38 ct_flw_http_mthd   Integer
## 39  39     is_ftp_login    Binary
## 40  40       ct_ftp_cmd   integer
## 41  41       ct_srv_src   integer
## 42  42       ct_srv_dst   integer
## 43  43       ct_dst_ltm   integer
## 44  44      ct_src_ ltm   integer
## 45  45 ct_src_dport_ltm   integer
## 46  46 ct_dst_sport_ltm   integer
## 47  47   ct_dst_src_ltm   integer
## 48  48       attack_cat   nominal
## 49  49            Label    binary
##                                                                                                                                                           Description
## 1                                                                                                                                                   Source IP address
## 2                                                                                                                                                  Source port number
## 3                                                                                                                                              Destination IP address
## 4                                                                                                                                             Destination port number
## 5                                                                                                                                                Transaction protocol
## 6      Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR, FIN, INT, MAS, PAR, REQ, RST, TST, TXD, URH, URN, and (-) (if not used state)
## 7                                                                                                                                               Record total duration
## 8                                                                                                                            Source to destination transaction bytes 
## 9                                                                                                                             Destination to source transaction bytes
## 10                                                                                                                          Source to destination time to live value 
## 11                                                                                                                           Destination to source time to live value
## 12                                                                                                                           Source packets retransmitted or dropped 
## 13                                                                                                                       Destination packets retransmitted or dropped
## 14                                                                                         http, ftp, smtp, ssh, dns, ftp-data ,irc  and (-) if not much used service
## 15                                                                                                                                             Source bits per second
## 16                                                                                                                                        Destination bits per second
## 17                                                                                                                                Source to destination packet count 
## 18                                                                                                                                 Destination to source packet count
## 19                                                                                                                              Source TCP window advertisement value
## 20                                                                                                                         Destination TCP window advertisement value
## 21                                                                                                                                    Source TCP base sequence number
## 22                                                                                                                               Destination TCP base sequence number
## 23                                                                                                                Mean of the ?ow packet size transmitted by the src 
## 24                                                                                                                Mean of the ?ow packet size transmitted by the dst 
## 25                                                                            Represents the pipelined depth into the connection of http request/response transaction
## 26                                                                        Actual uncompressed content size of the data transferred from the server\x92s http service.
## 27                                                                                                                                               Source jitter (mSec)
## 28                                                                                                                                          Destination jitter (mSec)
## 29                                                                                                                                                  record start time
## 30                                                                                                                                                   record last time
## 31                                                                                                                             Source interpacket arrival time (mSec)
## 32                                                                                                                        Destination interpacket arrival time (mSec)
## 33                                                                                TCP connection setup round-trip time, the sum of \x92synack\x92 and \x92ackdat\x92.
## 34                                                                                       TCP connection setup time, the time between the SYN and the SYN_ACK packets.
## 35                                                                                       TCP connection setup time, the time between the SYN_ACK and the ACK packets.
## 36                                        If source (1) and destination (3)IP addresses equal and port numbers (2)(4)  equal then, this variable takes value 1 else 0
## 37                                                        No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
## 38                                                                                                No. of flows that has methods such as Get and Post in http service.
## 39                                                                                                If the ftp session is accessed by user and password then 1 else 0. 
## 40                                                                                                                     No of flows that has a command in ftp session.
## 41                                   No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
## 42                              No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
## 43                                                         No. of connections of the same destination address (3) in 100 connections according to the last time (26).
## 44                                                              No. of connections of the same source address (1) in 100 connections according to the last time (26).
## 45                                  No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
## 46                                  No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
## 47                                    No of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).
## 48 The name of each attack category. In this data set , nine categories e.g. Fuzzers, Analysis, Backdoors, DoS Exploits, Generic, Reconnaissance, Shellcode and Worms
## 49                                                                                                                              0 for normal and 1 for attack records

First we assess the “FEATURES” file that reveals the variables that we can expect within the data sets.

str(FEATURES)
## 'data.frame':    49 obs. of  4 variables:
##  $ No.        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : chr  "srcip" "sport" "dstip" "dsport" ...
##  $ Type       : chr  "nominal" "integer" "nominal" "integer" ...
##  $ Description: chr  "Source IP address" "Source port number" "Destination IP address" "Destination port number" ...
head(FEATURES)
##   No.   Name    Type
## 1   1  srcip nominal
## 2   2  sport integer
## 3   3  dstip nominal
## 4   4 dsport integer
## 5   5  proto nominal
## 6   6  state nominal
##                                                                                                                                                      Description
## 1                                                                                                                                              Source IP address
## 2                                                                                                                                             Source port number
## 3                                                                                                                                         Destination IP address
## 4                                                                                                                                        Destination port number
## 5                                                                                                                                           Transaction protocol
## 6 Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR, FIN, INT, MAS, PAR, REQ, RST, TST, TXD, URH, URN, and (-) (if not used state)
dim(FEATURES)
## [1] 49  4
dim(TESTING)
## [1] 82332    45
dim(TRAINING)
## [1] 175341     45

The Features file reveals to have 4 variables within it; 3 of the data types are ‘char’. These 3 char variables hold the name, data type, and description of each variable in the original data set.

Yet I do notice that the features includes 49 rows which don’t match with the 45 variables of the “TESTING” and “TRAINING” data sets. This is likely due to the testing sets to be missing some of the original variables. We will have to compensate for this later…

Step 4:

TRAINING

I will now use the “table()” function and the “label” feature to display a table with a tally of the number of 0s and 1s in the training set.

label_table = table(TRAINING$label)
label_table
## 
##      0      1 
##  56000 119341
## With a little math we can see the proportions of the 0s and 1s.
proportions = label_table / nrow(TRAINING)
proportions
## 
##         0         1 
## 0.3193777 0.6806223

TESTING

Lets also do this for the “TESTING” data set.

label_table = table(TESTING$label)
label_table
## 
##     0     1 
## 37000 45332
## With a little math we can see the proportions of the 0s and 1s.
proportions = label_table / nrow(TESTING)
proportions
## 
##      0      1 
## 0.4494 0.5506

The proportions seem somewhat similar and glady they are not the same.


Model 1:

I create a model using the “glm()” function. I specify the following features:

#desired_values <- c("Dpkts", "sbytes", "dbytes", "synack", "dur")
#FEATURES[FEATURES$Name %in% desired_values, ]

Below I created the model.

glm.fit = glm(TRAINING$label ~ dpkts + sbytes + dbytes + synack + sqrt(dur), data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
## 
## Call:
## glm(formula = TRAINING$label ~ dpkts + sbytes + dbytes + synack + 
##     sqrt(dur), family = "binomial", data = TRAINING)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.297e+00  7.609e-03  170.42   <2e-16 ***
## dpkts       -1.197e-01  8.513e-04 -140.57   <2e-16 ***
## sbytes       2.633e-05  5.967e-07   44.12   <2e-16 ***
## dbytes       8.696e-05  6.474e-07  134.32   <2e-16 ***
## synack       5.225e+00  1.764e-01   29.61   <2e-16 ***
## sqrt(dur)    2.112e-01  9.326e-03   22.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 219667  on 175340  degrees of freedom
## Residual deviance: 182669  on 175335  degrees of freedom
## AIC: 182681
## 
## Number of Fisher Scoring iterations: 9

I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.

predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)

I convert probabilities to binary predictions (0 or 1)

binary_predictions = ifelse(predictions > 0.5, 1, 0)

Below I create the confusion matrix.

confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
##       Predicted
## Actual     0     1
##      0  9295 27705
##      1  2160 43172

I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.

accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.6372613

Model 2:

I create a model using the “glm()” function. I specify the following features:

#desired_values <- c("sttl", "ct_dst_src_ltm", "spkts", "dload", "sloss", "dloss", "ct_src_ltm". "ct_srv_dst")
#FEATURES[FEATURES$Name %in% desired_values, ]

Below I created the model.

glm.fit = glm(TRAINING$label ~ sttl + ct_dst_src_ltm + spkts + dload + sloss + dloss + ct_src_ltm + ct_srv_dst, data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
## 
## Call:
## glm(formula = TRAINING$label ~ sttl + ct_dst_src_ltm + spkts + 
##     dload + sloss + dloss + ct_src_ltm + ct_srv_dst, family = "binomial", 
##     data = TRAINING)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -9.263e-01  1.737e-02  -53.32   <2e-16 ***
## sttl            1.241e-02  8.804e-05  140.98   <2e-16 ***
## ct_dst_src_ltm  2.301e-01  5.086e-03   45.23   <2e-16 ***
## spkts          -2.126e-02  5.569e-04  -38.18   <2e-16 ***
## dload          -2.810e-06  4.808e-08  -58.45   <2e-16 ***
## sloss           5.004e-02  1.314e-03   38.08   <2e-16 ***
## dloss           2.632e-02  5.710e-04   46.08   <2e-16 ***
## ct_src_ltm      8.622e-02  1.978e-03   43.58   <2e-16 ***
## ct_srv_dst     -2.676e-01  5.059e-03  -52.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 219667  on 175340  degrees of freedom
## Residual deviance: 112114  on 175332  degrees of freedom
## AIC: 112132
## 
## Number of Fisher Scoring iterations: 9

I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.

predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)

I convert probabilities to binary predictions (0 or 1)

binary_predictions = ifelse(predictions > 0.5, 1, 0)

Below I create the confusion matrix.

confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
##       Predicted
## Actual     0     1
##      0 21223 15777
##      1  6484 38848

I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.

accuracy = sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.7296191

Model 3:

I create a model using the “glm()” function. I specify the following features:

#desired_values <- c("service", "sbytes", "sttl", "smean", "ct_dst_sport_ltm")
#FEATURES[FEATURES$Name %in% desired_values, ]

I make predictions on the “TESTING” set. I make sure to pass the “NA” values and skip them.

glm.fit = glm(TRAINING$label ~ service + sbytes + sttl + smean + smean + ct_dst_sport_ltm, data = TRAINING, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
## 
## Call:
## glm(formula = TRAINING$label ~ service + sbytes + sttl + smean + 
##     smean + ct_dst_sport_ltm, family = "binomial", data = TRAINING)
## 
## Coefficients:
##                    Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)      -4.080e+00  3.216e-02 -126.897  < 2e-16 ***
## servicedhcp       1.612e+01  7.054e+01    0.228  0.81927    
## servicedns        6.904e-01  4.340e-02   15.905  < 2e-16 ***
## serviceftp        2.707e+00  4.895e-02   55.299  < 2e-16 ***
## serviceftp-data   2.567e+00  4.367e-02   58.771  < 2e-16 ***
## servicehttp       3.056e+00  3.165e-02   96.569  < 2e-16 ***
## serviceirc        1.561e+01  1.363e+02    0.115  0.90884    
## servicepop3       8.143e+00  5.015e-01   16.237  < 2e-16 ***
## serviceradius    -1.424e-03  7.768e-01   -0.002  0.99854    
## servicesmtp       4.036e+00  4.750e-02   84.965  < 2e-16 ***
## servicesnmp       2.594e+00  1.007e+00    2.577  0.00996 ** 
## servicessh       -2.156e+00  3.354e-01   -6.429 1.28e-10 ***
## servicessl        1.540e+01  9.401e+01    0.164  0.86990    
## sbytes            1.150e-05  5.485e-07   20.972  < 2e-16 ***
## sttl              2.284e-02  1.386e-04  164.820  < 2e-16 ***
## smean            -1.798e-03  4.305e-05  -41.777  < 2e-16 ***
## ct_dst_sport_ltm  2.034e-01  5.646e-03   36.027  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 219667  on 175340  degrees of freedom
## Residual deviance: 101523  on 175324  degrees of freedom
## AIC: 101557
## 
## Number of Fisher Scoring iterations: 13

I create

predictions = predict(glm.fit, newdata = TESTING, type = "response", na.action=na.pass)

I convert probabilities to binary predictions (0 or 1)

binary_predictions = ifelse(predictions > 0.5, 1, 0)

Below I create the confusion matrix.

confusion_matrix = table(Actual = TESTING$label, Predicted = binary_predictions)
confusion_matrix
##       Predicted
## Actual     0     1
##      0 19064 17936
##      1  1897 43435

I am then able to calculate the accuracy of the confusion matrix by running a little bit of math below.

accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.7591095

CONCLUSION:

Model 1:

I created model 1 from the structure that was left behind from the assignment information. I simply had to remove a few of the features to reach the point I could use it. I would recieve errors with using the full example so I played around with it and removed “state”, “rate”, and “service”.

The collection of features from MODEL 3 proved to be most useful and lead to a 63.72613% accuracy.

Model 2:

I created model 2 and 3 after reading an online research paper. Link here - https://shura.shu.ac.uk/15662/1/Feature%20Selection%20TJ-SZ-ISIE%202017-Camera%20Ready1.pdf. The reading mentioned the most popular used data features which made up my model 2.

The collection of features from MODEL 3 proved to be most useful and lead to a 72.96191% accuracy.

Model 3 (Most accurate Model):

I created model 3 from the same research paper I mentioned above. The research paper summarized that these features were determined with the aid of machine learning. The machine learning algorithms included Random Forest algorithm.

The features are as follows: * service = Service type * sbytes = Source to destination bytes * sttl = Source to destination time to live * smean = Mean of packet size transmitted by the srcip * ct_dst_sport_ltm = No. of rows of the same dstip and the sport in 100 rows

The collection of features from MODEL 3 proved to be most useful and lead to a 75.91095% accuracy.

Resources: JANARTHANAN, Tharmini and ZARGARI, Shahrzad (2017). Feature Selection in UNSW-NB15 and KDDCUP’99 datasets. In: 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE),. IEEE.