Introduction of Dataset The UNSW-NB15 dataset is a network intrusion detection dataset created by the University of New South Wales (UNSW) in Australia. It contains network traffic data that simulates a real-world environment and includes various types of attacks, such as DoS (Denial of Service), reconnaissance, backdoors, and others. The dataset consists of both normal traffic and attack traffic, and it is divided into training and testing sets. It has been widely used as a benchmark dataset for evaluating the performance of intrusion detection systems and for developing new machine learning algorithms to detect network attacks.
Objective of Project The main objective of this project is to develop a machine learning-based intrusion detection system that can predict network intrusion detection (ID) using the publicly available UNSW-NB15 dataset. The dataset contains normal and attack records, including nine attack categories and malware such as analysis, backdoors, DoS, exploits, generic, reconnaissance, fuzzers for anomalous activity, shellcode, and worms. The dataset also has a binary classification of attack category (1/0), with a “label” of 0 representing normal or non-attack and 1 representing any type of attack.Secondly, differents models was compared to find the best model for predicting an intrusion using this dataset.
To achieve this objective, i used generalized linear models (GLMs) with binomial families(Logistic Regression) to predict IDs. The experiments conducted in this work were as follows:
I conducted an exploratory analysis of the independent variables to gain insight into the dataset’s structure and identify any outliers or anomalies.
I determined if there was a significant association between the input features and the response variable. This involved running statistical tests to identify the most significant features that contribute to the model’s prediction accuracy.
I performed data pre-processing to clean and scale the data for accurate prediction
I evaluated the accuracy of all input features in predicting Label. I trained the models on the training dataset and used the test set to evaluate its performance.
I determined how accurately the most significant features, obtained from step (b), predict intrusion. We used these features to train the model and evaluated its performance on the test set.
I compared the performance of the model using the most significant features and using all features in the dataset to determine which model performs better.
I also compared the Logistic regression model to the K-nearest neighbor model to determine which model performs better in classification of attack and or intrusion.
The first thin we carried out is Loading the required Library needed for this project.
Load some necessary libraries needed
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.0 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.1 âś” tibble 3.2.0
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## here() starts at /Users/swanky/Downloads
##
##
## Attaching package: 'psych'
##
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
##
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
## Type 'citation("pROC")' for a citation.
##
##
## Attaching package: 'pROC'
##
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##
##
## Loaded ROSE 0.0-4
##
##
## AUC 0.3.2
##
## Type AUCNews() to see the change log and ?AUC to get an overview.
##
##
## Attaching package: 'AUC'
##
##
## The following objects are masked from 'package:pROC':
##
## auc, roc
##
##
## The following objects are masked from 'package:caret':
##
## sensitivity, specificity
(a) I conducted an exploratory analysis of the independent variables to gain insight into the dataset’s structure and identify any outliers or anomalies.
Read training dataset from csv file from current directory Read testing dataset from csv file from current directory
## [1] 82332 45
## [1] 175341 45
The dimension of both the training and test datset is 175341 rows and 45 coloumn (predictors) Check the class distribution of training set and also the distribution of Attack categories
##
## Analysis Backdoor DoS Exploits Fuzzers
## 677 583 4089 11132 6062
## Generic Normal Reconnaissance Shellcode Worms
## 18871 37000 3496 378 44
Check the class distribution of testing set and also the distribution of Attack categories
##
## Analysis Backdoor DoS Exploits Fuzzers
## 2000 1746 12264 33393 18184
## Generic Normal Reconnaissance Shellcode Worms
## 40000 56000 10491 1133 130
#Check the class distribution of attack category for both training
and test set Using barchart, I illustrate the Distribution of data in
each of the Attack categories
The Barchart for both the training and testing datasets reveals that
there are nine categories of attacks, with “Normal” representing
non-attacks. However, the data is highly imbalanced, with a majority of
non-attacks compared to attacks. Among the training data, the most
frequently occurring attack categories are “Generic,” “Exploits,”
“Fuzzers,” “DoS,” and “Reconnaissance.”
#Exclude ID and attack category from training set
Below, i excluded the ID column and the class label that will be used to
classify the attck categories. This is done twice because the dataset is
divided into two which are training dataset and test dataset. This is
also part of the data pre-processing in the dataset
test<-test[which(test$is_ftp_login!=2),] : The “test” data frame is subsetted by excluding rows where the value of the “is_ftp_login” column is equal to 2 using the “which()” function.
## [1] 82328 43
#Exclude ID and attack category from test set
## [1] 175335 42
##Compare the class label distribution of train and test dataset
Here, using barplot i compared the class label of (1/0) of the train and
test dataset with 1 representing attack and 0 representing non-attack.
The class label is going to be used for the classification.
Here, we explored the distribution of binary class label which is represented in 1/0 in the dataset.
In Train: there are 44.94% of class “normal” and 55.06% of class “attack” In Test: there are 31.91% of class “normal” and 68.06% of class “attack”
CATEGORICAL VARIABLES IN THE DATASET
We will now investigate the distribution of Attack categories across the Predictor variables. This analysis will demonstrate how each predictor is distributed among the attack categories.
The service field in the dataset provides information about the network protocol or service associated with the recorded network traffic. This field identifies the type of service or protocol used for communication. Examples of services in the dataset include HTTP, FTP, SMTP, DNS, SSH, among others.
In the case of “normal,” there are many instances of “dns” and a few rare values, indicating a prevalence of “-” in the dataset. In contrast, in the attack data, “dns” occurs more frequently than any other value, with only a few instances of other protocols such as HTTP, resulting in low distribution. The distribution of “normal” attacks is low, while a high distribution of attacks is prevalent.
##For State Variable
The “state” field represents the state and activity of the protocol, such as TCP, UDP, or ICMP, and it indicates the state of the connection, such as establishing, closing, or maintaining a connection. The values of the “state” field vary depending on the protocol and its activity. For TCP connections, the “state” field may contain values such as “FIN”, “SYN”, “ACK”, “RST”, “URG”, and “PSH.” In the case of “normal” data, “fin” is the most frequently occurring value, followed by “cons,” which is about half as frequent as “fin.” Additionally, a few instances of “int” are present. On the other hand, in the attack data, “int” occurs more frequently than any normal category value, indicating a potentially important feature for detection. There are very few instances of “fin” and low frequency of other TCP fields in the attack category, indicating a low distribution of the field. Based on my understanding of the data, there seem to be more TCP connections that have not been closed after packets are sent, as evidenced by the prevalence of “FIN.” This feature can be used to monitor network activity and detect any potential issues.
##For is_ftp_login Variable
The “is_ftp_login” field in the dataset indicates whether a transaction is an FTP login or not. It has a binary value, with 1 indicating that the transaction is an FTP login and 0 indicating that it is not. FTP (File Transfer Protocol) is a standard network protocol used for transferring files from one host to another over a TCP-based network, such as the Internet. The distribution of “is_ftp_login” is comparatively higher in the normal category than in the attack category. We observe a significant difference in the representation of “is_ftp_login” between the two categories.
The feature “is_sm_ips_ports” in the dataset is binary and indicates if the source or destination port of a network connection is a well-known port assigned by the Internet Assigned Numbers Authority (IANA) for specific services. A value of 1 indicates that at least one end of the network connection involves a well-known port, while a value of 0 indicates that there is no involvement of any well-known port. Our observation reveals that there are a significant number of non-well-known ports in the attack category.
#Separate categorical from training data In this chunk, we saperated the categorical variables from the Traning dataset## [1] 82328 6
#Separate continuous from training data
After the separation of the CATEGORICAL variable, we are saperating the CONTINOUS variable in the dataset for exploration.
## [1] 82328 38
## dur spkts dpkts sbytes
## Min. : 0.00000 Min. : 1.00 Min. : 0.00 Min. : 24
## 1st Qu.: 0.00001 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 114
## Median : 0.01412 Median : 6.00 Median : 2.00 Median : 534
## Mean : 1.00678 Mean : 18.67 Mean : 17.55 Mean : 7994
## 3rd Qu.: 0.71936 3rd Qu.: 12.00 3rd Qu.: 10.00 3rd Qu.: 1280
## Max. :59.99999 Max. :10646.00 Max. :11018.00 Max. :14355774
## dbytes rate sttl dttl
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.00
## 1st Qu.: 0 1st Qu.: 28.6 1st Qu.: 62 1st Qu.: 0.00
## Median : 178 Median : 2651.2 Median :254 Median : 29.00
## Mean : 13234 Mean : 82414.9 Mean :181 Mean : 95.71
## 3rd Qu.: 956 3rd Qu.: 111111.1 3rd Qu.:254 3rd Qu.:252.00
## Max. :14657531 Max. :1000000.0 Max. :255 Max. :253.00
## sload dload sloss dloss
## Min. :0.000e+00 Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.120e+04 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 0.000
## Median :5.771e+05 Median : 2113 Median : 1.000 Median : 0.000
## Mean :6.455e+07 Mean : 630577 Mean : 4.754 Mean : 6.309
## 3rd Qu.:6.514e+07 3rd Qu.: 15858 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :5.268e+09 Max. :20821108 Max. :5319.000 Max. :5507.000
## sinpkt dinpkt sjit djit
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.01 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.56 Median : 0.01 Median : 17.6 Median : 0.0
## Mean : 755.43 Mean : 121.71 Mean : 6363.4 Mean : 535.1
## 3rd Qu.: 63.41 3rd Qu.: 63.14 3rd Qu.: 3219.5 3rd Qu.: 128.5
## Max. :60009.99 Max. :57739.24 Max. :1483830.9 Max. :463199.2
## swin stcpb dtcpb dwin
## Min. : 0.0 Min. :0.000e+00 Min. :0.000e+00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.:0.000e+00 1st Qu.:0.000e+00 1st Qu.: 0.0
## Median :255.0 Median :2.778e+07 Median :2.831e+07 Median :255.0
## Mean :133.5 Mean :1.085e+09 Mean :1.073e+09 Mean :128.3
## 3rd Qu.:255.0 3rd Qu.:2.171e+09 3rd Qu.:2.144e+09 3rd Qu.:255.0
## Max. :255.0 Max. :4.295e+09 Max. :4.295e+09 Max. :255.0
## tcprtt synack smean dmean
## Min. :0.000000 Min. :0.000000 Min. : 24.0 Min. : 0.0
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.: 57.0 1st Qu.: 0.0
## Median :0.000552 Median :0.000441 Median : 65.0 Median : 44.0
## Mean :0.055928 Mean :0.029257 Mean : 139.5 Mean : 116.3
## 3rd Qu.:0.105547 3rd Qu.:0.052604 3rd Qu.: 100.0 3rd Qu.: 87.0
## Max. :3.821465 Max. :3.226788 Max. :1504.0 Max. :1500.0
## trans_depth response_body_len ct_srv_src ct_state_ttl
## Min. : 0.00000 Min. : 0 Min. : 1.000 Min. :0.000
## 1st Qu.: 0.00000 1st Qu.: 0 1st Qu.: 2.000 1st Qu.:1.000
## Median : 0.00000 Median : 0 Median : 5.000 Median :1.000
## Mean : 0.09428 Mean : 1595 Mean : 9.547 Mean :1.369
## 3rd Qu.: 0.00000 3rd Qu.: 0 3rd Qu.:11.000 3rd Qu.:2.000
## Max. :131.00000 Max. :5242880 Max. :63.000 Max. :6.000
## ct_dst_ltm ct_src_dport_ltm ct_dst_sport_ltm ct_dst_src_ltm
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 2.000 Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 5.745 Mean : 4.929 Mean : 3.663 Mean : 7.457
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :59.000 Max. :59.000 Max. :38.000 Max. :63.000
## ct_ftp_cmd ct_flw_http_mthd ct_src_ltm ct_srv_dst
## Min. :0.000000 Min. : 0.0000 Min. : 1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.: 0.0000 1st Qu.: 1.000 1st Qu.: 2.000
## Median :0.000000 Median : 0.0000 Median : 3.000 Median : 5.000
## Mean :0.008284 Mean : 0.1297 Mean : 6.468 Mean : 9.165
## 3rd Qu.:0.000000 3rd Qu.: 0.0000 3rd Qu.: 7.000 3rd Qu.:11.000
## Max. :2.000000 Max. :16.0000 Max. :60.000 Max. :62.000
## attack_cat label
## Length:82328 0:37000
## Class :character 1:45328
## Mode :character
##
##
##
In the following plots, I will be using boxplots to visualize the distribution of the continuous variables in the dataset. For some of the plots, I will be using the “Geo_point” feature in ggplot to show the distribution of the data and to identify any possible outliers in the attack categories. For other plots, I will be applying a logarithmic transformation to the variables to better capture both the small and large values and Distribution of each variable when compared to the attack categories.
The Duration column represents the time duration of the connection in seconds, indicating how long the connection between the source and destination lasted. For example, for a network connection, the duration would be the length of time that the connection was open. We applied natural logarithm to the Duration variable to capture the small values, and observed a low distribution of both attack and normal traffic with relatively high outliers.
Similarly, when we applied Log2 to reduce the variance of the data and capture the underlying distribution of the “Source to Destination Packet Count” variable, we observed a relatively higher number of attacks with high outliers.
Similarly, by applying logarithm to the “dpkts” variable, we observed the distribution of the attack and non-attack traffic in the dataset. We found that there are outliers in the destination-to-source packet data, relative to the attack categories. there is also moderate distribution of both the attck and normal traffic.
The natural logarithm was also applied to the Source to Destination transaction bytes, and the resulting distribution in the attack categories was observed. It was found that the size of the attacks was relatively similar to that of normal traffic. However, there were also a significant number of outliers in the Sbytes variable.
The observation is that the distribution of attacks is higher than that of normal packets. Additionally, there are significant outliers for DNS and Generic protocols in the boxplot
The Rate feature in the dataset indicates the rate of the connection, measured as the number of packets per second (pps) during the last two seconds of the connection. This feature provides important information on the rate of traffic for a particular connection and can be used to detect abnormal behavior or attacks with high traffic rates. In our analysis, we observed a significant outlier in the normal traffic category, while there is a relatively higher number of attacks with high traffic rates in the attack categories.
The Source to Destination packet count shows an equal distribution between attacks and normal traffic. The natural logarithm was applied to capture the relatively smaller values in the Sinpkt variable. Additionally, there are observed outliers in Analysis, Backdoor, and Generic categories.
The destination to source packet count is very similar to the source to destination packet count. The distribution of both variables is comparatively equal in the attack categories.
Source jitter refers to the variation or fluctuations in the timing of packets sent from the source IP address, measured in milliseconds (mSec). Jitter can occur due to network congestion or other factors that cause delay or loss of packets. High jitter can result in poor quality of service (QoS) for real-time applications such as voice or video streaming. We observed that the distribution of Source Jitter is higher in attacks, with a significant height of normal traffic in the attack category. The attacks seem to be more prominent than the non-attacks.
From the plot, we observed that there is a higher distribution of Source Jitter than Destination Jitter. Additionally, there are some outliers in the Destination Jitter.
Source to destination during the last time period measured (two-way traffic), including both the payload and the TCP/IP headers. From the boxplot, we observed a relatively higher attack in the source window size compared to normal traffic, where the normal traffic is very small. This indicates that source window size can be a good predictor of an attack in the dataset.
#Plot Bar diagram of mean for different features by attack category
Below, I decided to display the mean variation of the features by attack categories to aid in understanding the distribution of these variables in attack categories. For variables with smaller values, logarithm may be applied to help capture the data for more insights.
The mean variation of the variables refers to the differences in the average values of the variables between the attack and normal traffic categories. By examining the mean variation of the variables, we can gain insights into how the variables are associated with the different attack types and use this information to develop models that accurately classify network traffic as either normal or malicious. The mean variation of the features can help in determining the accurate model by providing information on which features have the most significant impact on the target variable. Features with high mean variation are likely to be better predictors of the target variable than features with low mean variation. Therefore, selecting the features with the highest mean variation and using them in a model is likely to result in a more accurate model. Additionally, the mean variation can help in identifying outliers and detecting any unusual patterns in the data, which can inform the model selection and feature engineering processes. The mean variation plots below show how the data is distributed. Some of them have a higher distribution of attack while others have a higher distribution of normal traffic. The legends show the exact color and how it is represented in the main plot. By critically deducing from the color matching of the histogram and legend, it is very easy to interpret. Also, any bar that does not have the colour of normal in legend is an attack.
The mean seems to be slightly different from the boxplot of the Dur variable. The mean distribution of duration shows that there is a high variance in the number of attacks compared to the normal network packet duration.
The mean value of Source to Destination packet count (Spkts) indicates a higher number of attacks compared to normal traffic. This suggests that the mean can be a good predictor of the attack categories.
(b) I determined if there was a significant association between the input features and the response variable. This involved running statistical tests to identify the most significant features that contribute to the model’s prediction accuracy.
The training dataset consists of 36 continuous and 5 categorical variables/features. To investigate the association between input features and labels, I performed independent t-tests for continuous variables and Chi-square tests for categorical variables. These tests yield a P-value, which indicates the statistical significance of a predictor in explaining the data and can be used to accept or reject the null hypothesis.
###t-test
We observed that all 35 features, except for “dur”, are significantly associated with labels at a 5% level of significance. Therefore, we need to select features with p-values less than 0.05 for further analysis and prediction.
#Take 35 significant features whose p-value is less than 0.05;
We extracted the the features that has p-value less than 0.05
Apply Chi-Square test for categorical variable to show the association between categorical features and study variable
## Warning in chisq.test(training_cate[, i], training_cate$label): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(training_cate[, i], training_cate$label): Chi-squared
## approximation may be incorrect
## CS_test_statistic pvalue_CS_testtest
## proto 18655.69308 0.000000e+00
## service 12441.80864 0.000000e+00
## state 27075.54108 0.000000e+00
## is_ftp_login 24.44615 7.641607e-07
## is_sm_ips_ports 1132.55131 2.780579e-248
Below, I am going to do some encoding and data pre-processing
Encode All categorical variables and convert into numeric variables
Encode all categorical variables into numeric for Training Set
## [1] 82328 43
Encode all categorical variables into numeric for Test Set
## [1] 175335 42
Standardize the continuous vairiable for both training and test dataset
In this dataset, there are several continuous variables that have different properties and are measured on different scales. To make these continuous variables more comparable and homogeneous, we standardized or normalized the dataset. This involves transforming the data so that the values fall within a similar scale, usually between 0 and 1. Standardization or normalization helps to give equal importance to all the features, making it easier to compare and analyze them. It is particularly useful when dealing with datasets that contain many features as this. By scaling the data, we can prevent features with larger values from dominating the analysis and ensure that all features contribute equally to the model.
## [1] 82328 42
## [1] 175335 42
.
Here is where the model comparison and fitting process begins. First, we will train the models using all the features and then using the 20 most significant features. Next, I will fit the models and compare the classification accuracy between logistic regression and KNN.
To begin model comparison and fitting, a logistic regression model was trained using all the features in the dataset. The glmnet method was used to train the model for all 42 features in the dataset.
## glmnet
##
## 82328 samples
## 41 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 65862, 65863, 65862, 65863, 65862
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0005016511 0.9026941 0.8031727
## 0.10 0.0050165109 0.8854461 0.7678748
## 0.10 0.0501651087 0.8498203 0.6964789
## 0.55 0.0005016511 0.9053056 0.8085086
## 0.55 0.0050165109 0.8814863 0.7596450
## 0.55 0.0501651087 0.7777669 0.5407855
## 1.00 0.0005016511 0.9089861 0.8160772
## 1.00 0.0050165109 0.8779759 0.7523598
## 1.00 0.0501651087 0.7596322 0.5051752
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0005016511.
Here, we plot cross-validation results of logistic regression model trained for all features.
The best tuned optimal parameter is at alpha = 1 and lambda = 0.0005016511, which has an accuracy of 0.9089861. Therefore, we will use this parameter for predicting an attack.
Furthermore, by extracting the best tuned parameters of the logistic regression model, we can gain more evidence to support the assertion above.
Predict Class and its probability using logistic regression model on Test Dataset
We created a confusion matrix summary for our logistic regression model to visualize and evaluate the accuracy of our model, as well as to observe any instances of misclassification.
In the summary of our model, the accuracy is at 0.8526 for all the data variables using the test data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43438 13293
## 1 12560 106044
##
## Accuracy : 0.8526
## 95% CI : (0.8509, 0.8542)
## No Information Rate : 0.6806
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.662
##
## Mcnemar's Test P-Value : 5.3e-06
##
## Sensitivity : 0.7757
## Specificity : 0.8886
## Pos Pred Value : 0.7657
## Neg Pred Value : 0.8941
## Prevalence : 0.3194
## Detection Rate : 0.2477
## Detection Prevalence : 0.3236
## Balanced Accuracy : 0.8322
##
## 'Positive' Class : 0
##
Draw Confusion Matrix Plot of Logistics Rergression of test dataset for all features
After analyzing the model and the confusion matrix, we observed that our model classified attack and normal traffic with good accuracy. The number of true negatives and false positives was relatively small compared to the accurately classified instances. Additionally, we increased the tuning parameter of the cross-validation and discovered that the best optimal parameter was at N=5 for a five-fold cross-validation.
Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value
Check the ROC curve for all the features using Logistic regression model.
## Area under the curve (AUC): 0.949
From the Area Under the cove in the ROC curve, we deduced a very encouraging classification accuracy using this model for all features.
Combine all results for testing dataset in all predictive variables
## ACC_test SE_test SP_test PPV_test NPV_test
## Accuracy 0.8525508 0.7757063 0.8886096 0.7656837 0.8941014
Here, I train and fit in a KNN model for all features using k = 5 and cross validation of 5 folds #Fit KNN-based model for considering all features
## k-Nearest Neighbors
##
## 82328 samples
## 41 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 65862, 65863, 65862, 65863, 65862
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9351253 0.8695803
## 7 0.9336192 0.8666848
## 9 0.9334977 0.8665166
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Plot cross-validation results of KNN model
From the plot, the optimal value for the tuned model is when K= 5 which is seen in the plainly with an accuracy of 0.9351253.
Also, if we use the function of obtaining the best parameter, we found the best tuned parameters for knn model of all features below.
Predict Class and its probability using KNN on Test Dtaset and the best Tuned model.
Create Confusion Matrix summary for KNN model of all variables to fully furthermore, from the summary, we deduced the model accuracy to be 0.8526 which seems to be thesame with that of the logisic regression.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43438 13293
## 1 12560 106044
##
## Accuracy : 0.8526
## 95% CI : (0.8509, 0.8542)
## No Information Rate : 0.6806
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.662
##
## Mcnemar's Test P-Value : 5.3e-06
##
## Sensitivity : 0.7757
## Specificity : 0.8886
## Pos Pred Value : 0.7657
## Neg Pred Value : 0.8941
## Prevalence : 0.3194
## Detection Rate : 0.2477
## Detection Prevalence : 0.3236
## Balanced Accuracy : 0.8322
##
## 'Positive' Class : 0
##
We draw a confusion matrixs of the KNN prediction for all variables.
T
he classification is very similar, with a minor difference in the logistic regression. This will become apparent when we plot the ROC curve for KNN below.
Plot the ROC curve for KNN for all features
## Area under the curve (AUC): 0.937
Based on the ROC curve of the K-nearest neighbor model, we observed that its performance is very similar to that of the logistic regression model, with only minor differences. However, when we compared the two models for all features, logistic regression showed a slightly better performance, with an accuracy difference of only 0.012. Although I had expected KNN to perform better, we cannot manipulate the data to favor our biases.
#Determine the Important Features from training set and choose only top 20 features
I extracted the most 20 important features from our first model and train it, fit Logistic regression and KNN to study which one is more accurate between the two model.
Extract the index and feature names
## [1] "dttl" "swin" "ct_dst_src_ltm" "is_sm_ips_ports"
## [5] "ct_dst_sport_ltm" "ct_state_ttl" "dmean" "state"
## [9] "ct_src_dport_ltm" "ct_srv_src" "synack" "service"
## [13] "spkts" "ct_srv_dst" "tcprtt" "sinpkt"
## [17] "dload" "dwin" "sttl" "ct_dst_ltm"
Extract top 20 features from test dataset
Now take top 20 most singificant features obtained and implement the model to check its performance using Logistic regression and K-Nearest Neighbor.
##Fit LR model again for 20 most significant features
## glmnet
##
## 82328 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 65862, 65862, 65863, 65862, 65863
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0005016511 0.8895515 0.7758404
## 0.10 0.0050165109 0.8780973 0.7523978
## 0.10 0.0501651087 0.8372120 0.6687031
## 0.55 0.0005016511 0.8907540 0.7782981
## 0.55 0.0050165109 0.8757166 0.7474437
## 0.55 0.0501651087 0.7726411 0.5306174
## 1.00 0.0005016511 0.8923331 0.7815427
## 1.00 0.0050165109 0.8761418 0.7482439
## 1.00 0.0501651087 0.7579316 0.5018521
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0005016511.
The plot shows that the optimal parameter is at Alpha = 1, Lambda = 0.0005016511 . Next we check the prediction accuracy of the model.
Predict Class and also its class probability on Test Dataset considering 20 significant features
Create Confusion Matrix summary for test dataset for considering 20 significant features
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48325 27125
## 1 7673 92212
##
## Accuracy : 0.8015
## 95% CI : (0.7997, 0.8034)
## No Information Rate : 0.6806
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.582
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8630
## Specificity : 0.7727
## Pos Pred Value : 0.6405
## Neg Pred Value : 0.9232
## Prevalence : 0.3194
## Detection Rate : 0.2756
## Detection Prevalence : 0.4303
## Balanced Accuracy : 0.8178
##
## 'Positive' Class : 0
##
Here the accuracy of the model is O.8015 which is lower than the accuracy of logistic regresion using all features.
The confusion matrix for this logistic regression model shows a significant increase in false positives and decrease in False Negative compared to the model using all features. The model performed better when using all features rather than the reduced set of features. We will further check this assertion by examining the ROC curve.
This is the ROC curve for LR, considering top 20 features:
## Area under the curve (AUC): 0.925
The ROC curve shows a decrease in the AUC, which supports our conclusion that the logistic regression model using all features has higher predictive accuracy compared to the logistic regression model using only the 20 most significant features.
Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for considering 20 significant features
##Combine the all results for testing dataset for considering 40 significant features
## ACC_test_sig SE_test_sig SP_test_sig PPV_test_sig NPV_test_sig
## Accuracy 0.8015342 0.8629772 0.7727025 0.6404904 0.9231817
##Combined all training and test results for considering 20 significant features
## LR: All Features LR: Limited (20) Features
## ACC 85.26 80.15
## SE 77.57 86.30
## SP 88.86 77.27
## PPV 76.57 64.05
## NPV 89.41 92.32
The final model 4 using KNN in the 20 most significant features. ##Fit KNN model again for 20 significant features
## k-Nearest Neighbors
##
## 82328 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 65862, 65862, 65862, 65863, 65863
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9494826 0.8980221
## 7 0.9488388 0.8967735
## 9 0.9481100 0.8953205
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
The optimal parameter is obtained when k = 5. It’s worth noting that I also conducted cross-validation for higher values of k, but the best-tuned parameter remained at k = 5, which is why I consistently used it. I excluded higher values of k because performing cross-validation for k = 20 would take much longer and require more computational power. we will check the predictive accuracy below.
Predict Class and also its class probability on Test Dtaset for considering 20 significant
Create Confusion Matrix summary for test dataset for considering 20 significant features
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50451 27097
## 1 5547 92240
##
## Accuracy : 0.8138
## 95% CI : (0.812, 0.8156)
## No Information Rate : 0.6806
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6114
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9009
## Specificity : 0.7729
## Pos Pred Value : 0.6506
## Neg Pred Value : 0.9433
## Prevalence : 0.3194
## Detection Rate : 0.2877
## Detection Prevalence : 0.4423
## Balanced Accuracy : 0.8369
##
## 'Positive' Class : 0
##
Based on the summary of the confusion matrix, we observed that the accuracy of the KNN model is higher than that of Logistic Regression for the 20 most significant features.
Draw KNN Confusion Matrix Plot for test dataset for considering 20 significant features
This is the ROC curve for KNN, considering top 20 features:
## Area under the curve (AUC): 0.889
Well, surprisingly, the predictive accuracy of KNN with the 20 most significant features turned out to be the lowest, even lower than that of logistic regression using only the 20 features, despite KNN having a higher model accuracy than logistic regression.
Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for considering 20 significant features
Combine all the results for testing dataset considering 40 significant features
## ACC_test_sig_knn SE_test_sig_knn SP_test_sig_knn PPV_test_sig_knn
## Accuracy 0.8138193 0.9009429 0.7729371 0.6505777
## NPV_test_sig_knn
## Accuracy 0.9432747
Combined all LR and KNN results for considering 20 significant features
## LR: All Features LR: Limited (20) Features
## ACC 85.26 81.38
## SE 77.57 90.09
## SP 88.86 77.29
## PPV 76.57 65.06
## NPV 89.41 94.33
Finally we plotted the ROC curve of all the four models together to detrmine which model accurately perform better than the other between all the features and 20 most predictive features.
Now we have to compare the ROC curve for of LR and KNN for all features and only 20 features
## Area under the curve (AUC): 0.949
## Area under the curve (AUC): 0.937
## Area under the curve (AUC): 0.925
## Area under the curve (AUC): 0.889
Conclusion
The final ROC curve for the four (4) models shows that the Logistic regression model for all features has the highest predictive accuracy with an AUC of 0.949. We also observed that the predictive accuracy of the 20 most significant features’ performance is low compared to that of all features. Individually, the Logistic regression performed better in both the 20 features and all features. In conclusion, the Logistic regression model is better at predicting an attack with a very high accuracy on unseen data in the UNSW_NB15 dataset.
Impact of my analysis:
The impact of this analysis is that it provides insights into the performance of different machine learning models, specifically logistic regression and K-nearest neighbors (KNN), on predicting attacks in the UNSW_NB15 dataset. The analysis shows that logistic regression with all features has the highest predictive accuracy, while KNN with 20 most significant features has the lowest predictive accuracy. This information can be useful for organizations and security professionals who are looking to implement machine learning-based intrusion detection systems. By understanding the strengths and weaknesses of different models, they can make informed decisions about which models to use and how to configure them for optimal performance. i also learnt that the most predictive feature is dttl followed by sttl. I am not surprise because i already deduced it from the distribution of the two variables in the Data exploration i.e the mean variance of both of them. I also learnt from this model that it can be used to predict intrusion and can serve as a test parameter for real world security or intrusion detection design.
Trade-off
There is a trade-off between predictive accuracy and computational resources required to train and test the models. Some models, such as K-nearest neighbors, can be computationally expensive when working with large datasets or high-dimensional feature spaces just like in this dataset. Additionally, more complex models like deep neural networks may require significant computational power and time to train. Therefore, choosing the right model for a given problem involves a trade-off between accuracy and resource requirements. There is also a possiblity that the model may have inherent biases that can lead to unfair predictions. This can be observed in the KNN for 20 most predictive features where the model has higher accurracy compared to Logistic regression but the predictive accuracy is vice-versa.