R Markdown

Introduction of Dataset The UNSW-NB15 dataset is a network intrusion detection dataset created by the University of New South Wales (UNSW) in Australia. It contains network traffic data that simulates a real-world environment and includes various types of attacks, such as DoS (Denial of Service), reconnaissance, backdoors, and others. The dataset consists of both normal traffic and attack traffic, and it is divided into training and testing sets. It has been widely used as a benchmark dataset for evaluating the performance of intrusion detection systems and for developing new machine learning algorithms to detect network attacks.

Objective of Project The main objective of this project is to develop a machine learning-based intrusion detection system that can predict network intrusion detection (ID) using the publicly available UNSW-NB15 dataset. The dataset contains normal and attack records, including nine attack categories and malware such as analysis, backdoors, DoS, exploits, generic, reconnaissance, fuzzers for anomalous activity, shellcode, and worms. The dataset also has a binary classification of attack category (1/0), with a “label” of 0 representing normal or non-attack and 1 representing any type of attack.Secondly, differents models was compared to find the best model for predicting an intrusion using this dataset.

To achieve this objective, i used generalized linear models (GLMs) with binomial families(Logistic Regression) to predict IDs. The experiments conducted in this work were as follows:

  1. I conducted an exploratory analysis of the independent variables to gain insight into the dataset’s structure and identify any outliers or anomalies.

  2. I determined if there was a significant association between the input features and the response variable. This involved running statistical tests to identify the most significant features that contribute to the model’s prediction accuracy.

  3. I performed data pre-processing to clean and scale the data for accurate prediction

  4. I evaluated the accuracy of all input features in predicting Label. I trained the models on the training dataset and used the test set to evaluate its performance.

  5. I determined how accurately the most significant features, obtained from step (b), predict intrusion. We used these features to train the model and evaluated its performance on the test set.

  6. I compared the performance of the model using the most significant features and using all features in the dataset to determine which model performs better.

  7. I also compared the Logistic regression model to the K-nearest neighbor model to determine which model performs better in classification of attack and or intrusion.

The first thin we carried out is Loading the required Library needed for this project.

Load some necessary libraries needed

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.0     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.1     âś” tibble    3.2.0
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## here() starts at /Users/swanky/Downloads
## 
## 
## Attaching package: 'psych'
## 
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Type 'citation("pROC")' for a citation.
## 
## 
## Attaching package: 'pROC'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## 
## Loaded ROSE 0.0-4
## 
## 
## AUC 0.3.2
## 
## Type AUCNews() to see the change log and ?AUC to get an overview.
## 
## 
## Attaching package: 'AUC'
## 
## 
## The following objects are masked from 'package:pROC':
## 
##     auc, roc
## 
## 
## The following objects are masked from 'package:caret':
## 
##     sensitivity, specificity

(a) I conducted an exploratory analysis of the independent variables to gain insight into the dataset’s structure and identify any outliers or anomalies.

Read training dataset from csv file from current directory Read testing dataset from csv file from current directory

## [1] 82332    45
## [1] 175341     45

The dimension of both the training and test datset is 175341 rows and 45 coloumn (predictors) Check the class distribution of training set and also the distribution of Attack categories

## 
##       Analysis       Backdoor            DoS       Exploits        Fuzzers 
##            677            583           4089          11132           6062 
##        Generic         Normal Reconnaissance      Shellcode          Worms 
##          18871          37000           3496            378             44

Check the class distribution of testing set and also the distribution of Attack categories

## 
##       Analysis       Backdoor            DoS       Exploits        Fuzzers 
##           2000           1746          12264          33393          18184 
##        Generic         Normal Reconnaissance      Shellcode          Worms 
##          40000          56000          10491           1133            130

#Check the class distribution of attack category for both training and test set Using barchart, I illustrate the Distribution of data in each of the Attack categories The Barchart for both the training and testing datasets reveals that there are nine categories of attacks, with “Normal” representing non-attacks. However, the data is highly imbalanced, with a majority of non-attacks compared to attacks. Among the training data, the most frequently occurring attack categories are “Generic,” “Exploits,” “Fuzzers,” “DoS,” and “Reconnaissance.”

#Exclude ID and attack category from training set
Below, i excluded the ID column and the class label that will be used to classify the attck categories. This is done twice because the dataset is divided into two which are training dataset and test dataset. This is also part of the data pre-processing in the dataset

test<-test[which(test$is_ftp_login!=2),] : The “test” data frame is subsetted by excluding rows where the value of the “is_ftp_login” column is equal to 2 using the “which()” function.

## [1] 82328    43

#Exclude ID and attack category from test set

## [1] 175335     42

##Compare the class label distribution of train and test dataset Here, using barplot i compared the class label of (1/0) of the train and test dataset with 1 representing attack and 0 representing non-attack. The class label is going to be used for the classification.

Here, we explored the distribution of binary class label which is represented in 1/0 in the dataset.

In Train: there are 44.94% of class “normal” and 55.06% of class “attack” In Test: there are 31.91% of class “normal” and 68.06% of class “attack”

CATEGORICAL VARIABLES IN THE DATASET

We will now investigate the distribution of Attack categories across the Predictor variables. This analysis will demonstrate how each predictor is distributed among the attack categories.

The service field in the dataset provides information about the network protocol or service associated with the recorded network traffic. This field identifies the type of service or protocol used for communication. Examples of services in the dataset include HTTP, FTP, SMTP, DNS, SSH, among others.

In the case of “normal,” there are many instances of “dns” and a few rare values, indicating a prevalence of “-” in the dataset. In contrast, in the attack data, “dns” occurs more frequently than any other value, with only a few instances of other protocols such as HTTP, resulting in low distribution. The distribution of “normal” attacks is low, while a high distribution of attacks is prevalent.

##For State Variable

The “state” field represents the state and activity of the protocol, such as TCP, UDP, or ICMP, and it indicates the state of the connection, such as establishing, closing, or maintaining a connection. The values of the “state” field vary depending on the protocol and its activity. For TCP connections, the “state” field may contain values such as “FIN”, “SYN”, “ACK”, “RST”, “URG”, and “PSH.” In the case of “normal” data, “fin” is the most frequently occurring value, followed by “cons,” which is about half as frequent as “fin.” Additionally, a few instances of “int” are present. On the other hand, in the attack data, “int” occurs more frequently than any normal category value, indicating a potentially important feature for detection. There are very few instances of “fin” and low frequency of other TCP fields in the attack category, indicating a low distribution of the field. Based on my understanding of the data, there seem to be more TCP connections that have not been closed after packets are sent, as evidenced by the prevalence of “FIN.” This feature can be used to monitor network activity and detect any potential issues.

##For is_ftp_login Variable

The “is_ftp_login” field in the dataset indicates whether a transaction is an FTP login or not. It has a binary value, with 1 indicating that the transaction is an FTP login and 0 indicating that it is not. FTP (File Transfer Protocol) is a standard network protocol used for transferring files from one host to another over a TCP-based network, such as the Internet. The distribution of “is_ftp_login” is comparatively higher in the normal category than in the attack category. We observe a significant difference in the representation of “is_ftp_login” between the two categories.

The feature “is_sm_ips_ports” in the dataset is binary and indicates if the source or destination port of a network connection is a well-known port assigned by the Internet Assigned Numbers Authority (IANA) for specific services. A value of 1 indicates that at least one end of the network connection involves a well-known port, while a value of 0 indicates that there is no involvement of any well-known port. Our observation reveals that there are a significant number of non-well-known ports in the attack category.

#Separate categorical from training data In this chunk, we saperated the categorical variables from the Traning dataset
## [1] 82328     6

#Separate continuous from training data

After the separation of the CATEGORICAL variable, we are saperating the CONTINOUS variable in the dataset for exploration.

## [1] 82328    38
##       dur               spkts              dpkts              sbytes        
##  Min.   : 0.00000   Min.   :    1.00   Min.   :    0.00   Min.   :      24  
##  1st Qu.: 0.00001   1st Qu.:    2.00   1st Qu.:    0.00   1st Qu.:     114  
##  Median : 0.01412   Median :    6.00   Median :    2.00   Median :     534  
##  Mean   : 1.00678   Mean   :   18.67   Mean   :   17.55   Mean   :    7994  
##  3rd Qu.: 0.71936   3rd Qu.:   12.00   3rd Qu.:   10.00   3rd Qu.:    1280  
##  Max.   :59.99999   Max.   :10646.00   Max.   :11018.00   Max.   :14355774  
##      dbytes              rate                sttl          dttl       
##  Min.   :       0   Min.   :      0.0   Min.   :  0   Min.   :  0.00  
##  1st Qu.:       0   1st Qu.:     28.6   1st Qu.: 62   1st Qu.:  0.00  
##  Median :     178   Median :   2651.2   Median :254   Median : 29.00  
##  Mean   :   13234   Mean   :  82414.9   Mean   :181   Mean   : 95.71  
##  3rd Qu.:     956   3rd Qu.: 111111.1   3rd Qu.:254   3rd Qu.:252.00  
##  Max.   :14657531   Max.   :1000000.0   Max.   :255   Max.   :253.00  
##      sload               dload              sloss              dloss         
##  Min.   :0.000e+00   Min.   :       0   Min.   :   0.000   Min.   :   0.000  
##  1st Qu.:1.120e+04   1st Qu.:       0   1st Qu.:   0.000   1st Qu.:   0.000  
##  Median :5.771e+05   Median :    2113   Median :   1.000   Median :   0.000  
##  Mean   :6.455e+07   Mean   :  630577   Mean   :   4.754   Mean   :   6.309  
##  3rd Qu.:6.514e+07   3rd Qu.:   15858   3rd Qu.:   3.000   3rd Qu.:   2.000  
##  Max.   :5.268e+09   Max.   :20821108   Max.   :5319.000   Max.   :5507.000  
##      sinpkt             dinpkt              sjit                djit         
##  Min.   :    0.00   Min.   :    0.00   Min.   :      0.0   Min.   :     0.0  
##  1st Qu.:    0.01   1st Qu.:    0.00   1st Qu.:      0.0   1st Qu.:     0.0  
##  Median :    0.56   Median :    0.01   Median :     17.6   Median :     0.0  
##  Mean   :  755.43   Mean   :  121.71   Mean   :   6363.4   Mean   :   535.1  
##  3rd Qu.:   63.41   3rd Qu.:   63.14   3rd Qu.:   3219.5   3rd Qu.:   128.5  
##  Max.   :60009.99   Max.   :57739.24   Max.   :1483830.9   Max.   :463199.2  
##       swin           stcpb               dtcpb                dwin      
##  Min.   :  0.0   Min.   :0.000e+00   Min.   :0.000e+00   Min.   :  0.0  
##  1st Qu.:  0.0   1st Qu.:0.000e+00   1st Qu.:0.000e+00   1st Qu.:  0.0  
##  Median :255.0   Median :2.778e+07   Median :2.831e+07   Median :255.0  
##  Mean   :133.5   Mean   :1.085e+09   Mean   :1.073e+09   Mean   :128.3  
##  3rd Qu.:255.0   3rd Qu.:2.171e+09   3rd Qu.:2.144e+09   3rd Qu.:255.0  
##  Max.   :255.0   Max.   :4.295e+09   Max.   :4.295e+09   Max.   :255.0  
##      tcprtt             synack             smean            dmean       
##  Min.   :0.000000   Min.   :0.000000   Min.   :  24.0   Min.   :   0.0  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:  57.0   1st Qu.:   0.0  
##  Median :0.000552   Median :0.000441   Median :  65.0   Median :  44.0  
##  Mean   :0.055928   Mean   :0.029257   Mean   : 139.5   Mean   : 116.3  
##  3rd Qu.:0.105547   3rd Qu.:0.052604   3rd Qu.: 100.0   3rd Qu.:  87.0  
##  Max.   :3.821465   Max.   :3.226788   Max.   :1504.0   Max.   :1500.0  
##   trans_depth        response_body_len   ct_srv_src      ct_state_ttl  
##  Min.   :  0.00000   Min.   :      0   Min.   : 1.000   Min.   :0.000  
##  1st Qu.:  0.00000   1st Qu.:      0   1st Qu.: 2.000   1st Qu.:1.000  
##  Median :  0.00000   Median :      0   Median : 5.000   Median :1.000  
##  Mean   :  0.09428   Mean   :   1595   Mean   : 9.547   Mean   :1.369  
##  3rd Qu.:  0.00000   3rd Qu.:      0   3rd Qu.:11.000   3rd Qu.:2.000  
##  Max.   :131.00000   Max.   :5242880   Max.   :63.000   Max.   :6.000  
##    ct_dst_ltm     ct_src_dport_ltm ct_dst_sport_ltm ct_dst_src_ltm  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 2.000   Median : 1.000   Median : 1.000   Median : 3.000  
##  Mean   : 5.745   Mean   : 4.929   Mean   : 3.663   Mean   : 7.457  
##  3rd Qu.: 6.000   3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.: 6.000  
##  Max.   :59.000   Max.   :59.000   Max.   :38.000   Max.   :63.000  
##    ct_ftp_cmd       ct_flw_http_mthd    ct_src_ltm       ct_srv_dst    
##  Min.   :0.000000   Min.   : 0.0000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.: 0.0000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median :0.000000   Median : 0.0000   Median : 3.000   Median : 5.000  
##  Mean   :0.008284   Mean   : 0.1297   Mean   : 6.468   Mean   : 9.165  
##  3rd Qu.:0.000000   3rd Qu.: 0.0000   3rd Qu.: 7.000   3rd Qu.:11.000  
##  Max.   :2.000000   Max.   :16.0000   Max.   :60.000   Max.   :62.000  
##   attack_cat        label    
##  Length:82328       0:37000  
##  Class :character   1:45328  
##  Mode  :character            
##                              
##                              
## 

In the following plots, I will be using boxplots to visualize the distribution of the continuous variables in the dataset. For some of the plots, I will be using the “Geo_point” feature in ggplot to show the distribution of the data and to identify any possible outliers in the attack categories. For other plots, I will be applying a logarithmic transformation to the variables to better capture both the small and large values and Distribution of each variable when compared to the attack categories.

The Duration column represents the time duration of the connection in seconds, indicating how long the connection between the source and destination lasted. For example, for a network connection, the duration would be the length of time that the connection was open. We applied natural logarithm to the Duration variable to capture the small values, and observed a low distribution of both attack and normal traffic with relatively high outliers.

Similarly, when we applied Log2 to reduce the variance of the data and capture the underlying distribution of the “Source to Destination Packet Count” variable, we observed a relatively higher number of attacks with high outliers.

Similarly, by applying logarithm to the “dpkts” variable, we observed the distribution of the attack and non-attack traffic in the dataset. We found that there are outliers in the destination-to-source packet data, relative to the attack categories. there is also moderate distribution of both the attck and normal traffic.

The natural logarithm was also applied to the Source to Destination transaction bytes, and the resulting distribution in the attack categories was observed. It was found that the size of the attacks was relatively similar to that of normal traffic. However, there were also a significant number of outliers in the Sbytes variable.

The observation is that the distribution of attacks is higher than that of normal packets. Additionally, there are significant outliers for DNS and Generic protocols in the boxplot

The Rate feature in the dataset indicates the rate of the connection, measured as the number of packets per second (pps) during the last two seconds of the connection. This feature provides important information on the rate of traffic for a particular connection and can be used to detect abnormal behavior or attacks with high traffic rates. In our analysis, we observed a significant outlier in the normal traffic category, while there is a relatively higher number of attacks with high traffic rates in the attack categories.

The Source to Destination packet count shows an equal distribution between attacks and normal traffic. The natural logarithm was applied to capture the relatively smaller values in the Sinpkt variable. Additionally, there are observed outliers in Analysis, Backdoor, and Generic categories.

The destination to source packet count is very similar to the source to destination packet count. The distribution of both variables is comparatively equal in the attack categories.

Source jitter refers to the variation or fluctuations in the timing of packets sent from the source IP address, measured in milliseconds (mSec). Jitter can occur due to network congestion or other factors that cause delay or loss of packets. High jitter can result in poor quality of service (QoS) for real-time applications such as voice or video streaming. We observed that the distribution of Source Jitter is higher in attacks, with a significant height of normal traffic in the attack category. The attacks seem to be more prominent than the non-attacks.

From the plot, we observed that there is a higher distribution of Source Jitter than Destination Jitter. Additionally, there are some outliers in the Destination Jitter.

Source to destination during the last time period measured (two-way traffic), including both the payload and the TCP/IP headers. From the boxplot, we observed a relatively higher attack in the source window size compared to normal traffic, where the normal traffic is very small. This indicates that source window size can be a good predictor of an attack in the dataset.

#Plot Bar diagram of mean for different features by attack category

Below, I decided to display the mean variation of the features by attack categories to aid in understanding the distribution of these variables in attack categories. For variables with smaller values, logarithm may be applied to help capture the data for more insights.

The mean variation of the variables refers to the differences in the average values of the variables between the attack and normal traffic categories. By examining the mean variation of the variables, we can gain insights into how the variables are associated with the different attack types and use this information to develop models that accurately classify network traffic as either normal or malicious. The mean variation of the features can help in determining the accurate model by providing information on which features have the most significant impact on the target variable. Features with high mean variation are likely to be better predictors of the target variable than features with low mean variation. Therefore, selecting the features with the highest mean variation and using them in a model is likely to result in a more accurate model. Additionally, the mean variation can help in identifying outliers and detecting any unusual patterns in the data, which can inform the model selection and feature engineering processes. The mean variation plots below show how the data is distributed. Some of them have a higher distribution of attack while others have a higher distribution of normal traffic. The legends show the exact color and how it is represented in the main plot. By critically deducing from the color matching of the histogram and legend, it is very easy to interpret. Also, any bar that does not have the colour of normal in legend is an attack.

The mean seems to be slightly different from the boxplot of the Dur variable. The mean distribution of duration shows that there is a high variance in the number of attacks compared to the normal network packet duration.

The mean value of Source to Destination packet count (Spkts) indicates a higher number of attacks compared to normal traffic. This suggests that the mean can be a good predictor of the attack categories.

(b) I determined if there was a significant association between the input features and the response variable. This involved running statistical tests to identify the most significant features that contribute to the model’s prediction accuracy.

The training dataset consists of 36 continuous and 5 categorical variables/features. To investigate the association between input features and labels, I performed independent t-tests for continuous variables and Chi-square tests for categorical variables. These tests yield a P-value, which indicates the statistical significance of a predictor in explaining the data and can be used to accept or reject the null hypothesis.

###t-test

We observed that all 35 features, except for “dur”, are significantly associated with labels at a 5% level of significance. Therefore, we need to select features with p-values less than 0.05 for further analysis and prediction.

#Take 35 significant features whose p-value is less than 0.05;

We extracted the the features that has p-value less than 0.05

Apply Chi-Square test for categorical variable to show the association between categorical features and study variable

## Warning in chisq.test(training_cate[, i], training_cate$label): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(training_cate[, i], training_cate$label): Chi-squared
## approximation may be incorrect
##                 CS_test_statistic pvalue_CS_testtest
## proto                 18655.69308       0.000000e+00
## service               12441.80864       0.000000e+00
## state                 27075.54108       0.000000e+00
## is_ftp_login             24.44615       7.641607e-07
## is_sm_ips_ports        1132.55131      2.780579e-248

Below, I am going to do some encoding and data pre-processing

Encode All categorical variables and convert into numeric variables

Encode all categorical variables into numeric for Training Set

## [1] 82328    43

Encode all categorical variables into numeric for Test Set

## [1] 175335     42
  1. We performed data pre-processing to clean and scale the data

Standardize the continuous vairiable for both training and test dataset

In this dataset, there are several continuous variables that have different properties and are measured on different scales. To make these continuous variables more comparable and homogeneous, we standardized or normalized the dataset. This involves transforming the data so that the values fall within a similar scale, usually between 0 and 1. Standardization or normalization helps to give equal importance to all the features, making it easier to compare and analyze them. It is particularly useful when dealing with datasets that contain many features as this. By scaling the data, we can prevent features with larger values from dominating the analysis and ensure that all features contribute equally to the model.

## [1] 82328    42
## [1] 175335     42

.

  1. we evaluated the accuracy of all input features in predicting Label. I trained the models on the training dataset and used the test set to evaluate its performance

Here is where the model comparison and fitting process begins. First, we will train the models using all the features and then using the 20 most significant features. Next, I will fit the models and compare the classification accuracy between logistic regression and KNN.

To begin model comparison and fitting, a logistic regression model was trained using all the features in the dataset. The glmnet method was used to train the model for all 42 features in the dataset.

## glmnet 
## 
## 82328 samples
##    41 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 65862, 65863, 65862, 65863, 65862 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0005016511  0.9026941  0.8031727
##   0.10   0.0050165109  0.8854461  0.7678748
##   0.10   0.0501651087  0.8498203  0.6964789
##   0.55   0.0005016511  0.9053056  0.8085086
##   0.55   0.0050165109  0.8814863  0.7596450
##   0.55   0.0501651087  0.7777669  0.5407855
##   1.00   0.0005016511  0.9089861  0.8160772
##   1.00   0.0050165109  0.8779759  0.7523598
##   1.00   0.0501651087  0.7596322  0.5051752
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0005016511.

Here, we plot cross-validation results of logistic regression model trained for all features.

The best tuned optimal parameter is at alpha = 1 and lambda = 0.0005016511, which has an accuracy of 0.9089861. Therefore, we will use this parameter for predicting an attack.

Furthermore, by extracting the best tuned parameters of the logistic regression model, we can gain more evidence to support the assertion above.

Predict Class and its probability using logistic regression model on Test Dataset

We created a confusion matrix summary for our logistic regression model to visualize and evaluate the accuracy of our model, as well as to observe any instances of misclassification.

In the summary of our model, the accuracy is at 0.8526 for all the data variables using the test data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0  43438  13293
##          1  12560 106044
##                                           
##                Accuracy : 0.8526          
##                  95% CI : (0.8509, 0.8542)
##     No Information Rate : 0.6806          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.662           
##                                           
##  Mcnemar's Test P-Value : 5.3e-06         
##                                           
##             Sensitivity : 0.7757          
##             Specificity : 0.8886          
##          Pos Pred Value : 0.7657          
##          Neg Pred Value : 0.8941          
##              Prevalence : 0.3194          
##          Detection Rate : 0.2477          
##    Detection Prevalence : 0.3236          
##       Balanced Accuracy : 0.8322          
##                                           
##        'Positive' Class : 0               
## 

Draw Confusion Matrix Plot of Logistics Rergression of test dataset for all features

After analyzing the model and the confusion matrix, we observed that our model classified attack and normal traffic with good accuracy. The number of true negatives and false positives was relatively small compared to the accurately classified instances. Additionally, we increased the tuning parameter of the cross-validation and discovered that the best optimal parameter was at N=5 for a five-fold cross-validation.

Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value

Check the ROC curve for all the features using Logistic regression model.

## Area under the curve (AUC): 0.949

From the Area Under the cove in the ROC curve, we deduced a very encouraging classification accuracy using this model for all features.

Combine all results for testing dataset in all predictive variables

##           ACC_test   SE_test   SP_test  PPV_test  NPV_test
## Accuracy 0.8525508 0.7757063 0.8886096 0.7656837 0.8941014

Here, I train and fit in a KNN model for all features using k = 5 and cross validation of 5 folds #Fit KNN-based model for considering all features

## k-Nearest Neighbors 
## 
## 82328 samples
##    41 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 65862, 65863, 65862, 65863, 65862 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9351253  0.8695803
##   7  0.9336192  0.8666848
##   9  0.9334977  0.8665166
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Plot cross-validation results of KNN model

From the plot, the optimal value for the tuned model is when K= 5 which is seen in the plainly with an accuracy of 0.9351253.

Also, if we use the function of obtaining the best parameter, we found the best tuned parameters for knn model of all features below.

Predict Class and its probability using KNN on Test Dtaset and the best Tuned model.

Create Confusion Matrix summary for KNN model of all variables to fully furthermore, from the summary, we deduced the model accuracy to be 0.8526 which seems to be thesame with that of the logisic regression.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0  43438  13293
##          1  12560 106044
##                                           
##                Accuracy : 0.8526          
##                  95% CI : (0.8509, 0.8542)
##     No Information Rate : 0.6806          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.662           
##                                           
##  Mcnemar's Test P-Value : 5.3e-06         
##                                           
##             Sensitivity : 0.7757          
##             Specificity : 0.8886          
##          Pos Pred Value : 0.7657          
##          Neg Pred Value : 0.8941          
##              Prevalence : 0.3194          
##          Detection Rate : 0.2477          
##    Detection Prevalence : 0.3236          
##       Balanced Accuracy : 0.8322          
##                                           
##        'Positive' Class : 0               
## 

We draw a confusion matrixs of the KNN prediction for all variables.

Confusion Matrix of K-Nearest Neighbour for all features.

T

he classification is very similar, with a minor difference in the logistic regression. This will become apparent when we plot the ROC curve for KNN below.

Plot the ROC curve for KNN for all features

## Area under the curve (AUC): 0.937

Based on the ROC curve of the K-nearest neighbor model, we observed that its performance is very similar to that of the logistic regression model, with only minor differences. However, when we compared the two models for all features, logistic regression showed a slightly better performance, with an accuracy difference of only 0.012. Although I had expected KNN to perform better, we cannot manipulate the data to favor our biases.

  1. I determined how accurately the most significant features, obtained from step (b), predict intrusion. We used these features to train the model and evaluated its performance on the test set.
  2. I compared the performance of the model using the most significant features and using all features in the dataset to determine which model performs better.

#Determine the Important Features from training set and choose only top 20 features

I extracted the most 20 important features from our first model and train it, fit Logistic regression and KNN to study which one is more accurate between the two model.

Extract the index and feature names

##  [1] "dttl"             "swin"             "ct_dst_src_ltm"   "is_sm_ips_ports" 
##  [5] "ct_dst_sport_ltm" "ct_state_ttl"     "dmean"            "state"           
##  [9] "ct_src_dport_ltm" "ct_srv_src"       "synack"           "service"         
## [13] "spkts"            "ct_srv_dst"       "tcprtt"           "sinpkt"          
## [17] "dload"            "dwin"             "sttl"             "ct_dst_ltm"

Extract top 20 features from test dataset

Now take top 20 most singificant features obtained and implement the model to check its performance using Logistic regression and K-Nearest Neighbor.

##Fit LR model again for 20 most significant features

## glmnet 
## 
## 82328 samples
##    20 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 65862, 65862, 65863, 65862, 65863 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0005016511  0.8895515  0.7758404
##   0.10   0.0050165109  0.8780973  0.7523978
##   0.10   0.0501651087  0.8372120  0.6687031
##   0.55   0.0005016511  0.8907540  0.7782981
##   0.55   0.0050165109  0.8757166  0.7474437
##   0.55   0.0501651087  0.7726411  0.5306174
##   1.00   0.0005016511  0.8923331  0.7815427
##   1.00   0.0050165109  0.8761418  0.7482439
##   1.00   0.0501651087  0.7579316  0.5018521
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0005016511.

Plot cross-validation results for 20 significant features

The plot shows that the optimal parameter is at Alpha = 1, Lambda = 0.0005016511 . Next we check the prediction accuracy of the model.

Predict Class and also its class probability on Test Dataset considering 20 significant features

Create Confusion Matrix summary for test dataset for considering 20 significant features

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 48325 27125
##          1  7673 92212
##                                           
##                Accuracy : 0.8015          
##                  95% CI : (0.7997, 0.8034)
##     No Information Rate : 0.6806          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.582           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8630          
##             Specificity : 0.7727          
##          Pos Pred Value : 0.6405          
##          Neg Pred Value : 0.9232          
##              Prevalence : 0.3194          
##          Detection Rate : 0.2756          
##    Detection Prevalence : 0.4303          
##       Balanced Accuracy : 0.8178          
##                                           
##        'Positive' Class : 0               
## 

Here the accuracy of the model is O.8015 which is lower than the accuracy of logistic regresion using all features.

Draw Confusion Matrix Plot for test dataset for considering 20 significant features

The confusion matrix for this logistic regression model shows a significant increase in false positives and decrease in False Negative compared to the model using all features. The model performed better when using all features rather than the reduced set of features. We will further check this assertion by examining the ROC curve.

This is the ROC curve for LR, considering top 20 features:

## Area under the curve (AUC): 0.925

The ROC curve shows a decrease in the AUC, which supports our conclusion that the logistic regression model using all features has higher predictive accuracy compared to the logistic regression model using only the 20 most significant features.

Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for considering 20 significant features

##Combine the all results for testing dataset for considering 40 significant features

##          ACC_test_sig SE_test_sig SP_test_sig PPV_test_sig NPV_test_sig
## Accuracy    0.8015342   0.8629772   0.7727025    0.6404904    0.9231817

##Combined all training and test results for considering 20 significant features

##     LR: All Features LR: Limited (20) Features
## ACC            85.26                     80.15
## SE             77.57                     86.30
## SP             88.86                     77.27
## PPV            76.57                     64.05
## NPV            89.41                     92.32

The final model 4 using KNN in the 20 most significant features. ##Fit KNN model again for 20 significant features

## k-Nearest Neighbors 
## 
## 82328 samples
##    20 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 65862, 65862, 65862, 65863, 65863 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9494826  0.8980221
##   7  0.9488388  0.8967735
##   9  0.9481100  0.8953205
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Plot cross-validation results for 20 significant features

The optimal parameter is obtained when k = 5. It’s worth noting that I also conducted cross-validation for higher values of k, but the best-tuned parameter remained at k = 5, which is why I consistently used it. I excluded higher values of k because performing cross-validation for k = 20 would take much longer and require more computational power. we will check the predictive accuracy below.

Predict Class and also its class probability on Test Dtaset for considering 20 significant

Create Confusion Matrix summary for test dataset for considering 20 significant features

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 50451 27097
##          1  5547 92240
##                                          
##                Accuracy : 0.8138         
##                  95% CI : (0.812, 0.8156)
##     No Information Rate : 0.6806         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6114         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9009         
##             Specificity : 0.7729         
##          Pos Pred Value : 0.6506         
##          Neg Pred Value : 0.9433         
##              Prevalence : 0.3194         
##          Detection Rate : 0.2877         
##    Detection Prevalence : 0.4423         
##       Balanced Accuracy : 0.8369         
##                                          
##        'Positive' Class : 0              
## 

Based on the summary of the confusion matrix, we observed that the accuracy of the KNN model is higher than that of Logistic Regression for the 20 most significant features.

Draw KNN Confusion Matrix Plot for test dataset for considering 20 significant features

This is the ROC curve for KNN, considering top 20 features:

## Area under the curve (AUC): 0.889

Well, surprisingly, the predictive accuracy of KNN with the 20 most significant features turned out to be the lowest, even lower than that of logistic regression using only the 20 features, despite KNN having a higher model accuracy than logistic regression.

Extract accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for considering 20 significant features

Combine all the results for testing dataset considering 40 significant features

##          ACC_test_sig_knn SE_test_sig_knn SP_test_sig_knn PPV_test_sig_knn
## Accuracy        0.8138193       0.9009429       0.7729371        0.6505777
##          NPV_test_sig_knn
## Accuracy        0.9432747

Combined all LR and KNN results for considering 20 significant features

##     LR: All Features LR: Limited (20) Features
## ACC            85.26                     81.38
## SE             77.57                     90.09
## SP             88.86                     77.29
## PPV            76.57                     65.06
## NPV            89.41                     94.33
  1. I also compared the Logistic regression model to the K-nearest neighbor model to determine which model performs better in classification of attack and or intrusion.

Finally we plotted the ROC curve of all the four models together to detrmine which model accurately perform better than the other between all the features and 20 most predictive features.

Now we have to compare the ROC curve for of LR and KNN for all features and only 20 features

## Area under the curve (AUC): 0.949
## Area under the curve (AUC): 0.937
## Area under the curve (AUC): 0.925
## Area under the curve (AUC): 0.889

Conclusion

The final ROC curve for the four (4) models shows that the Logistic regression model for all features has the highest predictive accuracy with an AUC of 0.949. We also observed that the predictive accuracy of the 20 most significant features’ performance is low compared to that of all features. Individually, the Logistic regression performed better in both the 20 features and all features. In conclusion, the Logistic regression model is better at predicting an attack with a very high accuracy on unseen data in the UNSW_NB15 dataset.

Impact of my analysis:

The impact of this analysis is that it provides insights into the performance of different machine learning models, specifically logistic regression and K-nearest neighbors (KNN), on predicting attacks in the UNSW_NB15 dataset. The analysis shows that logistic regression with all features has the highest predictive accuracy, while KNN with 20 most significant features has the lowest predictive accuracy. This information can be useful for organizations and security professionals who are looking to implement machine learning-based intrusion detection systems. By understanding the strengths and weaknesses of different models, they can make informed decisions about which models to use and how to configure them for optimal performance. i also learnt that the most predictive feature is dttl followed by sttl. I am not surprise because i already deduced it from the distribution of the two variables in the Data exploration i.e the mean variance of both of them. I also learnt from this model that it can be used to predict intrusion and can serve as a test parameter for real world security or intrusion detection design.

Trade-off

There is a trade-off between predictive accuracy and computational resources required to train and test the models. Some models, such as K-nearest neighbors, can be computationally expensive when working with large datasets or high-dimensional feature spaces just like in this dataset. Additionally, more complex models like deep neural networks may require significant computational power and time to train. Therefore, choosing the right model for a given problem involves a trade-off between accuracy and resource requirements. There is also a possiblity that the model may have inherent biases that can lead to unfair predictions. This can be observed in the KNN for 20 most predictive features where the model has higher accurracy compared to Logistic regression but the predictive accuracy is vice-versa.