library(knitr)
knitr::opts_chunk$set(cache=FALSE,message=FALSE)
# if error=false stop on errors. error=TRUE ignore errors
knitr::opts_chunk$set(error=TRUE)
knitr::opts_knit$set(progress=FALSE)
library(h2o)
h2o.init(min_mem_size = "16g")
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpusjwGc/h2o_rstudio_started_from_r.out
##     /tmp/RtmpusjwGc/h2o_rstudio_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 40 milliseconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.7 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_rstudio_ymr419 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   15.33 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)

Directory set up

Source Functions File

# Whilst in some ways one notebook for reproducible code is preferable, Once the file is getting unmanagable it was wiser to split into seperate functions files. For larger works this is a much cleaner approach. Using Chunk style notebooks for development is suited only for small works.

#source("networkAnomaliesFunctions.R")

install_latest_h2o <- function() {
  if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
  if ("h2o" %in% rownames(installed.packages())) {remove.packages("h2o") }
  
  pkgs <- c("RCurl","jsonlite")
  for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}
  
  install.packages("h2o", type="source",
                   repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
  
  # If clustering, all versions should match. Install specific h2o version to match cluster nodes.
  # Install.packages(file.path(h2oDir,"R/h2o_3.18.0.5.tar.gz"), repos = NULL, type = "source")
}

h2o.importFileFromProccesedOrData <- function (dataDir,fileName,dest_frame) {
  #'Here is one I prepared earlier!"
  # Function to load in the preprocessed data if it exists,
  # otherwise load from given dataDir
  filePath <- file.path(dataDir,"processed", fileName)
  if (file.exists(filePath)) {
    print(paste("Reading processed file from ",filePath))
  } else{
    filePath <- file.path(dataDir, fileName)
    print(paste("Reading data file from ",filePath))}
  return( h2o.importFile(path=filePath , sep=",",na.strings="-") )
  print("Functions Loaded")
}

exportPreProcessedFiles <- function(Dir="data", processedDir="processed" ){
  file_name="trainWithLabels.csv"
  try(dir.create(file.path(Dir, processedDir), showWarnings = FALSE))
  file_path= file.path(Dir, processedDir,file_name)
  if (file.exists(file_path)) {print("preprocessed file exists")
  } else {
    try(h2o.exportFile(h2o.data.all,path = file_path))
  }
  file_name="testWithLabels.csv"
  file_path= file.path(Dir, processedDir,file_name)
  if (file.exists(file_path)) {print("preprocessed file exists")
  } else {
    try(h2o.exportFile(h2o.data.test,path = file_path))
  } 
  print("Pre processed files Saved For next Time.")
}

copyFromS3 <- function(bucket=bucketname,filen=file_path) {
  from_path <- file.path("s3:/",bucket,filen)
  to_path <- file.path('.',filen)
  command <- paste("aws s3 cp ",from_path," ",to_path,sep='')
  system(command)
}

copyToS3 <- function(bucket=bucketname,filen=filename) {
  from_path <- filen
  to_path <- file.path("s3:/",bucket,filen)
  command <- paste("aws s3 cp ",from_path," ",to_path,sep='')
  system(command)
}

NewestFileInDir <- function(pathDir=".",pattern="*.*") {
  list.files(pathDir, pattern, full.names = TRUE)
  details = file.info(list.files(path=pathDir, full.names = TRUE))
  details = details[with(details, order(as.POSIXct(mtime),decreasing = TRUE)), ]
  newestfile = rownames(details[1,])
  return(newestfile)
}

Note - This analyis was implememnted using h2o version 3.18.05 on rStudio 1.1.442. An installation of h2o is needed to run all of the functionality. There appear to be some issues in Knitr outputting h2o calls not as they are reflected on the screen. So this notebook may not appear the same as when running the program

Install necessary R Packages

# Load Packages
pkgs <- c(
           "knitr",
           "rmarkdown",
           "tictoc",
           "pROC",
           "ggridges",
           "beepr",
           "data.table",
           "lemon",
           "tictoc",
           "iptools",
           "caret",
           "h2o",
           "kableExtra",
           "gbm",
           "ggridges",
           "tidyr",
           "dplyr",
           "ggplot2",
           "scales",
           "readr"
          # "ff",
          # "statmod",
          # "stats",
          # "graphics",
          # "jsonlite",
          # "tools",
          # "utils",
          # "tidyverse",
          # "devtools",
          # "RCurl",
          # "methods",
          # "xgboost",
          # "randomForest",
)

# Check packages and install then load any that are needed 
for (pkg in pkgs) {
        lapply(pkg, function(pkg) {
                    if (!require(pkg, character.only=T)) {
                    if (!(pkg %in% rownames(installed.packages()) )) {
                        install.packages(pkg)}
                        }
                    require(pkg, character.only=T)})
}

options("h2o.use.data.table"=TRUE)
knit_print.data.frame <- lemon_print
h2oSourceDir <- "C:/Users/D/Downloads/installers/h2o-3.18.0.5"
## install_latest_h2o () 

Description

Network flow: A network flow is a network traffic stream defined with a common set of identifiers, typically, the same combination.

If any of these variables change then a new flow is started. As a result, it's possible to have several flows between the same client and server nodes at a given time. In addition to the above primary features, as shown in your dataset, several other useful features can be defined to describe the characteristics of a network flow.

Dataset: This third party dataset includes network flow data produced by nine different attack types within realistic normal user activities. Following attack types are included in the dataset.

  1. Reconnaissance: can be defined as a probe in which attacker gathers information about a computer network to evade its security controls.
  2. Fuzzers: attacker attempts to discover security loopholes in a program, operating system, or network by feeding it with the massive inputting of random data to make it crash.
  3. Analysis: a type of intrusions that penetrates the web applications via ports (e.g., port scans), emails (e.g., spam), and web scripts (e.g., HTML files).
  4. Backdoor: a stealthy technique to bypass normal authentication inorder to securing unauthorised remote access to a device.
  5. Exploit: a sequence of instructions that takes advantage of a bug or vulnerability to be caused by an unintentional or unsuspected behaviour on a host or network.
  6. Generic: a technique that establishes against every block cipher using a hash function to collision without respect to the configuration of the block-cipher.
  7. Shellcode: attacker penetrates a slight piece of code starting from a shell to control the compromised machine.
  8. Worm: attack replicates itself in order to spread on other computers. Often, it uses a computer network to spread itself, depending on the security failures on the target computer to access it.
  9. DoS: an intrusion which disrupts the computer resources, often via memory, to be extremely busy in order to prevent the authorised requests from accessing a device.

GOALS:

G1. Employ (or develop) an unsupervised machine learning technique to spot malicious network flows in the dataset. Then further extend your analysis to divide malicious group into nine different clusters (groups). At this stage you may not be able to assign labels for each group you obtained.

G2. Employ one class modelling approach (incuding novelty/anomaly detection) to spot malicious network flows in the dataset. The idea here is to build the model only using benign samples and then use trained model to identification of new/unknown data that trained model has not been trained with and was not previously aware of, with the help of either statistical or machine learning based approaches.

G3. Employ supervised learning techniques to spot malicious network flows in the dataset.
Labels for each class will be provided at this stage.
Data description:
Train.csv - the training set,
Test.csv - the test set and
Features.txt - description of each feature defined in the dataset. Evaluation metric: the accuracy (true/false positives rates) and computational cost for testing.
Submission format: Submission files should contain two columns: Flow_id and Label.
Every raw is separated by a comma ( "," ).
Labels contain benign connection (0) and malicious connection (1) for G1 and G2.
For G3, labels contain benign connection (0) and nine types of attacks (1-9).
The file should contain a header and may have the following format:

Flow_id,Label
1,1
2,1
3,2
4,0

20th March 2018


Approach

A challenging brief to use unsupervised anomaly detection to spot network attacks.

While some decison trees classifiers can be used to build models in a supervised fashion quite easily, I decided to tackle the unsupervised learning challenge, using Deep Learning AutoEncoders to attempt to spot anomalies in the data flow.

Implementing this on h2o built on an Amazon Web Services Virtual machine brought speed to the model processing, but brought a host of its own configuration challenges.

Let's examine the data set.


Inspect the data features

features  <- read.csv("Features.txt" ,sep=",",na.strings="-")
features
Name DataType Description
srcip nominal Source IP address
sport integer Source port number
dstip nominal Destination IP address
dsport integer Destination port number
proto nominal Transaction protocol
state nominal Indicates to the state and its dependent protocol
dur Float Record total duration
sbytes Integer Source to destination transaction bytes
dbytes Integer Destination to source transaction bytes
sttl Integer Source to destination time to live value
dttl Integer Destination to source time to live value
sloss Integer Source packets retransmitted or dropped
dloss Integer Destination packets retransmitted or dropped
service nominal http, ftp, smtp, ssh, dns, ftp-data ,irc
Sload Float Source bits per second
Dload Float Destination bits per second
Spkts integer Source to destination packet count
Dpkts integer Destination to source packet count
swin integer Source TCP window advertisement value
dwin integer Destination TCP window advertisement value
stcpb integer Source TCP base sequence number
dtcpb integer Destination TCP base sequence number
smeansz integer Mean of the ?ow packet size transmitted by the src
dmeansz integer Mean of the ?ow packet size transmitted by the dst
trans_depth integer Represents the pipelined depth into the connection of http request/response transaction
res_bdy_len integer Actual uncompressed content size of the data transferred from the server’s http service.
Sjit Float Source jitter (mSec)
Djit Float Destination jitter (mSec)
Stime Timestamp record start time
Ltime Timestamp record last time
Sintpkt Float Source interpacket arrival time (mSec)
Dintpkt Float Destination interpacket arrival time (mSec)
tcprtt Float TCP connection setup round-trip time, the sum of synack and ackdat
synack Float TCP connection setup time, the time between the SYN and the SYN_ACK packets.
ackdat Float TCP connection setup time, the time between the SYN_ACK and the ACK packets.
is_sm_ips_ports Binary If source and destination IP addresses equal and port numbers equal then, this variable takes value 1 else 0
ct_state_ttl Integer No. for each state according to specific range of values for source/destination time to live.
ct_flw_http_mthd Integer No. of flows that has methods such as Get and Post in http service.
is_ftp_login Binary If the ftp session is accessed by user and password then 1 else 0.
ct_ftp_cmd integer No of flows that has a command in ftp session.
ct_srv_src integer No. of connections that contain the same service and source address in 100 connections according to the last time.
ct_srv_dst integer No. of connections that contain the same service and destination address in 100 connections according to the last time.
ct_dst_ltm integer No. of connections of the same destination address in 100 connections according to the last time.
ct_src_ltm integer No. of connections of the same source address in 100 connections according to the last time.
ct_src_dport_ltm integer No of connections of the same source address and the destination port in 100 connections according to the last time.
ct_dst_sport_ltm integer No of connections of the same destination address and the source port in 100 connections according to the last time.
ct_dst_src_ltm integer No of connections of the same source and the destination address in in 100 connections according to the last time.
# Loadin some test data - Stand ard
# Short test read
if(!exists("dfL1000")) {
  try(dfL1000 <- read.csv("data/trainWithLabels.csv",nrows = 1000,sep=",",na.strings="-"))
} else{print("dfL1000 Already loaded")}
col_names <- names(dfL1000)
feature_names <- setdiff(col_names, c("Label","attack_cat"))
tic()
if(!exists("train_data_all")) {
  print("Loading train_data csv.")
  train_data_all <- read.csv(train_data_file ,sep=",",na.strings="-")
}else{
  print("train_data_all already loaded")
} 
## [1] "Loading train_data csv."
toc()
## 85.009 sec elapsed

Regular read in of training data file took 200Seconds, Much Faster in aws

tic()
if (!exists("test_data")) {
  print("Loading test_data csv.")
  test_data <- read.csv(test_data_file ,sep=",",na.strings="-")
}else
  {print("test_data Already loaded")
}
## [1] "Loading test_data csv."
toc()
## 36.216 sec elapsed
# Was 105sec in pc.  27 secs in aws.

A regular read in of test data file on standard PC took 105 seconds On AWS EC2 virtual machine read in took around a quarter of that. Model Processing times would be similarly faster.


Dealing with a large data set

Inspecting the files, it turned out that the train.csv file and the first 47 columns of TrainWithLabels.CSV, were identical. As I was facing PC memory issues with the big data files, I have removed the references loading the train.csv and test.csv data sets. We will simply take the first 47 columns of the WithLabels versions to avoid duplication in memory.

By referencing the columns we want to use using the setdiff function, we ensure that the same code will handle processing of the 47 column train.csv file and the 49 column trainWithLabels.csv without modification.


File splitting approach

A first approach to dealing with the memory hungry size of the data set was using raw file splitting tools;

CSVSplitter filename="data/trainWithLabels.csv" outputfolder="data/splitdata" rowcount=100000 firstrowheader=1 > repeatheader=1

which will split the csv file into smaller files to process with model checkpoints.

This was a decent approach, however I researched better alternatives and decided to implement in h2o.


H2O

H2O is open-source software for big-data analysis,I installed the h2o platform allowing for faster in memory processing of big data sets, in order to better handle the Large dataset H2o runs outside of R
R can connect to its service for processing.


After initially installing on a cluster and running models on 3 of my own local pcs, I then decided to go further and entered into a deeper than expected dive into creating and configuring a virtual machine. My adventures in installing and configuring an Amazon Web Services Elastic Cloud Compute (AWS EC2 - for the acronymically inclined) Linux virtual machine running R, R-studio-server and a h2o server, are perhaps beyond the scope of this report. Getting everything to mesh and install was challenging but ultimately rewarding and worthwhile.

Elastic Compute Cloud is a web service that provides resizable compute capacity in the cloud, designed to make web-scale cloud computing easier for developers.

The processing power will allow the further build out of the trained models.

h2o.start <- function(type='local') {
  # Start function with options for local/cluster/AWS Server launching.
  library(h2o)
  if (type=='cluster') {
    # Requires configuration files loaded by batch file on each local cluster machine 
    setwd("C:/Users/D/Downloads/installers/h2o-3.18.0.5")
    system("starth2oname.bat")
    h2o.connect(ip="192.168.0.10",port=54321)
    setwd(mainDir)
  } else if (type=='AWS') {
    # If connecting to remote AWS server requires srver launced from config file. 
    setwd("C:/Users/D/Downloads/installers/h2o-3.18.0.5")
    getwd()
    system("starth2oAWS.bat")
    awsip=aws_ip
      h2o.connect(ip=awsip,port=54321) 
  } else {
    h2o.init(min_mem_size = "16g")
  }
} #endfunction h2o.start

h2o.start('local')
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 minutes 7 seconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.7 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_rstudio_ymr419 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   15.16 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)
Sys.sleep(3)
# h2o.start('cluster')
#h2o.start('AWS')
#h2o.shutdown()

Loading the test data into h2o on an AWS VM is almost 3 time faster again, having gone from 200 seconds on a standalone pc to just over 30 seconds.

H2o Loading of data files.

h2o.loadData <- function(dataDirectory) {
  tic()
  if (!exists("h2o.data.all")){
    h2o.data.all <- h2o.importFileFromProccesedOrData(dataDirectory,"trainWithLabels.csv","train")
    }
  toc()
  assign("h2o.data.all", h2o.data.all, envir = .GlobalEnv)
  tic()
  if (!exists("h2o.data.test")){
    h2o.data.test <- h2o.importFileFromProccesedOrData(dataDirectory,"testWithLabels.csv","test")
  }
  toc()
   assign("h2o.data.test", h2o.data.test, envir = .GlobalEnv)
  }
h2o.loadData(dataDir)
## [1] "Reading processed file from  data/processed/trainWithLabels.csv"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=========================================                        |  62%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
## 12.535 sec elapsed
## [1] "Reading processed file from  data/processed/testWithLabels.csv"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
## 6.034 sec elapsed
numMalicious <- h2o.sum(h2o.data.all$Label=='1')
numBenign <- h2o.sum(h2o.data.all$Label=='0')
rateMalicious <- numMalicious/h2o.nrow(h2o.data.all)
cat(rateMalicious*100,"% of samples are malicious")
## 0 % of samples are malicious

Check levels of Attack_Cat in Training Data

h2o.levels(h2o.data.all$attack_cat)
##  [1] "Analysis"       "Backdoor"       "DoS"            "Exploits"      
##  [5] "Fuzzers"        "Generic"        "NA"             "Reconnaissance"
##  [9] "Shellcode"      "Worms"

Before any preprocessing there appear to be categories of both BACKDOOR and BACKDOORS. we will merge these into a single category for Backdoor attacks. On subsequent runs we should see just one "backdoor" category.

Preprocessing Data.

Encode IPs & Labels.
Ip addresses are encoded as FACTORS, implying all are equally dissimilar, however very similar IP addresses may indeed yield some level of information versus treating all as equally disimilar. So we will encode these as Numeric.

h2o_ip_to_numeric <- function(ipcol){
  ipcol <- ip_to_numeric(as.vector(as.character(ipcol)))
  if( class(ipcol[1])=="H2OFrame") {
    ipcol <- as.h2o(ipcol)} 
  return (ipcol)
}
fixDataTypes <- function(df) {
  print("encoding dstip")
   if (h2o.isnumeric(df$dstip[1])){NULL
  } else {
    dfcol <- as.data.frame(df[, "dstip"])
    dfcolconverted <- apply(dfcol, 1, ip_to_numeric)
    df[, "dstip"] <- as.h2o(dfcolconverted)
    print(df$dstip[3:6])
  }
  print("encoding srcip")
  if (h2o.isnumeric(df$srcip)) { NULL
  } else {
    dfcol <- as.data.frame(df[, "srcip"])
    dfcolconverted <- apply(dfcol, 1, ip_to_numeric)
    df[, "srcip"] <- as.h2o(dfcolconverted)
    print(df$srcip[3:6])
  }
  print("Encoding sport")
  df$sport <- h2o.asnumeric(df$sport)
  print(df$sport[3:6])

  print("Encoding Label as factor")
  if ("Label" %in% names(df)){ df$Label <- as.factor(df$Label)  }

  print("Encoding Attack Cat")
  if ("attack_cat" %in% names(df)){
    df$attack_cat <- h2o.ascharacter(df$attack_cat)
    df$attack_cat[df$attack_cat=="Backdoors"] <- "Backdoor"
    df$attack_cat <- h2o.asfactor(df$attack_cat)
    }

  if( class(df$attack_cat)=="H2OFrame") {
    print(h2o.table(df$attack_cat))
  } else {print(table(df$attack_cat))}
  
  return(df)
}

print("Preprocess training Data")
## [1] "Preprocess training Data"
response_name <- 'Label'
response <- 'Label'
response_vars <- c("Label","attack_cat")
skipped_vars <- c("srcip","dstip","Ltime","Stime")
predictor_vars <- setdiff(names(h2o.data.all), c(response_vars,skipped_vars))
print("Preprocess test Data")
## [1] "Preprocess test Data"
tic()
h2o.data.all <- fixDataTypes(h2o.data.all)
## [1] "encoding dstip"
## [1] "encoding srcip"
## [1] "Encoding sport"
##   sport
## 1  1464
## 2  3593
## 3 49664
## 4 32119
## 
## [4 rows x 1 column] 
## [1] "Encoding Label as factor"
## [1] "Encoding Attack Cat"
##   attack_cat Count
## 1   Analysis  1218
## 2   Backdoor   939
## 3        DoS  6091
## 4   Exploits 18102
## 5    Fuzzers 10912
## 6    Generic 65483
## 
## [10 rows x 2 columns]
h2o.data.test <- fixDataTypes(h2o.data.test)
## [1] "encoding dstip"
## [1] "encoding srcip"
## [1] "Encoding sport"
##   sport
## 1  1043
## 2  1043
## 3     0
## 4     0
## 
## [4 rows x 1 column] 
## [1] "Encoding Label as factor"
## [1] "Encoding Attack Cat"
##   attack_cat  Count
## 1   Analysis   1459
## 2   Backdoor   1390
## 3        DoS  10262
## 4   Exploits  26423
## 5    Fuzzers  13334
## 6    Generic 149998
## 
## [10 rows x 2 columns]
toc()
## 8.755 sec elapsed

Inspecting data Values

summary(train_data_all)
##         srcip            sport                   dstip       
##  59.166.0.0:134436   1043   :  66843   149.171.126.4:133914  
##  59.166.0.2:134223   47439  :  63285   149.171.126.3:133839  
##  59.166.0.4:133701   0      :  26483   149.171.126.2:133816  
##  59.166.0.5:133592   138    :   1140   149.171.126.1:133719  
##  59.166.0.1:133490   80     :    374   149.171.126.0:133653  
##  59.166.0.3:132621   53     :    218   149.171.126.5:133289  
##  (Other)   :721965   (Other):1365685   (Other)      :721798  
##      dsport              proto            state             dur      
##  Min.   :        0   tcp    :992483   FIN    :981503   Min.   :   0  
##  1st Qu.:       53   udp    :504825   CON    :378233   1st Qu.:   0  
##  Median :      143   arp    :  7634   INT    :158493   Median :   0  
##  Mean   :    13296   unas   :  6362   REQ    :  5112   Mean   :   1  
##  3rd Qu.:    20472   ospf   :  4555   RST    :   220   3rd Qu.:   0  
##  Max.   :538989345   sctp   :   556   CLO    :   161   Max.   :8787  
##  NA's   :7           (Other):  7613   (Other):   306                 
##      sbytes             dbytes              sttl            dttl    
##  Min.   :       0   Min.   :       0   Min.   :  0.0   Min.   :  0  
##  1st Qu.:     264   1st Qu.:     178   1st Qu.: 31.0   1st Qu.: 29  
##  Median :    1684   Median :    2468   Median : 31.0   Median : 29  
##  Mean   :    4591   Mean   :   41467   Mean   : 48.8   Mean   : 31  
##  3rd Qu.:    3614   3rd Qu.:   16734   3rd Qu.: 31.0   3rd Qu.: 29  
##  Max.   :14355774   Max.   :14657531   Max.   :255.0   Max.   :254  
##                                                                     
##      sloss          dloss          service           Sload           
##  Min.   :   0   Min.   :   0   dns     :365313   Min.   :         0  
##  1st Qu.:   0   1st Qu.:   0   http    :128630   1st Qu.:     99483  
##  Median :   3   Median :   5   ftp-data: 81646   Median :    554080  
##  Mean   :   6   Mean   :  19   smtp    : 52142   Mean   :  20370332  
##  3rd Qu.:   7   3rd Qu.:  15   ftp     : 32369   3rd Qu.:   1429449  
##  Max.   :5319   Max.   :5507   (Other) : 32906   Max.   :5468000256  
##                                NA's    :831022                       
##      Dload               Spkts           Dpkts            swin    
##  Min.   :        0   Min.   :    0   Min.   :    0   Min.   :  0  
##  1st Qu.:    66242   1st Qu.:    2   1st Qu.:    2   1st Qu.:  0  
##  Median :   649043   Median :   14   Median :   18   Median :255  
##  Mean   :  2754003   Mean   :   37   Mean   :   48   Mean   :166  
##  3rd Qu.:  3631981   3rd Qu.:   48   3rd Qu.:   46   3rd Qu.:255  
##  Max.   :128761904   Max.   :10646   Max.   :11018   Max.   :255  
##                                                                   
##       dwin         stcpb                dtcpb               smeansz    
##  Min.   :  0   Min.   :         0   Min.   :         0   Min.   :   0  
##  1st Qu.:  0   1st Qu.:         0   1st Qu.:         0   1st Qu.:  61  
##  Median :255   Median : 991503414   Median : 990879062   Median :  73  
##  Mean   :166   Mean   :1395989787   Mean   :1395608278   Mean   : 126  
##  3rd Qu.:255   3rd Qu.:2642617073   3rd Qu.:2642564840   3rd Qu.: 131  
##  Max.   :255   Max.   :4294953347   Max.   :4294953724   Max.   :1504  
##                                                                        
##     dmeansz      trans_depth      res_bdy_len           Sjit        
##  Min.   :   0   Min.   :  0.00   Min.   :      0   Min.   :      0  
##  1st Qu.:  80   1st Qu.:  0.00   1st Qu.:      0   1st Qu.:      0  
##  Median : 103   Median :  0.00   Median :      0   Median :     24  
##  Mean   : 311   Mean   :  0.09   Mean   :   4652   Mean   :   1630  
##  3rd Qu.: 565   3rd Qu.:  0.00   3rd Qu.:      0   3rd Qu.:    578  
##  Max.   :1500   Max.   :131.00   Max.   :5242880   Max.   :1460480  
##                                                                     
##       Djit            Stime                Ltime           
##  Min.   :     0   Min.   :1421927377   Min.   :1421927414  
##  1st Qu.:     0   1st Qu.:1421943357   1st Qu.:1421943357  
##  Median :    16   Median :1421958678   Median :1421958679  
##  Mean   :   829   Mean   :1422602492   Mean   :1422602493  
##  3rd Qu.:    72   3rd Qu.:1424223184   3rd Qu.:1424223184  
##  Max.   :781221   Max.   :1424233663   Max.   :1424233663  
##                                                            
##     Sintpkt         Dintpkt          tcprtt          synack      
##  Min.   :    0   Min.   :    0   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :    1   Median :    1   Median :0.001   Median :0.0005  
##  Mean   :  240   Mean   :   95   Mean   :0.004   Mean   :0.0023  
##  3rd Qu.:    9   3rd Qu.:    8   3rd Qu.:0.001   3rd Qu.:0.0006  
##  Max.   :60010   Max.   :59485   Max.   :3.864   Max.   :2.1049  
##                                                                  
##      ackdat      is_sm_ips_ports   ct_state_ttl  ct_flw_http_mthd
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00   0      :986791  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00   NA     :404027  
##  Median :0.000   Median :0.0000   Median :0.00   1      :119301  
##  Mean   :0.002   Mean   :0.0021   Mean   :0.15   6      :  7880  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.00   4      :  3918  
##  Max.   :3.519   Max.   :1.0000   Max.   :6.00   3      :   735  
##                                                  (Other):  1376  
##  is_ftp_login   ct_ftp_cmd        ct_srv_src      ct_srv_dst   
##  0 :1066593   0      :1066498   Min.   : 1.00   Min.   : 1.00  
##  1 :  27716   NA     : 429667   1st Qu.: 2.00   1st Qu.: 2.00  
##  2 :     10   1      :  24404   Median : 5.00   Median : 5.00  
##  4 :     42   2      :   1244   Mean   : 7.54   Mean   : 7.29  
##  NA: 429667   4      :    846   3rd Qu.: 9.00   3rd Qu.: 9.00  
##               3      :    729   Max.   :63.00   Max.   :62.00  
##               (Other):    640                                  
##    ct_dst_ltm     ct_src_ltm   ct_src_dport_ltm ct_dst_sport_ltm
##  Min.   : 1.0   Min.   : 1.0   Min.   : 1.00    Min.   : 1.00   
##  1st Qu.: 2.0   1st Qu.: 2.0   1st Qu.: 1.00    1st Qu.: 1.00   
##  Median : 3.0   Median : 4.0   Median : 1.00    Median : 1.00   
##  Mean   : 5.1   Mean   : 5.6   Mean   : 3.02    Mean   : 2.35   
##  3rd Qu.: 5.0   3rd Qu.: 6.0   3rd Qu.: 2.00    3rd Qu.: 1.00   
##  Max.   :60.0   Max.   :60.0   Max.   :60.00    Max.   :60.00   
##                                                                 
##  ct_dst_src_ltm           attack_cat          Label       
##  Min.   : 1.00   NA            :1415027   Min.   :0.0000  
##  1st Qu.: 1.00   Generic       :  65483   1st Qu.:0.0000  
##  Median : 2.00   Exploits      :  18102   Median :0.0000  
##  Mean   : 4.35   Fuzzers       :  10912   Mean   :0.0715  
##  3rd Qu.: 3.00   DoS           :   6091   3rd Qu.:0.0000  
##  Max.   :63.00   Reconnaissance:   5561   Max.   :1.0000  
##                  (Other)       :   2852

The variables dsport and dur appear to have some odd values, that we will need to investigate further

#Lets select the High sport values
which(train_data_all$dsport==max(train_data_all$dsport))
## integer(0)
maxPortRowNums <- which(train_data_all$dsport==
          max(train_data_all$dsport[!is.na(train_data_all$dsport)]))

train_data_all[(maxPortRowNums-1):(maxPortRowNums+2),]
## Warning in (maxPortRowNums - 1):(maxPortRowNums + 2): numerical expression
## has 2 elements: only the first used

## Warning in (maxPortRowNums - 1):(maxPortRowNums + 2): numerical expression
## has 2 elements: only the first used
srcip sport dstip dsport proto state dur sbytes dbytes sttl dttl sloss dloss service Sload Dload Spkts Dpkts swin dwin stcpb dtcpb smeansz dmeansz trans_depth res_bdy_len Sjit Djit Stime Ltime Sintpkt Dintpkt tcprtt synack ackdat is_sm_ips_ports ct_state_ttl ct_flw_http_mthd is_ftp_login ct_ftp_cmd ct_srv_src ct_srv_dst ct_dst_ltm ct_src_ltm ct_src_dport_ltm ct_dst_sport_ltm ct_dst_src_ltm attack_cat Label
82531 59.166.0.7 52475 149.171.126.8 53 udp CON 0.0010 146 178 31 29 0 0 dns 581095 708458 2 2 0 0 0 0 73 89 0 0 0 0 1421930680 1421930680 0.001 0.011 0 0 0 0 0 0 0 0 2 3 2 3 1 1 2 NA 0
82532 175.45.176.1 NA 149.171.126.12 538989345 esp INT 0.0000 200 0 254 0 0 0 NA 200000000 0 2 0 0 0 0 0 100 0 0 0 0 0 1421930680 1421930680 0.004 0.000 0 0 0 0 2 0 0 0 12 12 4 4 2 1 4 NA 0
82533 175.45.176.1 0 149.171.126.12 538989345 esp INT 0.0000 200 0 254 0 0 0 NA 200000000 0 2 0 0 0 0 0 100 0 0 0 0 0 1421930680 1421930680 0.004 0.000 0 0 0 0 2 0 0 0 12 12 4 4 2 3 4 NA 0
82534 59.166.0.2 47418 149.171.126.7 53 udp CON 0.0011 146 178 31 29 0 0 dns 539741 658041 2 2 0 0 0 0 73 89 0 0 0 0 1421930680 1421930680 0.011 0.008 0 0 0 0 0 0 0 0 3 2 2 3 2 1 1 NA 0
train_data_all$dsport[!is.na(train_data_all$dsport)][
  train_data_all$dsport[!is.na(train_data_all$dsport)] >
    quantile(train_data_all$dsport[!is.na(train_data_all$dsport)],prob=0.99999)]
##  [1]     65535 538989345 538989345     65535     65535     65535     65535
##  [8]     65535     65535     65535     65535     65535     65535     65535
plot(as.vector(train_data_all[,"dsport"]),col=rgb(0, 0.7, 0.3, 0.1))

The highest possible port number is 65535,so these very high values for ports are clearly erroneous rows. We will remove them, and replot the chart to check.

train_data_all <- train_data_all[-maxPortRowNums,]
plot(as.vector(train_data_all[sample.int(nrow(train_data_all),10000),"dsport"]),col=rgb(0, 0.7, 0.3, 0.1))

That looks more reasonably distributed, and all ports are now within a possible range of values.

Now to check dsport values in the h2o version of the data frame.

h2o.max(h2o.data.all$dsport[!is.na(h2o.data.all$dsport)])
## [1] 65535
h2o.data.all$dsport[!is.na(h2o.data.all$dsport)][
  h2o.data.all$dsport[!is.na(h2o.data.all$dsport)] >
    h2o.quantile(h2o.data.all$dsport[!is.na(h2o.data.all$dsport)],probs=0.99999)]
##   dsport
## 1  65535
## 2  65535
## 3  65535
## 4  65535
## 5  65535
## 6  65535
## 
## [12 rows x 1 column]

The h2o Frame does not appear to have the same error in the dsport values as the base r dataframe had.

#plot(as.vector(h2o.data.all[,"dsport"]),col=rgb(0, 0.7, 0.3, 0.3))

No overly large values for dsport in the h2o Dataframe.


Checking similarly for dur

which(train_data_all$dur==max(train_data_all$dur))
## [1] 687663 687664
maxDurRowNums <- which(train_data_all$dur==
          max(train_data_all$dur[!is.na(train_data_all$dur)]))
kable(train_data_all[(maxDurRowNums-1):(maxDurRowNums+2),], "html") %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "400px")
## Warning in (maxDurRowNums - 1):(maxDurRowNums + 2): numerical expression
## has 2 elements: only the first used

## Warning in (maxDurRowNums - 1):(maxDurRowNums + 2): numerical expression
## has 2 elements: only the first used
srcip sport dstip dsport proto state dur sbytes dbytes sttl dttl sloss dloss service Sload Dload Spkts Dpkts swin dwin stcpb dtcpb smeansz dmeansz trans_depth res_bdy_len Sjit Djit Stime Ltime Sintpkt Dintpkt tcprtt synack ackdat is_sm_ips_ports ct_state_ttl ct_flw_http_mthd is_ftp_login ct_ftp_cmd ct_srv_src ct_srv_dst ct_dst_ltm ct_src_ltm ct_src_dport_ltm ct_dst_sport_ltm ct_dst_src_ltm attack_cat Label
687664 59.166.0.2 30239 149.171.126.2 53991 tcp FIN 0.0691 424 8824 31 29 1 4 ftp-data 42951.0039 936470.7500 8 12 255 255 3964727548 4277903059 53 735 0 0 810.4 640.83 1421954430 1421954430 9.827 6.238 0.0006 0.0005 0.0001 0 0 0 0 0 9 3 2 4 1 1 2 NA 0
687665 10.40.182.1 0 10.40.182.3 0 arp CON 8786.6377 644 1058 0 0 0 0 NA 0.5609 0.9214 23 23 0 0 0 0 28 46 0 0 0.0 0.00 1421945643 1421954430 0.000 0.000 0.0000 0.0000 0.0000 0 0 0 0 0 2 2 2 2 2 2 2 NA 0
687666 10.40.182.1 0 10.40.182.3 0 arp CON 8786.6377 644 1058 0 0 0 0 NA 0.5609 0.9214 23 23 0 0 0 0 28 46 0 0 0.0 0.00 1421945643 1421954430 0.000 0.000 0.0000 0.0000 0.0000 0 0 0 0 0 2 2 2 2 2 2 2 NA 0
687667 59.166.0.4 64187 149.171.126.3 21 tcp FIN 0.9167 2934 3742 31 29 11 15 ftp 25116.0117 32053.8965 52 54 255 255 1459339333 1460380764 56 69 0 0 1630.0 55.21 1421954429 1421954430 17.967 17.287 0.0006 0.0005 0.0001 0 0 0 1 1 1 2 1 1 1 1 1 NA 0

Now to check the odd values in the duration variable, dur.

train_data_all$dur[train_data_all$dur > quantile(train_data_all$dur,prob=0.99999)]
##  [1] 8787 8787 8761 8761 8761 8761   60   60   60   60   60   60   60   60
## [15]   60   60
plot(as.vector(train_data_all[,"dur"]),col=rgb(0, 0.7, 0.3, 0.5))

Seems again to be very odd outlier value ot values. We will remove these extreme outliers. A closer look at these rows shows that the source and destination ip addresses of these items they are only 2 apart, so these are likely very large local file transfers. Whilst the may still be genuine suspicious values, they are so far out of the range of the rest of the data as to be unhelpful for modelling.

We will remove the dur outliers and replot.

train_data_all <- train_data_all[-which(train_data_all$dur>8000),]
plot(as.vector(train_data_all[,"dur"]),col=rgb(0, 0.7, 0.3, 0.5))

That looks more reasonably distributed now.

We will now check the dur values in the h2o version of our data frame and remove any wildly extreme values.

# we can use the useful h2o.quantie function to return us only the very highest values in the large data set.

h2o.data.all$dur[!is.na(h2o.data.all$dur)][
  h2o.data.all$dur[!is.na(h2o.data.all$dur)] >
    h2o.quantile(h2o.data.all$dur[!is.na(h2o.data.all$dur)],probs=0.99999)]
##   dur
## 1  60
## 2  60
## 3  60
## 4  60
## 5  60
## 6  60
## 
## [16 rows x 1 column]

Remove outliers and recheck top values in h2o data frame

h2o.data.all <- h2o.data.all[h2o.data.all$dur<8000,]

h2o.data.all$dur[!is.na(h2o.data.all$dur)][
  h2o.data.all$dur[!is.na(h2o.data.all$dur)] >
    h2o.quantile(h2o.data.all$dur[!is.na(h2o.data.all$dur)],probs=0.999995)]
##   dur
## 1  60
## 2  60
## 3  60
## 4  60
## 5  60
## 6  60
## 
## [8 rows x 1 column]
summary(train_data_all)
##         srcip            sport                   dstip       
##  59.166.0.0:134436   1043   :  66843   149.171.126.4:133914  
##  59.166.0.2:134223   47439  :  63285   149.171.126.3:133839  
##  59.166.0.4:133701   0      :  26476   149.171.126.2:133816  
##  59.166.0.5:133592   138    :   1140   149.171.126.1:133719  
##  59.166.0.1:133490   80     :    374   149.171.126.0:133653  
##  59.166.0.3:132621   53     :    218   149.171.126.5:133289  
##  (Other)   :721957   (Other):1365684   (Other)      :721790  
##      dsport          proto            state             dur       
##  Min.   :    0   tcp    :992483   FIN    :981503   Min.   : 0.00  
##  1st Qu.:   53   udp    :504825   CON    :378231   1st Qu.: 0.00  
##  Median :  143   arp    :  7632   INT    :158491   Median : 0.02  
##  Mean   :12589   unas   :  6362   REQ    :  5108   Mean   : 0.67  
##  3rd Qu.:20472   ospf   :  4551   RST    :   220   3rd Qu.: 0.25  
##  Max.   :65535   sctp   :   556   CLO    :   161   Max.   :60.00  
##  NA's   :7       (Other):  7611   (Other):   306                  
##      sbytes             dbytes              sttl            dttl    
##  Min.   :       0   Min.   :       0   Min.   :  0.0   Min.   :  0  
##  1st Qu.:     264   1st Qu.:     178   1st Qu.: 31.0   1st Qu.: 29  
##  Median :    1684   Median :    2468   Median : 31.0   Median : 29  
##  Mean   :    4591   Mean   :   41467   Mean   : 48.8   Mean   : 31  
##  3rd Qu.:    3614   3rd Qu.:   16734   3rd Qu.: 31.0   3rd Qu.: 29  
##  Max.   :14355774   Max.   :14657531   Max.   :255.0   Max.   :254  
##                                                                     
##      sloss          dloss          service           Sload           
##  Min.   :   0   Min.   :   0   dns     :365313   Min.   :         0  
##  1st Qu.:   0   1st Qu.:   0   http    :128630   1st Qu.:     99485  
##  Median :   3   Median :   5   ftp-data: 81646   Median :    554080  
##  Mean   :   6   Mean   :  19   smtp    : 52142   Mean   :  20370176  
##  3rd Qu.:   7   3rd Qu.:  15   ftp     : 32369   3rd Qu.:   1429449  
##  Max.   :5319   Max.   :5507   (Other) : 32906   Max.   :5468000256  
##                                NA's    :831014                       
##      Dload               Spkts           Dpkts            swin    
##  Min.   :        0   Min.   :    0   Min.   :    0   Min.   :  0  
##  1st Qu.:    66242   1st Qu.:    2   1st Qu.:    2   1st Qu.:  0  
##  Median :   649043   Median :   14   Median :   18   Median :255  
##  Mean   :  2754018   Mean   :   37   Mean   :   48   Mean   :166  
##  3rd Qu.:  3631999   3rd Qu.:   48   3rd Qu.:   46   3rd Qu.:255  
##  Max.   :128761904   Max.   :10646   Max.   :11018   Max.   :255  
##                                                                   
##       dwin         stcpb                dtcpb               smeansz    
##  Min.   :  0   Min.   :         0   Min.   :         0   Min.   :   0  
##  1st Qu.:  0   1st Qu.:         0   1st Qu.:         0   1st Qu.:  61  
##  Median :255   Median : 991520167   Median : 990893074   Median :  73  
##  Mean   :166   Mean   :1395997115   Mean   :1395615604   Mean   : 126  
##  3rd Qu.:255   3rd Qu.:2642621606   3rd Qu.:2642573682   3rd Qu.: 131  
##  Max.   :255   Max.   :4294953347   Max.   :4294953724   Max.   :1504  
##                                                                        
##     dmeansz      trans_depth      res_bdy_len           Sjit        
##  Min.   :   0   Min.   :  0.00   Min.   :      0   Min.   :      0  
##  1st Qu.:  80   1st Qu.:  0.00   1st Qu.:      0   1st Qu.:      0  
##  Median : 103   Median :  0.00   Median :      0   Median :     24  
##  Mean   : 311   Mean   :  0.09   Mean   :   4652   Mean   :   1630  
##  3rd Qu.: 565   3rd Qu.:  0.00   3rd Qu.:      0   3rd Qu.:    578  
##  Max.   :1500   Max.   :131.00   Max.   :5242880   Max.   :1460480  
##                                                                     
##       Djit            Stime                Ltime           
##  Min.   :     0   Min.   :1421927377   Min.   :1421927414  
##  1st Qu.:     0   1st Qu.:1421943357   1st Qu.:1421943357  
##  Median :    16   Median :1421958679   Median :1421958679  
##  Mean   :   829   Mean   :1422602495   Mean   :1422602496  
##  3rd Qu.:    72   3rd Qu.:1424223184   3rd Qu.:1424223184  
##  Max.   :781221   Max.   :1424233663   Max.   :1424233663  
##                                                            
##     Sintpkt         Dintpkt          tcprtt          synack      
##  Min.   :    0   Min.   :    0   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :    1   Median :    1   Median :0.001   Median :0.0005  
##  Mean   :  240   Mean   :   95   Mean   :0.004   Mean   :0.0023  
##  3rd Qu.:    9   3rd Qu.:    8   3rd Qu.:0.001   3rd Qu.:0.0006  
##  Max.   :60010   Max.   :59485   Max.   :3.864   Max.   :2.1049  
##                                                                  
##      ackdat      is_sm_ips_ports   ct_state_ttl  ct_flw_http_mthd
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00   0      :986783  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00   NA     :404027  
##  Median :0.000   Median :0.0000   Median :0.00   1      :119301  
##  Mean   :0.002   Mean   :0.0021   Mean   :0.15   6      :  7880  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.00   4      :  3918  
##  Max.   :3.519   Max.   :1.0000   Max.   :6.00   3      :   735  
##                                                  (Other):  1376  
##  is_ftp_login   ct_ftp_cmd        ct_srv_src      ct_srv_dst   
##  0 :1066585   0      :1066490   Min.   : 1.00   Min.   : 1.00  
##  1 :  27716   NA     : 429667   1st Qu.: 2.00   1st Qu.: 2.00  
##  2 :     10   1      :  24404   Median : 5.00   Median : 5.00  
##  4 :     42   2      :   1244   Mean   : 7.54   Mean   : 7.29  
##  NA: 429667   4      :    846   3rd Qu.: 9.00   3rd Qu.: 9.00  
##               3      :    729   Max.   :63.00   Max.   :62.00  
##               (Other):    640                                  
##    ct_dst_ltm     ct_src_ltm   ct_src_dport_ltm ct_dst_sport_ltm
##  Min.   : 1.0   Min.   : 1.0   Min.   : 1.00    Min.   : 1.00   
##  1st Qu.: 2.0   1st Qu.: 2.0   1st Qu.: 1.00    1st Qu.: 1.00   
##  Median : 3.0   Median : 4.0   Median : 1.00    Median : 1.00   
##  Mean   : 5.1   Mean   : 5.6   Mean   : 3.02    Mean   : 2.35   
##  3rd Qu.: 5.0   3rd Qu.: 6.0   3rd Qu.: 2.00    3rd Qu.: 1.00   
##  Max.   :60.0   Max.   :60.0   Max.   :60.00    Max.   :60.00   
##                                                                 
##  ct_dst_src_ltm           attack_cat          Label       
##  Min.   : 1.00   NA            :1415019   Min.   :0.0000  
##  1st Qu.: 1.00   Generic       :  65483   1st Qu.:0.0000  
##  Median : 2.00   Exploits      :  18102   Median :0.0000  
##  Mean   : 4.35   Fuzzers       :  10912   Mean   :0.0715  
##  3rd Qu.: 3.00   DoS           :   6091   3rd Qu.:0.0000  
##  Max.   :63.00   Reconnaissance:   5561   Max.   :1.0000  
##                  (Other)       :   2852

Save prepared data files

We will now save these files after preprocessing. This notebook script will load in these versions on next running of the script, if they exist, otherwise loading the original files and preprocessing again.

One of the challenges in this work was dealing with re-running functions on the large data set. So, saving our progress and using if statements to check if we had already done certain steps enabled me to rerun the script without re-doing some of the processing steps over and over again.
Most of these were written so as to check if something already exists, or start to create it, if not. Another reason this was important was that when I want to shut down my AWS virtual machine, and rerun at a later date, I will be starting with new VM machine's clean hard drive and need to be able to reload the files already processed from a backup s3 drive.

try(exportPreProcessedFiles(Dir="data", processedDir="processed" ))
## [1] "preprocessed file exists"
## [1] "preprocessed file exists"
## [1] "Pre processed files Saved For next Time."

Investigate the time stamp variables

sampleset=seq(1,1500000,10)
plot(as.vector(h2o.data.all$Ltime[sampleset]))

plot(as.vector(h2o.data.all$Stime[sampleset]))

plot(as.vector(h2o.data.all$Ltime[sampleset]-h2o.data.all$Stime[sampleset]))

plot(as.vector(h2o.data.all$Ltime[sampleset]))

The time stamp variable will likely relate to the attacks that have happened in a statistically relevant way, yet for our use in predicting future attacks, we need to identify attacks independent of when they arrive. One option was to create an interval time column, of the difference in start and last time, but since there is already a duration column, I decided not to use this approach.

Create a Time interval column - finally decided against this.

#h2o.data.all$Itime <- h2o.data.all$Ltime - h2o.data.all$Stime
#h2o.data.test$Itime <- h2o.data.test$Ltime - h2o.data.test$Stime

Similarly to the time stamps - Whilsts IP addresses certainly will be related to attacks, if our goal is to detect attacks from any IP, then this is not going to be useful in our modelling. For an example an ip address associated with 1,000,000 benign transactions COULD easily still be the source of our next attack, hoever modelling using past IP addresses could associte that IP address with a high liklihood of being safe.

# Variable Distributions
Plots of Variables distributions
```r lastcol <-ncol(h2o.data.all)
# Split the 47 columns into smaller groups of columns. dfsample = h2o.data.all[ h2o.runif(h2o.data.all) <0.1,]
# For Non h2o frames we can use: # samplesize = 100000 # dfsample = sample_n(h2o.data.all,samplesize)
gatheredsample1 <- as.data.frame(dfsample[,c(1:15,lastcol)]) %>% gather(variable, value, -Label) gatheredsample2 <- as.data.frame(dfsample[,c(16:30,lastcol)]) %>% gather(variable, value, -Label) gatheredsample3 <- as.data.frame(dfsample[,31:lastcol]) %>% gather(variable, value, -Label) ```
```r highlightnames <- feature_names[c(1,4,9,10,11,12,13,17,18,24,32,35,37,41,42,43,lastcol)]
gatheredsample_highlight <- rbind( gatheredsample1[gatheredsample1$variable %in% highlightnames,], gatheredsample2[gatheredsample2$variable %in% highlightnames,], gatheredsample3[gatheredsample3$variable %in% highlightnames,]) ```
Visualisations of the distributions of variables.
r tic() density_plot <- function(gatheredsample_highlight) { sample_frac(gatheredsample_highlight,0.01)%>% ggplot() + geom_density(aes(x = value, col = Label), alpha = 0.2) + labs(title = "Network Flow Variables Distribution\n", x = "Variables", y = "Density", color = "Attack Label\n") + scale_color_manual(labels = c("Benign", "Malicious"), values = c("green", "red")) + facet_wrap(~variable, ncol = 4) } toc()
## 0.003 sec elapsed
```r if (!file.exists("densityploth.png")){ density_plot(gatheredsample_highlight) dev.copy(png,'densityploth.png') invisible(dev.off()) }
if (!file.exists("densityplot1.png")){ density_plot(gatheredsample1) dev.copy(png,'densityplot1.png') invisible(dev.off()) }
if (!file.exists("densityplot2.png")){ density_plot(gatheredsample2) dev.copy(png,'densityplot2.png') invisible(dev.off()) }
if (!file.exists("densityplot3.png")){ density_plot(gatheredsample3) dev.copy(png,'densityplot3.png') invisible(dev.off()) }
toc() ```
densityplot1.png densityplot3.png
densityplot2.png Categorical or boolean variables with mostly zero values
Plot for For some of the more important variables
r sample_frac(gatheredsample_highlight,0.01)%>% ggplot() + geom_density(aes(x = value, col = Label), alpha = 0.2) + labs(title = "Network Flow Variables Distribution\n", x = "Variables", y = "Density", color = "Attack Label\n") + scale_color_manual(labels = c("Benign", "Malicious"), values = c("green", "red")) + facet_wrap(~variable, ncol = 4)
It can be quite difficult to interpret this distribution of data and see what information is contained. There appear to be a wide variation in values with not many obvious patterns of attacks in red versus benign samples in green.

Ridges Plots

We can look at a smoothed distribution of the variables to make any differences easier to visualise. The below plots show the distribution of our sample variables, contrasting those identified as Malicious(Label 1) with those not (Label 0). SOme variable are binary but will still appear as smoothed ridges on the distribution plots.

We will take a 10% sub sample of the data to speed plotting.

plot_ridges <- function (df,output_filename=NULL){
                draw_plot <- ggplot(df,aes(y = as.factor(variable), 
                fill = (as.factor(Label)),
                x = percent_rank(value))) +
                geom_density_ridges() +
                theme(text = element_text(size=12))+
                 scale_fill_manual(values = alpha(c( "green","red"), .3))
    
  if (is.null(output_filename)) {
    draw_plot
    } else {
#    for hi res print output
#    png(filename=output_filename , width=1600, height=1600)
#    print(draw_plot)
#    invisible(dev.off())
     print(draw_plot)
     dev.copy(png,output_filename)
     invisible(dev.off())
    }
}
if (!file.exists("ridges1.png")){
  plot_ridges(gatheredsample1,'ridges1.png')
}
if (!file.exists("ridges2.png")){
  plot_ridges(gatheredsample2,'ridges2.png')
}
if (!file.exists("ridges3.png")){
  plot_ridges(gatheredsample3,'ridges3.png')
}
if (!file.exists("ridgeshighlights.png")){
plot_ridges(
  rbind(
   gatheredsample1[gatheredsample1$variable %in% highlightnames,],
   gatheredsample2[gatheredsample2$variable %in% highlightnames,],
   gatheredsample3[gatheredsample3$variable %in% highlightnames,]),'ridgeshighlights.png')
  #gatheredsample_highlight)
  invisible(dev.off())
}

Variable Distributions 1 highlighted

Variable Distributions 2 two


Variable Distributions 1 highlighted

Variable Distributions 2 two

# rm("gatheredsample1","gatheredsample2","gatheredsample3")

Table of attack categories

attack_cat_table0 <- sort(decreasing =TRUE,
                         table(as.data.frame(
                           h2o.data.all["attack_cat"])))

attack_cat_table <- sort(decreasing =TRUE,
                         table(as.data.frame(
                           h2o.data.all[
                             as.vector(h2o.which(h2o.data.all$Label=='1')),"attack_cat"])))

attack_cat_table
## 
##        Generic       Exploits        Fuzzers            DoS Reconnaissance 
##          65483          18102          10912           6091           5561 
##       Analysis       Backdoor      Shellcode          Worms 
##           1218            939            625             70

There are over 65000 generic attacks and only 70 worms. It may prove difficult to distinguish well some of the less well represented instances.

# Plot of attack categories

Variable Distributions 1

barplot0 <- barplot(attack_cat_table0,
     main = "Attack Categories",
     col= 19:1,
     cex.names=.8,
     las=2,
     )

Vastly more benign data samples than Attacks

Variable Distributions 2

barplot1 <- barplot(attack_cat_table,
     main = "Attack Categories",
     col= 18:1,
     cex.names=.8,
     las=2,
)

Some attacks have very few occurrences in the data, which may make predicting for those categories difficult.

Split Data.

We will split our original training data into a random 80% Training split and 20% Validation Split.
Models can be built using the training split and verified on the Validation Split.
For the deep learning, we can pass in both data sets, and the learned changes to the model from the training data can be evaluated against their performance on the validation data withing the model building.
This is quite a nice feature to have the validation set performance evaluation included in the model building step.

Create an 80% training split and 20% validation split

randomsplits <- h2o.runif(h2o.data.all)
h2o.data.train <- h2o.data.all[randomsplits <0.8,]
h2o.data.validation <- h2o.data.all[randomsplits >=0.8,]
# Recreate our h2o data splits as standard R dataframes
train_data <- as.data.frame(h2o.data.train)
validation_data <- as.data.frame(h2o.data.validation)

Principal Components Analysis.

Learn how many combinations of variables are required to explain the variance in the data.

h2o.data.all.pca <- h2o.prcomp(h2o.data.all, x = predictor_vars,
                               transform = "NORMALIZE",
                               k = 20,
                               max_iterations = 1000,
                               compute_metrics = TRUE,
                               impute_missing = FALSE,
                               max_runtime_secs = 60
                               )
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): _train:
## Dataset used may contain fewer number of rows due to removal of rows with
## NA/missing values. If this is not desirable, set impute_missing argument in
## pca call to TRUE/True/true/... depending on the client language.
summary(h2o.data.all.pca,plot=TRUE)
## Model Details:
## ==============
## 
## H2ODimReductionModel: pca
## Model Key:  PCA_model_R_1524748087009_7 
## Importance of components: 
##                             pc1      pc2      pc3      pc4      pc5
## Standard deviation     1.323251 0.940512 0.330133 0.295593 0.271781
## Proportion of Variance 0.522008 0.263707 0.032492 0.026048 0.022021
## Cumulative Proportion  0.522008 0.785716 0.818207 0.844256 0.866276
##                             pc6      pc7      pc8      pc9     pc10
## Standard deviation     0.262425 0.242044 0.228554 0.214266 0.197184
## Proportion of Variance 0.020531 0.017466 0.015573 0.013687 0.011591
## Cumulative Proportion  0.886807 0.904273 0.919846 0.933532 0.945124
##                            pc11     pc12     pc13     pc14     pc15
## Standard deviation     0.190681 0.170244 0.150804 0.135906 0.128240
## Proportion of Variance 0.010840 0.008640 0.006780 0.005506 0.004903
## Cumulative Proportion  0.955963 0.964604 0.971384 0.976890 0.981793
##                            pc16     pc17     pc18     pc19     pc20
## Standard deviation     0.116508 0.094400 0.085330 0.078867 0.060079
## Proportion of Variance 0.004047 0.002657 0.002171 0.001854 0.001076
## Cumulative Proportion  0.985840 0.988496 0.990667 0.992521 0.993597
## 
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA
## 
## 
## 
## Scoring History for GramSVD: 
##             timestamp   duration iterations
## 1 2018-04-26 13:19:03  4.769 sec          0

It takes 15 principal components to account for 98% of the sample variation.

Autoencoder

We will apply an unsupervised deep learning Autoencoder.

An autoencoder is a neural network that learns a representation or encoding of a set of data, typically for the purpose of dimensionality reduction.

For the first task we will procede as if we have no labels on the data and simple try to identify suspicious anomalies in the data.

After building the models we will use the labels to discover if this approach will be useful.

FUNCTIONS

FUNCTION Get Reconstruction Errors

getReconErrors <- function(dl_model, dataF) {
  reconstructionMses = h2o.anomaly(dl_model, dataF, per_feature=FALSE)
  reconMses <- as.data.frame(reconstructionMses)
  return(reconMses)
}

FUNCTION Plot reconstruction errors.

plotReconstructionErrors <- function(dataF, errorsList,type='raw',sampleRate=1) {
  quants <- quantile(errorsList$Reconstruction.MSE, c(.70, .75, .80)) 
  quants
# Plot anomaly MSE level
  dataF <- dataF[h2o.runif(dataF)<sampleRate,]
  library(scales)
  labels1 <- as.vector(dataF$Label)
  all <- TRUE
  raw <- TRUE
  benign <- labels1==0
  malicious <- labels1==1
  min_MSE<- min(errorsList$Reconstruction.MSE)
  max_MSE<- max(errorsList$Reconstruction.MSE)
  if (type == 'raw') { plotData<-errorsList$Reconstruction.MSE
  } else {
    plotData<-sort(errorsList$Reconstruction.MSE[get(type)])
  }
  plot(plotData,
       main=paste(c('Reconstruction Error Percentiles',c(names(quants))), collapse='. '),
       log="y",
       ylim=c(min_MSE,max_MSE),
       ylab="Reconstruction MSE error level",
       sub = paste("Sorted Reconstruction Errors", collapse='. '),
       col=alpha((3-as.numeric(labels1[get(type)])),0.25))
  abline(h=quants[1],col='yellow')
  abline(h=quants[2],col='orange')
  abline(h=quants[3],col='red')
}

FUNCTION Output Model Performance Information

output_dl_model_performance <- function(model, h2o_data) {
  
  #Plot anomaly MSE level
  recon_errors <- getReconErrors(model, h2o_data)
  plotAll <- plotReconstructionErrors(h2o_data, recon_errors)
  
  #Plot benign anomaly MSE level
  plotBenign <- plotReconstructionErrors(h2o_data, recon_errors,type='benign')
  
  #Plot malicious anomaly MSE level
  plotMalicious <- plotReconstructionErrors(h2o_data, recon_errors,type='malicious')
  
  malicious_indexes_logical <- as.vector(h2o.which(as.character(h2o_data$Label)=='1'))
  minimumMaliciousMSE <- min(recon_errors$Reconstruction.MSE[malicious_indexes_logical])
  
  percentileRank <- function(vec, testval) {length(vec[vec <= testval])/length(vec)*100}
  
  mal_below_percentile <- function(rate){
    sort(recon_errors$Reconstruction.MSE[malicious_indexes_logical])[floor(length(recon_errors$Reconstruction.MSE[malicious_indexes_logical])*rate)]}
  
  cat("We could consider all Samples below a certain Reconstruction error as \"Safe\"\n\n")
  miss<-0.1
  perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
  cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
  
  miss<-0.05
  perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
  cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
  
  miss<-0.025
  perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
  cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
  
  cat(percentileRank(recon_errors$Reconstruction.MSE,minimumMaliciousMSE),"% Cut off is where the lowest Malicious sample Reconstruction error was, at ", minimumMaliciousMSE,".\n The samples with reconstruction error below this are all Benign samples")
  }

1. Autoencoder H2o.deeplearning

We will use the h2o.deeplearning, withautoencoder set to TRUE:

tic()
TrainDlModel1 <- function(predictor_vars, datafr) {
  deepmodel = h2o.deeplearning (x = predictor_vars,
                   training_frame = h2o.data.train[predictor_vars],
                   ignore_const_cols = FALSE,
                   autoencoder = TRUE,
               #    standardize = TRUE,
                   seed = 42,
                   hidden = c(20),
                   epochs = 50000,
                   max_w2 = 10,
                   sparse = FALSE,
                   activation = "TanhWithDropout",
                   loss = "Automatic",
                   l1 =1e-5,
                   stopping_metric =  "MSE",
                   stopping_rounds = 3,
                   stopping_tolerance =0.0001,
                   model_id = "DL_model_R_1a"
                   )
   return(deepmodel)
}
toc()
## 0.007 sec elapsed
if (file.exists(NewestFileInDir("./models/hidden_20")) ) {
  print("loading saved model")
  h2o.data.train.dl <- h2o.loadModel(NewestFileInDir("./models/hidden_20"))
} else {
  h2o.data.train.dl <- TrainDlModel1(predictor_vars, h2o.data.train)
  h2o.saveModel(h2o.data.train.dl, path="./models/hidden_20", force = TRUE)
}
## [1] "loading saved model"
cat("model Id is :",h2o.data.train.dl@model_id)
## model Id is : DeepLearning_model_R_1524433485776_56
#hidden (20) layer model id was saved as "DeepLearning_model_R_1524353906773_25"

After the model is trained, We will call the h2o.anomaly function to try to rebuild the original data set by using the model reduced set of features. We will then calculate a means squared error between both versions to measure how good the model could reconstruct the data.

Single 20-Node Hidden Layer Model performance on training Data

To discover if the reconstruction error may prove useful in identifying attacks, we can plot the reconstruction errors of the model, and here we will highlight the malicious instances in red to see if they do indeed show up as more anomalous than the benign samples.

We will then replot, sorting by reconstuction error and seperating into benign and malicious plots, still highlighting quantiles of the whole set. We will examine what quantiles of error each category predominantly falls into.

output_dl_model_performance(h2o.data.train.dl, h2o.data.train)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 87.8 % cut off would miss 10 % of attacks
## 84.05 % cut off would miss 5 % of attacks
## 74.88 % cut off would miss 2.5 % of attacks
## 65.6 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.004665 .
##  The samples with reconstruction error below this are all Benign samples

We can see most of the red Attacks are picked up with higher reconstruction errors from the AutoEncoder model, with ALL of the attacks above the 65th percentile of reconstruction errors. The implications are that this model could quickly identify 65% of this training data as "SAFE", by passing this data through a fairly simple autoencoder.

Single 20-Node Hidden Layer Model performance on validation Data

We will evaluate on held back validation data

output_dl_model_performance(h2o.data.train.dl, h2o.data.validation)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 87.68 % cut off would miss 10 % of attacks
## 83.84 % cut off would miss 5 % of attacks
## 74.86 % cut off would miss 2.5 % of attacks
## 65.84 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.004718 .
##  The samples with reconstruction error below this are all Benign samples

Detection thresholds proved very similar for the validation set.

2. GRID SEARCH for Autoencoder hidden layer options

We will try to do better, by doing a grid search of running the same process but trying a few Autoencoders with more hidden layers. If the complexity of the relationships in the data is greater this may perform better. Our code will train and evaluate the carious models and select the best one for us. For the grid search, to speed up training, I reduced the stopping criteria, so model performance could be a little bit worse. The early stopping enabled us to grid search more models in a shorter space of time, and compare the different models. For Autoencoder the "AUTO" stoppinf criteria will be MSE, s owe can stop the model building when the MSE is no longer improving beyond the MSE improvent we specify.

hyper_params <- list(
                           # 3 Grid options commented out below for faster running. To rerun entire grid search, uncomment here.
  hidden=list(c(15,10,15),c(32,32),c(64,64),c(20),c(20,20))

#  rate=c(0.01,0.02),
#  rate_annealing=c(1e-8,1e-7)

)
hyper_params
## $hidden
## $hidden[[1]]
## [1] 15 10 15
## 
## $hidden[[2]]
## [1] 32 32
## 
## $hidden[[3]]
## [1] 64 64
## 
## $hidden[[4]]
## [1] 20
## 
## $hidden[[5]]
## [1] 20 20
doGridSearch <- function() {
  h2o.grid(
    algorithm="deeplearning",
    grid_id="dl_grid", 
    training_frame=h2o.data.train,
    validation_frame=h2o.data.validation  , 
    x=predictor_vars, 
    y= NULL,
    autoencoder = TRUE,
    epochs=50,
    stopping_metric="AUTO",
    stopping_tolerance=3e-2,        ## Stop when we improve by less than 1% for 2 scoring events, using AUTO which will be MSE score
    stopping_rounds=2,
    score_validation_samples=100000, ## Set size of validation set for faster scoring
    l1=1e-3,
    activation=c("TanhWithDropout"),
    max_w2=10,                      ## Set weight limit to improve model stability
    hyper_params=hyper_params
  )
}


if (file.exists(NewestFileInDir("./models/grid_best")) ) {
  print("loading saved model")
  h2o.best.dl.model <- h2o.loadModel(NewestFileInDir("./models/grid_best"))
} else {
  print("Performing Grid Search on models")
  h2o.grid.dl <- doGridSearch()
  h2o.grid.dl
  print("Finished Grid Search on models")
  h2o.grid.results <- h2o.getGrid("dl_grid",sort_by="mse",decreasing=FALSE)
  h2o.best.dl.model.id <-h2o.grid.results@model_ids[[1]]
  h2o.best.dl.model <- h2o.getModel(h2o.best.dl.model.id)
  h2o.saveModel(h2o.best.dl.model, path="./models/grid_best", force = TRUE)
  print(h2o.grid.results)
}
## [1] "loading saved model"
cat("model Id is :",h2o.data.train.dl@model_id)  
## model Id is : DeepLearning_model_R_1524433485776_56
cat("model MSE is: ",h2o.mse(h2o.best.dl.model))
## model MSE is:  0.005363

The single 20 layer model has the lowest MSE reconstruction error on the validation set.

Copy of H2O Grid Details

Grid ID: dl_grid

Used hyper parameters: - hidden

Number of models: 5

Number of failed models: 0

Hyper-Parameter Search Summary: ordered by increasing mse
hidden model_ids mse
1 [20] dl_grid_model_3 0.005309525216384074
2 [32, 32] dl_grid_model_1 0.005460826923559729
3 [15, 10, 15] dl_grid_model_0 0.006091622073189775
4 [20, 20] dl_grid_model_4 0.006243060895154623
5 [64, 64] dl_grid_model_2 0.007167924403874439
[1] "dl_grid_model_3"
[1] 0.005363

h2o.best.dl.model
## Model Details:
## ==============
## 
## H2OAutoEncoderModel: deeplearning
## Model ID:  dl_grid_model_3 
## Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 8,507 weights/biases, 119.2 KB, 1,880,013 training samples, mini-batch size 1
##   layer units        type dropout       l1       l2 mean_rate rate_rms
## 1     1   207       Input  0.00 %                                     
## 2     2    20 TanhDropout 50.00 % 0.001000 0.000000  0.037511 0.137523
## 3     3   207        Tanh         0.001000 0.000000  0.059986 0.035958
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.004208   0.211915 -0.106799 4.464722
## 3 0.000000   -0.000134   0.033201  0.016737 0.117233
## 
## 
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.005363
## RMSE: (Extract with `h2o.rmse`) 0.07323
## 
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on validation data. **
## 
## Validation Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.00531
## RMSE: (Extract with `h2o.rmse`) 0.07287

Training Set Metrics:

MSE: (Extract with h2o.mse) 0.005363 RMSE: (Extract with h2o.rmse) 0.07323

H2OAutoEncoderMetrics: deeplearning ** Reported on validation data. **

Validation Set Metrics:

MSE: (Extract with h2o.mse) 0.00531 RMSE: (Extract with h2o.rmse) 0.07287


  h2o.saveModel(h2o.best.dl.model, path="./models/grid_best", force = TRUE)
## [1] "/home/rstudio/networkflows/models/grid_best/dl_grid_model_3"
output_dl_model_performance(h2o.best.dl.model, h2o.data.validation)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 88.07 % cut off would miss 10 % of attacks
## 85.65 % cut off would miss 5 % of attacks
## 80.69 % cut off would miss 2.5 % of attacks
## 61.12 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.005202 .
##  The samples with reconstruction error below this are all Benign samples

Our lowest reconstruction error for an attack is lower than before. Since we specified a lower stopping criteria in the gris serch to compare more models faster, this is to be expected. However, the 80th percentile is cathing a similar number of attacks as before.


3. BENIGN samples only autoencoder training

Though we do have the luxury of labelled benign data, we didn't use that information to train our autoencoder above, but simply looked for anomalies and compared the labels to see how we did. Since we do have classified data We can do better than just looking for anomalies in all the data.

We can train our autoencoder ONLY on the samples that we know to be benign. Then when we reconstruct on all the data we should she higher reconstruction errors for the Malicious data. Since the data was heavily skewed towards benign samples already, the improvement here may not be large.

Train Autoencoder on benign samples only

trainBenignModel <- function() {
   h2o.deeplearning(
                   x = predictor_vars,
                   autoencoder = TRUE,
                   standardize = TRUE,
                   epochs = 150,
                   seed = 42,
                   l1=1e-3,
                   max_w2 = 10,
                   stopping_rounds = 3,
                   stopping_tolerance =0.005,
                   stopping_metric = "AUTO",
                   hidden = c(20),
                   
                   training_frame=h2o.data.train[as.character(h2o.data.train$Label)=='0',],
                   
                   validation_frame=h2o.data.validation[as.character(h2o.data.validation$Label)=='0',],
                   
                   activation=c("TanhWithDropout")
                   )
 }
  

if (file.exists(NewestFileInDir("./models/benign")) ) {
  print("loading saved model")
  h2o.benign.dl.model <- h2o.loadModel(NewestFileInDir("./models/benign"))
} else {
  h2o.benign.dl.model <- trainBenignModel(predictor_vars, h2o.data.train)
  h2o.saveModel(h2o.benign.dl.model, path="./models/benign", force = TRUE)
}
## [1] "loading saved model"
benign_model_id <- h2o.benign.dl.model@model_id
cat("model Id is :",h2o.benign.dl.model@model_id)
## model Id is : DeepLearning_model_R_1524353906773_27
#"DeepLearning_model_R_1524353906773_27"

We will calculate reconstruction errors for all the validation data and examine performance.

output_dl_model_performance(h2o.benign.dl.model, h2o.data.validation)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.005036 .
##  The samples with reconstruction error below this are all Benign samples

We will now plot reconstruction errors on the full training data set.

  recon_errors_all <- getReconErrors(h2o.benign.dl.model, h2o.data.all)
  h2o.data.all[which(recon_errors_all==max(recon_errors_all)):(which(recon_errors_all==max(recon_errors_all))+3),]
##        srcip sport      dstip dsport proto state      dur sbytes dbytes
## 1 2939006978 38224 2511044107     80   tcp   FIN 2.002472 193156   1128
## 2 2939006976     0 2511044109      0   hmp   INT 0.000005    200      0
## 3 2939006976     0 2511044109      0   hmp   INT 0.000005    200      0
## 4 2939006976     0 2511044109      0   hmp   INT 0.000005    200      0
##   sttl dttl sloss dloss service     Sload Dload Spkts Dpkts swin dwin
## 1  254  252    73     1    http    766596  4335   152    26  255  255
## 2  254    0     0     0       - 160000000     0     2     0    0    0
## 3  254    0     0     0       - 160000000     0     2     0    0    0
## 4  254    0     0     0       - 160000000     0     2     0    0    0
##        stcpb      dtcpb smeansz dmeansz trans_depth res_bdy_len Sjit  Djit
## 1 1767075271 2729791068    1271      43         131           0 1567 122.2
## 2          0          0     100       0           0           0    0   0.0
## 3          0          0     100       0           0           0    0   0.0
## 4          0          0     100       0           0           0    0   0.0
##        Stime      Ltime Sintpkt Dintpkt tcprtt  synack ackdat
## 1 1424223168 1424223170  12.969   78.09 0.1693 0.05015 0.1192
## 2 1424223170 1424223170   0.005    0.00 0.0000 0.00000 0.0000
## 3 1424223170 1424223170   0.005    0.00 0.0000 0.00000 0.0000
## 4 1424223170 1424223170   0.005    0.00 0.0000 0.00000 0.0000
##   is_sm_ips_ports ct_state_ttl ct_flw_http_mthd is_ftp_login ct_ftp_cmd
## 1               0            1                1          NaN        NaN
## 2               0            2              NaN          NaN        NaN
## 3               0            2              NaN          NaN        NaN
## 4               0            2              NaN          NaN        NaN
##   ct_srv_src ct_srv_dst ct_dst_ltm ct_src_ltm ct_src_dport_ltm
## 1          7          5          2          2                2
## 2          3          3          1          1                1
## 3          3          3          1          1                1
## 4          3          3          1          1                1
##   ct_dst_sport_ltm ct_dst_src_ltm attack_cat Label
## 1                1              5   Exploits     1
## 2                1              3        DoS     1
## 3                1              3        DoS     1
## 4                1              3   Exploits     1
## 
## [4 rows x 49 columns]

One single attack instance had a reconstruction error 10 times that of other samples.

We can remove this single instance to better visualise the distribution of errors.

h2o.data.all <- h2o.data.all[-which(recon_errors_all==max(recon_errors_all)),]  

recon_errors_val <- getReconErrors(h2o.benign.dl.model, h2o.data.validation)
h2o.data.validation <- h2o.data.validation[-which(recon_errors_val==max(recon_errors_val)),]  
output_dl_model_performance(h2o.benign.dl.model, h2o.data.validation)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.005036 .
##  The samples with reconstruction error below this are all Benign samples

Now almost all the malicious samples are above a 80% detection threshold, with only a small tail below this. Setting a detection threshold of 70% will include a few more true positives, but at the expense of a lot (10% of total) more false positives. Similarly we could detect ALL of the malicious samples in this dataset with a threhold of 60%, but at the expense of labeling as suspicious 40 of all data Interestingly, that while our LOWEST reconstruction error for a malicious sample is lower than before, our 97.5% recall rate is still around the 78th percentile. Trying to catch every single odd outlier is likely not wise and choosing a 97.5% or 99% recall rate is wiser than Over fitting our threshold to single oulier values.


4.Checkpointed continued training of DL model

We can continue on training with a lower MSE stopping metric. At this stage we will also feed our model ALL the data, both training and validation splits, before moving on to consider test data performance.

We will: -reload the last model, -retrain on new data ; with higher accuracy stopping criteria requirement, -Save the updated model.

A simple for loop on the following section could feed in smaller segments of data to continue training. If reducing the memory requirement for large data is an issue then this approach would be easily implemented.

Also it would be a small step to put this into a function to update the model as new data becomes available.

Performance of our further trained model

output_dl_model_performance(h2o.benign.dl.model.continued, h2o.data.validation) 

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.005036 .
##  The samples with reconstruction error below this are all Benign samples

Performance on unseen test data

output_dl_model_performance(h2o.benign.dl.model.continued, h2o.data.test)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 67.59 % cut off would miss 10 % of attacks
## 61.88 % cut off would miss 5 % of attacks
## 59.09 % cut off would miss 2.5 % of attacks
## 42.44 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.00484 .
##  The samples with reconstruction error below this are all Benign samples

Lets also evaluate our very first model on the test data

output_dl_model_performance(h2o.data.train.dl, h2o.data.test)

## We could consider all Samples below a certain Reconstruction error as "Safe"
## 
## 70.11 % cut off would miss 10 % of attacks
## 65.64 % cut off would miss 5 % of attacks
## 57.42 % cut off would miss 2.5 % of attacks
## 46.36 % Cut off is where the lowest Malicious sample Reconstruction error was, at  0.004535 .
##  The samples with reconstruction error below this are all Benign samples

So model was performing best on validation and test data. We will use it as our anomaly detector model.


Relative importances of variables in reconstruction

we can also examine what variables are having the biggest effect on our model. It seems not to have a single dominant factor, but with several variables all having an effect.

h2o.varimp_plot(h2o.benign.dl.model.continued, num_of_features = 20)

Comparing Reconstruction error thresholds.

So a question is where to put our anomaly detection threshold. Putting the threshold to catch every malicious sample in our data would lead to a huge level of false positives. Using the F1 metric to consider false positives and false negatives equally would improve "accuracy" but mainly by reducing false positives at the expensive of too many false negatives.

F2 Metric

By increasing our Beta value we can weight our scoring towards favouring recall : (Correctly Predicted Positives / Actual positives) The F2 metric with beta of 2 will score higher with better recall higher even at the expensive of more false negatives. F2 is a good metric to select a threshold when lower false negative rate is more important than a higher false positive rate.

Plot Detection threshold effects.

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         27 minutes 56 seconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.7 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_rstudio_ymr419 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.01 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)
if (!exists("h2o.benign.dl.model.continued.errors")){
  h2o.benign.dl.model.continued.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.validation)
}  
  validationLabels <- as.data.frame(h2o.data.validation[,"Label"])
  ErrorLabelFrame <- data.frame( errors=h2o.benign.dl.model.continued.errors,
                                 Labels=validationLabels)
  
  colnames(ErrorLabelFrame)[colnames(ErrorLabelFrame) == 'Reconstruction.MSE'] <- 'errors'
  quantlist <- seq(0.6, 1, 0.01)
  errorquants <- quantile(ErrorLabelFrame$errors,probs = quantlist)
  totalframennegatives <- h2o.nrow(h2o.data.validation$Label[h2o.data.validation$Label=="0",])
  positivesAboveQuant <- c(0)
  negativesAboveQuant <- c(0) 
  totalAboveQuant <- c(0) 
    
  for ( i in 1:length(errorquants)) {
    positivesAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]])&(ErrorLabelFrame$Label=="1"),])
    negativesAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]])&(ErrorLabelFrame$Label=="0"),])
    totalAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]]),])
  }

  actualpositives<- max(positivesAboveQuant)
  falsePositiverate <- negativesAboveQuant/totalframennegatives
  truePositivePercent <- positivesAboveQuant/totalAboveQuant
  truePositiverate <- positivesAboveQuant/actualpositives
  falsetrueratio <- positivesAboveQuant/negativesAboveQuant
  

  plot(x=quantlist,
       y=c(truePositiverate),
       type='l',
       ylim=c(0.3,1),
       xlim=c(0.65,0.95),
       col='red',
       ylab = "Percentage found",
       xlab = '',
       main = "Malicious attacks found vs. benign allowed past",
        sub = "85% Threshold catches 90% of Attacks, with only 10% False negative rate. \nTo maximise F2 score 80% threshold will catch over 95% of Attacks."
        
     )
lines(x=quantlist,y=c(1-falsePositiverate),type='l',col="green")
grid(col = "lightgray", lty = "dotted",
     lwd = par("lwd"), equilogs = TRUE)

Detection threshold.

We will set the anomaly detection threshold at the 80th Quantile and consider all samples beyond this MSE Errror threshold suspicious.


h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         28 minutes 29 seconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.7 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_rstudio_ymr419 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.01 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)
errorquants[16:25]
##      75%      76%      77%      78%      79%      80%      81%      82% 
## 0.008749 0.008861 0.008977 0.009101 0.009238 0.009378 0.009543 0.009725 
##      83%      84% 
## 0.009956 0.010250
errorquants[21]
##      80% 
## 0.009378
suspicionThreshold = errorquants[[21]]

All MSE's above this are to be considered suspicious.

Selecting the Suspicious subset

if (!exists("h2o.data.train.errors")){
  h2o.data.train.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.train)
}

if (!exists("h2o.data.all.errors")){
  h2o.data.all.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.all)
}

if (!exists("h2o.data.test.errors")){
h2o.data.test.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.test)
}

if (!exists("h2o.data.all.suspicious")){
h2o.data.all.suspicious <- h2o.data.all[which(as.vector(h2o.data.all.errors[1] > suspicionThreshold)),]
}

if (!exists("h2o.data.test.suspicious")){
h2o.data.test.suspicious <- h2o.data.test[which(as.vector(h2o.data.test.errors[1] > suspicionThreshold)),]
}

We can now try to subclassify the attack_categories.
Lets try out h2o gbm on this data
# Run h2oGBM
```r #GBM gbm_parameters <- list(learn_rate = 0.1, #c(0.1,0.01), max_depth = 5, #c(5,7), sample_rate = 0.8, #c(0.3,0.5,0.7), col_sample_rate = 0.9 # c(0.5,0.7)
) tic()
if (!exists("h2o.gbm.grid")){ h2o.gbm.grid<- h2o.grid("gbm", training_frame = h2o.data.train, validation_frame = h2o.data.validation, #nfolds=3, # commented out for faster running for printing. x = predictor_vars, y = "Label", hyper_params=gbm_parameters, ntrees = 100, stopping_rounds = 3, stopping_tolerance = 0.1, stopping_metric = "misclassification", model_id = "gbm_response4", seed = 42) } ```
## | | | 0% | |= | 1% | |= | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 9% | |====== | 10% | |======= | 11% | |======== | 12% | |======== | 13% | |========= | 14% | |========== | 15% | |========== | 16% | |=========== | 17% | |============ | 18% | |============ | 19% | |============= | 20% | |============== | 21% | |============== | 22% | |=============== | 23% | |================ | 24% | |================ | 25% | |================= | 26% | |================== | 27% | |================== | 28% | |=================== | 29% | |==================== | 30% | |==================== | 31% | |===================== | 32% | |===================== | 33% | |====================== | 34% | |======================= | 35% | |======================= | 36% | |======================== | 37% | |========================= | 38% | |========================= | 39% | |========================== | 40% | |=========================== | 41% | |=========================== | 42% | |============================ | 43% | |============================= | 44% | |============================= | 45% | |============================== | 46% | |=============================== | 47% | |=============================== | 48% | |================================ | 49% | |================================ | 50% | |================================= | 51% | |================================== | 52% | |================================== | 53% | |=================================== | 54% | |==================================== | 55% | |==================================== | 56% | |===================================== | 57% | |====================================== | 58% | |====================================== | 59% | |======================================= | 60% | |======================================== | 61% | |======================================== | 62% | |========================================= | 63% | |========================================== | 64% | |========================================== | 65% | |=========================================== | 66% | |============================================ | 67% | |============================================ | 68% | |============================================= | 69% | |============================================== | 70% | |============================================== | 71% | |=============================================== | 72% | |=============================================== | 73% | |================================================ | 74% | |================================================= | 75% | |================================================= | 76% | |================================================== | 77% | |=================================================== | 78% | |=================================================== | 79% | |==================================================== | 80% | |===================================================== | 81% | |===================================================== | 82% | |====================================================== | 83% | |======================================================= | 84% | |======================================================= | 85% | |======================================================== | 86% | |========================================================= | 87% | |========================================================= | 88% | |========================================================== | 89% | |========================================================== | 90% | |=========================================================== | 91% | |============================================================ | 92% | |============================================================ | 93% | |============================================================= | 94% | |============================================================== | 95% | |============================================================== | 96% | |=============================================================== | 97% | |================================================================ | 98% | |================================================================ | 99% | |=================================================================| 100%
r toc()
## 191.001 sec elapsed
r #47 secs for stopping tol 0.1 auc grid search, #166sec for 0.1 misclassification #237 FOR auc WITH GRID FOR SAMPLE RATE 0.3,0.5,0.7 # 89 FOR STOPPING_metric lift_top_group #56s for 0.2 sample rate #60sec col sample 0.9 #112S MPC
r print(h2o.gbm.grid)
## H2O Grid Details ## ================ ## ## Grid ID: Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8 ## Used hyper parameters: ## - col_sample_rate ## - learn_rate ## - max_depth ## - sample_rate ## Number of models: 1 ## Number of failed models: 0 ## ## Hyper-Parameter Search Summary: ordered by increasing logloss ## col_sample_rate learn_rate max_depth sample_rate ## 1 0.9 0.1 5 0.8 ## model_ids ## 1 Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8_model_0 ## logloss ## 1 0.009242757067122489
r h2o.gbm.model.1 <- h2o.getModel(h2o.gbm.grid@model_ids[[1]]) h2o.gbm.model.1
## Model Details: ## ============== ## ## H2OBinomialModel: gbm ## Model ID: Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8_model_0 ## Model Summary: ## number_of_trees number_of_internal_trees model_size_in_bytes min_depth ## 1 100 100 38276 5 ## max_depth mean_depth min_leaves max_leaves mean_leaves ## 1 5 5.00000 7 32 24.07000 ## ## ## H2OBinomialMetrics: gbm ## ** Reported on training data. ** ## ## MSE: 0.002757 ## RMSE: 0.05251 ## LogLoss: 0.008854 ## Mean Per-Class Error: 0.0162 ## AUC: 0.9998 ## Gini: 0.9997 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 1130007 2130 0.001881 =2130/1132137 ## 1 2662 84572 0.030516 =2662/87234 ## Totals 1132669 86702 0.003930 =4792/1219371 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.509902 0.972450 208 ## 2 max f2 0.280356 0.981252 305 ## 3 max f0point5 0.742980 0.981642 129 ## 4 max accuracy 0.509902 0.996070 208 ## 5 max precision 0.997702 1.000000 0 ## 6 max recall 0.086182 1.000000 371 ## 7 max specificity 0.997702 1.000000 0 ## 8 max absolute_mcc 0.509902 0.970339 208 ## 9 max min_per_class_accuracy 0.284447 0.994349 303 ## 10 max mean_per_class_accuracy 0.130351 0.995873 361 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` ## H2OBinomialMetrics: gbm ## ** Reported on validation data. ** ## ## MSE: 0.002925 ## RMSE: 0.05408 ## LogLoss: 0.009243 ## Mean Per-Class Error: 0.01506 ## AUC: 0.9998 ## Gini: 0.9996 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 282188 696 0.002460 =696/282884 ## 1 602 21164 0.027658 =602/21766 ## Totals 282790 21860 0.004261 =1298/304650 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.476244 0.970247 227 ## 2 max f2 0.265718 0.980209 316 ## 3 max f0point5 0.794345 0.980148 113 ## 4 max accuracy 0.479206 0.995739 226 ## 5 max precision 0.997702 1.000000 0 ## 6 max recall 0.080384 1.000000 374 ## 7 max specificity 0.997702 1.000000 0 ## 8 max absolute_mcc 0.476244 0.967955 227 ## 9 max min_per_class_accuracy 0.279336 0.994008 310 ## 10 max mean_per_class_accuracy 0.080384 0.995760 374 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
r if (!exists("h2o.gbm.model.1.probs.valid")){ h2o.gbm.model.1.probs.valid <-h2o.predict(object = h2o.gbm.model.1, newdata = h2o.data.validation) h2o.gbm.model.1.predictions.valid <- as.vector(h2o.gbm.model.1.probs.valid[,1]) gbm_predicts_validation <- as.factor(h2o.gbm.model.1.predictions.valid) }
## | | | 0% | |=================================================================| 100%
So at Max F1, our supervised GBM is predicting 21164 out of an actual 21766 malicious instances. A False negative rate of 2.7%. With only 602 False positives from the 282790 actual negatives. Using Max F2 threshold, improves detection to a levle of just under 1.98% false negatives."
Much more effictive than the unsupervised Autoencoder model.
r h2o.auc(h2o.gbm.model.1,train=TRUE,valid=TRUE)
## train valid ## 0.9998 0.9998
Lets create a predicted malicious only data frame from that model.
r if (!exists("h2o.data.gbmpredictedMalicious")){ h2o.data.gbmpredictedMalicious <- h2o.data.validation[as.h2o(gbm_predicts_validation==1),] }
## | | | 0% | |=================================================================| 100%
## Test Data Performance
r if (!exists("h2o.gbm.model.1.perf.test")) { h2o.gbm.model.1.perf.test <- h2o.performance( model = h2o.gbm.model.1, newdata = h2o.data.test) } h2o.auc(h2o.gbm.model.1.perf.test,train=TRUE,valid=TRUE)
## [1] 0.9996
r # Look at the hyperparamters for the best model print(h2o.gbm.model.1@model[["model_summary"]])
## Model Summary: ## number_of_trees number_of_internal_trees model_size_in_bytes min_depth ## 1 100 100 38276 5 ## max_depth mean_depth min_leaves max_leaves mean_leaves ## 1 5 5.00000 7 32 24.07000
Grid search of GBM - commented out for faster execution

{r , eal=FALSE} gbm_parameters <- list(learn_rate = c(0.1), #,0.01),
max_depth = c(7), #c(5,7),
sample_rate = c(0.7), #c(0.5,0.7),
col_sample_rate = c(0.5), # ),0.7))
tic()
gbm_gridperf1<- h2o.grid("gbm",
training_frame = h2o.data.train,
validation_frame = h2o.data.validation,
#nfolds=3,
x = predictor_vars,
y = response,
hyper_params=gbm_parameters,
ntrees = 30,
stopping_rounds = 2,
stopping_tolerance = 1e-2, #set to low tolerance for fast running to produce notebook. initially 1e-6
#stopping_metric = "AUC",
model_id = "gbm_response2",
seed = 42)
toc()
#1627 secs for grid search,

Grid search returned best parameters of
#Best parameters
# col_sample_rate learn_rate max_depth sample_rate
# 0.5 0.1 7 0.7

Display h2o GBM MSEs

#print(gbm_gridperf1)

# Grab the top GBM model, chosen by validationation AUC
gbm_model1 <- h2o.gbm.model.1

gbm_validationdata_predictions <-h2o.predict(object = gbm_model1,
                                newdata = h2o.data.validation)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
gbm_predicts_validation_class <- as.vector(gbm_validationdata_predictions[,1]) # take first column 
gbm_predicts_validation <- as.factor(gbm_predicts_validation_class)

gbm_predicts_validation_class <- as.vector(gbm_validationdata_predictions[,1])
gbm_predicts_validation <- as.factor(gbm_predicts_validation_class)
h2o.auc(gbm_model1,train=TRUE,valid=TRUE)
##  train  valid 
## 0.9998 0.9998
gbm_predicts_validation <- as.factor((gbm_predicts_validation_class))

## Model summary
print(gbm_model1@model[["model_summary"]])
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      100               38276         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          7         32    24.07000

Model performance on test data

accuracy and MSE

gbm_model1_testdata_performance <- h2o.performance(
                                    model = gbm_model1,
                                    newdata = h2o.data.test)

gbm_testdata_predictions <- h2o.predict(object = gbm_model1,
                                newdata = h2o.data.test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
gbm_predicts_test_class <- as.factor(as.vector(gbm_testdata_predictions[,1]))
gbm_predicts_test <- as.factor(as.vector(gbm_testdata_predictions[,1]))
h2o.auc(gbm_model1_testdata_performance,train=TRUE,valid=TRUE)  # 0.9990
## [1] 0.9996
#str(gbm_model1_testdata_performance)
gbm_model1_testdata_performance@metrics$MSE
## [1] 0.006625
#gbm_model1_testdata_performance$

Plot AUC from best gbm

Plot h2o AUC using proc

Let's use the pROC library to calculate our AUC score (remember, an AUC of 0.5 is random and 1 is perfect) and plot a chart:

validation_actual = as.numeric(as.vector(h2o.data.validation$Label))
validation_predictions = as.numeric(gbm_predicts_validation)

PlotAUC <- function(response, predictor) {
  auc = roc(response=response,
               predictor=predictor)
  plot(auc, print.thres = "best",
       main=paste('validationation data AUC:',round(auc$auc[[1]],3)))
  abline(h=1,col='blue')
  abline(h=0,col='green')
}
PlotAUC(validation_actual, validation_predictions)

example code:
F_meas(as.factor(validation_predictions), as.factor(as.vector(validation_actual)), relevant = '1',
beta = 1, na.rm = TRUE)

F_meas(as.factor(validation_predictions), as.factor(as.vector(validation_actual)), relevant = '1',
beta = 5, na.rm = TRUE)

For cybersecurirty sensitivity, the true positive (or more importantly here, low false negative) rate is perhaps more important than specificity(low false positives). We are perhaps more intrested in catching attacks than in NOT suspecting flows wrongly.

Increasing the Beta level of f_measure will add weight to Sensitivity over specificity, and so, as a metric, it will penalise false negtives more than false positives.

Now we will Run a h2o Gradient boosted machine to try to learn to predict the diffeernt attackCat on our training data set

We will use balance classes to get even samples of the different types. We will train this ONLY on the preselected suspicious Labels predicted from our previous GBM model.

#GBM
gbm_parameters <- list(learn_rate = 0.1, #c(0.1,0.01),
                   max_depth = 5, #c(5,7),
                   sample_rate = 0.8, #c(0.3,0.5,0.7),
                   col_sample_rate = 0.9 #  c(0.5,0.7)
                   
                   )
tic()

if (!exists("h2o.gbm.grid.cat")){
  h2o.gbm.grid.cat<- h2o.grid("gbm",
                training_frame = h2o.data.gbmpredictedMalicious,
                validation_frame = h2o.data.gbmpredictedMalicious,
                #nfolds=3,   # commented out for faster running for printing.
                x = predictor_vars,
                y = "attack_cat",
                balance_classes = TRUE,
                hyper_params=gbm_parameters,
                ntrees = 100,
                stopping_rounds =  3,
                stopping_tolerance =  0.05,
                stopping_metric = "MSE",
                model_id = "gbm_response4",
                seed = 42)
}
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |=================================================================| 100%
toc()
## 93.254 sec elapsed
print(h2o.gbm.grid.cat)
## H2O Grid Details
## ================
## 
## Grid ID: Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34 
## Used hyper parameters: 
##   -  col_sample_rate 
##   -  learn_rate 
##   -  max_depth 
##   -  sample_rate 
## Number of models: 1 
## Number of failed models: 0 
## 
## Hyper-Parameter Search Summary: ordered by increasing logloss
##   col_sample_rate learn_rate max_depth sample_rate
## 1             0.9        0.1         5         0.8
##                                                     model_ids
## 1 Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34_model_0
##              logloss
## 1 0.2917082773410527
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         39 minutes 55 seconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.7 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_rstudio_ymr419 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.25 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)
h2o.gbm.model.cat1 <- h2o.getModel(h2o.gbm.grid.cat@model_ids[[1]])
h2o.gbm.model.cat1
## Model Details:
## ==============
## 
## H2OMultinomialModel: gbm
## Model ID:  Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34_model_0 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              52                      520              239805         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         10         32    26.64038
## 
## 
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("RTMP_sid_a52c_151")`
## MSE: (Extract with `h2o.mse`) 0.2785
## RMSE: (Extract with `h2o.rmse`) 0.5277
## Logloss: (Extract with `h2o.logloss`) 0.7832
## Mean Per-Class Error: 0.3335
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                Analysis Backdoor  DoS Exploits Fuzzers Generic   NA
## Analysis           2248        0 1047     7048     573       0    0
## Backdoor            363     2728  731     6552     487      60    0
## DoS                  71       45 3276     7155     171     138   18
## Exploits             51       54  451     9904     100      80   27
## Fuzzers              73       40  202     1408    8949       7  254
## Generic               0        0   27      163      11   10761    0
## NA                    0        0    0      621    4377      48 5825
## Reconnaissance        0        0  200     1482       0      18    0
## Shellcode             0        0    0      181       0       0    0
## Worms                 0        0    0     1725       0       0    0
## Totals             2806     2867 5934    36239   14668   11112 6124
##                Reconnaissance Shellcode Worms  Error               Rate
## Analysis                    0         0     0 0.7941 =   8,668 / 10,916
## Backdoor                    0         0     0 0.7502 =   8,193 / 10,921
## DoS                        51         0     0 0.7001 =   7,649 / 10,925
## Exploits                  249         3     0 0.0930 =   1,015 / 10,919
## Fuzzers                     0         0     0 0.1815 =   1,984 / 10,933
## Generic                     0         0     0 0.0183 =     201 / 10,962
## NA                         63         0     0 0.4673 =   5,109 / 10,934
## Reconnaissance           9226         0     0 0.1556 =   1,700 / 10,926
## Shellcode                   0     10729     0 0.0166 =     181 / 10,910
## Worms                       0         0  9196 0.1580 =   1,725 / 10,921
## Totals                   9589     10732  9196 0.3334 = 36,425 / 109,267
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-10 Hit Ratios: 
##     k hit_ratio
## 1   1  0.666642
## 2   2  0.852371
## 3   3  0.940943
## 4   4  0.975894
## 5   5  0.995571
## 6   6  0.997190
## 7   7  1.000000
## 8   8  1.000000
## 9   9  1.000000
## 10 10  1.000000
## 
## 
## H2OMultinomialMetrics: gbm
## ** Reported on validation data. **
## 
## Validation Set Metrics: 
## =====================
## 
## Extract validation frame with `h2o.getFrame("RTMP_sid_a52c_151")`
## MSE: (Extract with `h2o.mse`) 0.1019
## RMSE: (Extract with `h2o.rmse`) 0.3192
## Logloss: (Extract with `h2o.logloss`) 0.2917
## Mean Per-Class Error: 0.3335
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                Analysis Backdoor DoS Exploits Fuzzers Generic  NA
## Analysis             43        0  20      135      11       0   0
## Backdoor              6       45  12      108       8       1   0
## DoS                   8        5 374      817      20      16   2
## Exploits             17       18 147     3241      33      26   9
## Fuzzers              11        6  31      212    1357       1  39
## Generic               0        0  36      203      13   12747   0
## NA                    0        0   0       39     276       3 367
## Reconnaissance        0        0  21      156       0       2   0
## Shellcode             0        0   0        2       0       0   0
## Worms                 0        0   0        3       0       0   0
## Totals               85       74 641     4916    1718   12796 417
##                Reconnaissance Shellcode Worms  Error             Rate
## Analysis                    0         0     0 0.7943 =      166 / 209
## Backdoor                    0         0     0 0.7500 =      135 / 180
## DoS                         6         0     0 0.7003 =    874 / 1,248
## Exploits                   81         1     0 0.0929 =    332 / 3,573
## Fuzzers                     0         0     0 0.1811 =    300 / 1,657
## Generic                     0         0     0 0.0194 =   252 / 12,999
## NA                          4         0     0 0.4673 =      322 / 689
## Reconnaissance            971         0     0 0.1557 =    179 / 1,150
## Shellcode                   0       119     0 0.0165 =        2 / 121
## Worms                       0         0    16 0.1579 =         3 / 19
## Totals                   1062       120    16 0.1174 = 2,565 / 21,845
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
## =======================================================================
## Top-10 Hit Ratios: 
##     k hit_ratio
## 1   1  0.882582
## 2   2  0.960998
## 3   3  0.982650
## 4   4  0.993362
## 5   5  0.997528
## 6   6  0.999497
## 7   7  1.000000
## 8   8  1.000000
## 9   9  1.000000
## 10 10  1.000000

We got an overall error rate of 11% predicting across all categories

h2o.data.test.malicious <-  h2o.data.test[as.h2o(gbm_predicts_test==1),]
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Predict attack categories for our suspicious data

if (!exists("h2o.gbm.model.cat1.probs.sus")){
  h2o.gbm.model.cat1.probs.sus <-h2o.predict(object = h2o.gbm.model.cat1,
                                            newdata = h2o.data.test.malicious)
}
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.gbm.model.cat1.predictions.sus <- h2o.predict(object = h2o.gbm.model.cat1,
                                            newdata = h2o.data.test.malicious,
                                            type = 'response')
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.gbm.model.cat1.predicted.class.sus <- h2o.gbm.model.cat1.predictions.sus[,1]
predicted_cat <- as.factor(as.vector(h2o.gbm.model.cat1.predicted.class.sus))
actual_cat <- as.factor(as.vector(h2o.data.test.malicious$attack_cat))


confusionMatrix(predicted_cat,actual_cat)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Analysis Backdoor    DoS Exploits Fuzzers Generic     NA
##   Analysis              1        2     10       19       2       2      4
##   Backdoor              0       41      5        7       0       1      0
##   DoS                 215      217   1741     1939     220     268     12
##   Exploits            950     1078   8019    22792    1389    1863    769
##   Fuzzers              14       10    101      209    6233     106   3588
##   Generic               0       19    188      388     298  147666     47
##   NA                   15        4     45      150    1032      25   1314
##   Reconnaissance        2       10     74      642      13      19      1
##   Shellcode             0        9     17       64       0       9      6
##   Worms                 0        0      0        3       0       0      0
##                 Reference
## Prediction       Reconnaissance Shellcode  Worms
##   Analysis                    3         0      0
##   Backdoor                    0         0      0
##   DoS                       274         0      1
##   Exploits                 1632       222     68
##   Fuzzers                     1         5      2
##   Generic                    32        27     18
##   NA                          1         3      0
##   Reconnaissance           6478         0      2
##   Shellcode                   0       629      0
##   Worms                       0         0     11
## 
## Overall Statistics
##                                         
##                Accuracy : 0.876         
##                  95% CI : (0.875, 0.878)
##     No Information Rate : 0.703         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.744         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: Analysis Class: Backdoor Class: DoS
## Sensitivity               0.00083542        0.029496    0.17069
## Specificity               0.99980198        0.999939    0.98451
## Pos Pred Value            0.02325581        0.759259    0.35625
## Neg Pred Value            0.99439164        0.993674    0.95941
## Prevalence                0.00561192        0.006517    0.04782
## Detection Rate            0.00000469        0.000192    0.00816
## Detection Prevalence      0.00020160        0.000253    0.02291
## Balanced Accuracy         0.50031870        0.514718    0.57760
##                      Class: Exploits Class: Fuzzers Class: Generic
## Sensitivity                    0.869         0.6785          0.985
## Specificity                    0.915         0.9802          0.984
## Pos Pred Value                 0.588         0.6070          0.993
## Neg Pred Value                 0.980         0.9855          0.965
## Prevalence                     0.123         0.0431          0.703
## Detection Rate                 0.107         0.0292          0.692
## Detection Prevalence           0.182         0.0481          0.697
## Balanced Accuracy              0.892         0.8293          0.984
##                      Class: NA Class: Reconnaissance Class: Shellcode
## Sensitivity            0.22888                0.7693          0.70993
## Specificity            0.99386                0.9963          0.99951
## Pos Pred Value         0.50753                0.8946          0.85695
## Neg Pred Value         0.97899                0.9906          0.99879
## Prevalence             0.02692                0.0395          0.00415
## Detection Rate         0.00616                0.0304          0.00295
## Detection Prevalence   0.01214                0.0339          0.00344
## Balanced Accuracy      0.61137                0.8828          0.85472
##                      Class: Worms
## Sensitivity             0.1078431
## Specificity             0.9999859
## Pos Pred Value          0.7857143
## Neg Pred Value          0.9995733
## Prevalence              0.0004782
## Detection Rate          0.0000516
## Detection Prevalence    0.0000656
## Balanced Accuracy       0.5539145

With an overall 87% accuracy of assigning categories after labelling items as Malicious in the test set. Our model is performing somewhat satisfactorily. However it is clear from the detection rates that the succes is dominated by that of the much more frequently occuring categories and classification for categories with few samples in the training set is poor.


Save our Label predictions to csv

flow_id <- as.vector(seq(1:length(gbm_predicts_test)))
rows <- cbind(flow_id,gbm_predicts_test)
colnames(rows) <- c("Flow_id","Label")
write.csv(rows,file="GBMPredictiedLabels.csv", row.names=FALSE)

Checking and saving attack Category predictions to CSV in a numeric format.

length(gbm_predicts_test)
## [1] 1016019
length(predicted_cat)
## [1] 213296
length(gbm_predicts_test[gbm_predicts_test==1])
## [1] 213296
predicted_cat_numeric <- predicted_cat

#encode as numbers
predicted_cat_numeric <- factor(predicted_cat,
               levels = c("NA","Reconnaissance","Fuzzers","Analysis","Backdoor","Exploits","Generic","Shellcode","Worm","DoS"),
               labels = c(0,1, 2, 3,4,5,6,7,8,9))

predicted_cat_numeric <- as.character(predicted_cat_numeric)
predicted_cat_numeric <- as.numeric(predicted_cat_numeric)


table(predicted_cat)
## predicted_cat
##       Analysis       Backdoor            DoS       Exploits        Fuzzers 
##             43             54           4887          38782          10269 
##        Generic             NA Reconnaissance      Shellcode          Worms 
##         148683           2589           7241            734             14
table(predicted_cat_numeric)
## predicted_cat_numeric
##      0      1      2      3      4      5      6      7      9 
##   2589   7241  10269     43     54  38782 148683    734   4887
# add cats to al test data labels 
predictedTestCats <- as.numeric(as.character(gbm_predicts_test))
sum(predictedTestCats)
## [1] 213296
length(predicted_cat_numeric)
## [1] 213296
predictedTestCats[predictedTestCats==1] <- predicted_cat_numeric

length(predictedTestCats)
## [1] 1016019
flow_id <- as.vector(seq(1:length(predictedTestCats)))
rows2 <- cbind(flow_id,gbm_predicts_test)
colnames(rows2) <- c("Flow_id","Label")
write.csv(rows2,file="GBMPredictiedAttackLabels.csv", row.names=FALSE)

Predicted categories of the predicted attacks

table(predicted_cat)
## predicted_cat
##       Analysis       Backdoor            DoS       Exploits        Fuzzers 
##             43             54           4887          38782          10269 
##        Generic             NA Reconnaissance      Shellcode          Worms 
##         148683           2589           7241            734             14

Actual test Categories for all test data below

table(as.vector(h2o.data.test$attack_cat))
## 
##       Analysis       Backdoor            DoS       Exploits        Fuzzers 
##           1459           1390          10262          26423          13334 
##        Generic             NA Reconnaissance      Shellcode          Worms 
##         149998         803737           8426            886            104

Conclusion

We had moderate success predicting the attack categories.
The Autoencoder was also moderatly succesful, at least letting us quickly discard more than 60 % of flows as "benign". There may always be benign samples that are very similar to attacks on these data fields so a considerable amount of false positives from this type of data may be inevitable.

Next Steps

  • Lowered stopping criteria Metrics. I had set fairly easy early stopping criteria in model training, to speed running whilst producing this notebook. I was stopping model building when errors were not improving more than 1% over a few scoring rounds. I had previosuly run individual models with higher required accuracy and I will go back and rerun all the models setting the stopping criteria below a tenth of a percent. This will run for a lot longer but it should return a more accurate model of the benign data and we can assess if this improves the anomaly detection. Ir is possible that it may NOT perform any better, in spite of longer model training.

  • Optimising on F2 metric in code. Coding to extract the detection threshold on an optomised F2 metric. We could also code a customisable Beta for F metric to set our detection threshold programmatically that way.

  • Smarter synthetic columns. The various synthetic columns of "instances similar in the last 100" are quite rudimentary and a machine learning informed approach to creating this kind of synthetic data column may lead to much greater success and insight into the patterns of attacks. A model that is "keeping score" of attack patterns happening as they come in may also have better success in using current state of recent flows to inform new classification decisions.

DavidMcCann April 2018