library(knitr)
knitr::opts_chunk$set(cache=FALSE,message=FALSE)
# if error=false stop on errors. error=TRUE ignore errors
knitr::opts_chunk$set(error=TRUE)
knitr::opts_knit$set(progress=FALSE)
library(h2o)
h2o.init(min_mem_size = "16g")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/RtmpusjwGc/h2o_rstudio_started_from_r.out
## /tmp/RtmpusjwGc/h2o_rstudio_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 40 milliseconds
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.7
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_rstudio_ymr419
## H2O cluster total nodes: 1
## H2O cluster total memory: 15.33 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.0 (2017-04-21)
Directory set up
Source Functions File
# Whilst in some ways one notebook for reproducible code is preferable, Once the file is getting unmanagable it was wiser to split into seperate functions files. For larger works this is a much cleaner approach. Using Chunk style notebooks for development is suited only for small works.
#source("networkAnomaliesFunctions.R")
install_latest_h2o <- function() {
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) {remove.packages("h2o") }
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}
install.packages("h2o", type="source",
repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
# If clustering, all versions should match. Install specific h2o version to match cluster nodes.
# Install.packages(file.path(h2oDir,"R/h2o_3.18.0.5.tar.gz"), repos = NULL, type = "source")
}
h2o.importFileFromProccesedOrData <- function (dataDir,fileName,dest_frame) {
#'Here is one I prepared earlier!"
# Function to load in the preprocessed data if it exists,
# otherwise load from given dataDir
filePath <- file.path(dataDir,"processed", fileName)
if (file.exists(filePath)) {
print(paste("Reading processed file from ",filePath))
} else{
filePath <- file.path(dataDir, fileName)
print(paste("Reading data file from ",filePath))}
return( h2o.importFile(path=filePath , sep=",",na.strings="-") )
print("Functions Loaded")
}
exportPreProcessedFiles <- function(Dir="data", processedDir="processed" ){
file_name="trainWithLabels.csv"
try(dir.create(file.path(Dir, processedDir), showWarnings = FALSE))
file_path= file.path(Dir, processedDir,file_name)
if (file.exists(file_path)) {print("preprocessed file exists")
} else {
try(h2o.exportFile(h2o.data.all,path = file_path))
}
file_name="testWithLabels.csv"
file_path= file.path(Dir, processedDir,file_name)
if (file.exists(file_path)) {print("preprocessed file exists")
} else {
try(h2o.exportFile(h2o.data.test,path = file_path))
}
print("Pre processed files Saved For next Time.")
}
copyFromS3 <- function(bucket=bucketname,filen=file_path) {
from_path <- file.path("s3:/",bucket,filen)
to_path <- file.path('.',filen)
command <- paste("aws s3 cp ",from_path," ",to_path,sep='')
system(command)
}
copyToS3 <- function(bucket=bucketname,filen=filename) {
from_path <- filen
to_path <- file.path("s3:/",bucket,filen)
command <- paste("aws s3 cp ",from_path," ",to_path,sep='')
system(command)
}
NewestFileInDir <- function(pathDir=".",pattern="*.*") {
list.files(pathDir, pattern, full.names = TRUE)
details = file.info(list.files(path=pathDir, full.names = TRUE))
details = details[with(details, order(as.POSIXct(mtime),decreasing = TRUE)), ]
newestfile = rownames(details[1,])
return(newestfile)
}
Note - This analyis was implememnted using h2o version 3.18.05 on rStudio 1.1.442. An installation of h2o is needed to run all of the functionality. There appear to be some issues in Knitr outputting h2o calls not as they are reflected on the screen. So this notebook may not appear the same as when running the program
Install necessary R Packages
# Load Packages
pkgs <- c(
"knitr",
"rmarkdown",
"tictoc",
"pROC",
"ggridges",
"beepr",
"data.table",
"lemon",
"tictoc",
"iptools",
"caret",
"h2o",
"kableExtra",
"gbm",
"ggridges",
"tidyr",
"dplyr",
"ggplot2",
"scales",
"readr"
# "ff",
# "statmod",
# "stats",
# "graphics",
# "jsonlite",
# "tools",
# "utils",
# "tidyverse",
# "devtools",
# "RCurl",
# "methods",
# "xgboost",
# "randomForest",
)
# Check packages and install then load any that are needed
for (pkg in pkgs) {
lapply(pkg, function(pkg) {
if (!require(pkg, character.only=T)) {
if (!(pkg %in% rownames(installed.packages()) )) {
install.packages(pkg)}
}
require(pkg, character.only=T)})
}
options("h2o.use.data.table"=TRUE)
knit_print.data.frame <- lemon_print
h2oSourceDir <- "C:/Users/D/Downloads/installers/h2o-3.18.0.5"
## install_latest_h2o ()
Network flow: A network flow is a network traffic stream defined with a common set of identifiers, typically, the same
If any of these variables change then a new flow is started. As a result, it's possible to have several flows between the same client and server nodes at a given time. In addition to the above primary features, as shown in your dataset, several other useful features can be defined to describe the characteristics of a network flow.
Dataset: This third party dataset includes network flow data produced by nine different attack types within realistic normal user activities. Following attack types are included in the dataset.
GOALS:
G1. Employ (or develop) an unsupervised machine learning technique to spot malicious network flows in the dataset. Then further extend your analysis to divide malicious group into nine different clusters (groups). At this stage you may not be able to assign labels for each group you obtained.
G2. Employ one class modelling approach (incuding novelty/anomaly detection) to spot malicious network flows in the dataset. The idea here is to build the model only using benign samples and then use trained model to identification of new/unknown data that trained model has not been trained with and was not previously aware of, with the help of either statistical or machine learning based approaches.
G3. Employ supervised learning techniques to spot malicious network flows in the dataset.
Labels for each class will be provided at this stage.
Data description:
Train.csv - the training set,
Test.csv - the test set and
Features.txt - description of each feature defined in the dataset. Evaluation metric: the accuracy (true/false positives rates) and computational cost for testing.
Submission format: Submission files should contain two columns: Flow_id and Label.
Every raw is separated by a comma ( "," ).
Labels contain benign connection (0) and malicious connection (1) for G1 and G2.
For G3, labels contain benign connection (0) and nine types of attacks (1-9).
The file should contain a header and may have the following format:
Flow_id,Label
1,1
2,1
3,2
4,0
20th March 2018
A challenging brief to use unsupervised anomaly detection to spot network attacks.
While some decison trees classifiers can be used to build models in a supervised fashion quite easily, I decided to tackle the unsupervised learning challenge, using Deep Learning AutoEncoders to attempt to spot anomalies in the data flow.
Implementing this on h2o built on an Amazon Web Services Virtual machine brought speed to the model processing, but brought a host of its own configuration challenges.
Let's examine the data set.
features <- read.csv("Features.txt" ,sep=",",na.strings="-")
features
| Name | DataType | Description |
|---|---|---|
| srcip | nominal | Source IP address |
| sport | integer | Source port number |
| dstip | nominal | Destination IP address |
| dsport | integer | Destination port number |
| proto | nominal | Transaction protocol |
| state | nominal | Indicates to the state and its dependent protocol |
| dur | Float | Record total duration |
| sbytes | Integer | Source to destination transaction bytes |
| dbytes | Integer | Destination to source transaction bytes |
| sttl | Integer | Source to destination time to live value |
| dttl | Integer | Destination to source time to live value |
| sloss | Integer | Source packets retransmitted or dropped |
| dloss | Integer | Destination packets retransmitted or dropped |
| service | nominal | http, ftp, smtp, ssh, dns, ftp-data ,irc |
| Sload | Float | Source bits per second |
| Dload | Float | Destination bits per second |
| Spkts | integer | Source to destination packet count |
| Dpkts | integer | Destination to source packet count |
| swin | integer | Source TCP window advertisement value |
| dwin | integer | Destination TCP window advertisement value |
| stcpb | integer | Source TCP base sequence number |
| dtcpb | integer | Destination TCP base sequence number |
| smeansz | integer | Mean of the ?ow packet size transmitted by the src |
| dmeansz | integer | Mean of the ?ow packet size transmitted by the dst |
| trans_depth | integer | Represents the pipelined depth into the connection of http request/response transaction |
| res_bdy_len | integer | Actual uncompressed content size of the data transferred from the server’s http service. |
| Sjit | Float | Source jitter (mSec) |
| Djit | Float | Destination jitter (mSec) |
| Stime | Timestamp | record start time |
| Ltime | Timestamp | record last time |
| Sintpkt | Float | Source interpacket arrival time (mSec) |
| Dintpkt | Float | Destination interpacket arrival time (mSec) |
| tcprtt | Float | TCP connection setup round-trip time, the sum of synack and ackdat |
| synack | Float | TCP connection setup time, the time between the SYN and the SYN_ACK packets. |
| ackdat | Float | TCP connection setup time, the time between the SYN_ACK and the ACK packets. |
| is_sm_ips_ports | Binary | If source and destination IP addresses equal and port numbers equal then, this variable takes value 1 else 0 |
| ct_state_ttl | Integer | No. for each state according to specific range of values for source/destination time to live. |
| ct_flw_http_mthd | Integer | No. of flows that has methods such as Get and Post in http service. |
| is_ftp_login | Binary | If the ftp session is accessed by user and password then 1 else 0. |
| ct_ftp_cmd | integer | No of flows that has a command in ftp session. |
| ct_srv_src | integer | No. of connections that contain the same service and source address in 100 connections according to the last time. |
| ct_srv_dst | integer | No. of connections that contain the same service and destination address in 100 connections according to the last time. |
| ct_dst_ltm | integer | No. of connections of the same destination address in 100 connections according to the last time. |
| ct_src_ltm | integer | No. of connections of the same source address in 100 connections according to the last time. |
| ct_src_dport_ltm | integer | No of connections of the same source address and the destination port in 100 connections according to the last time. |
| ct_dst_sport_ltm | integer | No of connections of the same destination address and the source port in 100 connections according to the last time. |
| ct_dst_src_ltm | integer | No of connections of the same source and the destination address in in 100 connections according to the last time. |
| # Loadin some test | data - Stand | ard |
# Short test read
if(!exists("dfL1000")) {
try(dfL1000 <- read.csv("data/trainWithLabels.csv",nrows = 1000,sep=",",na.strings="-"))
} else{print("dfL1000 Already loaded")}
col_names <- names(dfL1000)
feature_names <- setdiff(col_names, c("Label","attack_cat"))
tic()
if(!exists("train_data_all")) {
print("Loading train_data csv.")
train_data_all <- read.csv(train_data_file ,sep=",",na.strings="-")
}else{
print("train_data_all already loaded")
}
## [1] "Loading train_data csv."
toc()
## 85.009 sec elapsed
Regular read in of training data file took 200Seconds, Much Faster in aws
tic()
if (!exists("test_data")) {
print("Loading test_data csv.")
test_data <- read.csv(test_data_file ,sep=",",na.strings="-")
}else
{print("test_data Already loaded")
}
## [1] "Loading test_data csv."
toc()
## 36.216 sec elapsed
# Was 105sec in pc. 27 secs in aws.
A regular read in of test data file on standard PC took 105 seconds On AWS EC2 virtual machine read in took around a quarter of that. Model Processing times would be similarly faster.
Inspecting the files, it turned out that the train.csv file and the first 47 columns of TrainWithLabels.CSV, were identical. As I was facing PC memory issues with the big data files, I have removed the references loading the train.csv and test.csv data sets. We will simply take the first 47 columns of the WithLabels versions to avoid duplication in memory.
By referencing the columns we want to use using the setdiff function, we ensure that the same code will handle processing of the 47 column train.csv file and the 49 column trainWithLabels.csv without modification.
A first approach to dealing with the memory hungry size of the data set was using raw file splitting tools;
CSVSplitter filename="data/trainWithLabels.csv" outputfolder="data/splitdata" rowcount=100000 firstrowheader=1 > repeatheader=1
which will split the csv file into smaller files to process with model checkpoints.
This was a decent approach, however I researched better alternatives and decided to implement in h2o.
H2O is open-source software for big-data analysis,I installed the h2o platform allowing for faster in memory processing of big data sets, in order to better handle the Large dataset H2o runs outside of R
R can connect to its service for processing.
H2o can run on a standalone PC, improving performance.
H2o can run on a cluster of local machines allowing an Rstudio instance to harness the processing power of the cluster.
H2o can run on a cloud hosted virtual machine, or virtual machine cluster for even faster performance.
After initially installing on a cluster and running models on 3 of my own local pcs, I then decided to go further and entered into a deeper than expected dive into creating and configuring a virtual machine. My adventures in installing and configuring an Amazon Web Services Elastic Cloud Compute (AWS EC2 - for the acronymically inclined) Linux virtual machine running R, R-studio-server and a h2o server, are perhaps beyond the scope of this report. Getting everything to mesh and install was challenging but ultimately rewarding and worthwhile.
Elastic Compute Cloud is a web service that provides resizable compute capacity in the cloud, designed to make web-scale cloud computing easier for developers.
The processing power will allow the further build out of the trained models.
h2o.start <- function(type='local') {
# Start function with options for local/cluster/AWS Server launching.
library(h2o)
if (type=='cluster') {
# Requires configuration files loaded by batch file on each local cluster machine
setwd("C:/Users/D/Downloads/installers/h2o-3.18.0.5")
system("starth2oname.bat")
h2o.connect(ip="192.168.0.10",port=54321)
setwd(mainDir)
} else if (type=='AWS') {
# If connecting to remote AWS server requires srver launced from config file.
setwd("C:/Users/D/Downloads/installers/h2o-3.18.0.5")
getwd()
system("starth2oAWS.bat")
awsip=aws_ip
h2o.connect(ip=awsip,port=54321)
} else {
h2o.init(min_mem_size = "16g")
}
} #endfunction h2o.start
h2o.start('local')
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 minutes 7 seconds
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.7
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_rstudio_ymr419
## H2O cluster total nodes: 1
## H2O cluster total memory: 15.16 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.0 (2017-04-21)
Sys.sleep(3)
# h2o.start('cluster')
#h2o.start('AWS')
#h2o.shutdown()
Loading the test data into h2o on an AWS VM is almost 3 time faster again, having gone from 200 seconds on a standalone pc to just over 30 seconds.
h2o.loadData <- function(dataDirectory) {
tic()
if (!exists("h2o.data.all")){
h2o.data.all <- h2o.importFileFromProccesedOrData(dataDirectory,"trainWithLabels.csv","train")
}
toc()
assign("h2o.data.all", h2o.data.all, envir = .GlobalEnv)
tic()
if (!exists("h2o.data.test")){
h2o.data.test <- h2o.importFileFromProccesedOrData(dataDirectory,"testWithLabels.csv","test")
}
toc()
assign("h2o.data.test", h2o.data.test, envir = .GlobalEnv)
}
h2o.loadData(dataDir)
## [1] "Reading processed file from data/processed/trainWithLabels.csv"
##
|
| | 0%
|
|================ | 25%
|
|================================ | 50%
|
|========================================= | 62%
|
|================================================= | 75%
|
|=================================================================| 100%
## 12.535 sec elapsed
## [1] "Reading processed file from data/processed/testWithLabels.csv"
##
|
| | 0%
|
|======================== | 37%
|
|================================================= | 75%
|
|=================================================================| 100%
## 6.034 sec elapsed
numMalicious <- h2o.sum(h2o.data.all$Label=='1')
numBenign <- h2o.sum(h2o.data.all$Label=='0')
rateMalicious <- numMalicious/h2o.nrow(h2o.data.all)
cat(rateMalicious*100,"% of samples are malicious")
## 0 % of samples are malicious
Check levels of Attack_Cat in Training Data
h2o.levels(h2o.data.all$attack_cat)
## [1] "Analysis" "Backdoor" "DoS" "Exploits"
## [5] "Fuzzers" "Generic" "NA" "Reconnaissance"
## [9] "Shellcode" "Worms"
Before any preprocessing there appear to be categories of both BACKDOOR and BACKDOORS. we will merge these into a single category for Backdoor attacks. On subsequent runs we should see just one "backdoor" category.
Encode IPs & Labels.
Ip addresses are encoded as FACTORS, implying all are equally dissimilar, however very similar IP addresses may indeed yield some level of information versus treating all as equally disimilar. So we will encode these as Numeric.
h2o_ip_to_numeric <- function(ipcol){
ipcol <- ip_to_numeric(as.vector(as.character(ipcol)))
if( class(ipcol[1])=="H2OFrame") {
ipcol <- as.h2o(ipcol)}
return (ipcol)
}
fixDataTypes <- function(df) {
print("encoding dstip")
if (h2o.isnumeric(df$dstip[1])){NULL
} else {
dfcol <- as.data.frame(df[, "dstip"])
dfcolconverted <- apply(dfcol, 1, ip_to_numeric)
df[, "dstip"] <- as.h2o(dfcolconverted)
print(df$dstip[3:6])
}
print("encoding srcip")
if (h2o.isnumeric(df$srcip)) { NULL
} else {
dfcol <- as.data.frame(df[, "srcip"])
dfcolconverted <- apply(dfcol, 1, ip_to_numeric)
df[, "srcip"] <- as.h2o(dfcolconverted)
print(df$srcip[3:6])
}
print("Encoding sport")
df$sport <- h2o.asnumeric(df$sport)
print(df$sport[3:6])
print("Encoding Label as factor")
if ("Label" %in% names(df)){ df$Label <- as.factor(df$Label) }
print("Encoding Attack Cat")
if ("attack_cat" %in% names(df)){
df$attack_cat <- h2o.ascharacter(df$attack_cat)
df$attack_cat[df$attack_cat=="Backdoors"] <- "Backdoor"
df$attack_cat <- h2o.asfactor(df$attack_cat)
}
if( class(df$attack_cat)=="H2OFrame") {
print(h2o.table(df$attack_cat))
} else {print(table(df$attack_cat))}
return(df)
}
print("Preprocess training Data")
## [1] "Preprocess training Data"
response_name <- 'Label'
response <- 'Label'
response_vars <- c("Label","attack_cat")
skipped_vars <- c("srcip","dstip","Ltime","Stime")
predictor_vars <- setdiff(names(h2o.data.all), c(response_vars,skipped_vars))
print("Preprocess test Data")
## [1] "Preprocess test Data"
tic()
h2o.data.all <- fixDataTypes(h2o.data.all)
## [1] "encoding dstip"
## [1] "encoding srcip"
## [1] "Encoding sport"
## sport
## 1 1464
## 2 3593
## 3 49664
## 4 32119
##
## [4 rows x 1 column]
## [1] "Encoding Label as factor"
## [1] "Encoding Attack Cat"
## attack_cat Count
## 1 Analysis 1218
## 2 Backdoor 939
## 3 DoS 6091
## 4 Exploits 18102
## 5 Fuzzers 10912
## 6 Generic 65483
##
## [10 rows x 2 columns]
h2o.data.test <- fixDataTypes(h2o.data.test)
## [1] "encoding dstip"
## [1] "encoding srcip"
## [1] "Encoding sport"
## sport
## 1 1043
## 2 1043
## 3 0
## 4 0
##
## [4 rows x 1 column]
## [1] "Encoding Label as factor"
## [1] "Encoding Attack Cat"
## attack_cat Count
## 1 Analysis 1459
## 2 Backdoor 1390
## 3 DoS 10262
## 4 Exploits 26423
## 5 Fuzzers 13334
## 6 Generic 149998
##
## [10 rows x 2 columns]
toc()
## 8.755 sec elapsed
summary(train_data_all)
## srcip sport dstip
## 59.166.0.0:134436 1043 : 66843 149.171.126.4:133914
## 59.166.0.2:134223 47439 : 63285 149.171.126.3:133839
## 59.166.0.4:133701 0 : 26483 149.171.126.2:133816
## 59.166.0.5:133592 138 : 1140 149.171.126.1:133719
## 59.166.0.1:133490 80 : 374 149.171.126.0:133653
## 59.166.0.3:132621 53 : 218 149.171.126.5:133289
## (Other) :721965 (Other):1365685 (Other) :721798
## dsport proto state dur
## Min. : 0 tcp :992483 FIN :981503 Min. : 0
## 1st Qu.: 53 udp :504825 CON :378233 1st Qu.: 0
## Median : 143 arp : 7634 INT :158493 Median : 0
## Mean : 13296 unas : 6362 REQ : 5112 Mean : 1
## 3rd Qu.: 20472 ospf : 4555 RST : 220 3rd Qu.: 0
## Max. :538989345 sctp : 556 CLO : 161 Max. :8787
## NA's :7 (Other): 7613 (Other): 306
## sbytes dbytes sttl dttl
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 264 1st Qu.: 178 1st Qu.: 31.0 1st Qu.: 29
## Median : 1684 Median : 2468 Median : 31.0 Median : 29
## Mean : 4591 Mean : 41467 Mean : 48.8 Mean : 31
## 3rd Qu.: 3614 3rd Qu.: 16734 3rd Qu.: 31.0 3rd Qu.: 29
## Max. :14355774 Max. :14657531 Max. :255.0 Max. :254
##
## sloss dloss service Sload
## Min. : 0 Min. : 0 dns :365313 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 http :128630 1st Qu.: 99483
## Median : 3 Median : 5 ftp-data: 81646 Median : 554080
## Mean : 6 Mean : 19 smtp : 52142 Mean : 20370332
## 3rd Qu.: 7 3rd Qu.: 15 ftp : 32369 3rd Qu.: 1429449
## Max. :5319 Max. :5507 (Other) : 32906 Max. :5468000256
## NA's :831022
## Dload Spkts Dpkts swin
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 66242 1st Qu.: 2 1st Qu.: 2 1st Qu.: 0
## Median : 649043 Median : 14 Median : 18 Median :255
## Mean : 2754003 Mean : 37 Mean : 48 Mean :166
## 3rd Qu.: 3631981 3rd Qu.: 48 3rd Qu.: 46 3rd Qu.:255
## Max. :128761904 Max. :10646 Max. :11018 Max. :255
##
## dwin stcpb dtcpb smeansz
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 61
## Median :255 Median : 991503414 Median : 990879062 Median : 73
## Mean :166 Mean :1395989787 Mean :1395608278 Mean : 126
## 3rd Qu.:255 3rd Qu.:2642617073 3rd Qu.:2642564840 3rd Qu.: 131
## Max. :255 Max. :4294953347 Max. :4294953724 Max. :1504
##
## dmeansz trans_depth res_bdy_len Sjit
## Min. : 0 Min. : 0.00 Min. : 0 Min. : 0
## 1st Qu.: 80 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0
## Median : 103 Median : 0.00 Median : 0 Median : 24
## Mean : 311 Mean : 0.09 Mean : 4652 Mean : 1630
## 3rd Qu.: 565 3rd Qu.: 0.00 3rd Qu.: 0 3rd Qu.: 578
## Max. :1500 Max. :131.00 Max. :5242880 Max. :1460480
##
## Djit Stime Ltime
## Min. : 0 Min. :1421927377 Min. :1421927414
## 1st Qu.: 0 1st Qu.:1421943357 1st Qu.:1421943357
## Median : 16 Median :1421958678 Median :1421958679
## Mean : 829 Mean :1422602492 Mean :1422602493
## 3rd Qu.: 72 3rd Qu.:1424223184 3rd Qu.:1424223184
## Max. :781221 Max. :1424233663 Max. :1424233663
##
## Sintpkt Dintpkt tcprtt synack
## Min. : 0 Min. : 0 Min. :0.000 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:0.000 1st Qu.:0.0000
## Median : 1 Median : 1 Median :0.001 Median :0.0005
## Mean : 240 Mean : 95 Mean :0.004 Mean :0.0023
## 3rd Qu.: 9 3rd Qu.: 8 3rd Qu.:0.001 3rd Qu.:0.0006
## Max. :60010 Max. :59485 Max. :3.864 Max. :2.1049
##
## ackdat is_sm_ips_ports ct_state_ttl ct_flw_http_mthd
## Min. :0.000 Min. :0.0000 Min. :0.00 0 :986791
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00 NA :404027
## Median :0.000 Median :0.0000 Median :0.00 1 :119301
## Mean :0.002 Mean :0.0021 Mean :0.15 6 : 7880
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00 4 : 3918
## Max. :3.519 Max. :1.0000 Max. :6.00 3 : 735
## (Other): 1376
## is_ftp_login ct_ftp_cmd ct_srv_src ct_srv_dst
## 0 :1066593 0 :1066498 Min. : 1.00 Min. : 1.00
## 1 : 27716 NA : 429667 1st Qu.: 2.00 1st Qu.: 2.00
## 2 : 10 1 : 24404 Median : 5.00 Median : 5.00
## 4 : 42 2 : 1244 Mean : 7.54 Mean : 7.29
## NA: 429667 4 : 846 3rd Qu.: 9.00 3rd Qu.: 9.00
## 3 : 729 Max. :63.00 Max. :62.00
## (Other): 640
## ct_dst_ltm ct_src_ltm ct_src_dport_ltm ct_dst_sport_ltm
## Min. : 1.0 Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.0 Median : 4.0 Median : 1.00 Median : 1.00
## Mean : 5.1 Mean : 5.6 Mean : 3.02 Mean : 2.35
## 3rd Qu.: 5.0 3rd Qu.: 6.0 3rd Qu.: 2.00 3rd Qu.: 1.00
## Max. :60.0 Max. :60.0 Max. :60.00 Max. :60.00
##
## ct_dst_src_ltm attack_cat Label
## Min. : 1.00 NA :1415027 Min. :0.0000
## 1st Qu.: 1.00 Generic : 65483 1st Qu.:0.0000
## Median : 2.00 Exploits : 18102 Median :0.0000
## Mean : 4.35 Fuzzers : 10912 Mean :0.0715
## 3rd Qu.: 3.00 DoS : 6091 3rd Qu.:0.0000
## Max. :63.00 Reconnaissance: 5561 Max. :1.0000
## (Other) : 2852
The variables dsport and dur appear to have some odd values, that we will need to investigate further
#Lets select the High sport values
which(train_data_all$dsport==max(train_data_all$dsport))
## integer(0)
maxPortRowNums <- which(train_data_all$dsport==
max(train_data_all$dsport[!is.na(train_data_all$dsport)]))
train_data_all[(maxPortRowNums-1):(maxPortRowNums+2),]
## Warning in (maxPortRowNums - 1):(maxPortRowNums + 2): numerical expression
## has 2 elements: only the first used
## Warning in (maxPortRowNums - 1):(maxPortRowNums + 2): numerical expression
## has 2 elements: only the first used
| srcip | sport | dstip | dsport | proto | state | dur | sbytes | dbytes | sttl | dttl | sloss | dloss | service | Sload | Dload | Spkts | Dpkts | swin | dwin | stcpb | dtcpb | smeansz | dmeansz | trans_depth | res_bdy_len | Sjit | Djit | Stime | Ltime | Sintpkt | Dintpkt | tcprtt | synack | ackdat | is_sm_ips_ports | ct_state_ttl | ct_flw_http_mthd | is_ftp_login | ct_ftp_cmd | ct_srv_src | ct_srv_dst | ct_dst_ltm | ct_src_ltm | ct_src_dport_ltm | ct_dst_sport_ltm | ct_dst_src_ltm | attack_cat | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82531 | 59.166.0.7 | 52475 | 149.171.126.8 | 53 | udp | CON | 0.0010 | 146 | 178 | 31 | 29 | 0 | 0 | dns | 581095 | 708458 | 2 | 2 | 0 | 0 | 0 | 0 | 73 | 89 | 0 | 0 | 0 | 0 | 1421930680 | 1421930680 | 0.001 | 0.011 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 2 | 3 | 1 | 1 | 2 | NA | 0 |
| 82532 | 175.45.176.1 | NA | 149.171.126.12 | 538989345 | esp | INT | 0.0000 | 200 | 0 | 254 | 0 | 0 | 0 | NA | 200000000 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 1421930680 | 1421930680 | 0.004 | 0.000 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 12 | 12 | 4 | 4 | 2 | 1 | 4 | NA | 0 |
| 82533 | 175.45.176.1 | 0 | 149.171.126.12 | 538989345 | esp | INT | 0.0000 | 200 | 0 | 254 | 0 | 0 | 0 | NA | 200000000 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 1421930680 | 1421930680 | 0.004 | 0.000 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 12 | 12 | 4 | 4 | 2 | 3 | 4 | NA | 0 |
| 82534 | 59.166.0.2 | 47418 | 149.171.126.7 | 53 | udp | CON | 0.0011 | 146 | 178 | 31 | 29 | 0 | 0 | dns | 539741 | 658041 | 2 | 2 | 0 | 0 | 0 | 0 | 73 | 89 | 0 | 0 | 0 | 0 | 1421930680 | 1421930680 | 0.011 | 0.008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 3 | 2 | 1 | 1 | NA | 0 |
train_data_all$dsport[!is.na(train_data_all$dsport)][
train_data_all$dsport[!is.na(train_data_all$dsport)] >
quantile(train_data_all$dsport[!is.na(train_data_all$dsport)],prob=0.99999)]
## [1] 65535 538989345 538989345 65535 65535 65535 65535
## [8] 65535 65535 65535 65535 65535 65535 65535
plot(as.vector(train_data_all[,"dsport"]),col=rgb(0, 0.7, 0.3, 0.1))
The highest possible port number is 65535,so these very high values for ports are clearly erroneous rows. We will remove them, and replot the chart to check.
train_data_all <- train_data_all[-maxPortRowNums,]
plot(as.vector(train_data_all[sample.int(nrow(train_data_all),10000),"dsport"]),col=rgb(0, 0.7, 0.3, 0.1))
That looks more reasonably distributed, and all ports are now within a possible range of values.
Now to check dsport values in the h2o version of the data frame.
h2o.max(h2o.data.all$dsport[!is.na(h2o.data.all$dsport)])
## [1] 65535
h2o.data.all$dsport[!is.na(h2o.data.all$dsport)][
h2o.data.all$dsport[!is.na(h2o.data.all$dsport)] >
h2o.quantile(h2o.data.all$dsport[!is.na(h2o.data.all$dsport)],probs=0.99999)]
## dsport
## 1 65535
## 2 65535
## 3 65535
## 4 65535
## 5 65535
## 6 65535
##
## [12 rows x 1 column]
The h2o Frame does not appear to have the same error in the dsport values as the base r dataframe had.
#plot(as.vector(h2o.data.all[,"dsport"]),col=rgb(0, 0.7, 0.3, 0.3))
No overly large values for dsport in the h2o Dataframe.
Checking similarly for dur
which(train_data_all$dur==max(train_data_all$dur))
## [1] 687663 687664
maxDurRowNums <- which(train_data_all$dur==
max(train_data_all$dur[!is.na(train_data_all$dur)]))
kable(train_data_all[(maxDurRowNums-1):(maxDurRowNums+2),], "html") %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "400px")
## Warning in (maxDurRowNums - 1):(maxDurRowNums + 2): numerical expression
## has 2 elements: only the first used
## Warning in (maxDurRowNums - 1):(maxDurRowNums + 2): numerical expression
## has 2 elements: only the first used
| srcip | sport | dstip | dsport | proto | state | dur | sbytes | dbytes | sttl | dttl | sloss | dloss | service | Sload | Dload | Spkts | Dpkts | swin | dwin | stcpb | dtcpb | smeansz | dmeansz | trans_depth | res_bdy_len | Sjit | Djit | Stime | Ltime | Sintpkt | Dintpkt | tcprtt | synack | ackdat | is_sm_ips_ports | ct_state_ttl | ct_flw_http_mthd | is_ftp_login | ct_ftp_cmd | ct_srv_src | ct_srv_dst | ct_dst_ltm | ct_src_ltm | ct_src_dport_ltm | ct_dst_sport_ltm | ct_dst_src_ltm | attack_cat | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 687664 | 59.166.0.2 | 30239 | 149.171.126.2 | 53991 | tcp | FIN | 0.0691 | 424 | 8824 | 31 | 29 | 1 | 4 | ftp-data | 42951.0039 | 936470.7500 | 8 | 12 | 255 | 255 | 3964727548 | 4277903059 | 53 | 735 | 0 | 0 | 810.4 | 640.83 | 1421954430 | 1421954430 | 9.827 | 6.238 | 0.0006 | 0.0005 | 0.0001 | 0 | 0 | 0 | 0 | 0 | 9 | 3 | 2 | 4 | 1 | 1 | 2 | NA | 0 |
| 687665 | 10.40.182.1 | 0 | 10.40.182.3 | 0 | arp | CON | 8786.6377 | 644 | 1058 | 0 | 0 | 0 | 0 | NA | 0.5609 | 0.9214 | 23 | 23 | 0 | 0 | 0 | 0 | 28 | 46 | 0 | 0 | 0.0 | 0.00 | 1421945643 | 1421954430 | 0.000 | 0.000 | 0.0000 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | NA | 0 |
| 687666 | 10.40.182.1 | 0 | 10.40.182.3 | 0 | arp | CON | 8786.6377 | 644 | 1058 | 0 | 0 | 0 | 0 | NA | 0.5609 | 0.9214 | 23 | 23 | 0 | 0 | 0 | 0 | 28 | 46 | 0 | 0 | 0.0 | 0.00 | 1421945643 | 1421954430 | 0.000 | 0.000 | 0.0000 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | NA | 0 |
| 687667 | 59.166.0.4 | 64187 | 149.171.126.3 | 21 | tcp | FIN | 0.9167 | 2934 | 3742 | 31 | 29 | 11 | 15 | ftp | 25116.0117 | 32053.8965 | 52 | 54 | 255 | 255 | 1459339333 | 1460380764 | 56 | 69 | 0 | 0 | 1630.0 | 55.21 | 1421954429 | 1421954430 | 17.967 | 17.287 | 0.0006 | 0.0005 | 0.0001 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | NA | 0 |
Now to check the odd values in the duration variable, dur.
train_data_all$dur[train_data_all$dur > quantile(train_data_all$dur,prob=0.99999)]
## [1] 8787 8787 8761 8761 8761 8761 60 60 60 60 60 60 60 60
## [15] 60 60
plot(as.vector(train_data_all[,"dur"]),col=rgb(0, 0.7, 0.3, 0.5))
Seems again to be very odd outlier value ot values. We will remove these extreme outliers. A closer look at these rows shows that the source and destination ip addresses of these items they are only 2 apart, so these are likely very large local file transfers. Whilst the may still be genuine suspicious values, they are so far out of the range of the rest of the data as to be unhelpful for modelling.
We will remove the dur outliers and replot.
train_data_all <- train_data_all[-which(train_data_all$dur>8000),]
plot(as.vector(train_data_all[,"dur"]),col=rgb(0, 0.7, 0.3, 0.5))
That looks more reasonably distributed now.
We will now check the dur values in the h2o version of our data frame and remove any wildly extreme values.
# we can use the useful h2o.quantie function to return us only the very highest values in the large data set.
h2o.data.all$dur[!is.na(h2o.data.all$dur)][
h2o.data.all$dur[!is.na(h2o.data.all$dur)] >
h2o.quantile(h2o.data.all$dur[!is.na(h2o.data.all$dur)],probs=0.99999)]
## dur
## 1 60
## 2 60
## 3 60
## 4 60
## 5 60
## 6 60
##
## [16 rows x 1 column]
Remove outliers and recheck top values in h2o data frame
h2o.data.all <- h2o.data.all[h2o.data.all$dur<8000,]
h2o.data.all$dur[!is.na(h2o.data.all$dur)][
h2o.data.all$dur[!is.na(h2o.data.all$dur)] >
h2o.quantile(h2o.data.all$dur[!is.na(h2o.data.all$dur)],probs=0.999995)]
## dur
## 1 60
## 2 60
## 3 60
## 4 60
## 5 60
## 6 60
##
## [8 rows x 1 column]
summary(train_data_all)
## srcip sport dstip
## 59.166.0.0:134436 1043 : 66843 149.171.126.4:133914
## 59.166.0.2:134223 47439 : 63285 149.171.126.3:133839
## 59.166.0.4:133701 0 : 26476 149.171.126.2:133816
## 59.166.0.5:133592 138 : 1140 149.171.126.1:133719
## 59.166.0.1:133490 80 : 374 149.171.126.0:133653
## 59.166.0.3:132621 53 : 218 149.171.126.5:133289
## (Other) :721957 (Other):1365684 (Other) :721790
## dsport proto state dur
## Min. : 0 tcp :992483 FIN :981503 Min. : 0.00
## 1st Qu.: 53 udp :504825 CON :378231 1st Qu.: 0.00
## Median : 143 arp : 7632 INT :158491 Median : 0.02
## Mean :12589 unas : 6362 REQ : 5108 Mean : 0.67
## 3rd Qu.:20472 ospf : 4551 RST : 220 3rd Qu.: 0.25
## Max. :65535 sctp : 556 CLO : 161 Max. :60.00
## NA's :7 (Other): 7611 (Other): 306
## sbytes dbytes sttl dttl
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 264 1st Qu.: 178 1st Qu.: 31.0 1st Qu.: 29
## Median : 1684 Median : 2468 Median : 31.0 Median : 29
## Mean : 4591 Mean : 41467 Mean : 48.8 Mean : 31
## 3rd Qu.: 3614 3rd Qu.: 16734 3rd Qu.: 31.0 3rd Qu.: 29
## Max. :14355774 Max. :14657531 Max. :255.0 Max. :254
##
## sloss dloss service Sload
## Min. : 0 Min. : 0 dns :365313 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 http :128630 1st Qu.: 99485
## Median : 3 Median : 5 ftp-data: 81646 Median : 554080
## Mean : 6 Mean : 19 smtp : 52142 Mean : 20370176
## 3rd Qu.: 7 3rd Qu.: 15 ftp : 32369 3rd Qu.: 1429449
## Max. :5319 Max. :5507 (Other) : 32906 Max. :5468000256
## NA's :831014
## Dload Spkts Dpkts swin
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 66242 1st Qu.: 2 1st Qu.: 2 1st Qu.: 0
## Median : 649043 Median : 14 Median : 18 Median :255
## Mean : 2754018 Mean : 37 Mean : 48 Mean :166
## 3rd Qu.: 3631999 3rd Qu.: 48 3rd Qu.: 46 3rd Qu.:255
## Max. :128761904 Max. :10646 Max. :11018 Max. :255
##
## dwin stcpb dtcpb smeansz
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 61
## Median :255 Median : 991520167 Median : 990893074 Median : 73
## Mean :166 Mean :1395997115 Mean :1395615604 Mean : 126
## 3rd Qu.:255 3rd Qu.:2642621606 3rd Qu.:2642573682 3rd Qu.: 131
## Max. :255 Max. :4294953347 Max. :4294953724 Max. :1504
##
## dmeansz trans_depth res_bdy_len Sjit
## Min. : 0 Min. : 0.00 Min. : 0 Min. : 0
## 1st Qu.: 80 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0
## Median : 103 Median : 0.00 Median : 0 Median : 24
## Mean : 311 Mean : 0.09 Mean : 4652 Mean : 1630
## 3rd Qu.: 565 3rd Qu.: 0.00 3rd Qu.: 0 3rd Qu.: 578
## Max. :1500 Max. :131.00 Max. :5242880 Max. :1460480
##
## Djit Stime Ltime
## Min. : 0 Min. :1421927377 Min. :1421927414
## 1st Qu.: 0 1st Qu.:1421943357 1st Qu.:1421943357
## Median : 16 Median :1421958679 Median :1421958679
## Mean : 829 Mean :1422602495 Mean :1422602496
## 3rd Qu.: 72 3rd Qu.:1424223184 3rd Qu.:1424223184
## Max. :781221 Max. :1424233663 Max. :1424233663
##
## Sintpkt Dintpkt tcprtt synack
## Min. : 0 Min. : 0 Min. :0.000 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:0.000 1st Qu.:0.0000
## Median : 1 Median : 1 Median :0.001 Median :0.0005
## Mean : 240 Mean : 95 Mean :0.004 Mean :0.0023
## 3rd Qu.: 9 3rd Qu.: 8 3rd Qu.:0.001 3rd Qu.:0.0006
## Max. :60010 Max. :59485 Max. :3.864 Max. :2.1049
##
## ackdat is_sm_ips_ports ct_state_ttl ct_flw_http_mthd
## Min. :0.000 Min. :0.0000 Min. :0.00 0 :986783
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00 NA :404027
## Median :0.000 Median :0.0000 Median :0.00 1 :119301
## Mean :0.002 Mean :0.0021 Mean :0.15 6 : 7880
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00 4 : 3918
## Max. :3.519 Max. :1.0000 Max. :6.00 3 : 735
## (Other): 1376
## is_ftp_login ct_ftp_cmd ct_srv_src ct_srv_dst
## 0 :1066585 0 :1066490 Min. : 1.00 Min. : 1.00
## 1 : 27716 NA : 429667 1st Qu.: 2.00 1st Qu.: 2.00
## 2 : 10 1 : 24404 Median : 5.00 Median : 5.00
## 4 : 42 2 : 1244 Mean : 7.54 Mean : 7.29
## NA: 429667 4 : 846 3rd Qu.: 9.00 3rd Qu.: 9.00
## 3 : 729 Max. :63.00 Max. :62.00
## (Other): 640
## ct_dst_ltm ct_src_ltm ct_src_dport_ltm ct_dst_sport_ltm
## Min. : 1.0 Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.0 Median : 4.0 Median : 1.00 Median : 1.00
## Mean : 5.1 Mean : 5.6 Mean : 3.02 Mean : 2.35
## 3rd Qu.: 5.0 3rd Qu.: 6.0 3rd Qu.: 2.00 3rd Qu.: 1.00
## Max. :60.0 Max. :60.0 Max. :60.00 Max. :60.00
##
## ct_dst_src_ltm attack_cat Label
## Min. : 1.00 NA :1415019 Min. :0.0000
## 1st Qu.: 1.00 Generic : 65483 1st Qu.:0.0000
## Median : 2.00 Exploits : 18102 Median :0.0000
## Mean : 4.35 Fuzzers : 10912 Mean :0.0715
## 3rd Qu.: 3.00 DoS : 6091 3rd Qu.:0.0000
## Max. :63.00 Reconnaissance: 5561 Max. :1.0000
## (Other) : 2852
We will now save these files after preprocessing. This notebook script will load in these versions on next running of the script, if they exist, otherwise loading the original files and preprocessing again.
One of the challenges in this work was dealing with re-running functions on the large data set. So, saving our progress and using if statements to check if we had already done certain steps enabled me to rerun the script without re-doing some of the processing steps over and over again.
Most of these were written so as to check if something already exists, or start to create it, if not. Another reason this was important was that when I want to shut down my AWS virtual machine, and rerun at a later date, I will be starting with new VM machine's clean hard drive and need to be able to reload the files already processed from a backup s3 drive.
try(exportPreProcessedFiles(Dir="data", processedDir="processed" ))
## [1] "preprocessed file exists"
## [1] "preprocessed file exists"
## [1] "Pre processed files Saved For next Time."
Investigate the time stamp variables
sampleset=seq(1,1500000,10)
plot(as.vector(h2o.data.all$Ltime[sampleset]))
plot(as.vector(h2o.data.all$Stime[sampleset]))
plot(as.vector(h2o.data.all$Ltime[sampleset]-h2o.data.all$Stime[sampleset]))
plot(as.vector(h2o.data.all$Ltime[sampleset]))
The time stamp variable will likely relate to the attacks that have happened in a statistically relevant way, yet for our use in predicting future attacks, we need to identify attacks independent of when they arrive. One option was to create an interval time column, of the difference in start and last time, but since there is already a duration column, I decided not to use this approach.
Create a Time interval column - finally decided against this.
#h2o.data.all$Itime <- h2o.data.all$Ltime - h2o.data.all$Stime
#h2o.data.test$Itime <- h2o.data.test$Ltime - h2o.data.test$Stime
Similarly to the time stamps - Whilsts IP addresses certainly will be related to attacks, if our goal is to detect attacks from any IP, then this is not going to be useful in our modelling. For an example an ip address associated with 1,000,000 benign transactions COULD easily still be the source of our next attack, hoever modelling using past IP addresses could associte that IP address with a high liklihood of being safe.
| # Variable Distributions |
| Plots of Variables distributions |
| ```r lastcol <-ncol(h2o.data.all) |
| # Split the 47 columns into smaller groups of columns. dfsample = h2o.data.all[ h2o.runif(h2o.data.all) <0.1,] |
| # For Non h2o frames we can use: # samplesize = 100000 # dfsample = sample_n(h2o.data.all,samplesize) |
| gatheredsample1 <- as.data.frame(dfsample[,c(1:15,lastcol)]) %>% gather(variable, value, -Label) gatheredsample2 <- as.data.frame(dfsample[,c(16:30,lastcol)]) %>% gather(variable, value, -Label) gatheredsample3 <- as.data.frame(dfsample[,31:lastcol]) %>% gather(variable, value, -Label) ``` |
| ```r highlightnames <- feature_names[c(1,4,9,10,11,12,13,17,18,24,32,35,37,41,42,43,lastcol)] |
| gatheredsample_highlight <- rbind( gatheredsample1[gatheredsample1$variable %in% highlightnames,], gatheredsample2[gatheredsample2$variable %in% highlightnames,], gatheredsample3[gatheredsample3$variable %in% highlightnames,]) ``` |
| Visualisations of the distributions of variables. |
r tic() density_plot <- function(gatheredsample_highlight) { sample_frac(gatheredsample_highlight,0.01)%>% ggplot() + geom_density(aes(x = value, col = Label), alpha = 0.2) + labs(title = "Network Flow Variables Distribution\n", x = "Variables", y = "Density", color = "Attack Label\n") + scale_color_manual(labels = c("Benign", "Malicious"), values = c("green", "red")) + facet_wrap(~variable, ncol = 4) } toc() |
## 0.003 sec elapsed |
| ```r if (!file.exists("densityploth.png")){ density_plot(gatheredsample_highlight) dev.copy(png,'densityploth.png') invisible(dev.off()) } |
| if (!file.exists("densityplot1.png")){ density_plot(gatheredsample1) dev.copy(png,'densityplot1.png') invisible(dev.off()) } |
| if (!file.exists("densityplot2.png")){ density_plot(gatheredsample2) dev.copy(png,'densityplot2.png') invisible(dev.off()) } |
| if (!file.exists("densityplot3.png")){ density_plot(gatheredsample3) dev.copy(png,'densityplot3.png') invisible(dev.off()) } |
| toc() ``` |
| Plot for For some of the more important variables |
r sample_frac(gatheredsample_highlight,0.01)%>% ggplot() + geom_density(aes(x = value, col = Label), alpha = 0.2) + labs(title = "Network Flow Variables Distribution\n", x = "Variables", y = "Density", color = "Attack Label\n") + scale_color_manual(labels = c("Benign", "Malicious"), values = c("green", "red")) + facet_wrap(~variable, ncol = 4) |
| It can be quite difficult to interpret this distribution of data and see what information is contained. There appear to be a wide variation in values with not many obvious patterns of attacks in red versus benign samples in green. |
We can look at a smoothed distribution of the variables to make any differences easier to visualise. The below plots show the distribution of our sample variables, contrasting those identified as Malicious(Label 1) with those not (Label 0). SOme variable are binary but will still appear as smoothed ridges on the distribution plots.
We will take a 10% sub sample of the data to speed plotting.
plot_ridges <- function (df,output_filename=NULL){
draw_plot <- ggplot(df,aes(y = as.factor(variable),
fill = (as.factor(Label)),
x = percent_rank(value))) +
geom_density_ridges() +
theme(text = element_text(size=12))+
scale_fill_manual(values = alpha(c( "green","red"), .3))
if (is.null(output_filename)) {
draw_plot
} else {
# for hi res print output
# png(filename=output_filename , width=1600, height=1600)
# print(draw_plot)
# invisible(dev.off())
print(draw_plot)
dev.copy(png,output_filename)
invisible(dev.off())
}
}
if (!file.exists("ridges1.png")){
plot_ridges(gatheredsample1,'ridges1.png')
}
if (!file.exists("ridges2.png")){
plot_ridges(gatheredsample2,'ridges2.png')
}
if (!file.exists("ridges3.png")){
plot_ridges(gatheredsample3,'ridges3.png')
}
if (!file.exists("ridgeshighlights.png")){
plot_ridges(
rbind(
gatheredsample1[gatheredsample1$variable %in% highlightnames,],
gatheredsample2[gatheredsample2$variable %in% highlightnames,],
gatheredsample3[gatheredsample3$variable %in% highlightnames,]),'ridgeshighlights.png')
#gatheredsample_highlight)
invisible(dev.off())
}
Variable Distributions 1
Variable Distributions 2
Variable Distributions 1
Variable Distributions 2
# rm("gatheredsample1","gatheredsample2","gatheredsample3")
attack_cat_table0 <- sort(decreasing =TRUE,
table(as.data.frame(
h2o.data.all["attack_cat"])))
attack_cat_table <- sort(decreasing =TRUE,
table(as.data.frame(
h2o.data.all[
as.vector(h2o.which(h2o.data.all$Label=='1')),"attack_cat"])))
attack_cat_table
##
## Generic Exploits Fuzzers DoS Reconnaissance
## 65483 18102 10912 6091 5561
## Analysis Backdoor Shellcode Worms
## 1218 939 625 70
There are over 65000 generic attacks and only 70 worms. It may prove difficult to distinguish well some of the less well represented instances.
# Plot of attack categories
Variable Distributions 1
barplot0 <- barplot(attack_cat_table0,
main = "Attack Categories",
col= 19:1,
cex.names=.8,
las=2,
)
Vastly more benign data samples than Attacks
Variable Distributions 2
barplot1 <- barplot(attack_cat_table,
main = "Attack Categories",
col= 18:1,
cex.names=.8,
las=2,
)
Some attacks have very few occurrences in the data, which may make predicting for those categories difficult.
We will split our original training data into a random 80% Training split and 20% Validation Split.
Models can be built using the training split and verified on the Validation Split.
For the deep learning, we can pass in both data sets, and the learned changes to the model from the training data can be evaluated against their performance on the validation data withing the model building.
This is quite a nice feature to have the validation set performance evaluation included in the model building step.
Create an 80% training split and 20% validation split
randomsplits <- h2o.runif(h2o.data.all)
h2o.data.train <- h2o.data.all[randomsplits <0.8,]
h2o.data.validation <- h2o.data.all[randomsplits >=0.8,]
# Recreate our h2o data splits as standard R dataframes
train_data <- as.data.frame(h2o.data.train)
validation_data <- as.data.frame(h2o.data.validation)
Learn how many combinations of variables are required to explain the variance in the data.
h2o.data.all.pca <- h2o.prcomp(h2o.data.all, x = predictor_vars,
transform = "NORMALIZE",
k = 20,
max_iterations = 1000,
compute_metrics = TRUE,
impute_missing = FALSE,
max_runtime_secs = 60
)
##
|
| | 0%
|
|= | 2%
|
|== | 4%
|
|============= | 20%
|
|==================================================== | 80%
|
|=================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): _train:
## Dataset used may contain fewer number of rows due to removal of rows with
## NA/missing values. If this is not desirable, set impute_missing argument in
## pca call to TRUE/True/true/... depending on the client language.
summary(h2o.data.all.pca,plot=TRUE)
## Model Details:
## ==============
##
## H2ODimReductionModel: pca
## Model Key: PCA_model_R_1524748087009_7
## Importance of components:
## pc1 pc2 pc3 pc4 pc5
## Standard deviation 1.323251 0.940512 0.330133 0.295593 0.271781
## Proportion of Variance 0.522008 0.263707 0.032492 0.026048 0.022021
## Cumulative Proportion 0.522008 0.785716 0.818207 0.844256 0.866276
## pc6 pc7 pc8 pc9 pc10
## Standard deviation 0.262425 0.242044 0.228554 0.214266 0.197184
## Proportion of Variance 0.020531 0.017466 0.015573 0.013687 0.011591
## Cumulative Proportion 0.886807 0.904273 0.919846 0.933532 0.945124
## pc11 pc12 pc13 pc14 pc15
## Standard deviation 0.190681 0.170244 0.150804 0.135906 0.128240
## Proportion of Variance 0.010840 0.008640 0.006780 0.005506 0.004903
## Cumulative Proportion 0.955963 0.964604 0.971384 0.976890 0.981793
## pc16 pc17 pc18 pc19 pc20
## Standard deviation 0.116508 0.094400 0.085330 0.078867 0.060079
## Proportion of Variance 0.004047 0.002657 0.002171 0.001854 0.001076
## Cumulative Proportion 0.985840 0.988496 0.990667 0.992521 0.993597
##
## H2ODimReductionMetrics: pca
##
## No model metrics available for PCA
##
##
##
## Scoring History for GramSVD:
## timestamp duration iterations
## 1 2018-04-26 13:19:03 4.769 sec 0
It takes 15 principal components to account for 98% of the sample variation.
We will apply an unsupervised deep learning Autoencoder.
An autoencoder is a neural network that learns a representation or encoding of a set of data, typically for the purpose of dimensionality reduction.
For the first task we will procede as if we have no labels on the data and simple try to identify suspicious anomalies in the data.
After building the models we will use the labels to discover if this approach will be useful.
getReconErrors <- function(dl_model, dataF) {
reconstructionMses = h2o.anomaly(dl_model, dataF, per_feature=FALSE)
reconMses <- as.data.frame(reconstructionMses)
return(reconMses)
}
plotReconstructionErrors <- function(dataF, errorsList,type='raw',sampleRate=1) {
quants <- quantile(errorsList$Reconstruction.MSE, c(.70, .75, .80))
quants
# Plot anomaly MSE level
dataF <- dataF[h2o.runif(dataF)<sampleRate,]
library(scales)
labels1 <- as.vector(dataF$Label)
all <- TRUE
raw <- TRUE
benign <- labels1==0
malicious <- labels1==1
min_MSE<- min(errorsList$Reconstruction.MSE)
max_MSE<- max(errorsList$Reconstruction.MSE)
if (type == 'raw') { plotData<-errorsList$Reconstruction.MSE
} else {
plotData<-sort(errorsList$Reconstruction.MSE[get(type)])
}
plot(plotData,
main=paste(c('Reconstruction Error Percentiles',c(names(quants))), collapse='. '),
log="y",
ylim=c(min_MSE,max_MSE),
ylab="Reconstruction MSE error level",
sub = paste("Sorted Reconstruction Errors", collapse='. '),
col=alpha((3-as.numeric(labels1[get(type)])),0.25))
abline(h=quants[1],col='yellow')
abline(h=quants[2],col='orange')
abline(h=quants[3],col='red')
}
output_dl_model_performance <- function(model, h2o_data) {
#Plot anomaly MSE level
recon_errors <- getReconErrors(model, h2o_data)
plotAll <- plotReconstructionErrors(h2o_data, recon_errors)
#Plot benign anomaly MSE level
plotBenign <- plotReconstructionErrors(h2o_data, recon_errors,type='benign')
#Plot malicious anomaly MSE level
plotMalicious <- plotReconstructionErrors(h2o_data, recon_errors,type='malicious')
malicious_indexes_logical <- as.vector(h2o.which(as.character(h2o_data$Label)=='1'))
minimumMaliciousMSE <- min(recon_errors$Reconstruction.MSE[malicious_indexes_logical])
percentileRank <- function(vec, testval) {length(vec[vec <= testval])/length(vec)*100}
mal_below_percentile <- function(rate){
sort(recon_errors$Reconstruction.MSE[malicious_indexes_logical])[floor(length(recon_errors$Reconstruction.MSE[malicious_indexes_logical])*rate)]}
cat("We could consider all Samples below a certain Reconstruction error as \"Safe\"\n\n")
miss<-0.1
perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
miss<-0.05
perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
miss<-0.025
perc<-percentileRank(recon_errors$Reconstruction.MSE,mal_below_percentile(miss))
cat(perc,"% cut off would miss",miss*100,"% of attacks\n")
cat(percentileRank(recon_errors$Reconstruction.MSE,minimumMaliciousMSE),"% Cut off is where the lowest Malicious sample Reconstruction error was, at ", minimumMaliciousMSE,".\n The samples with reconstruction error below this are all Benign samples")
}
We will use the h2o.deeplearning, withautoencoder set to TRUE:
tic()
TrainDlModel1 <- function(predictor_vars, datafr) {
deepmodel = h2o.deeplearning (x = predictor_vars,
training_frame = h2o.data.train[predictor_vars],
ignore_const_cols = FALSE,
autoencoder = TRUE,
# standardize = TRUE,
seed = 42,
hidden = c(20),
epochs = 50000,
max_w2 = 10,
sparse = FALSE,
activation = "TanhWithDropout",
loss = "Automatic",
l1 =1e-5,
stopping_metric = "MSE",
stopping_rounds = 3,
stopping_tolerance =0.0001,
model_id = "DL_model_R_1a"
)
return(deepmodel)
}
toc()
## 0.007 sec elapsed
if (file.exists(NewestFileInDir("./models/hidden_20")) ) {
print("loading saved model")
h2o.data.train.dl <- h2o.loadModel(NewestFileInDir("./models/hidden_20"))
} else {
h2o.data.train.dl <- TrainDlModel1(predictor_vars, h2o.data.train)
h2o.saveModel(h2o.data.train.dl, path="./models/hidden_20", force = TRUE)
}
## [1] "loading saved model"
cat("model Id is :",h2o.data.train.dl@model_id)
## model Id is : DeepLearning_model_R_1524433485776_56
#hidden (20) layer model id was saved as "DeepLearning_model_R_1524353906773_25"
After the model is trained, We will call the h2o.anomaly function to try to rebuild the original data set by using the model reduced set of features. We will then calculate a means squared error between both versions to measure how good the model could reconstruct the data.
Grid ID: dl_grid
Used hyper parameters: - hidden
Number of models: 5
Number of failed models: 0
Hyper-Parameter Search Summary: ordered by increasing mse
hidden model_ids mse
1 [20] dl_grid_model_3 0.005309525216384074
2 [32, 32] dl_grid_model_1 0.005460826923559729
3 [15, 10, 15] dl_grid_model_0 0.006091622073189775
4 [20, 20] dl_grid_model_4 0.006243060895154623
5 [64, 64] dl_grid_model_2 0.007167924403874439
[1] "dl_grid_model_3"
[1] 0.005363
h2o.best.dl.model
## Model Details:
## ==============
##
## H2OAutoEncoderModel: deeplearning
## Model ID: dl_grid_model_3
## Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 8,507 weights/biases, 119.2 KB, 1,880,013 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 207 Input 0.00 %
## 2 2 20 TanhDropout 50.00 % 0.001000 0.000000 0.037511 0.137523
## 3 3 207 Tanh 0.001000 0.000000 0.059986 0.035958
## momentum mean_weight weight_rms mean_bias bias_rms
## 1
## 2 0.000000 -0.004208 0.211915 -0.106799 4.464722
## 3 0.000000 -0.000134 0.033201 0.016737 0.117233
##
##
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.005363
## RMSE: (Extract with `h2o.rmse`) 0.07323
##
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on validation data. **
##
## Validation Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.00531
## RMSE: (Extract with `h2o.rmse`) 0.07287
MSE: (Extract with h2o.mse) 0.005363 RMSE: (Extract with h2o.rmse) 0.07323
H2OAutoEncoderMetrics: deeplearning ** Reported on validation data. **
MSE: (Extract with h2o.mse) 0.00531 RMSE: (Extract with h2o.rmse) 0.07287
h2o.saveModel(h2o.best.dl.model, path="./models/grid_best", force = TRUE)
## [1] "/home/rstudio/networkflows/models/grid_best/dl_grid_model_3"
output_dl_model_performance(h2o.best.dl.model, h2o.data.validation)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 88.07 % cut off would miss 10 % of attacks
## 85.65 % cut off would miss 5 % of attacks
## 80.69 % cut off would miss 2.5 % of attacks
## 61.12 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.005202 .
## The samples with reconstruction error below this are all Benign samples
Our lowest reconstruction error for an attack is lower than before. Since we specified a lower stopping criteria in the gris serch to compare more models faster, this is to be expected. However, the 80th percentile is cathing a similar number of attacks as before.
Though we do have the luxury of labelled benign data, we didn't use that information to train our autoencoder above, but simply looked for anomalies and compared the labels to see how we did. Since we do have classified data We can do better than just looking for anomalies in all the data.
We can train our autoencoder ONLY on the samples that we know to be benign. Then when we reconstruct on all the data we should she higher reconstruction errors for the Malicious data. Since the data was heavily skewed towards benign samples already, the improvement here may not be large.
trainBenignModel <- function() {
h2o.deeplearning(
x = predictor_vars,
autoencoder = TRUE,
standardize = TRUE,
epochs = 150,
seed = 42,
l1=1e-3,
max_w2 = 10,
stopping_rounds = 3,
stopping_tolerance =0.005,
stopping_metric = "AUTO",
hidden = c(20),
training_frame=h2o.data.train[as.character(h2o.data.train$Label)=='0',],
validation_frame=h2o.data.validation[as.character(h2o.data.validation$Label)=='0',],
activation=c("TanhWithDropout")
)
}
if (file.exists(NewestFileInDir("./models/benign")) ) {
print("loading saved model")
h2o.benign.dl.model <- h2o.loadModel(NewestFileInDir("./models/benign"))
} else {
h2o.benign.dl.model <- trainBenignModel(predictor_vars, h2o.data.train)
h2o.saveModel(h2o.benign.dl.model, path="./models/benign", force = TRUE)
}
## [1] "loading saved model"
benign_model_id <- h2o.benign.dl.model@model_id
cat("model Id is :",h2o.benign.dl.model@model_id)
## model Id is : DeepLearning_model_R_1524353906773_27
#"DeepLearning_model_R_1524353906773_27"
output_dl_model_performance(h2o.benign.dl.model, h2o.data.validation)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.005036 .
## The samples with reconstruction error below this are all Benign samples
We will now plot reconstruction errors on the full training data set.
recon_errors_all <- getReconErrors(h2o.benign.dl.model, h2o.data.all)
h2o.data.all[which(recon_errors_all==max(recon_errors_all)):(which(recon_errors_all==max(recon_errors_all))+3),]
## srcip sport dstip dsport proto state dur sbytes dbytes
## 1 2939006978 38224 2511044107 80 tcp FIN 2.002472 193156 1128
## 2 2939006976 0 2511044109 0 hmp INT 0.000005 200 0
## 3 2939006976 0 2511044109 0 hmp INT 0.000005 200 0
## 4 2939006976 0 2511044109 0 hmp INT 0.000005 200 0
## sttl dttl sloss dloss service Sload Dload Spkts Dpkts swin dwin
## 1 254 252 73 1 http 766596 4335 152 26 255 255
## 2 254 0 0 0 - 160000000 0 2 0 0 0
## 3 254 0 0 0 - 160000000 0 2 0 0 0
## 4 254 0 0 0 - 160000000 0 2 0 0 0
## stcpb dtcpb smeansz dmeansz trans_depth res_bdy_len Sjit Djit
## 1 1767075271 2729791068 1271 43 131 0 1567 122.2
## 2 0 0 100 0 0 0 0 0.0
## 3 0 0 100 0 0 0 0 0.0
## 4 0 0 100 0 0 0 0 0.0
## Stime Ltime Sintpkt Dintpkt tcprtt synack ackdat
## 1 1424223168 1424223170 12.969 78.09 0.1693 0.05015 0.1192
## 2 1424223170 1424223170 0.005 0.00 0.0000 0.00000 0.0000
## 3 1424223170 1424223170 0.005 0.00 0.0000 0.00000 0.0000
## 4 1424223170 1424223170 0.005 0.00 0.0000 0.00000 0.0000
## is_sm_ips_ports ct_state_ttl ct_flw_http_mthd is_ftp_login ct_ftp_cmd
## 1 0 1 1 NaN NaN
## 2 0 2 NaN NaN NaN
## 3 0 2 NaN NaN NaN
## 4 0 2 NaN NaN NaN
## ct_srv_src ct_srv_dst ct_dst_ltm ct_src_ltm ct_src_dport_ltm
## 1 7 5 2 2 2
## 2 3 3 1 1 1
## 3 3 3 1 1 1
## 4 3 3 1 1 1
## ct_dst_sport_ltm ct_dst_src_ltm attack_cat Label
## 1 1 5 Exploits 1
## 2 1 3 DoS 1
## 3 1 3 DoS 1
## 4 1 3 Exploits 1
##
## [4 rows x 49 columns]
One single attack instance had a reconstruction error 10 times that of other samples.
We can remove this single instance to better visualise the distribution of errors.
h2o.data.all <- h2o.data.all[-which(recon_errors_all==max(recon_errors_all)),]
recon_errors_val <- getReconErrors(h2o.benign.dl.model, h2o.data.validation)
h2o.data.validation <- h2o.data.validation[-which(recon_errors_val==max(recon_errors_val)),]
output_dl_model_performance(h2o.benign.dl.model, h2o.data.validation)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.005036 .
## The samples with reconstruction error below this are all Benign samples
Now almost all the malicious samples are above a 80% detection threshold, with only a small tail below this. Setting a detection threshold of 70% will include a few more true positives, but at the expense of a lot (10% of total) more false positives. Similarly we could detect ALL of the malicious samples in this dataset with a threhold of 60%, but at the expense of labeling as suspicious 40 of all data Interestingly, that while our LOWEST reconstruction error for a malicious sample is lower than before, our 97.5% recall rate is still around the 78th percentile. Trying to catch every single odd outlier is likely not wise and choosing a 97.5% or 99% recall rate is wiser than Over fitting our threshold to single oulier values.
We can continue on training with a lower MSE stopping metric. At this stage we will also feed our model ALL the data, both training and validation splits, before moving on to consider test data performance.
We will: -reload the last model, -retrain on new data ; with higher accuracy stopping criteria requirement, -Save the updated model.
A simple for loop on the following section could feed in smaller segments of data to continue training. If reducing the memory requirement for large data is an issue then this approach would be easily implemented.
Also it would be a small step to put this into a function to update the model as new data becomes available.
Performance of our further trained model
output_dl_model_performance(h2o.benign.dl.model.continued, h2o.data.validation)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 85.72 % cut off would miss 10 % of attacks
## 81.99 % cut off would miss 5 % of attacks
## 77.72 % cut off would miss 2.5 % of attacks
## 60.73 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.005036 .
## The samples with reconstruction error below this are all Benign samples
output_dl_model_performance(h2o.benign.dl.model.continued, h2o.data.test)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 67.59 % cut off would miss 10 % of attacks
## 61.88 % cut off would miss 5 % of attacks
## 59.09 % cut off would miss 2.5 % of attacks
## 42.44 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.00484 .
## The samples with reconstruction error below this are all Benign samples
Lets also evaluate our very first model on the test data
output_dl_model_performance(h2o.data.train.dl, h2o.data.test)
## We could consider all Samples below a certain Reconstruction error as "Safe"
##
## 70.11 % cut off would miss 10 % of attacks
## 65.64 % cut off would miss 5 % of attacks
## 57.42 % cut off would miss 2.5 % of attacks
## 46.36 % Cut off is where the lowest Malicious sample Reconstruction error was, at 0.004535 .
## The samples with reconstruction error below this are all Benign samples
So model was performing best on validation and test data. We will use it as our anomaly detector model.
we can also examine what variables are having the biggest effect on our model. It seems not to have a single dominant factor, but with several variables all having an effect.
h2o.varimp_plot(h2o.benign.dl.model.continued, num_of_features = 20)
So a question is where to put our anomaly detection threshold. Putting the threshold to catch every malicious sample in our data would lead to a huge level of false positives. Using the F1 metric to consider false positives and false negatives equally would improve "accuracy" but mainly by reducing false positives at the expensive of too many false negatives.
F2 Metric
By increasing our Beta value we can weight our scoring towards favouring recall : (Correctly Predicted Positives / Actual positives) The F2 metric with beta of 2 will score higher with better recall higher even at the expensive of more false negatives. F2 is a good metric to select a threshold when lower false negative rate is more important than a higher false positive rate.
Plot Detection threshold effects.
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 27 minutes 56 seconds
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.7
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_rstudio_ymr419
## H2O cluster total nodes: 1
## H2O cluster total memory: 14.01 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.0 (2017-04-21)
if (!exists("h2o.benign.dl.model.continued.errors")){
h2o.benign.dl.model.continued.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.validation)
}
validationLabels <- as.data.frame(h2o.data.validation[,"Label"])
ErrorLabelFrame <- data.frame( errors=h2o.benign.dl.model.continued.errors,
Labels=validationLabels)
colnames(ErrorLabelFrame)[colnames(ErrorLabelFrame) == 'Reconstruction.MSE'] <- 'errors'
quantlist <- seq(0.6, 1, 0.01)
errorquants <- quantile(ErrorLabelFrame$errors,probs = quantlist)
totalframennegatives <- h2o.nrow(h2o.data.validation$Label[h2o.data.validation$Label=="0",])
positivesAboveQuant <- c(0)
negativesAboveQuant <- c(0)
totalAboveQuant <- c(0)
for ( i in 1:length(errorquants)) {
positivesAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]])&(ErrorLabelFrame$Label=="1"),])
negativesAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]])&(ErrorLabelFrame$Label=="0"),])
totalAboveQuant[i] <- nrow(ErrorLabelFrame[(ErrorLabelFrame$errors > errorquants[[i]]),])
}
actualpositives<- max(positivesAboveQuant)
falsePositiverate <- negativesAboveQuant/totalframennegatives
truePositivePercent <- positivesAboveQuant/totalAboveQuant
truePositiverate <- positivesAboveQuant/actualpositives
falsetrueratio <- positivesAboveQuant/negativesAboveQuant
plot(x=quantlist,
y=c(truePositiverate),
type='l',
ylim=c(0.3,1),
xlim=c(0.65,0.95),
col='red',
ylab = "Percentage found",
xlab = '',
main = "Malicious attacks found vs. benign allowed past",
sub = "85% Threshold catches 90% of Attacks, with only 10% False negative rate. \nTo maximise F2 score 80% threshold will catch over 95% of Attacks."
)
lines(x=quantlist,y=c(1-falsePositiverate),type='l',col="green")
grid(col = "lightgray", lty = "dotted",
lwd = par("lwd"), equilogs = TRUE)
We will set the anomaly detection threshold at the 80th Quantile and consider all samples beyond this MSE Errror threshold suspicious.
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 28 minutes 29 seconds
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.7
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_rstudio_ymr419
## H2O cluster total nodes: 1
## H2O cluster total memory: 14.01 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.0 (2017-04-21)
errorquants[16:25]
## 75% 76% 77% 78% 79% 80% 81% 82%
## 0.008749 0.008861 0.008977 0.009101 0.009238 0.009378 0.009543 0.009725
## 83% 84%
## 0.009956 0.010250
errorquants[21]
## 80%
## 0.009378
suspicionThreshold = errorquants[[21]]
All MSE's above this are to be considered suspicious.
if (!exists("h2o.data.train.errors")){
h2o.data.train.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.train)
}
if (!exists("h2o.data.all.errors")){
h2o.data.all.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.all)
}
if (!exists("h2o.data.test.errors")){
h2o.data.test.errors <- getReconErrors(h2o.benign.dl.model.continued, h2o.data.test)
}
if (!exists("h2o.data.all.suspicious")){
h2o.data.all.suspicious <- h2o.data.all[which(as.vector(h2o.data.all.errors[1] > suspicionThreshold)),]
}
if (!exists("h2o.data.test.suspicious")){
h2o.data.test.suspicious <- h2o.data.test[which(as.vector(h2o.data.test.errors[1] > suspicionThreshold)),]
}
| We can now try to subclassify the attack_categories. |
| Lets try out h2o gbm on this data |
| # Run h2oGBM |
| ```r #GBM gbm_parameters <- list(learn_rate = 0.1, #c(0.1,0.01), max_depth = 5, #c(5,7), sample_rate = 0.8, #c(0.3,0.5,0.7), col_sample_rate = 0.9 # c(0.5,0.7) |
| ) tic() |
| if (!exists("h2o.gbm.grid")){ h2o.gbm.grid<- h2o.grid("gbm", training_frame = h2o.data.train, validation_frame = h2o.data.validation, #nfolds=3, # commented out for faster running for printing. x = predictor_vars, y = "Label", hyper_params=gbm_parameters, ntrees = 100, stopping_rounds = 3, stopping_tolerance = 0.1, stopping_metric = "misclassification", model_id = "gbm_response4", seed = 42) } ``` |
## | | | 0% | |= | 1% | |= | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 9% | |====== | 10% | |======= | 11% | |======== | 12% | |======== | 13% | |========= | 14% | |========== | 15% | |========== | 16% | |=========== | 17% | |============ | 18% | |============ | 19% | |============= | 20% | |============== | 21% | |============== | 22% | |=============== | 23% | |================ | 24% | |================ | 25% | |================= | 26% | |================== | 27% | |================== | 28% | |=================== | 29% | |==================== | 30% | |==================== | 31% | |===================== | 32% | |===================== | 33% | |====================== | 34% | |======================= | 35% | |======================= | 36% | |======================== | 37% | |========================= | 38% | |========================= | 39% | |========================== | 40% | |=========================== | 41% | |=========================== | 42% | |============================ | 43% | |============================= | 44% | |============================= | 45% | |============================== | 46% | |=============================== | 47% | |=============================== | 48% | |================================ | 49% | |================================ | 50% | |================================= | 51% | |================================== | 52% | |================================== | 53% | |=================================== | 54% | |==================================== | 55% | |==================================== | 56% | |===================================== | 57% | |====================================== | 58% | |====================================== | 59% | |======================================= | 60% | |======================================== | 61% | |======================================== | 62% | |========================================= | 63% | |========================================== | 64% | |========================================== | 65% | |=========================================== | 66% | |============================================ | 67% | |============================================ | 68% | |============================================= | 69% | |============================================== | 70% | |============================================== | 71% | |=============================================== | 72% | |=============================================== | 73% | |================================================ | 74% | |================================================= | 75% | |================================================= | 76% | |================================================== | 77% | |=================================================== | 78% | |=================================================== | 79% | |==================================================== | 80% | |===================================================== | 81% | |===================================================== | 82% | |====================================================== | 83% | |======================================================= | 84% | |======================================================= | 85% | |======================================================== | 86% | |========================================================= | 87% | |========================================================= | 88% | |========================================================== | 89% | |========================================================== | 90% | |=========================================================== | 91% | |============================================================ | 92% | |============================================================ | 93% | |============================================================= | 94% | |============================================================== | 95% | |============================================================== | 96% | |=============================================================== | 97% | |================================================================ | 98% | |================================================================ | 99% | |=================================================================| 100% |
r toc() |
## 191.001 sec elapsed |
r #47 secs for stopping tol 0.1 auc grid search, #166sec for 0.1 misclassification #237 FOR auc WITH GRID FOR SAMPLE RATE 0.3,0.5,0.7 # 89 FOR STOPPING_metric lift_top_group #56s for 0.2 sample rate #60sec col sample 0.9 #112S MPC |
r print(h2o.gbm.grid) |
## H2O Grid Details ## ================ ## ## Grid ID: Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8 ## Used hyper parameters: ## - col_sample_rate ## - learn_rate ## - max_depth ## - sample_rate ## Number of models: 1 ## Number of failed models: 0 ## ## Hyper-Parameter Search Summary: ordered by increasing logloss ## col_sample_rate learn_rate max_depth sample_rate ## 1 0.9 0.1 5 0.8 ## model_ids ## 1 Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8_model_0 ## logloss ## 1 0.009242757067122489 |
r h2o.gbm.model.1 <- h2o.getModel(h2o.gbm.grid@model_ids[[1]]) h2o.gbm.model.1 |
## Model Details: ## ============== ## ## H2OBinomialModel: gbm ## Model ID: Grid_GBM_RTMP_sid_b01f_79_model_R_1524748087009_8_model_0 ## Model Summary: ## number_of_trees number_of_internal_trees model_size_in_bytes min_depth ## 1 100 100 38276 5 ## max_depth mean_depth min_leaves max_leaves mean_leaves ## 1 5 5.00000 7 32 24.07000 ## ## ## H2OBinomialMetrics: gbm ## ** Reported on training data. ** ## ## MSE: 0.002757 ## RMSE: 0.05251 ## LogLoss: 0.008854 ## Mean Per-Class Error: 0.0162 ## AUC: 0.9998 ## Gini: 0.9997 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 1130007 2130 0.001881 =2130/1132137 ## 1 2662 84572 0.030516 =2662/87234 ## Totals 1132669 86702 0.003930 =4792/1219371 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.509902 0.972450 208 ## 2 max f2 0.280356 0.981252 305 ## 3 max f0point5 0.742980 0.981642 129 ## 4 max accuracy 0.509902 0.996070 208 ## 5 max precision 0.997702 1.000000 0 ## 6 max recall 0.086182 1.000000 371 ## 7 max specificity 0.997702 1.000000 0 ## 8 max absolute_mcc 0.509902 0.970339 208 ## 9 max min_per_class_accuracy 0.284447 0.994349 303 ## 10 max mean_per_class_accuracy 0.130351 0.995873 361 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` ## H2OBinomialMetrics: gbm ## ** Reported on validation data. ** ## ## MSE: 0.002925 ## RMSE: 0.05408 ## LogLoss: 0.009243 ## Mean Per-Class Error: 0.01506 ## AUC: 0.9998 ## Gini: 0.9996 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 282188 696 0.002460 =696/282884 ## 1 602 21164 0.027658 =602/21766 ## Totals 282790 21860 0.004261 =1298/304650 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.476244 0.970247 227 ## 2 max f2 0.265718 0.980209 316 ## 3 max f0point5 0.794345 0.980148 113 ## 4 max accuracy 0.479206 0.995739 226 ## 5 max precision 0.997702 1.000000 0 ## 6 max recall 0.080384 1.000000 374 ## 7 max specificity 0.997702 1.000000 0 ## 8 max absolute_mcc 0.476244 0.967955 227 ## 9 max min_per_class_accuracy 0.279336 0.994008 310 ## 10 max mean_per_class_accuracy 0.080384 0.995760 374 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` |
r if (!exists("h2o.gbm.model.1.probs.valid")){ h2o.gbm.model.1.probs.valid <-h2o.predict(object = h2o.gbm.model.1, newdata = h2o.data.validation) h2o.gbm.model.1.predictions.valid <- as.vector(h2o.gbm.model.1.probs.valid[,1]) gbm_predicts_validation <- as.factor(h2o.gbm.model.1.predictions.valid) } |
## | | | 0% | |=================================================================| 100% |
| So at Max F1, our supervised GBM is predicting 21164 out of an actual 21766 malicious instances. A False negative rate of 2.7%. With only 602 False positives from the 282790 actual negatives. Using Max F2 threshold, improves detection to a levle of just under 1.98% false negatives." |
| Much more effictive than the unsupervised Autoencoder model. |
r h2o.auc(h2o.gbm.model.1,train=TRUE,valid=TRUE) |
## train valid ## 0.9998 0.9998 |
| Lets create a predicted malicious only data frame from that model. |
r if (!exists("h2o.data.gbmpredictedMalicious")){ h2o.data.gbmpredictedMalicious <- h2o.data.validation[as.h2o(gbm_predicts_validation==1),] } |
## | | | 0% | |=================================================================| 100% |
| ## Test Data Performance |
r if (!exists("h2o.gbm.model.1.perf.test")) { h2o.gbm.model.1.perf.test <- h2o.performance( model = h2o.gbm.model.1, newdata = h2o.data.test) } h2o.auc(h2o.gbm.model.1.perf.test,train=TRUE,valid=TRUE) |
## [1] 0.9996 |
r # Look at the hyperparamters for the best model print(h2o.gbm.model.1@model[["model_summary"]]) |
## Model Summary: ## number_of_trees number_of_internal_trees model_size_in_bytes min_depth ## 1 100 100 38276 5 ## max_depth mean_depth min_leaves max_leaves mean_leaves ## 1 5 5.00000 7 32 24.07000 |
| Grid search of GBM - commented out for faster execution |
{r , eal=FALSE} gbm_parameters <- list(learn_rate = c(0.1), #,0.01),
max_depth = c(7), #c(5,7),
sample_rate = c(0.7), #c(0.5,0.7),
col_sample_rate = c(0.5), # ),0.7))
tic()
gbm_gridperf1<- h2o.grid("gbm",
training_frame = h2o.data.train,
validation_frame = h2o.data.validation,
#nfolds=3,
x = predictor_vars,
y = response,
hyper_params=gbm_parameters,
ntrees = 30,
stopping_rounds = 2,
stopping_tolerance = 1e-2, #set to low tolerance for fast running to produce notebook. initially 1e-6
#stopping_metric = "AUC",
model_id = "gbm_response2",
seed = 42)
toc()
#1627 secs for grid search,
Grid search returned best parameters of
#Best parameters
# col_sample_rate learn_rate max_depth sample_rate
# 0.5 0.1 7 0.7
#print(gbm_gridperf1)
# Grab the top GBM model, chosen by validationation AUC
gbm_model1 <- h2o.gbm.model.1
gbm_validationdata_predictions <-h2o.predict(object = gbm_model1,
newdata = h2o.data.validation)
##
|
| | 0%
|
|=================================================================| 100%
gbm_predicts_validation_class <- as.vector(gbm_validationdata_predictions[,1]) # take first column
gbm_predicts_validation <- as.factor(gbm_predicts_validation_class)
gbm_predicts_validation_class <- as.vector(gbm_validationdata_predictions[,1])
gbm_predicts_validation <- as.factor(gbm_predicts_validation_class)
h2o.auc(gbm_model1,train=TRUE,valid=TRUE)
## train valid
## 0.9998 0.9998
gbm_predicts_validation <- as.factor((gbm_predicts_validation_class))
## Model summary
print(gbm_model1@model[["model_summary"]])
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 38276 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 7 32 24.07000
accuracy and MSE
gbm_model1_testdata_performance <- h2o.performance(
model = gbm_model1,
newdata = h2o.data.test)
gbm_testdata_predictions <- h2o.predict(object = gbm_model1,
newdata = h2o.data.test)
##
|
| | 0%
|
|=================================================================| 100%
gbm_predicts_test_class <- as.factor(as.vector(gbm_testdata_predictions[,1]))
gbm_predicts_test <- as.factor(as.vector(gbm_testdata_predictions[,1]))
h2o.auc(gbm_model1_testdata_performance,train=TRUE,valid=TRUE) # 0.9990
## [1] 0.9996
#str(gbm_model1_testdata_performance)
gbm_model1_testdata_performance@metrics$MSE
## [1] 0.006625
#gbm_model1_testdata_performance$
Let's use the pROC library to calculate our AUC score (remember, an AUC of 0.5 is random and 1 is perfect) and plot a chart:
validation_actual = as.numeric(as.vector(h2o.data.validation$Label))
validation_predictions = as.numeric(gbm_predicts_validation)
PlotAUC <- function(response, predictor) {
auc = roc(response=response,
predictor=predictor)
plot(auc, print.thres = "best",
main=paste('validationation data AUC:',round(auc$auc[[1]],3)))
abline(h=1,col='blue')
abline(h=0,col='green')
}
PlotAUC(validation_actual, validation_predictions)
example code:
F_meas(as.factor(validation_predictions), as.factor(as.vector(validation_actual)), relevant = '1',
beta = 1, na.rm = TRUE)
F_meas(as.factor(validation_predictions), as.factor(as.vector(validation_actual)), relevant = '1',
beta = 5, na.rm = TRUE)
Increasing the Beta level of f_measure will add weight to Sensitivity over specificity, and so, as a metric, it will penalise false negtives more than false positives.
We will use balance classes to get even samples of the different types. We will train this ONLY on the preselected suspicious Labels predicted from our previous GBM model.
#GBM
gbm_parameters <- list(learn_rate = 0.1, #c(0.1,0.01),
max_depth = 5, #c(5,7),
sample_rate = 0.8, #c(0.3,0.5,0.7),
col_sample_rate = 0.9 # c(0.5,0.7)
)
tic()
if (!exists("h2o.gbm.grid.cat")){
h2o.gbm.grid.cat<- h2o.grid("gbm",
training_frame = h2o.data.gbmpredictedMalicious,
validation_frame = h2o.data.gbmpredictedMalicious,
#nfolds=3, # commented out for faster running for printing.
x = predictor_vars,
y = "attack_cat",
balance_classes = TRUE,
hyper_params=gbm_parameters,
ntrees = 100,
stopping_rounds = 3,
stopping_tolerance = 0.05,
stopping_metric = "MSE",
model_id = "gbm_response4",
seed = 42)
}
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 11%
|
|======== | 12%
|
|======== | 13%
|
|========= | 14%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 43%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|=================================================================| 100%
toc()
## 93.254 sec elapsed
print(h2o.gbm.grid.cat)
## H2O Grid Details
## ================
##
## Grid ID: Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34
## Used hyper parameters:
## - col_sample_rate
## - learn_rate
## - max_depth
## - sample_rate
## Number of models: 1
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by increasing logloss
## col_sample_rate learn_rate max_depth sample_rate
## 1 0.9 0.1 5 0.8
## model_ids
## 1 Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34_model_0
## logloss
## 1 0.2917082773410527
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 39 minutes 55 seconds
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.7
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_rstudio_ymr419
## H2O cluster total nodes: 1
## H2O cluster total memory: 14.25 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.0 (2017-04-21)
h2o.gbm.model.cat1 <- h2o.getModel(h2o.gbm.grid.cat@model_ids[[1]])
h2o.gbm.model.cat1
## Model Details:
## ==============
##
## H2OMultinomialModel: gbm
## Model ID: Grid_GBM_RTMP_sid_a52c_151_model_R_1524748087009_34_model_0
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 52 520 239805 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 10 32 26.64038
##
##
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("RTMP_sid_a52c_151")`
## MSE: (Extract with `h2o.mse`) 0.2785
## RMSE: (Extract with `h2o.rmse`) 0.5277
## Logloss: (Extract with `h2o.logloss`) 0.7832
## Mean Per-Class Error: 0.3335
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## Analysis Backdoor DoS Exploits Fuzzers Generic NA
## Analysis 2248 0 1047 7048 573 0 0
## Backdoor 363 2728 731 6552 487 60 0
## DoS 71 45 3276 7155 171 138 18
## Exploits 51 54 451 9904 100 80 27
## Fuzzers 73 40 202 1408 8949 7 254
## Generic 0 0 27 163 11 10761 0
## NA 0 0 0 621 4377 48 5825
## Reconnaissance 0 0 200 1482 0 18 0
## Shellcode 0 0 0 181 0 0 0
## Worms 0 0 0 1725 0 0 0
## Totals 2806 2867 5934 36239 14668 11112 6124
## Reconnaissance Shellcode Worms Error Rate
## Analysis 0 0 0 0.7941 = 8,668 / 10,916
## Backdoor 0 0 0 0.7502 = 8,193 / 10,921
## DoS 51 0 0 0.7001 = 7,649 / 10,925
## Exploits 249 3 0 0.0930 = 1,015 / 10,919
## Fuzzers 0 0 0 0.1815 = 1,984 / 10,933
## Generic 0 0 0 0.0183 = 201 / 10,962
## NA 63 0 0 0.4673 = 5,109 / 10,934
## Reconnaissance 9226 0 0 0.1556 = 1,700 / 10,926
## Shellcode 0 10729 0 0.0166 = 181 / 10,910
## Worms 0 0 9196 0.1580 = 1,725 / 10,921
## Totals 9589 10732 9196 0.3334 = 36,425 / 109,267
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-10 Hit Ratios:
## k hit_ratio
## 1 1 0.666642
## 2 2 0.852371
## 3 3 0.940943
## 4 4 0.975894
## 5 5 0.995571
## 6 6 0.997190
## 7 7 1.000000
## 8 8 1.000000
## 9 9 1.000000
## 10 10 1.000000
##
##
## H2OMultinomialMetrics: gbm
## ** Reported on validation data. **
##
## Validation Set Metrics:
## =====================
##
## Extract validation frame with `h2o.getFrame("RTMP_sid_a52c_151")`
## MSE: (Extract with `h2o.mse`) 0.1019
## RMSE: (Extract with `h2o.rmse`) 0.3192
## Logloss: (Extract with `h2o.logloss`) 0.2917
## Mean Per-Class Error: 0.3335
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## Analysis Backdoor DoS Exploits Fuzzers Generic NA
## Analysis 43 0 20 135 11 0 0
## Backdoor 6 45 12 108 8 1 0
## DoS 8 5 374 817 20 16 2
## Exploits 17 18 147 3241 33 26 9
## Fuzzers 11 6 31 212 1357 1 39
## Generic 0 0 36 203 13 12747 0
## NA 0 0 0 39 276 3 367
## Reconnaissance 0 0 21 156 0 2 0
## Shellcode 0 0 0 2 0 0 0
## Worms 0 0 0 3 0 0 0
## Totals 85 74 641 4916 1718 12796 417
## Reconnaissance Shellcode Worms Error Rate
## Analysis 0 0 0 0.7943 = 166 / 209
## Backdoor 0 0 0 0.7500 = 135 / 180
## DoS 6 0 0 0.7003 = 874 / 1,248
## Exploits 81 1 0 0.0929 = 332 / 3,573
## Fuzzers 0 0 0 0.1811 = 300 / 1,657
## Generic 0 0 0 0.0194 = 252 / 12,999
## NA 4 0 0 0.4673 = 322 / 689
## Reconnaissance 971 0 0 0.1557 = 179 / 1,150
## Shellcode 0 119 0 0.0165 = 2 / 121
## Worms 0 0 16 0.1579 = 3 / 19
## Totals 1062 120 16 0.1174 = 2,565 / 21,845
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
## =======================================================================
## Top-10 Hit Ratios:
## k hit_ratio
## 1 1 0.882582
## 2 2 0.960998
## 3 3 0.982650
## 4 4 0.993362
## 5 5 0.997528
## 6 6 0.999497
## 7 7 1.000000
## 8 8 1.000000
## 9 9 1.000000
## 10 10 1.000000
We got an overall error rate of 11% predicting across all categories
h2o.data.test.malicious <- h2o.data.test[as.h2o(gbm_predicts_test==1),]
##
|
| | 0%
|
|=================================================================| 100%
Predict attack categories for our suspicious data
if (!exists("h2o.gbm.model.cat1.probs.sus")){
h2o.gbm.model.cat1.probs.sus <-h2o.predict(object = h2o.gbm.model.cat1,
newdata = h2o.data.test.malicious)
}
##
|
| | 0%
|
|=================================================================| 100%
h2o.gbm.model.cat1.predictions.sus <- h2o.predict(object = h2o.gbm.model.cat1,
newdata = h2o.data.test.malicious,
type = 'response')
##
|
| | 0%
|
|=================================================================| 100%
h2o.gbm.model.cat1.predicted.class.sus <- h2o.gbm.model.cat1.predictions.sus[,1]
predicted_cat <- as.factor(as.vector(h2o.gbm.model.cat1.predicted.class.sus))
actual_cat <- as.factor(as.vector(h2o.data.test.malicious$attack_cat))
confusionMatrix(predicted_cat,actual_cat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Analysis Backdoor DoS Exploits Fuzzers Generic NA
## Analysis 1 2 10 19 2 2 4
## Backdoor 0 41 5 7 0 1 0
## DoS 215 217 1741 1939 220 268 12
## Exploits 950 1078 8019 22792 1389 1863 769
## Fuzzers 14 10 101 209 6233 106 3588
## Generic 0 19 188 388 298 147666 47
## NA 15 4 45 150 1032 25 1314
## Reconnaissance 2 10 74 642 13 19 1
## Shellcode 0 9 17 64 0 9 6
## Worms 0 0 0 3 0 0 0
## Reference
## Prediction Reconnaissance Shellcode Worms
## Analysis 3 0 0
## Backdoor 0 0 0
## DoS 274 0 1
## Exploits 1632 222 68
## Fuzzers 1 5 2
## Generic 32 27 18
## NA 1 3 0
## Reconnaissance 6478 0 2
## Shellcode 0 629 0
## Worms 0 0 11
##
## Overall Statistics
##
## Accuracy : 0.876
## 95% CI : (0.875, 0.878)
## No Information Rate : 0.703
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.744
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Analysis Class: Backdoor Class: DoS
## Sensitivity 0.00083542 0.029496 0.17069
## Specificity 0.99980198 0.999939 0.98451
## Pos Pred Value 0.02325581 0.759259 0.35625
## Neg Pred Value 0.99439164 0.993674 0.95941
## Prevalence 0.00561192 0.006517 0.04782
## Detection Rate 0.00000469 0.000192 0.00816
## Detection Prevalence 0.00020160 0.000253 0.02291
## Balanced Accuracy 0.50031870 0.514718 0.57760
## Class: Exploits Class: Fuzzers Class: Generic
## Sensitivity 0.869 0.6785 0.985
## Specificity 0.915 0.9802 0.984
## Pos Pred Value 0.588 0.6070 0.993
## Neg Pred Value 0.980 0.9855 0.965
## Prevalence 0.123 0.0431 0.703
## Detection Rate 0.107 0.0292 0.692
## Detection Prevalence 0.182 0.0481 0.697
## Balanced Accuracy 0.892 0.8293 0.984
## Class: NA Class: Reconnaissance Class: Shellcode
## Sensitivity 0.22888 0.7693 0.70993
## Specificity 0.99386 0.9963 0.99951
## Pos Pred Value 0.50753 0.8946 0.85695
## Neg Pred Value 0.97899 0.9906 0.99879
## Prevalence 0.02692 0.0395 0.00415
## Detection Rate 0.00616 0.0304 0.00295
## Detection Prevalence 0.01214 0.0339 0.00344
## Balanced Accuracy 0.61137 0.8828 0.85472
## Class: Worms
## Sensitivity 0.1078431
## Specificity 0.9999859
## Pos Pred Value 0.7857143
## Neg Pred Value 0.9995733
## Prevalence 0.0004782
## Detection Rate 0.0000516
## Detection Prevalence 0.0000656
## Balanced Accuracy 0.5539145
With an overall 87% accuracy of assigning categories after labelling items as Malicious in the test set. Our model is performing somewhat satisfactorily. However it is clear from the detection rates that the succes is dominated by that of the much more frequently occuring categories and classification for categories with few samples in the training set is poor.
flow_id <- as.vector(seq(1:length(gbm_predicts_test)))
rows <- cbind(flow_id,gbm_predicts_test)
colnames(rows) <- c("Flow_id","Label")
write.csv(rows,file="GBMPredictiedLabels.csv", row.names=FALSE)
Checking and saving attack Category predictions to CSV in a numeric format.
length(gbm_predicts_test)
## [1] 1016019
length(predicted_cat)
## [1] 213296
length(gbm_predicts_test[gbm_predicts_test==1])
## [1] 213296
predicted_cat_numeric <- predicted_cat
#encode as numbers
predicted_cat_numeric <- factor(predicted_cat,
levels = c("NA","Reconnaissance","Fuzzers","Analysis","Backdoor","Exploits","Generic","Shellcode","Worm","DoS"),
labels = c(0,1, 2, 3,4,5,6,7,8,9))
predicted_cat_numeric <- as.character(predicted_cat_numeric)
predicted_cat_numeric <- as.numeric(predicted_cat_numeric)
table(predicted_cat)
## predicted_cat
## Analysis Backdoor DoS Exploits Fuzzers
## 43 54 4887 38782 10269
## Generic NA Reconnaissance Shellcode Worms
## 148683 2589 7241 734 14
table(predicted_cat_numeric)
## predicted_cat_numeric
## 0 1 2 3 4 5 6 7 9
## 2589 7241 10269 43 54 38782 148683 734 4887
# add cats to al test data labels
predictedTestCats <- as.numeric(as.character(gbm_predicts_test))
sum(predictedTestCats)
## [1] 213296
length(predicted_cat_numeric)
## [1] 213296
predictedTestCats[predictedTestCats==1] <- predicted_cat_numeric
length(predictedTestCats)
## [1] 1016019
flow_id <- as.vector(seq(1:length(predictedTestCats)))
rows2 <- cbind(flow_id,gbm_predicts_test)
colnames(rows2) <- c("Flow_id","Label")
write.csv(rows2,file="GBMPredictiedAttackLabels.csv", row.names=FALSE)
Predicted categories of the predicted attacks
table(predicted_cat)
## predicted_cat
## Analysis Backdoor DoS Exploits Fuzzers
## 43 54 4887 38782 10269
## Generic NA Reconnaissance Shellcode Worms
## 148683 2589 7241 734 14
Actual test Categories for all test data below
table(as.vector(h2o.data.test$attack_cat))
##
## Analysis Backdoor DoS Exploits Fuzzers
## 1459 1390 10262 26423 13334
## Generic NA Reconnaissance Shellcode Worms
## 149998 803737 8426 886 104
We had moderate success predicting the attack categories.
The Autoencoder was also moderatly succesful, at least letting us quickly discard more than 60 % of flows as "benign". There may always be benign samples that are very similar to attacks on these data fields so a considerable amount of false positives from this type of data may be inevitable.
Lowered stopping criteria Metrics. I had set fairly easy early stopping criteria in model training, to speed running whilst producing this notebook. I was stopping model building when errors were not improving more than 1% over a few scoring rounds. I had previosuly run individual models with higher required accuracy and I will go back and rerun all the models setting the stopping criteria below a tenth of a percent. This will run for a lot longer but it should return a more accurate model of the benign data and we can assess if this improves the anomaly detection. Ir is possible that it may NOT perform any better, in spite of longer model training.
Optimising on F2 metric in code. Coding to extract the detection threshold on an optomised F2 metric. We could also code a customisable Beta for F metric to set our detection threshold programmatically that way.
Smarter synthetic columns. The various synthetic columns of "instances similar in the last 100" are quite rudimentary and a machine learning informed approach to creating this kind of synthetic data column may lead to much greater success and insight into the patterns of attacks. A model that is "keeping score" of attack patterns happening as they come in may also have better success in using current state of recent flows to inform new classification decisions.
DavidMcCann April 2018