MATH2349 Semester 1, 2018

Required packages

# Libraries
library(pryr)
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
library(knitr)
library(ggplot2)

Executive Summary

For this assignment, I have chosen two data sets typically used in any modern Security Operation Centres (SOC). Data-driven analysis is a key skill in a SOC’s defence capability. The data comprises of a reputation database maintained by AlienVault, a cyber security threat management platform, as well as anonymised web application firewall (WAF) logs recording ‘blocked’ traffic attempting to navigate an internet-facing application.

The goal of this assignment, through data wrangling, is to compare the nature of both an active defence control (in this case the WAF) and a threat intelligence source. In today’s world, SOCs are significantly disadvantaged when it comes to the fight against cyber adversaries. Often, security decisions are made on risk appetite with inherent probabilities of events considered. This in mind, assessing the delta of these data sets may unveil critial threat information and enable further hardening of existing defence controls.

The activities below show the stages of data preprocessing that will enable analysis of a WAF log against a Reputation database:

Ingestion of data
Understanding the data
Tidy and manipulation of the data
Outlier detection

It was discovered, via an inner join, that the resulting relationship table of the data-sets contained only 96 records. Limiting the amount of analysis possible. However, this relationship table is featured in the Transform section, and shows a benefit of this analysis. A common problem is that security analysts are often inundated by security events each and every day. Through data preprocessing we have distilled some 423962 records down 96 records of which can be used for targeted investigation into potential threats. This is an actionable list made up of qualified signals.

Data

The following data sets will be used for the assignment:

AlienVault reputation database

AlienVault provides a database of potentially compromised IPs accompanied by severity scores and geolocation data. Security professionals may use the database as a reference against public IPs associated with their organisation to better understand potential threats and support investigation activities such as threat hunting and forensics. Not to mention, the IPs may also be added to a blacklist.

The reputation database comprises of the following variables:

IP: A potentially compromised IP address (IPv4)
Reliability: A reliability score indicating the validity of the intel
Risk: A risk score indicating the potential threat that a IP may pose
Type: The type of behaviour or state of the IP i.e. scanner, malicious host, etc.
Country: Country code associated with the IP
Locale: Location associated with the IP
Coords: Geo-coordinates of the IP
X: Number of community submissions against this IP

Data profile:

54069 records
8 variables
6.67MB

Anonymous Web Application Firewall logs

The second set of data consists of ‘blocked’ traffic attempting to communicate with an Organisations internet-facing application. Typically, organisations will establish a web application firewall as a shield between a web application and the internet. ‘Out-of-the-box’ WAFs will come with a set of policies and rules to detect and block traffic in real-time. However, the default rules are not enough, WAFs must be continually improved so that they are able to deal with new and emerging attack techniques.

The WAF log comprises of the following variables:

time: The date and time of the event
geo.city: The city associated with the event
geo.country: The country associated with the event
geo.lat: The latitude reference associated with the event
geo.long: The longitude reference associated with the event
geo.region: The region associated with the event
message.UA: A URL-encoded character string that allows the network to identify the application type, OS, or software associated with a request.
message.bytes: Byte size of the request
message.cliIP: IP address associated with the event

Data profile:

423962 records
9 variables
33.2MB

The following initial steps were taken to setup the analysis:

Import libraries: Specific libraries are imported so that we have the required tools for our analysis.
Data ingestion: The AlienVault reputation data is sourced, via URL, from the Data-driven Security text’s code repository. Whilst, the WAF data is sourced from an Organisation with permission for research and has since been sanitised.
Data profiling: Base R functions are used to get a high-level understanding of the size and nature of the data.
Data merging: The two data-sets are merged using both a left and inner join as optional data sources in the analysis.

# 1 - AlienVault Reputation data
avReputationURL <- "http://datadrivensecurity.info/book/ch03/data/reputation.data"
rep <- read_delim(avReputationURL, delim = "#", col_names = c("IP", "Reliability", "Risk", "Type", "Country", "Locale", "Coords", "x"))

## Parsed with column specification:
## cols(
##   IP = col_character(),
##   Reliability = col_integer(),
##   Risk = col_integer(),
##   Type = col_character(),
##   Country = col_character(),
##   Locale = col_character(),
##   Coords = col_character(),
##   x = col_character()
## )

# 2 - Web Application Firewall data
waf <- read_delim("~/Data Science Projects/assignment3/data/webProxy2017Clean.csv", delim = ",")

## Parsed with column specification:
## cols(
##   `_time` = col_datetime(format = ""),
##   geo.city = col_character(),
##   geo.country = col_character(),
##   geo.lat = col_double(),
##   geo.long = col_double(),
##   geo.region = col_character(),
##   message.UA = col_character(),
##   message.bytes = col_integer(),
##   message.cliIP = col_character()
## )

# Check dimensions of the data 
dim(rep)

## [1] 258626      8

dim(waf)

## [1] 423962      9

# Check size of the data
object_size(rep)

## 29.3 MB

object_size(waf)

## 33.2 MB

# Merge data
df <- left_join(x = waf, y = rep, by = c("message.cliIP" = "IP"))

# Show sample
head(df)

Understand

Taking a deeper look at the variables and data structures associated with the data the following was observed:

A number of variables have been stored as string values (chr)
It appears there are a number of nominal categorical variables (i.e. Country, City, Region, etc.)
In addition, Risk and Reliability appear to be ordinal categorical variables
The primary data (WAF) is of temporal nature with the time variable stored as a POSIXct type

This in mind, each ordinal and nominal categorical variable was converted to a factor data type. This is so that R’s functions treat these variables appropriately.

Furthermore, given Risk and Reliability are of ordinal nature, it was decided that labels be added as descriptors for each of the variables levels.

# Check variables of the data
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    423962 obs. of  16 variables:
##  $ _time        : POSIXct, format: "2017-08-31 02:17:33" "2017-08-31 02:15:24" ...
##  $ geo.city     : chr  "WASHINGTON" "SACRAMENTO" "LOSANGELES" "MELBOURNE" ...
##  $ geo.country  : chr  "US" "US" "US" "AU" ...
##  $ geo.lat      : num  38.9 38.6 34 -37.8 -37.8 ...
##  $ geo.long     : num  -77 -121 -118 145 145 ...
##  $ geo.region   : chr  "DC" "CA" "CA" "VIC" ...
##  $ message.UA   : chr  "Mozilla%2f5.0%20(Windows%3b%20U%3b%20Windows%20NT%205.1%3b%20en-US%3b%20rv%3a1.9.0.1)%20Gecko%2f2008070208%20Firefox%2f3.0.1" "Mozilla%2f5.0%20(Windows%20NT%206.1%3b%20WOW64%3b%20rv%3a40.0)%20Gecko%2f20100101%20Firefox%2f40.1" "Mozilla%2f5.0%20(compatible%3b%20Googlebot%2f2.1%3b%20http%3a%2f%2frobotto.org%2frobotto-bot%2f)" "Microsoft%20Office%20Protocol%20Discovery" ...
##  $ message.bytes: int  347 263 281 216 216 216 216 213 213 216 ...
##  $ message.cliIP: chr  "69.41.14.233" "107.141.168.103" "23.224.64.12" "202.67.98.35" ...
##  $ Reliability  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Risk         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Type         : chr  NA NA NA NA ...
##  $ Country      : chr  NA NA NA NA ...
##  $ Locale       : chr  NA NA NA NA ...
##  $ Coords       : chr  NA NA NA NA ...
##  $ x            : chr  NA NA NA NA ...

# Conduct type conversions
df$geo.country <- as.factor(df$geo.country)
df$geo.city <- as.factor(df$geo.city)
df$geo.region <- as.factor(df$geo.region)
df$Country <- as.factor(df$Country)
df$Locale <- as.factor(df$Locale)
df$Type <- as.factor(df$Type)

df$x <- as.integer(df$x)

# Let's add some labels to the Risk and Reliability data
df$Reliability <- ordered(df$Reliability, levels = c(2,3,4,6), labels = c("Low", "Medium", "High", "Very High"))
df$Risk <- ordered(df$Risk, levels = c(2,3,4), labels = c("Low", "Medium", "High"))

# Show sample
head(df)

Tidy & Manipulate Data I

The data is predominantly tidy, however, the following steps were taken to better represent the data and conform to the Tidy data principles:

Clearer column names to distinguish between WAF and Reputation data (by adding a prefix to distinguish between origin data sets)
The Coords variable, residing from the AlienVault data, is more useful as separate longitude and latitude variables

Note, variables in the data related to location from each of the original data sets have been retained so that comparative analysis may be conducted for particular IPs and their origins. In addition, the User Agent string variable could possibly be separated into sub-variables such as the Operating System, Web Browser, and more, associated with an IP.

# Tidy up the columns names so they are easier to understand
colnames(df) <- c("waf.time", "waf.city", "waf.country", "waf.latitude", 
             "waf.longitude", "waf.region", "waf.userAgent", 
             "waf.bytes", "waf.IP", "av.reliability", "av.risk", 
             "av.type", "av.country", "av.locale", "av.coords", "av.x")

# Seperate coords out into longitude and latitude
df <- separate(df, av.coords, c("av.longitude", "av.latitude"), ",", extra = "drop", fill = "right")

# Check data types
str(df$av.longitude)

##  chr [1:423962] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...

str(df$av.latitude)

##  chr [1:423962] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...

# Fix the data types of the seperated coordinate variables
df$av.longitude <- as.double(df$av.longitude)
df$av.latitude <- as.double(df$av.latitude)

# Show sample
head(df)

Tidy & Manipulate Data II

The User Agent string variable contains some information that we can derive a new variable from. A common format for browsers is ‘User-Agent: Mozilla/ () () ’. In this case we will extract the % boxplot(main=“Box Plot of Bytes”, ylab=“bytes”, col=“grey”) ```

# Using the plot we can infer the outlers using a filter
df_no_outliers <- df %>% 
  filter(waf.bytes < 10000)

# Summarise the mean of bytes without the outlier
df_no_outliers %>% summarise(mean = mean(df_no_outliers$waf.bytes))

Transform

Using inner_join, we derive an enriched subset of data where we can compare the origins of each IP according to both the WAF and Reputation data-sets. Furthermore, this is a manageable list of IPs that a security analyst may pick up and investigate further to understand potential threats to the organisation’s application.

# Inner join of the data
df_inner <- inner_join(x = waf, y = rep, by = c("message.cliIP" = "IP"))

# Show subset
df_inner