# Libraries
library(pryr)
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
library(knitr)
library(ggplot2)
For this assignment, I have chosen two data sets typically used in any modern Security Operation Centres (SOC). Data-driven analysis is a key skill in a SOC’s defence capability. The data comprises of a reputation database maintained by AlienVault, a cyber security threat management platform, as well as anonymised web application firewall (WAF) logs recording ‘blocked’ traffic attempting to navigate an internet-facing application.
The goal of this assignment, through data wrangling, is to compare the nature of both an active defence control (in this case the WAF) and a threat intelligence source. In today’s world, SOCs are significantly disadvantaged when it comes to the fight against cyber adversaries. Often, security decisions are made on risk appetite with inherent probabilities of events considered. This in mind, assessing the delta of these data sets may unveil critial threat information and enable further hardening of existing defence controls.
The activities below show the stages of data preprocessing that will enable analysis of a WAF log against a Reputation database:
It was discovered, via an inner join, that the resulting relationship table of the data-sets contained only 96 records. Limiting the amount of analysis possible. However, this relationship table is featured in the Transform section, and shows a benefit of this analysis. A common problem is that security analysts are often inundated by security events each and every day. Through data preprocessing we have distilled some 423962 records down 96 records of which can be used for targeted investigation into potential threats. This is an actionable list made up of qualified signals.
The following data sets will be used for the assignment:
AlienVault provides a database of potentially compromised IPs accompanied by severity scores and geolocation data. Security professionals may use the database as a reference against public IPs associated with their organisation to better understand potential threats and support investigation activities such as threat hunting and forensics. Not to mention, the IPs may also be added to a blacklist.
The reputation database comprises of the following variables:
Data profile:
The second set of data consists of ‘blocked’ traffic attempting to communicate with an Organisations internet-facing application. Typically, organisations will establish a web application firewall as a shield between a web application and the internet. ‘Out-of-the-box’ WAFs will come with a set of policies and rules to detect and block traffic in real-time. However, the default rules are not enough, WAFs must be continually improved so that they are able to deal with new and emerging attack techniques.
The WAF log comprises of the following variables:
Data profile:
The following initial steps were taken to setup the analysis:
# 1 - AlienVault Reputation data
avReputationURL <- "http://datadrivensecurity.info/book/ch03/data/reputation.data"
rep <- read_delim(avReputationURL, delim = "#", col_names = c("IP", "Reliability", "Risk", "Type", "Country", "Locale", "Coords", "x"))
## Parsed with column specification:
## cols(
## IP = col_character(),
## Reliability = col_integer(),
## Risk = col_integer(),
## Type = col_character(),
## Country = col_character(),
## Locale = col_character(),
## Coords = col_character(),
## x = col_character()
## )
# 2 - Web Application Firewall data
waf <- read_delim("~/Data Science Projects/assignment3/data/webProxy2017Clean.csv", delim = ",")
## Parsed with column specification:
## cols(
## `_time` = col_datetime(format = ""),
## geo.city = col_character(),
## geo.country = col_character(),
## geo.lat = col_double(),
## geo.long = col_double(),
## geo.region = col_character(),
## message.UA = col_character(),
## message.bytes = col_integer(),
## message.cliIP = col_character()
## )
# Check dimensions of the data
dim(rep)
## [1] 258626 8
dim(waf)
## [1] 423962 9
# Check size of the data
object_size(rep)
## 29.3 MB
object_size(waf)
## 33.2 MB
# Merge data
df <- left_join(x = waf, y = rep, by = c("message.cliIP" = "IP"))
# Show sample
head(df)
Taking a deeper look at the variables and data structures associated with the data the following was observed:
This in mind, each ordinal and nominal categorical variable was converted to a factor data type. This is so that R’s functions treat these variables appropriately.
Furthermore, given Risk and Reliability are of ordinal nature, it was decided that labels be added as descriptors for each of the variables levels.
# Check variables of the data
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 423962 obs. of 16 variables:
## $ _time : POSIXct, format: "2017-08-31 02:17:33" "2017-08-31 02:15:24" ...
## $ geo.city : chr "WASHINGTON" "SACRAMENTO" "LOSANGELES" "MELBOURNE" ...
## $ geo.country : chr "US" "US" "US" "AU" ...
## $ geo.lat : num 38.9 38.6 34 -37.8 -37.8 ...
## $ geo.long : num -77 -121 -118 145 145 ...
## $ geo.region : chr "DC" "CA" "CA" "VIC" ...
## $ message.UA : chr "Mozilla%2f5.0%20(Windows%3b%20U%3b%20Windows%20NT%205.1%3b%20en-US%3b%20rv%3a1.9.0.1)%20Gecko%2f2008070208%20Firefox%2f3.0.1" "Mozilla%2f5.0%20(Windows%20NT%206.1%3b%20WOW64%3b%20rv%3a40.0)%20Gecko%2f20100101%20Firefox%2f40.1" "Mozilla%2f5.0%20(compatible%3b%20Googlebot%2f2.1%3b%20http%3a%2f%2frobotto.org%2frobotto-bot%2f)" "Microsoft%20Office%20Protocol%20Discovery" ...
## $ message.bytes: int 347 263 281 216 216 216 216 213 213 216 ...
## $ message.cliIP: chr "69.41.14.233" "107.141.168.103" "23.224.64.12" "202.67.98.35" ...
## $ Reliability : int NA NA NA NA NA NA NA NA NA NA ...
## $ Risk : int NA NA NA NA NA NA NA NA NA NA ...
## $ Type : chr NA NA NA NA ...
## $ Country : chr NA NA NA NA ...
## $ Locale : chr NA NA NA NA ...
## $ Coords : chr NA NA NA NA ...
## $ x : chr NA NA NA NA ...
# Conduct type conversions
df$geo.country <- as.factor(df$geo.country)
df$geo.city <- as.factor(df$geo.city)
df$geo.region <- as.factor(df$geo.region)
df$Country <- as.factor(df$Country)
df$Locale <- as.factor(df$Locale)
df$Type <- as.factor(df$Type)
df$x <- as.integer(df$x)
# Let's add some labels to the Risk and Reliability data
df$Reliability <- ordered(df$Reliability, levels = c(2,3,4,6), labels = c("Low", "Medium", "High", "Very High"))
df$Risk <- ordered(df$Risk, levels = c(2,3,4), labels = c("Low", "Medium", "High"))
# Show sample
head(df)
The data is predominantly tidy, however, the following steps were taken to better represent the data and conform to the Tidy data principles:
Note, variables in the data related to location from each of the original data sets have been retained so that comparative analysis may be conducted for particular IPs and their origins. In addition, the User Agent string variable could possibly be separated into sub-variables such as the Operating System, Web Browser, and more, associated with an IP.
# Tidy up the columns names so they are easier to understand
colnames(df) <- c("waf.time", "waf.city", "waf.country", "waf.latitude",
"waf.longitude", "waf.region", "waf.userAgent",
"waf.bytes", "waf.IP", "av.reliability", "av.risk",
"av.type", "av.country", "av.locale", "av.coords", "av.x")
# Seperate coords out into longitude and latitude
df <- separate(df, av.coords, c("av.longitude", "av.latitude"), ",", extra = "drop", fill = "right")
# Check data types
str(df$av.longitude)
## chr [1:423962] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
str(df$av.latitude)
## chr [1:423962] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
# Fix the data types of the seperated coordinate variables
df$av.longitude <- as.double(df$av.longitude)
df$av.latitude <- as.double(df$av.latitude)
# Show sample
head(df)
The User Agent string variable contains some information that we can derive a new variable from. A common format for browsers is ‘User-Agent: Mozilla/
# Using the plot we can infer the outlers using a filter
df_no_outliers <- df %>%
filter(waf.bytes < 10000)
# Summarise the mean of bytes without the outlier
df_no_outliers %>% summarise(mean = mean(df_no_outliers$waf.bytes))
Using inner_join, we derive an enriched subset of data where we can compare the origins of each IP according to both the WAF and Reputation data-sets. Furthermore, this is a manageable list of IPs that a security analyst may pick up and investigate further to understand potential threats to the organisation’s application.
# Inner join of the data
df_inner <- inner_join(x = waf, y = rep, by = c("message.cliIP" = "IP"))
# Show subset
df_inner