Today’s cyber threat environment is persistent and ever changing. Organizations defending against these threats need tools to mitigate potential intrusions. I have analyzed cyberthreats to Saint Martin’s University using datasets obtained from existing networking equipment. This data was processed to provide useful analysis to further mitigate these threats.
With this analysis, the University has gained valuable insights and techniques in their defense of cybersecurity threats. These insights can be used to tighten the University’s cybersecurity or used as evidence in criminal investigations and prosecutions. We, as a community, are safer and more secure as a result.
Questions
1. What country is the top attacker to Saint Martin’s University?
2. What techniques are used to try to attack the University?
Data Source Saint Martin’s University utilizes a Palo Alto Networks firewall that generates extensive and detailed data regarding network traffic and threats inbound and outbound from the Main Campus. This data is sent to a logging server, where the data is categorized and stored for further analysis. Currently, there is over 4TB of data in the repository, but I can extract only the attributes necessary for the analysis. In order to limit the dataset to a usable size, I extracted only the last 30 days of data.
We begin our analysis with the setup; load libraries, and pull in csv file. Then, the raw data is filtered for the “critical” entries. “Critical” entries are those that are the most destructive, usually a code execution vulnerability or virus. These are being blocked by the firewall.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
threat_raw <- read_csv("C:\\linux\\CSC530 Data Analysis\\Projects\\Week 1\\AprilThreats.csv", col_types = cols(timestamp = col_character()))
threats <- filter(threat_raw, Severity == "critical", na.rm=TRUE)
summary(threats)
## timestamp Action Application Destination_Address
## Length:2768 Length:2768 Length:2768 Length:2768
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Destination_Country Destination_Port IP_Protocol Severity
## Length:2768 Min. : 25 Length:2768 Length:2768
## Class :character 1st Qu.: 9034 Class :character Class :character
## Mode :character Median : 9034 Mode :character Mode :character
## Mean : 7320
## 3rd Qu.: 9034
## Max. :63759
## Source_Address Source_Country Source_Port Threat_Content_Type
## Length:2768 Length:2768 Min. : 123 Length:2768
## Class :character Class :character 1st Qu.:39056 Class :character
## Mode :character Mode :character Median :46443 Mode :character
## Mean :45747
## 3rd Qu.:54403
## Max. :64557
## Threat_ID
## Length:2768
## Class :character
## Mode :character
##
##
##
str(threats)
## spec_tbl_df [2,768 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ timestamp : chr [1:2768] "2022-03-24T20:26:24.000Z" "2022-03-24T22:03:37.000Z" "2022-03-24T23:39:01.000Z" "2022-03-25T00:08:55.000Z" ...
## $ Action : chr [1:2768] "reset-both" "reset-both" "reset-both" "reset-both" ...
## $ Application : chr [1:2768] "web-browsing" "web-browsing" "web-browsing" "web-browsing" ...
## $ Destination_Address: chr [1:2768] "199.245.238.31" "199.245.238.31" "199.245.238.31" "199.245.238.74" ...
## $ Destination_Country: chr [1:2768] "United States" "United States" "United States" "United States" ...
## $ Destination_Port : num [1:2768] 81 81 81 80 80 80 80 81 81 80 ...
## $ IP_Protocol : chr [1:2768] "tcp" "tcp" "tcp" "tcp" ...
## $ Severity : chr [1:2768] "critical" "critical" "critical" "critical" ...
## $ Source_Address : chr [1:2768] "194.31.98.144" "194.31.98.144" "194.31.98.144" "115.60.193.14" ...
## $ Source_Country : chr [1:2768] "United States" "United States" "United States" "China" ...
## $ Source_Port : num [1:2768] 40022 36634 59636 1897 34623 ...
## $ Threat_Content_Type: chr [1:2768] "vulnerability" "vulnerability" "vulnerability" "vulnerability" ...
## $ Threat_ID : chr [1:2768] "Wireless IP Camera Pre-Auth Info Leak Vulnerability(33556)" "Wireless IP Camera Pre-Auth Info Leak Vulnerability(33556)" "Wireless IP Camera Pre-Auth Info Leak Vulnerability(33556)" "Netgear DGN Device Remote Command Execution Vulnerability(40741)" ...
## - attr(*, "spec")=
## .. cols(
## .. timestamp = col_character(),
## .. Action = col_character(),
## .. Application = col_character(),
## .. Destination_Address = col_character(),
## .. Destination_Country = col_character(),
## .. Destination_Port = col_double(),
## .. IP_Protocol = col_character(),
## .. Severity = col_character(),
## .. Source_Address = col_character(),
## .. Source_Country = col_character(),
## .. Source_Port = col_double(),
## .. Threat_Content_Type = col_character(),
## .. Threat_ID = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The first chart is to find out the answer to the first question, “What countries are attacking Saint Martin’s University the most?”
#
#Threats per source country chart
#
threat_countries = threats %>%
group_by(Source_Country) %>%
summarize(count = n()) %>%
ungroup()
threat_countries %>%
ggplot(aes(x = count, y=reorder(Source_Country,count))) +
geom_col(aes()) +
labs(title = "Threat Count By Country", y = "Country")
As you can see, China has the most, followed by Netherlands, Iran, Russia, and then the U.S.
The second chart is to find out the answer to the second question, “What vulnerabilities are being attacked the most?” We can use the Threat_ID for find this out. This Threat ID is formulated by the firewall and added to the log entry.
#
# Threat IDs Plot
#
threat_IDs = threats %>%
group_by(Threat_ID) %>%
summarize(count = n()) %>%
ungroup()
head(threat_IDs)
threat_IDs %>%
ggplot(aes(x = count, y=reorder(Threat_ID,count))) +
geom_col() +
labs(title = "Threat Id Count", y = "Threat Id")
We find that the most used critical vulnerability is the “Realtek Jungle SDK Remote Code Execution Vulnerability.” This vulnerability uses a flaw in certain versions of the Realtek Audio driver to gain unauthorized control of a computer system remotely.
Next, We want to find out how often China attacks Saint Martin’s University. Is it on a schedule?
#
#Time-series Charts of China
#
china <- filter(threats, Source_Country == "China")
china_threats <- china %>%
mutate(day = as.integer(format(as.Date(china$timestamp,format="%Y-%m-%d"), format = "%d"))) %>%
group_by(day) %>%
summarize(threat_count = n()) %>%
ungroup
ggplot(data = china_threats, aes(x = day, y = threat_count)) +
geom_col() +
labs(title = "April 2022 China Threat Count Per Day", x = "Day Of the Month", y = "Threat Count")
We find that China attacked us with “critical” vulnerabilities everyday except for the 24th. Note that on the 1st and 5th of April, a large attack of many vulnerabilities was performed.
We would then like to know what vulnerabilities were used the most by China. This chart shows the number of threats as well as the Threat ID.
china_ids <- china %>%
group_by(Threat_ID) %>%
summarize(threats = n()) %>%
ungroup
str(china_ids)
## tibble [11 x 2] (S3: tbl_df/tbl/data.frame)
## $ Threat_ID: chr [1:11] "Apache Shiro Improper Authentication Vulnerability(58132)" "Apache Struts Content-Type Remote Code Execution Vulnerability(33196)" "D-Link DSL Soap Authorization Remote Command Execution Vulnerability(58483)" "GPON Home Routers Remote Code Execution Vulnerability(37264)" ...
## $ threats : int [1:11] 3 2 6 49 59 3 1027 3 2 2 ...
ggplot(data = china_ids, aes(x = threats, y = reorder(Threat_ID,threats))) +
geom_col() +
labs(title = "April 2022 China Threat IDs", y = "Threat ID", x = "Threat Count")
We find that China indeed used the Realtek Jungle SDK Remote Execution Vulnerability the most, followed by a Netgear DGN Device Remote Command Execution Vulnerability, and several instances of GPON Home Routers Remote Code Execution Vulnerability. Several others were caught, but only once or twice. These are small in comparison.
The results of our analysis is that China has been attacking us the most, with a vulnerability called Realtek Jungle SDK Remote Execution Vulnerability. These were blocked and recorded by our Palo Alto firewall. Without this firewall in place, it could be possible that several people on campus could have been compromised.