Introduction

The issue I want to investigate is cybersecurity threats. This area sparked my curiosity because of my background and interests and the moment in time we are in currently. This will be a very interesting project since I want to get into the cybersecurity field in the future. Also, in the present time cybersecurity is prominent and thriving with new vulnerabilities and exploits; and ways to prevent and protect against these things.

With this project I want to dive into concepts like what kind of attacks are the most popular, which one the most effective. Or, even what are some vulnerabilities seem to be lacking at large. I will be doing a lot of analysis but with this project’s certain structure and requirements, I will focus on only two main questions that will ultimately see a result that includes both the red team and blue team.

Questions I want to answer

Questions:

1.) What kind of vulnerability is the worst vulnerability to have?

2.) What kind of attack seems to be the most effective?

Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Dataset

df <- read_csv("CS_Dataset.csv") %>%
  clean_names()

## Rows: 3000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Country, Attack Type, Target Industry, Attack Source, Security Vuln...
## dbl (4): Year, Financial Loss (in Million $), Number of Affected Users, Inci...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert numeric fields
df <- df %>%
  mutate(
    financial_loss_in_million = as.numeric(financial_loss_in_million),
    number_of_affected_users = as.numeric(number_of_affected_users),
    incident_resolution_time_in_hours = as.numeric(incident_resolution_time_in_hours)
  )

summary(df)

##    country               year      attack_type        target_industry   
##  Length:3000        Min.   :2015   Length:3000        Length:3000       
##  Class :character   1st Qu.:2017   Class :character   Class :character  
##  Mode  :character   Median :2020   Mode  :character   Mode  :character  
##                     Mean   :2020                                        
##                     3rd Qu.:2022                                        
##                     Max.   :2024                                        
##  financial_loss_in_million number_of_affected_users attack_source     
##  Min.   : 0.50             Min.   :   424           Length:3000       
##  1st Qu.:25.76             1st Qu.:255805           Class :character  
##  Median :50.80             Median :504513           Mode  :character  
##  Mean   :50.49             Mean   :504684                             
##  3rd Qu.:75.63             3rd Qu.:758088                             
##  Max.   :99.99             Max.   :999635                             
##  security_vulnerability_type defense_mechanism_used
##  Length:3000                 Length:3000           
##  Class :character            Class :character      
##  Mode  :character            Mode  :character      
##                                                    
##                                                    
##                                                    
##  incident_resolution_time_in_hours
##  Min.   : 1.00                    
##  1st Qu.:19.00                    
##  Median :37.00                    
##  Mean   :36.48                    
##  3rd Qu.:55.00                    
##  Max.   :72.00

head(df)

## # A tibble: 6 × 10
##   country  year attack_type       target_industry    financial_loss_in_million
##   <chr>   <dbl> <chr>             <chr>                                  <dbl>
## 1 China    2019 Phishing          Education                               80.5
## 2 China    2019 Ransomware        Retail                                  62.2
## 3 India    2017 Man-in-the-Middle IT                                      38.6
## 4 UK       2024 Ransomware        Telecommunications                      41.4
## 5 Germany  2018 Man-in-the-Middle IT                                      74.4
## 6 Germany  2017 Man-in-the-Middle Retail                                  98.2
## # ℹ 5 more variables: number_of_affected_users <dbl>, attack_source <chr>,
## #   security_vulnerability_type <chr>, defense_mechanism_used <chr>,
## #   incident_resolution_time_in_hours <dbl>



## Start Investigating Vulnerabilities

``` r
vulnerability_summary <- df %>%
  group_by(security_vulnerability_type) %>%
  summarize(
    avg_loss = mean(financial_loss_in_million),
    avg_users = mean(number_of_affected_users),
    avg_resolution = mean(incident_resolution_time_in_hours),
    count = n()
  ) %>%
  arrange(desc(avg_loss))

vulnerability_summary

## # A tibble: 4 × 5
##   security_vulnerability_type avg_loss avg_users avg_resolution count
##   <chr>                          <dbl>     <dbl>          <dbl> <int>
## 1 Social Engineering              50.9   500802.           36.5   747
## 2 Weak Passwords                  50.5   519339.           35.6   730
## 3 Zero-day                        50.4   504836.           36.0   785
## 4 Unpatched Software              50.2   493956.           37.9   738

Visual

ggplot(vulnerability_summary, 
       aes(x = reorder(security_vulnerability_type, avg_loss), y = avg_loss)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Average Financial Loss by Vulnerability Type",
    x = "Vulnerability Type",
    y = "Average Loss (Million USD)"
  )

## Start Investigating Attacks

attack_summary <- df %>%
  group_by(attack_type) %>%
  summarize(
    avg_loss = mean(financial_loss_in_million),
    avg_users = mean(number_of_affected_users),
    avg_resolution = mean(incident_resolution_time_in_hours),
    count = n()
  ) %>%
  arrange(desc(avg_loss))

attack_summary

## # A tibble: 6 × 5
##   attack_type       avg_loss avg_users avg_resolution count
##   <chr>                <dbl>     <dbl>          <dbl> <int>
## 1 DDoS                  52.0   499437.           35.7   531
## 2 Man-in-the-Middle     51.3   520064.           36.9   459
## 3 Phishing              50.5   487180.           35.9   529
## 4 SQL Injection         50.0   512470.           36.9   503
## 5 Ransomware            49.7   502825.           36.5   493
## 6 Malware               49.4   508780.           37.1   485

Visual

ggplot(attack_summary, 
       aes(x = reorder(attack_type, avg_loss), y = avg_loss)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(
    title = "Average Financial Loss by Attack Type",
    x = "Attack Type",
    y = "Average Loss (Million USD)"
  )

Trends with Frequency of Attacks

trend <- df %>%
  group_by(year, attack_type) %>%
  summarize(incidents = n())

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

ggplot(trend, aes(x = year, y = incidents, color = attack_type)) +
  geom_line(size = 1.1) +
  labs(
    title = "Cyberattack Frequency Over Time",
    x = "Year",
    y = "Number of Incidents"
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

CSC360Project

Introduction

Questions I want to answer

Setup

Dataset

Visual

Visual

Trends with Frequency of Attacks