Businesses are facing unprecedented cybersecurity threats from AI-powered attacks, unprepared employees, and insider vulnerabilities that could devastate their bottom line
Source: Forbes. (2025). 5 Mistakes Companies Will Make This Year With Cybersecurity

Introduction

This report explores the relationship between the industry targeted and the type of cyberattack using a global cybersecurity dataset from 2015 to 2024. The aim is to identify whether certain industries are more susceptible to specific types of attacks, which can inform future cybersecurity strategies.

The dataset contains 3,000 observations (rows) and 10 variables (columns). Each observation represents a reported cybersecurity incident across various countries, industries, and years. The key variables include:

The remaining variables include country and year, which help contextualize when and where each incident occurred.

Dataset Source: Global Cybersecurity Threats 2015–2024. (Hypothetical dataset for educational use). Access: https://www.kaggle.com/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024 Direct file name: “Global_Cybersecurity_Threats_2015-2024.csv”

Research Question

Is there a significant association between the industry targeted and the type of cyberattack?

Data Analysis

To explore the relationship between targeted industries and the type of cyberattacks, I will apply exploratory data analysis (EDA) and conduct a Chi-Squared Test of Independence to assess whether a statistically significant association exists between the variables. The analysis began by cleaning the dataset to remove missing values and converting key variables to categorical format. To address the research question, I generated a bar plot showing the frequency of different attack types across industries, a stacked proportion chart to visualize how attack types are distributed within each industry, and a bar chart of average financial loss per industry to assess the impact dimension of the attacks. These visualizations helped uncover potential patterns or imbalances that suggest associations between specific sectors and attack types. The Chi-Squared Test then formally evaluated whether these observed differences were statistically meaningful.

Hypotheses - Null Hypothesis (H₀): There is no association between the targeted industry and the type of cyberattack.
- Alternative Hypothesis (H₁): There is a significant association between the targeted industry and the type of cyberattack.

Load Libraries and Dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) #used for cleaning and formatting column names
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(scales) #enhances visualizations with formatted scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(reshape2)  #used for melting matrices into data frames
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
#Set working directory
setwd("~/Desktop/DATA/Data 101/Final Project")

#load dataset and clean column names
df <- read.csv("Global_Cybersecurity_Threats_2015-2024.csv") %>%
  clean_names()

# view first rows
head(df)
##   country year       attack_type    target_industry financial_loss_in_million
## 1   China 2019          Phishing          Education                     80.53
## 2   China 2019        Ransomware             Retail                     62.19
## 3   India 2017 Man-in-the-Middle                 IT                     38.65
## 4      UK 2024        Ransomware Telecommunications                     41.44
## 5 Germany 2018 Man-in-the-Middle                 IT                     74.41
## 6 Germany 2017 Man-in-the-Middle             Retail                     98.24
##   number_of_affected_users attack_source security_vulnerability_type
## 1                   773169  Hacker Group          Unpatched Software
## 2                   295961  Hacker Group          Unpatched Software
## 3                   605895  Hacker Group              Weak Passwords
## 4                   659320  Nation-state          Social Engineering
## 5                   810682       Insider          Social Engineering
## 6                   285201       Unknown          Social Engineering
##   defense_mechanism_used incident_resolution_time_in_hours
## 1                    VPN                                63
## 2               Firewall                                71
## 3                    VPN                                20
## 4     AI-based Detection                                 7
## 5                    VPN                                68
## 6              Antivirus                                25

Data Cleaning and Preparation

#filter relevant rows with complete data
df_filtered <- df %>%
  filter(!is.na(target_industry), !is.na(attack_type)) %>%
  mutate(target_industry = as.factor(target_industry),
         attack_type = as.factor(attack_type))

#view unique industries and attack types
unique(df_filtered$target_industry)
## [1] Education          Retail             IT                 Telecommunications
## [5] Government         Banking            Healthcare        
## 7 Levels: Banking Education Government Healthcare IT ... Telecommunications
unique(df_filtered$attack_type)
## [1] Phishing          Ransomware        Man-in-the-Middle DDoS             
## [5] SQL Injection     Malware          
## Levels: DDoS Malware Man-in-the-Middle Phishing Ransomware SQL Injection

Exploratory Data Analysis (EDA)

1. Count of Cyberattack Types by Industry

ggplot(df_filtered, aes(x = target_industry, fill = attack_type)) +
  geom_bar(position = "dodge") +
  labs(title = "Types of Cyberattacks by Industry",
       x = "Target Industry", y = "Count", fill = "Attack Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2. Proportion of Attack Types (Stacked)

ggplot(df_filtered, aes(x = target_industry, fill = attack_type)) +
  geom_bar(position = "fill") +
  labs(title = "Proportional Breakdown of Attack Types by Industry",
       x = "Target Industry", y = "Proportion", fill = "Attack Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3. Average Financial Loss by Industry

df_filtered %>%
  filter(!is.na(financial_loss_in_million)) %>%
  group_by(target_industry) %>%
  summarize(mean_loss = mean(financial_loss_in_million, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(target_industry, -mean_loss), y = mean_loss)) +
  geom_col(fill = "steelblue") +
  labs(title = "Average Financial Loss by Industry",
       x = "Target Industry", y = "Mean Financial Loss (Million USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Chi-Squared Test of Independence

Contingency Table and Test

#contingency table
contingency_table <- table(df_filtered$target_industry, df_filtered$attack_type)

#test
chi_result <- chisq.test(contingency_table)

chi_result
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 38.455, df = 30, p-value = 0.1384

Expected Counts

chi_result$expected
##                     
##                        DDoS  Malware Man-in-the-Middle Phishing Ransomware
##   Banking            78.765 71.94167            68.085 78.46833   73.12833
##   Education          74.163 67.73833            64.107 73.88367   68.85567
##   Government         71.331 65.15167            61.659 71.06233   66.22633
##   Healthcare         75.933 69.35500            65.637 75.64700   70.49900
##   IT                 84.606 77.27667            73.134 84.28733   78.55133
##   Retail             74.871 68.38500            64.719 74.58900   69.51300
##   Telecommunications 71.331 65.15167            61.659 71.06233   66.22633
##                     
##                      SQL Injection
##   Banking                 74.61167
##   Education               70.25233
##   Government              67.56967
##   Healthcare              71.92900
##   IT                      80.14467
##   Retail                  70.92300
##   Telecommunications      67.56967

Observed vs Expected Counts

#matrix of expected counts
expected_counts <- matrix(c(
  78.765, 71.94167, 68.085, 78.46833, 73.12833, 74.61167,
  74.163, 67.73833, 64.107, 73.88367, 68.85567, 70.25233,
  71.331, 65.15167, 61.659, 71.06233, 66.22633, 67.56967,
  75.933, 69.35500, 65.637, 75.64700, 70.49900, 71.92900,
  84.606, 77.27667, 73.134, 84.28733, 78.55133, 80.14467,
  74.871, 68.38500, 64.719, 74.58900, 69.51300, 70.92300,
  71.331, 65.15167, 61.659, 71.06233, 66.22633, 67.56967
), nrow = 7, byrow = TRUE)

#row and column names
rownames(expected_counts) <- c("Banking", "Education", "Government", "Healthcare", "IT", "Retail", "Telecommunications")
colnames(expected_counts) <- c("DDoS", "Malware", "Man-in-the-Middle", "Phishing", "Ransomware", "SQL Injection")

#convert to data frame for ggplot
expected_df <- melt(expected_counts)
colnames(expected_df) <- c("Industry", "AttackType", "ExpectedCount")

#plot heatmap
ggplot(expected_df, aes(x = AttackType, y = Industry, fill = ExpectedCount)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(ExpectedCount, 1)), size = 3) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme_minimal() +
  labs(title = "Expected Cyberattack Counts by Industry and Type",
       x = "Attack Type", 
       y = "Target Industry") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

#contingency table for observed counts
observed_counts <- table(df_filtered$target_industry, df_filtered$attack_type)

#observed counts
observed_counts
##                     
##                      DDoS Malware Man-in-the-Middle Phishing Ransomware
##   Banking              71      61                77       96         69
##   Education            73      70                65       73         71
##   Government           71      64                53       68         72
##   Healthcare           78      81                58       63         77
##   IT                   91      67                80       89         74
##   Retail               62      68                70       89         71
##   Telecommunications   85      74                56       51         59
##                     
##                      SQL Injection
##   Banking                       71
##   Education                     67
##   Government                    75
##   Healthcare                    72
##   IT                            77
##   Retail                        63
##   Telecommunications            78
#matrix of observed counts
observed_counts <- matrix(c(
  71, 61, 77, 96, 69, 71,
  73, 70, 65, 73, 71, 67,
  71, 64, 53, 68, 72, 75,
  78, 81, 58, 63, 77, 72,
  91, 67, 80, 89, 74, 77,
  62, 68, 70, 89, 71, 63,
  85, 74, 56, 51, 59, 78
), nrow = 7, byrow = TRUE)

#row and column names
rownames(observed_counts) <- c("Banking", "Education", "Government", "Healthcare", "IT", "Retail", "Telecommunications")
colnames(observed_counts) <- c("DDoS", "Malware", "Man-in-the-Middle", "Phishing", "Ransomware", "SQL Injection")

#convert to data frame for ggplot
observed_df <- melt(observed_counts)
colnames(observed_df) <- c("Industry", "AttackType", "ObservedCount")

#heatmap for observed counts
ggplot(observed_df, aes(x = AttackType, y = Industry, fill = ObservedCount)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(ObservedCount, 1)), size = 3) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme_minimal() +
  labs(title = "Observed Cyberattack Counts by Industry and Type",
       x = "Attack Type", y = "Target Industry") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation of Results

  • Chi-squared statistic = 38.46
  • Degrees of freedom = 30
  • p-value = 0.1384

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that we do not have strong statistical evidence to support a significant association between the targeted industry and the type of cyberattack.

Conclusion

While exploratory analysis revealed that some industries such as IT and Healthcare experience higher financial losses or more frequent cyberattacks, the statistical test suggests that these patterns may not be significantly different across attack types. These findings imply that all industries should maintain broad and proactive cybersecurity measures, as no single sector is uniquely vulnerable to specific threats. Future research could build on this analysis by incorporating time series modeling to examine regional trends over time, using predictive models to anticipate high-risk attack scenarios, and conducting deeper investigations into financial impacts through more granular cost data.

References