Source: Forbes. (2025). 5
Mistakes Companies Will Make This Year With Cybersecurity
This report explores the relationship between the industry targeted and the type of cyberattack using a global cybersecurity dataset from 2015 to 2024. The aim is to identify whether certain industries are more susceptible to specific types of attacks, which can inform future cybersecurity strategies.
The dataset contains 3,000 observations (rows) and 10 variables (columns). Each observation represents a reported cybersecurity incident across various countries, industries, and years. The key variables include:
The remaining variables include country and year, which help contextualize when and where each incident occurred.
Dataset Source: Global Cybersecurity Threats 2015–2024. (Hypothetical dataset for educational use). Access: https://www.kaggle.com/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024 Direct file name: “Global_Cybersecurity_Threats_2015-2024.csv”
Is there a significant association between the industry targeted and the type of cyberattack?
To explore the relationship between targeted industries and the type of cyberattacks, I will apply exploratory data analysis (EDA) and conduct a Chi-Squared Test of Independence to assess whether a statistically significant association exists between the variables. The analysis began by cleaning the dataset to remove missing values and converting key variables to categorical format. To address the research question, I generated a bar plot showing the frequency of different attack types across industries, a stacked proportion chart to visualize how attack types are distributed within each industry, and a bar chart of average financial loss per industry to assess the impact dimension of the attacks. These visualizations helped uncover potential patterns or imbalances that suggest associations between specific sectors and attack types. The Chi-Squared Test then formally evaluated whether these observed differences were statistically meaningful.
Hypotheses - Null Hypothesis (H₀): There is no
association between the targeted industry and the type of
cyberattack.
- Alternative Hypothesis (H₁): There is a significant association
between the targeted industry and the type of cyberattack.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) #used for cleaning and formatting column names
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales) #enhances visualizations with formatted scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(reshape2) #used for melting matrices into data frames
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
#Set working directory
setwd("~/Desktop/DATA/Data 101/Final Project")
#load dataset and clean column names
df <- read.csv("Global_Cybersecurity_Threats_2015-2024.csv") %>%
clean_names()
# view first rows
head(df)
## country year attack_type target_industry financial_loss_in_million
## 1 China 2019 Phishing Education 80.53
## 2 China 2019 Ransomware Retail 62.19
## 3 India 2017 Man-in-the-Middle IT 38.65
## 4 UK 2024 Ransomware Telecommunications 41.44
## 5 Germany 2018 Man-in-the-Middle IT 74.41
## 6 Germany 2017 Man-in-the-Middle Retail 98.24
## number_of_affected_users attack_source security_vulnerability_type
## 1 773169 Hacker Group Unpatched Software
## 2 295961 Hacker Group Unpatched Software
## 3 605895 Hacker Group Weak Passwords
## 4 659320 Nation-state Social Engineering
## 5 810682 Insider Social Engineering
## 6 285201 Unknown Social Engineering
## defense_mechanism_used incident_resolution_time_in_hours
## 1 VPN 63
## 2 Firewall 71
## 3 VPN 20
## 4 AI-based Detection 7
## 5 VPN 68
## 6 Antivirus 25
#filter relevant rows with complete data
df_filtered <- df %>%
filter(!is.na(target_industry), !is.na(attack_type)) %>%
mutate(target_industry = as.factor(target_industry),
attack_type = as.factor(attack_type))
#view unique industries and attack types
unique(df_filtered$target_industry)
## [1] Education Retail IT Telecommunications
## [5] Government Banking Healthcare
## 7 Levels: Banking Education Government Healthcare IT ... Telecommunications
unique(df_filtered$attack_type)
## [1] Phishing Ransomware Man-in-the-Middle DDoS
## [5] SQL Injection Malware
## Levels: DDoS Malware Man-in-the-Middle Phishing Ransomware SQL Injection
ggplot(df_filtered, aes(x = target_industry, fill = attack_type)) +
geom_bar(position = "dodge") +
labs(title = "Types of Cyberattacks by Industry",
x = "Target Industry", y = "Count", fill = "Attack Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(df_filtered, aes(x = target_industry, fill = attack_type)) +
geom_bar(position = "fill") +
labs(title = "Proportional Breakdown of Attack Types by Industry",
x = "Target Industry", y = "Proportion", fill = "Attack Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
df_filtered %>%
filter(!is.na(financial_loss_in_million)) %>%
group_by(target_industry) %>%
summarize(mean_loss = mean(financial_loss_in_million, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(target_industry, -mean_loss), y = mean_loss)) +
geom_col(fill = "steelblue") +
labs(title = "Average Financial Loss by Industry",
x = "Target Industry", y = "Mean Financial Loss (Million USD)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#contingency table
contingency_table <- table(df_filtered$target_industry, df_filtered$attack_type)
#test
chi_result <- chisq.test(contingency_table)
chi_result
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 38.455, df = 30, p-value = 0.1384
chi_result$expected
##
## DDoS Malware Man-in-the-Middle Phishing Ransomware
## Banking 78.765 71.94167 68.085 78.46833 73.12833
## Education 74.163 67.73833 64.107 73.88367 68.85567
## Government 71.331 65.15167 61.659 71.06233 66.22633
## Healthcare 75.933 69.35500 65.637 75.64700 70.49900
## IT 84.606 77.27667 73.134 84.28733 78.55133
## Retail 74.871 68.38500 64.719 74.58900 69.51300
## Telecommunications 71.331 65.15167 61.659 71.06233 66.22633
##
## SQL Injection
## Banking 74.61167
## Education 70.25233
## Government 67.56967
## Healthcare 71.92900
## IT 80.14467
## Retail 70.92300
## Telecommunications 67.56967
#matrix of expected counts
expected_counts <- matrix(c(
78.765, 71.94167, 68.085, 78.46833, 73.12833, 74.61167,
74.163, 67.73833, 64.107, 73.88367, 68.85567, 70.25233,
71.331, 65.15167, 61.659, 71.06233, 66.22633, 67.56967,
75.933, 69.35500, 65.637, 75.64700, 70.49900, 71.92900,
84.606, 77.27667, 73.134, 84.28733, 78.55133, 80.14467,
74.871, 68.38500, 64.719, 74.58900, 69.51300, 70.92300,
71.331, 65.15167, 61.659, 71.06233, 66.22633, 67.56967
), nrow = 7, byrow = TRUE)
#row and column names
rownames(expected_counts) <- c("Banking", "Education", "Government", "Healthcare", "IT", "Retail", "Telecommunications")
colnames(expected_counts) <- c("DDoS", "Malware", "Man-in-the-Middle", "Phishing", "Ransomware", "SQL Injection")
#convert to data frame for ggplot
expected_df <- melt(expected_counts)
colnames(expected_df) <- c("Industry", "AttackType", "ExpectedCount")
#plot heatmap
ggplot(expected_df, aes(x = AttackType, y = Industry, fill = ExpectedCount)) +
geom_tile(color = "white") +
geom_text(aes(label = round(ExpectedCount, 1)), size = 3) +
scale_fill_gradient(low = "white", high = "steelblue") +
theme_minimal() +
labs(title = "Expected Cyberattack Counts by Industry and Type",
x = "Attack Type",
y = "Target Industry") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#contingency table for observed counts
observed_counts <- table(df_filtered$target_industry, df_filtered$attack_type)
#observed counts
observed_counts
##
## DDoS Malware Man-in-the-Middle Phishing Ransomware
## Banking 71 61 77 96 69
## Education 73 70 65 73 71
## Government 71 64 53 68 72
## Healthcare 78 81 58 63 77
## IT 91 67 80 89 74
## Retail 62 68 70 89 71
## Telecommunications 85 74 56 51 59
##
## SQL Injection
## Banking 71
## Education 67
## Government 75
## Healthcare 72
## IT 77
## Retail 63
## Telecommunications 78
#matrix of observed counts
observed_counts <- matrix(c(
71, 61, 77, 96, 69, 71,
73, 70, 65, 73, 71, 67,
71, 64, 53, 68, 72, 75,
78, 81, 58, 63, 77, 72,
91, 67, 80, 89, 74, 77,
62, 68, 70, 89, 71, 63,
85, 74, 56, 51, 59, 78
), nrow = 7, byrow = TRUE)
#row and column names
rownames(observed_counts) <- c("Banking", "Education", "Government", "Healthcare", "IT", "Retail", "Telecommunications")
colnames(observed_counts) <- c("DDoS", "Malware", "Man-in-the-Middle", "Phishing", "Ransomware", "SQL Injection")
#convert to data frame for ggplot
observed_df <- melt(observed_counts)
colnames(observed_df) <- c("Industry", "AttackType", "ObservedCount")
#heatmap for observed counts
ggplot(observed_df, aes(x = AttackType, y = Industry, fill = ObservedCount)) +
geom_tile(color = "white") +
geom_text(aes(label = round(ObservedCount, 1)), size = 3) +
scale_fill_gradient(low = "white", high = "steelblue") +
theme_minimal() +
labs(title = "Observed Cyberattack Counts by Industry and Type",
x = "Attack Type", y = "Target Industry") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that we do not have strong statistical evidence to support a significant association between the targeted industry and the type of cyberattack.
While exploratory analysis revealed that some industries such as IT and Healthcare experience higher financial losses or more frequent cyberattacks, the statistical test suggests that these patterns may not be significantly different across attack types. These findings imply that all industries should maintain broad and proactive cybersecurity measures, as no single sector is uniquely vulnerable to specific threats. Future research could build on this analysis by incorporating time series modeling to examine regional trends over time, using predictive models to anticipate high-risk attack scenarios, and conducting deeper investigations into financial impacts through more granular cost data.