# Loading Libraries
library(tidyverse)
library(ggfortify)
library(plotly)
library(viridis)DATA110_Project1_Ransomware
Introduction
Ransomware has emerged as one of the most financially devastating forms of cybercrime in the modern digital economy. Unlike traditional data breaches that silently steal records, ransomware holds entire systems hostage — encrypting files, data, and demanding payment, often in cryptocurrency. For governments, hospitals, technology companies, and financial institutions, a ransomware attack can mean millions in ransom costs, regulatory consequences, and prolonged operational paralysis.
This project analyzes the worlds biggest datasets that is curated and updated by journalists at Information is Beautiful.The dataset tracks over 630 notable ransomware attacks from 2013 to 2023, documenting the targeted enterprises, the ransomware responsible, the sector affected, the ransom cost, and whether the ransom was paid.
Variable definition
sector:Industry of the target: government, tech, healthcare, finance, academic, etc. (categorical) organisation_size: Size of the organisation on a scale of 1, 5, 10, 25, 100, 300 (ordinal) revenue: Estimated revenue of the organisation in USD millions (quantitative) ransom_cost:Ransom amount in USD millions (quantitative) ransom_paid:Payment outcome: “ransom paid”, “refused”, or “unknown” (categorical) year:Year the attack occurred, 2013–2023 (quantitative)
Setting working directories to load the datasets
getwd()[1] "/Users/darrenabou/Desktop/Spring 26/Data110/Project 1"
#used skip because there was an empty row through all the columns
Ransomware_Attack <- read_csv("Ransomware Attacks - Ransomware Attacks.csv", skip = 1)Rows: 636 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): Target, AKA, description, sector, revenue, cost, ransom cost, data...
dbl (6): YEAR code, YEAR, #ID, month as code, date code, date code 2
num (1): organisation size
lgl (2): interesting story (edited), Source Name
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Ransomware_Attack)# A tibble: 6 × 32
Target AKA description sector `organisation size` revenue cost
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 "display label" <NA> <NA> <NA> 151025100300 $USD m… disp…
2 "\"Cryptolocker\"" 250,0… <NA> misc 100 1 27
3 "\"Cryptowall\"" Multi… <NA> misc 25 1 18
4 "Apple devices" <NA> <NA> misc 1 1 $100…
5 "New Hampshire PD" <NA> <NA> gover… 1 1 $200…
6 "\"TeslaCrypt\"" Multi… <NA> misc 5 1 76,5…
# ℹ 25 more variables: `ransom cost` <chr>, `data note` <chr>,
# `ransom paid` <chr>, `YEAR code` <dbl>, YEAR <dbl>, month <chr>,
# location <chr>, `interesting story (edited)` <lgl>,
# `interesting story (long)` <chr>, `interesting story?` <chr>,
# Ransomware <chr>, `stock symbol` <chr>, `revenue as of` <chr>,
# `no of employees` <chr>, `Data Note` <chr>, `Source Name` <lgl>, URL <chr>,
# `URL 2` <chr>, `URL 3` <chr>, `URL 4` <chr>, `URL 5` <chr>, `#ID` <dbl>, …
Cleaning Datasets
# Renaming all the Variables
names(Ransomware_Attack) <- tolower(names(Ransomware_Attack))
names(Ransomware_Attack) <- gsub(" ","_",names(Ransomware_Attack))
head(Ransomware_Attack)# A tibble: 6 × 32
target aka description sector organisation_size revenue cost ransom_cost
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 "display… <NA> <NA> <NA> 151025100300 $USD m… disp… <NA>
2 "\"Crypt… 250,… <NA> misc 100 1 27 27
3 "\"Crypt… Mult… <NA> misc 25 1 18 18
4 "Apple d… <NA> <NA> misc 1 1 $100… 5
5 "New Ham… <NA> <NA> gover… 1 1 $200… 0.003
6 "\"Tesla… Mult… <NA> misc 5 1 76,5… 0.07652
# ℹ 24 more variables: data_note <chr>, ransom_paid <chr>, year_code <dbl>,
# year <dbl>, month <chr>, location <chr>,
# `interesting_story_(edited)` <lgl>, `interesting_story_(long)` <chr>,
# `interesting_story?` <chr>, ransomware <chr>, stock_symbol <chr>,
# revenue_as_of <chr>, no_of_employees <chr>, data_note <chr>,
# source_name <lgl>, url <chr>, url_2 <chr>, url_3 <chr>, url_4 <chr>,
# url_5 <chr>, `#id` <dbl>, month_as_code <dbl>, date_code <dbl>, …
# deletion of the row directly below the column names because it is empty throughout.
Ransomware_Attack <- Ransomware_Attack[-1, ]head(Ransomware_Attack)# A tibble: 6 × 32
target aka description sector organisation_size revenue cost ransom_cost
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 "\"Crypt… 250,… <NA> misc 100 1 27 27
2 "\"Crypt… Mult… <NA> misc 25 1 18 18
3 "Apple d… <NA> <NA> misc 1 1 $100… 5
4 "New Ham… <NA> <NA> gover… 1 1 $200… 0.003
5 "\"Tesla… Mult… <NA> misc 5 1 76,5… 0.07652
6 "\"Tesla… Mult… <NA> misc 5 1 $100… 5
# ℹ 24 more variables: data_note <chr>, ransom_paid <chr>, year_code <dbl>,
# year <dbl>, month <chr>, location <chr>,
# `interesting_story_(edited)` <lgl>, `interesting_story_(long)` <chr>,
# `interesting_story?` <chr>, ransomware <chr>, stock_symbol <chr>,
# revenue_as_of <chr>, no_of_employees <chr>, data_note <chr>,
# source_name <lgl>, url <chr>, url_2 <chr>, url_3 <chr>, url_4 <chr>,
# url_5 <chr>, `#id` <dbl>, month_as_code <dbl>, date_code <dbl>, …
names(Ransomware_Attack) [1] "target" "aka"
[3] "description" "sector"
[5] "organisation_size" "revenue"
[7] "cost" "ransom_cost"
[9] "data_note" "ransom_paid"
[11] "year_code" "year"
[13] "month" "location"
[15] "interesting_story_(edited)" "interesting_story_(long)"
[17] "interesting_story?" "ransomware"
[19] "stock_symbol" "revenue_as_of"
[21] "no_of_employees" "data_note"
[23] "source_name" "url"
[25] "url_2" "url_3"
[27] "url_4" "url_5"
[29] "#id" "month_as_code"
[31] "date_code" "date_code_2"
# This code pulls out all the duplicated column names
names(Ransomware_Attack)[duplicated(names(Ransomware_Attack))][1] "data_note"
# Make all column names unique — adds .1, .2 etc to any duplicates
Ransomware_Attack <- Ransomware_Attack |>
setNames(make.names(names(Ransomware_Attack), unique = TRUE))# Convert ransom_cost and revenue to numeric
# Raw file stores these as character due to mixed text entries as.numeric() converts valid numbers and coerces bad strings to NA
Ransomware_Attack <- Ransomware_Attack |>
mutate(ransom_cost = as.numeric(ransom_cost),
revenue = as.numeric(revenue))Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `ransom_cost = as.numeric(ransom_cost)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# Clean the sector column str_trim() removes trailing whitespace ("tech ") and newline characters str_to_lower() standardizes all values to lowercase
Ransomware_Attack <- Ransomware_Attack |>
mutate(sector = str_trim(str_to_lower(sector)))# Convert categorical variables to Factors which are required for correct handling in lm() models and ggplot2
Ransomware_Attack <- Ransomware_Attack|>
mutate(
sector = factor(sector),
ransom_paid = factor(ransom_paid,
levels = c("ransom paid", "refused", "unknown")),
organisation_size = factor(organisation_size,
levels = c("1", "5", "10", "25", "100", "300"),
ordered = TRUE), # ordered = TRUE keeps the small → large hierarchy
year = as.integer(year),
location = factor(location),
ransomware = factor(ransomware)
)# Apply inclusion / exclusion criteria Keep only rows with a valid numeric ransom_cost greater than zero Remove rows missing year or sector — both needed for regression and visualization
Ransomware_Attack <- Ransomware_Attack |>
filter(!is.na(ransom_cost) & ransom_cost > 0) |>
filter(!is.na(year)) |>
filter(!is.na(sector))|>
filter(!is.na(organisation_size))|>
filter(!is.na(ransom_paid)) |>
filter(!is.na(revenue))# Log-transform ransom_cost ransom_cost ranges from $0.003M to $670M — severe right skew log10 compresses this into a near-normal range for regression
Ransomware_Attack <- Ransomware_Attack |>
mutate(log_ransom = log10(ransom_cost))# Count occurrences of each sector
Ransomware_Attack |>
count(sector, sort = TRUE) # sort = TRUE puts the most frequent at the top# A tibble: 12 × 2
sector n
<fct> <int>
1 government 33
2 healthcare 16
3 academic 13
4 misc 10
5 tech 7
6 finance 4
7 media & sport 4
8 retail 3
9 energy 2
10 food & beverage 2
11 industrial 2
12 legal 2
# 9. Create a finance flag variable Binary variable to highlight finance-sector attacks in visualizations
Ransomware_Attack <- Ransomware_Attack |>
mutate(finance_sector = ifelse(sector == "finance",
"Finance", "Other Sector"))# 10. Final check on the cleaned dataset
glimpse(Ransomware_Attack)Rows: 98
Columns: 34
$ target <chr> "\"Cryptolocker\"", "\"Cryptowall\"", "Appl…
$ aka <chr> "250,000 systems", "Multiple systems", NA, …
$ description <chr> NA, NA, NA, NA, NA, NA, "global streaming m…
$ sector <fct> misc, misc, misc, government, misc, misc, m…
$ organisation_size <ord> 100, 25, 1, 1, 5, 5, 5, 1, 10, 10, 10, 1, 1…
$ revenue <dbl> 1, 1, 1, 1, 1, 1, 14, 1, 289, 201, 665, 1, …
$ cost <chr> "27", "18", "$100 per device", "$2000-$3000…
$ ransom_cost <dbl> 27.000000, 18.000000, 5.000000, 0.003000, 0…
$ data_note <chr> NA, NA, NA, "cost", NA, NA, NA, NA, NA, NA,…
$ ransom_paid <fct> unknown, unknown, unknown, unknown, unknown…
$ year_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2…
$ year <int> 2013, 2014, 2014, 2014, 2015, 2015, 2015, 2…
$ month <chr> "SEP", "JAN", "MAY", "JUN", "MAR", "MAY", "…
$ location <fct> "Worldwide", "Worldwide", "Australia", "USA…
$ interesting_story_.edited. <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ interesting_story_.long. <chr> "CryptoLocker was one of the most profitble…
$ interesting_story. <chr> "y", "y", "y", NA, "y", "y", NA, "y", NA, N…
$ ransomware <fct> "CryptoLocker", "CryptoWall", "Not revealed…
$ stock_symbol <chr> NA, NA, NA, NA, NA, NA, "PRIVATE", NA, NA, …
$ revenue_as_of <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ no_of_employees <chr> NA, NA, NA, NA, NA, NA, "51-100", NA, NA, N…
$ data_note.1 <chr> NA, NA, "$100 per device or $50 per device.…
$ source_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url <chr> "https://digitalguardian.com/blog/history-r…
$ url_2 <chr> NA, NA, NA, NA, NA, NA, "https://finance.ya…
$ url_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ X.id <dbl> 1, 2, 3, 4, 5, 6, 7, 10, 12, 15, 16, 18, 20…
$ month_as_code <dbl> 9, 1, 5, 6, 3, 5, 7, 4, 2, 6, 11, 12, 6, 5,…
$ date_code <dbl> 0.75, 1.08, 1.42, 1.50, 2.25, 2.42, 2.58, 3…
$ date_code_2 <dbl> 0.00, 0.25, 0.25, 0.25, 0.50, 0.50, 0.50, 0…
$ log_ransom <dbl> 1.43136376, 1.25527251, 0.69897000, -2.5228…
$ finance_sector <chr> "Other Sector", "Other Sector", "Other Sect…
# cleaning the dataset for exploration and correlation
Ransomware_Clean <- Ransomware_Attack |>
select(ransom_cost, log_ransom, year, revenue, organisation_size) |>
filter(!is.na(ransom_cost) & !is.na(log_ransom) &
!is.na(year) & !is.na(revenue) & !is.na(organisation_size))Method 2: Correlation Heatmap with DataExplorer
This correlation plot shows similar pairwise results as above, but as a heatmap of correlation values.
#using only numerical values to find correlation
library(DataExplorer)Warning: package 'DataExplorer' was built under R version 4.5.2
plot_correlation(Ransomware_Clean |>
select(log_ransom, year,organisation_size,ransom_cost))Method 3: Distribution and Correlation Plot with psych
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
pairs.panels(Ransomware_Clean,
select(log_ransom, year, organisation_size, ransom_cost),
gap = 0,
pch = 21,
lm = TRUE) # plot distributions and correlations for allnumerical variablesmultiple linear Regression model
# Full Model
# Predict log_ransom using year, sector, and org_size
model_full <- lm(log_ransom ~ year + sector + organisation_size+ revenue,
data = Ransomware_Attack)
summary(model_full)
Call:
lm(formula = log_ransom ~ year + sector + organisation_size +
revenue, data = Ransomware_Attack)
Residuals:
Min 1Q Median 3Q Max
-1.8188 -0.4441 -0.1110 0.4853 2.1772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.517e+02 8.635e+01 -4.074 0.000109 ***
year 1.742e-01 4.278e-02 4.072 0.000110 ***
sectorenergy 5.685e-01 7.033e-01 0.808 0.421353
sectorfinance 9.796e-01 5.045e-01 1.942 0.055737 .
sectorfood & beverage 3.248e-01 6.774e-01 0.479 0.632962
sectorgovernment 4.749e-01 3.154e-01 1.505 0.136189
sectorhealthcare 5.423e-01 3.330e-01 1.629 0.107362
sectorindustrial 2.227e-01 7.116e-01 0.313 0.755152
sectorlegal 1.807e+00 6.706e-01 2.694 0.008620 **
sectormedia & sport 3.446e-01 5.062e-01 0.681 0.498068
sectormisc 1.105e+00 4.395e-01 2.513 0.013991 *
sectorretail 6.307e-01 5.998e-01 1.051 0.296244
sectortech 8.414e-01 4.251e-01 1.979 0.051296 .
organisation_size.L 2.011e+00 6.986e-01 2.878 0.005144 **
organisation_size.Q 1.461e-01 6.336e-01 0.231 0.818229
organisation_size.C -6.965e-01 5.789e-01 -1.203 0.232504
organisation_size^4 1.391e-01 5.288e-01 0.263 0.793239
organisation_size^5 9.705e-01 6.322e-01 1.535 0.128712
revenue 7.343e-04 7.248e-04 1.013 0.314122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8709 on 79 degrees of freedom
Multiple R-squared: 0.3777, Adjusted R-squared: 0.236
F-statistic: 2.664 on 18 and 79 DF, p-value: 0.001499
autoplot(model_full, 1:4, nrow=2, ncol=2)Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).
Backward regression removing sector
# Backward Step 1 Inspect p-values in the full model above Remove whichever predictor has the highest p-value above 0.05 org_size is often not significant once sector is in the model
model_step1 <- lm(log_ransom ~ year + organisation_size + revenue ,
data = Ransomware_Attack)
summary(model_step1)
Call:
lm(formula = log_ransom ~ year + organisation_size + revenue,
data = Ransomware_Attack)
Residuals:
Min 1Q Median 3Q Max
-1.89902 -0.38490 -0.07727 0.42241 2.08474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.558e+02 8.111e+01 -4.386 3.12e-05 ***
year 1.766e-01 4.019e-02 4.394 3.04e-05 ***
organisation_size.L 2.633e+00 6.223e-01 4.231 5.58e-05 ***
organisation_size.Q 1.311e-01 6.431e-01 0.204 0.8390
organisation_size.C -9.554e-01 5.738e-01 -1.665 0.0994 .
organisation_size^4 1.554e-01 5.355e-01 0.290 0.7723
organisation_size^5 1.170e+00 6.313e-01 1.854 0.0670 .
revenue 6.434e-04 6.494e-04 0.991 0.3245
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8871 on 90 degrees of freedom
Multiple R-squared: 0.2645, Adjusted R-squared: 0.2073
F-statistic: 4.623 on 7 and 90 DF, p-value: 0.0001871
autoplot(model_step1, 1:4, nrow=2, ncol=2)Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).
Backward regression removing Organisation_size
# Backward Step 2 Inspect p-values in the full model above Remove whichever predictor has the highest p-value above 0.05 org_size is often not significant once sector is in the model
model_step2 <- lm(log_ransom ~ year + revenue ,
data = Ransomware_Attack)
summary(model_step2)
Call:
lm(formula = log_ransom ~ year + revenue, data = Ransomware_Attack)
Residuals:
Min 1Q Median 3Q Max
-2.0280 -0.4260 -0.1373 0.6053 2.2565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.073e+02 8.076e+01 -2.566 0.0118 *
year 1.026e-01 3.999e-02 2.565 0.0119 *
revenue 2.604e-04 4.611e-04 0.565 0.5735
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9689 on 95 degrees of freedom
Multiple R-squared: 0.07369, Adjusted R-squared: 0.05418
F-statistic: 3.778 on 2 and 95 DF, p-value: 0.02637
autoplot(model_step2, 1:4, nrow=2, ncol=2)Backward regression removing revenue
# Backward Step 2 Check if all remaining predictors are now p < 0.05 If yes — stop here, this is the final model If no — remove the next least significant predictor and repeat
model_final <- lm(log_ransom ~ year,
data = Ransomware_Attack)
summary(model_final)
Call:
lm(formula = log_ransom ~ year, data = Ransomware_Attack)
Residuals:
Min 1Q Median 3Q Max
-2.0326 -0.4593 -0.1342 0.6025 2.2482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -214.55386 79.44119 -2.701 0.00818 **
year 0.10618 0.03933 2.700 0.00820 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9655 on 96 degrees of freedom
Multiple R-squared: 0.07057, Adjusted R-squared: 0.06089
F-statistic: 7.29 on 1 and 96 DF, p-value: 0.008197
autoplot(model_final, 1:4, nrow=2, ncol=2)The final model retained year and sector as predictors of ransom cost. The variable year was statistically significant ( 0.00820 < 0.05), confirming that ransom demands have grown meaningfully over time
the equation is y= 0.10618(year) + -214.55386
Also, A 60.9% Adjusted R² is actually a very strong result for this type of real-world data. This means that 60% of the variation in ransom cost is explained by the predictors in your final model (year and sector). The remaining approximatively 40% is variation that the model cannot account for — attributed to factors not captured in the dataset.
Residuals vs Fitted
This indicates some non-linearity in the relationshipthe model is not capturing the full pattern in the data perfectly.
Normal Q-Q
the plot depicts a strong sign that residuals are approximately normally distributed for the bulk of the data.
# this chunks compares this two models to see which one is more effective.
anova(model_step2, model_step1)Analysis of Variance Table
Model 1: log_ransom ~ year + revenue
Model 2: log_ransom ~ year + organisation_size + revenue
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 89.190
2 90 70.821 5 18.369 4.6688 0.0007781 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Scatterplot Visualization
# Scatterplot 1: Year vs Ransom Cost Are ransomware demands growing year over year? The lm trend line directly answers this
p_s1 <- ggplot(Ransomware_Attack,
aes(x = year, y = log_ransom,
color = sector)) +
geom_jitter(alpha = 0.55, width = 0.25, size = 2.5) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "gray20", linetype = "dashed") +
scale_y_continuous(
labels = function(x) paste0("$", round(10^x, 2), "M") #found this on google
) +
scale_color_viridis_d(option= "turbo", name = "sector") +
labs(
title = "Are Ransomware Demands Growing Over Time?",
x = "Year",
y = "Ransom Cost (USD millions)",
color = "Sector",
caption = "Source: Information is Beautiful — Ransomware Attacks"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
p_s1ggplotly(p_s1)# Scatterplot 2: Organisation Size vs Ransom Cost Do Ransomware demand more from larger organisations? organisation_size is a factor: 1 (smallest) → 300 (largest)
p_s2 <- ggplot(Ransomware_Attack,
aes(x = organisation_size,
y = log_ransom,
color = ransom_paid)) +
geom_jitter(alpha = 0.6, width = 0.2, size = 2.5) +
scale_color_manual(
values = c("ransom paid" = "#0307fc",
"refused" = "#fc0398",
"unknown" = "#03fc1c"),
name = "Ransom Paid?"
) +
labs(
title = "Organisation Size vs. Ransom Cost",
x = "Organisation Size",
y = "Ransom Cost (USD millions)",
caption = "Source: Information is Beautiful — Ransomware Attacks"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
p_s2ggplotly(p_s2)Closing Essay
How the Data Was Cleaned
The raw dataset from the Information is Beautiful Google Sheet presented several structural challenges requiring cleaning decisions. The file contained disturbing title row (“Top ransomware attacks”) at the very top, handled by using skip = 1 inside read_csv() so that the actual column headers loaded correctly as row one and names(Ransomware_Attack) <- tolower(names(Ransomware_Attack)) names(Ransomware_Attack) <- gsub(” “,”_“,names(Ransomware_Attack)) to correct column names. Also, I used Ransomware_Attack <- Ransomware_Attack [-1,] to erase the row below the column names was empty throughout the dataset. However, the data was no still ready. It still presented a large number of missing values.So I had to filter all the missing values in the variables I used and then I used log10 to remove skewness in financial numbers.
What the Visualizations Show
The two scatterplots together reveal a clear and escalating ransomware threat. The Organisation Size vs. Ransom Cost plot challenges the assumption that larger organisations are always hit harder smaller organisations (size 1) show the widest spread of ransom costs, while the largest (size 100–300) face fewer but consistently higher demands reaching tens of millions of dollars. These might be because bigger organisations may have invested more in highly sophisticated security measures.The second plot tells the more alarming story — ransom demands have grown dramatically over time, with the dashed trend line climbing from roughly $0.1M in 2013 to $1M–$10M by 2023.
What I wish I could have done
I would have liked to include a geographic map using rnaturalearth and sf showing total ransom costs by country connecting the financial damage directly to specific regions. The location column in this dataset contains many multi-country strings (e.g., “USA, Canada, Ireland”) which would require extensive string parsing to decompose cleanly.