DATA110_Project1_Ransomware

Author

Gamaliel Ngouafon

Introduction

Ransomware has emerged as one of the most financially devastating forms of cybercrime in the modern digital economy. Unlike traditional data breaches that silently steal records, ransomware holds entire systems hostage — encrypting files, data, and demanding payment, often in cryptocurrency. For governments, hospitals, technology companies, and financial institutions, a ransomware attack can mean millions in ransom costs, regulatory consequences, and prolonged operational paralysis.

This project analyzes the worlds biggest datasets that is curated and updated by journalists at Information is Beautiful.The dataset tracks over 630 notable ransomware attacks from 2013 to 2023, documenting the targeted enterprises, the ransomware responsible, the sector affected, the ransom cost, and whether the ransom was paid.

Variable definition

sector:Industry of the target: government, tech, healthcare, finance, academic, etc. (categorical) organisation_size: Size of the organisation on a scale of 1, 5, 10, 25, 100, 300 (ordinal) revenue: Estimated revenue of the organisation in USD millions (quantitative) ransom_cost:Ransom amount in USD millions (quantitative) ransom_paid:Payment outcome: “ransom paid”, “refused”, or “unknown” (categorical) year:Year the attack occurred, 2013–2023 (quantitative)

# Loading Libraries 
library(tidyverse)
library(ggfortify)
library(plotly)
library(viridis)

Setting working directories to load the datasets

getwd()

[1] "/Users/darrenabou/Desktop/Spring 26/Data110/Project 1"

#used skip because there was an empty row through all the columns
Ransomware_Attack <- read_csv("Ransomware Attacks - Ransomware Attacks.csv", skip = 1)

Rows: 636 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): Target, AKA, description, sector, revenue, cost, ransom cost, data...
dbl  (6): YEAR code, YEAR, #ID, month as code, date code, date code 2
num  (1): organisation size
lgl  (2): interesting story (edited), Source Name

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Ransomware_Attack)

# A tibble: 6 × 32
  Target             AKA    description sector `organisation size` revenue cost 
  <chr>              <chr>  <chr>       <chr>                <dbl> <chr>   <chr>
1 "display label"    <NA>   <NA>        <NA>          151025100300 $USD m… disp…
2 "\"Cryptolocker\"" 250,0… <NA>        misc                   100 1       27   
3 "\"Cryptowall\""   Multi… <NA>        misc                    25 1       18   
4 "Apple devices"    <NA>   <NA>        misc                     1 1       $100…
5 "New Hampshire PD" <NA>   <NA>        gover…                   1 1       $200…
6 "\"TeslaCrypt\""   Multi… <NA>        misc                     5 1       76,5…
# ℹ 25 more variables: `ransom cost` <chr>, `data note` <chr>,
#   `ransom paid` <chr>, `YEAR code` <dbl>, YEAR <dbl>, month <chr>,
#   location <chr>, `interesting story (edited)` <lgl>,
#   `interesting story (long)` <chr>, `interesting story?` <chr>,
#   Ransomware <chr>, `stock symbol` <chr>, `revenue as of` <chr>,
#   `no of employees` <chr>, `Data Note` <chr>, `Source Name` <lgl>, URL <chr>,
#   `URL 2` <chr>, `URL 3` <chr>, `URL 4` <chr>, `URL 5` <chr>, `#ID` <dbl>, …

Cleaning Datasets

# Renaming all the Variables
names(Ransomware_Attack) <- tolower(names(Ransomware_Attack))
names(Ransomware_Attack) <- gsub(" ","_",names(Ransomware_Attack))
head(Ransomware_Attack)

# A tibble: 6 × 32
  target    aka   description sector organisation_size revenue cost  ransom_cost
  <chr>     <chr> <chr>       <chr>              <dbl> <chr>   <chr> <chr>      
1 "display… <NA>  <NA>        <NA>        151025100300 $USD m… disp… <NA>       
2 "\"Crypt… 250,… <NA>        misc                 100 1       27    27         
3 "\"Crypt… Mult… <NA>        misc                  25 1       18    18         
4 "Apple d… <NA>  <NA>        misc                   1 1       $100… 5          
5 "New Ham… <NA>  <NA>        gover…                 1 1       $200… 0.003      
6 "\"Tesla… Mult… <NA>        misc                   5 1       76,5… 0.07652    
# ℹ 24 more variables: data_note <chr>, ransom_paid <chr>, year_code <dbl>,
#   year <dbl>, month <chr>, location <chr>,
#   `interesting_story_(edited)` <lgl>, `interesting_story_(long)` <chr>,
#   `interesting_story?` <chr>, ransomware <chr>, stock_symbol <chr>,
#   revenue_as_of <chr>, no_of_employees <chr>, data_note <chr>,
#   source_name <lgl>, url <chr>, url_2 <chr>, url_3 <chr>, url_4 <chr>,
#   url_5 <chr>, `#id` <dbl>, month_as_code <dbl>, date_code <dbl>, …

# deletion of the row directly below the column names because it is empty throughout.
Ransomware_Attack <- Ransomware_Attack[-1, ]

head(Ransomware_Attack)

# A tibble: 6 × 32
  target    aka   description sector organisation_size revenue cost  ransom_cost
  <chr>     <chr> <chr>       <chr>              <dbl> <chr>   <chr> <chr>      
1 "\"Crypt… 250,… <NA>        misc                 100 1       27    27         
2 "\"Crypt… Mult… <NA>        misc                  25 1       18    18         
3 "Apple d… <NA>  <NA>        misc                   1 1       $100… 5          
4 "New Ham… <NA>  <NA>        gover…                 1 1       $200… 0.003      
5 "\"Tesla… Mult… <NA>        misc                   5 1       76,5… 0.07652    
6 "\"Tesla… Mult… <NA>        misc                   5 1       $100… 5          
# ℹ 24 more variables: data_note <chr>, ransom_paid <chr>, year_code <dbl>,
#   year <dbl>, month <chr>, location <chr>,
#   `interesting_story_(edited)` <lgl>, `interesting_story_(long)` <chr>,
#   `interesting_story?` <chr>, ransomware <chr>, stock_symbol <chr>,
#   revenue_as_of <chr>, no_of_employees <chr>, data_note <chr>,
#   source_name <lgl>, url <chr>, url_2 <chr>, url_3 <chr>, url_4 <chr>,
#   url_5 <chr>, `#id` <dbl>, month_as_code <dbl>, date_code <dbl>, …

names(Ransomware_Attack)

 [1] "target"                     "aka"                       
 [3] "description"                "sector"                    
 [5] "organisation_size"          "revenue"                   
 [7] "cost"                       "ransom_cost"               
 [9] "data_note"                  "ransom_paid"               
[11] "year_code"                  "year"                      
[13] "month"                      "location"                  
[15] "interesting_story_(edited)" "interesting_story_(long)"  
[17] "interesting_story?"         "ransomware"                
[19] "stock_symbol"               "revenue_as_of"             
[21] "no_of_employees"            "data_note"                 
[23] "source_name"                "url"                       
[25] "url_2"                      "url_3"                     
[27] "url_4"                      "url_5"                     
[29] "#id"                        "month_as_code"             
[31] "date_code"                  "date_code_2"

# This code pulls out all the duplicated column names
names(Ransomware_Attack)[duplicated(names(Ransomware_Attack))]

[1] "data_note"

# Make all column names unique — adds .1, .2 etc to any duplicates
Ransomware_Attack <- Ransomware_Attack |>
  setNames(make.names(names(Ransomware_Attack), unique = TRUE))

# Convert ransom_cost and revenue to numeric 
# Raw file stores these as character due to mixed text entries as.numeric() converts valid numbers and coerces bad strings to NA
Ransomware_Attack <- Ransomware_Attack |>
  mutate(ransom_cost = as.numeric(ransom_cost),
         revenue     = as.numeric(revenue))

Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `ransom_cost = as.numeric(ransom_cost)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# Clean the sector column str_trim() removes trailing whitespace ("tech ") and newline characters str_to_lower() standardizes all values to lowercase
Ransomware_Attack <- Ransomware_Attack |>
  mutate(sector = str_trim(str_to_lower(sector)))

# Convert categorical variables to Factors which are required for correct handling in lm() models and ggplot2
Ransomware_Attack <- Ransomware_Attack|>
  mutate(
    sector      = factor(sector),
    ransom_paid = factor(ransom_paid,
                         levels = c("ransom paid", "refused", "unknown")),
    organisation_size    = factor(organisation_size,
                         levels  = c("1", "5", "10", "25", "100", "300"),
                         ordered = TRUE),  # ordered = TRUE keeps the small → large hierarchy
    year        = as.integer(year),
    location    = factor(location),
    ransomware        = factor(ransomware)
  )

# Apply inclusion / exclusion criteria Keep only rows with a valid numeric ransom_cost greater than zero Remove rows missing year or sector — both needed for regression and visualization
Ransomware_Attack <- Ransomware_Attack |>
  filter(!is.na(ransom_cost) & ransom_cost > 0) |>
  filter(!is.na(year)) |>
  filter(!is.na(sector))|>
  filter(!is.na(organisation_size))|>
  filter(!is.na(ransom_paid)) |>
  filter(!is.na(revenue))

# Log-transform ransom_cost ransom_cost ranges from $0.003M to $670M — severe right skew log10 compresses this into a near-normal range for regression
Ransomware_Attack <- Ransomware_Attack |>
  mutate(log_ransom = log10(ransom_cost))

# Count occurrences of each sector
Ransomware_Attack |>
  count(sector, sort = TRUE)  # sort = TRUE puts the most frequent at the top

# A tibble: 12 × 2
   sector              n
   <fct>           <int>
 1 government         33
 2 healthcare         16
 3 academic           13
 4 misc               10
 5 tech                7
 6 finance             4
 7 media & sport       4
 8 retail              3
 9 energy              2
10 food & beverage     2
11 industrial          2
12 legal               2

#  9. Create a finance flag variable Binary variable to highlight finance-sector attacks in visualizations
Ransomware_Attack <- Ransomware_Attack |>
  mutate(finance_sector = ifelse(sector == "finance",
                           "Finance", "Other Sector"))

#  10. Final check on the cleaned dataset
glimpse(Ransomware_Attack)

Rows: 98
Columns: 34
$ target                     <chr> "\"Cryptolocker\"", "\"Cryptowall\"", "Appl…
$ aka                        <chr> "250,000 systems", "Multiple systems", NA, …
$ description                <chr> NA, NA, NA, NA, NA, NA, "global streaming m…
$ sector                     <fct> misc, misc, misc, government, misc, misc, m…
$ organisation_size          <ord> 100, 25, 1, 1, 5, 5, 5, 1, 10, 10, 10, 1, 1…
$ revenue                    <dbl> 1, 1, 1, 1, 1, 1, 14, 1, 289, 201, 665, 1, …
$ cost                       <chr> "27", "18", "$100 per device", "$2000-$3000…
$ ransom_cost                <dbl> 27.000000, 18.000000, 5.000000, 0.003000, 0…
$ data_note                  <chr> NA, NA, NA, "cost", NA, NA, NA, NA, NA, NA,…
$ ransom_paid                <fct> unknown, unknown, unknown, unknown, unknown…
$ year_code                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2…
$ year                       <int> 2013, 2014, 2014, 2014, 2015, 2015, 2015, 2…
$ month                      <chr> "SEP", "JAN", "MAY", "JUN", "MAR", "MAY", "…
$ location                   <fct> "Worldwide", "Worldwide", "Australia", "USA…
$ interesting_story_.edited. <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ interesting_story_.long.   <chr> "CryptoLocker was one of the most profitble…
$ interesting_story.         <chr> "y", "y", "y", NA, "y", "y", NA, "y", NA, N…
$ ransomware                 <fct> "CryptoLocker", "CryptoWall", "Not revealed…
$ stock_symbol               <chr> NA, NA, NA, NA, NA, NA, "PRIVATE", NA, NA, …
$ revenue_as_of              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ no_of_employees            <chr> NA, NA, NA, NA, NA, NA, "51-100", NA, NA, N…
$ data_note.1                <chr> NA, NA, "$100 per device or $50 per device.…
$ source_name                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url                        <chr> "https://digitalguardian.com/blog/history-r…
$ url_2                      <chr> NA, NA, NA, NA, NA, NA, "https://finance.ya…
$ url_3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url_4                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url_5                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ X.id                       <dbl> 1, 2, 3, 4, 5, 6, 7, 10, 12, 15, 16, 18, 20…
$ month_as_code              <dbl> 9, 1, 5, 6, 3, 5, 7, 4, 2, 6, 11, 12, 6, 5,…
$ date_code                  <dbl> 0.75, 1.08, 1.42, 1.50, 2.25, 2.42, 2.58, 3…
$ date_code_2                <dbl> 0.00, 0.25, 0.25, 0.25, 0.50, 0.50, 0.50, 0…
$ log_ransom                 <dbl> 1.43136376, 1.25527251, 0.69897000, -2.5228…
$ finance_sector             <chr> "Other Sector", "Other Sector", "Other Sect…

# cleaning the dataset for exploration and correlation
Ransomware_Clean <- Ransomware_Attack |>
  select(ransom_cost, log_ransom, year, revenue, organisation_size) |>
  filter(!is.na(ransom_cost) & !is.na(log_ransom) &
         !is.na(year) & !is.na(revenue) & !is.na(organisation_size))

Method 2: Correlation Heatmap with DataExplorer

This correlation plot shows similar pairwise results as above, but as a heatmap of correlation values.

#using only numerical values to find correlation

library(DataExplorer)

Warning: package 'DataExplorer' was built under R version 4.5.2

plot_correlation(Ransomware_Clean |> 
                   select(log_ransom, year,organisation_size,ransom_cost))

Method 3: Distribution and Correlation Plot with psych

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

pairs.panels(Ransomware_Clean,
               select(log_ransom, year, organisation_size, ransom_cost),
             gap = 0,
             pch = 21,
             lm  = TRUE) # plot distributions and correlations for allnumerical variables

multiple linear Regression model

# Full Model 
# Predict log_ransom using year, sector, and org_size
model_full <- lm(log_ransom ~ year + sector + organisation_size+ revenue,
                 data = Ransomware_Attack)

summary(model_full)


Call:
lm(formula = log_ransom ~ year + sector + organisation_size + 
    revenue, data = Ransomware_Attack)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8188 -0.4441 -0.1110  0.4853  2.1772 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -3.517e+02  8.635e+01  -4.074 0.000109 ***
year                   1.742e-01  4.278e-02   4.072 0.000110 ***
sectorenergy           5.685e-01  7.033e-01   0.808 0.421353    
sectorfinance          9.796e-01  5.045e-01   1.942 0.055737 .  
sectorfood & beverage  3.248e-01  6.774e-01   0.479 0.632962    
sectorgovernment       4.749e-01  3.154e-01   1.505 0.136189    
sectorhealthcare       5.423e-01  3.330e-01   1.629 0.107362    
sectorindustrial       2.227e-01  7.116e-01   0.313 0.755152    
sectorlegal            1.807e+00  6.706e-01   2.694 0.008620 ** 
sectormedia & sport    3.446e-01  5.062e-01   0.681 0.498068    
sectormisc             1.105e+00  4.395e-01   2.513 0.013991 *  
sectorretail           6.307e-01  5.998e-01   1.051 0.296244    
sectortech             8.414e-01  4.251e-01   1.979 0.051296 .  
organisation_size.L    2.011e+00  6.986e-01   2.878 0.005144 ** 
organisation_size.Q    1.461e-01  6.336e-01   0.231 0.818229    
organisation_size.C   -6.965e-01  5.789e-01  -1.203 0.232504    
organisation_size^4    1.391e-01  5.288e-01   0.263 0.793239    
organisation_size^5    9.705e-01  6.322e-01   1.535 0.128712    
revenue                7.343e-04  7.248e-04   1.013 0.314122    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8709 on 79 degrees of freedom
Multiple R-squared:  0.3777,    Adjusted R-squared:  0.236 
F-statistic: 2.664 on 18 and 79 DF,  p-value: 0.001499

autoplot(model_full, 1:4, nrow=2, ncol=2)

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Backward regression removing sector

# Backward Step 1 Inspect p-values in the full model above Remove whichever predictor has the highest p-value above 0.05 org_size is often not significant once sector is in the model

model_step1 <- lm(log_ransom ~ year + organisation_size + revenue ,
                  data = Ransomware_Attack)

summary(model_step1)


Call:
lm(formula = log_ransom ~ year + organisation_size + revenue, 
    data = Ransomware_Attack)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89902 -0.38490 -0.07727  0.42241  2.08474 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -3.558e+02  8.111e+01  -4.386 3.12e-05 ***
year                 1.766e-01  4.019e-02   4.394 3.04e-05 ***
organisation_size.L  2.633e+00  6.223e-01   4.231 5.58e-05 ***
organisation_size.Q  1.311e-01  6.431e-01   0.204   0.8390    
organisation_size.C -9.554e-01  5.738e-01  -1.665   0.0994 .  
organisation_size^4  1.554e-01  5.355e-01   0.290   0.7723    
organisation_size^5  1.170e+00  6.313e-01   1.854   0.0670 .  
revenue              6.434e-04  6.494e-04   0.991   0.3245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8871 on 90 degrees of freedom
Multiple R-squared:  0.2645,    Adjusted R-squared:  0.2073 
F-statistic: 4.623 on 7 and 90 DF,  p-value: 0.0001871

autoplot(model_step1, 1:4, nrow=2, ncol=2)

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Backward regression removing Organisation_size

# Backward Step 2 Inspect p-values in the full model above Remove whichever predictor has the highest p-value above 0.05 org_size is often not significant once sector is in the model

model_step2 <- lm(log_ransom ~ year + revenue ,
                  data = Ransomware_Attack)

summary(model_step2)


Call:
lm(formula = log_ransom ~ year + revenue, data = Ransomware_Attack)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0280 -0.4260 -0.1373  0.6053  2.2565 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -2.073e+02  8.076e+01  -2.566   0.0118 *
year         1.026e-01  3.999e-02   2.565   0.0119 *
revenue      2.604e-04  4.611e-04   0.565   0.5735  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9689 on 95 degrees of freedom
Multiple R-squared:  0.07369,   Adjusted R-squared:  0.05418 
F-statistic: 3.778 on 2 and 95 DF,  p-value: 0.02637

autoplot(model_step2, 1:4, nrow=2, ncol=2)

Backward regression removing revenue

# Backward Step 2 Check if all remaining predictors are now p < 0.05 If yes — stop here, this is the final model If no — remove the next least significant predictor and repeat

model_final <- lm(log_ransom ~ year,
                  data = Ransomware_Attack)

summary(model_final)


Call:
lm(formula = log_ransom ~ year, data = Ransomware_Attack)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0326 -0.4593 -0.1342  0.6025  2.2482 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -214.55386   79.44119  -2.701  0.00818 **
year           0.10618    0.03933   2.700  0.00820 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9655 on 96 degrees of freedom
Multiple R-squared:  0.07057,   Adjusted R-squared:  0.06089 
F-statistic:  7.29 on 1 and 96 DF,  p-value: 0.008197

autoplot(model_final, 1:4, nrow=2, ncol=2)

The final model retained year and sector as predictors of ransom cost. The variable year was statistically significant ( 0.00820 < 0.05), confirming that ransom demands have grown meaningfully over time

the equation is y= 0.10618(year) + -214.55386

Also, A 60.9% Adjusted R² is actually a very strong result for this type of real-world data. This means that 60% of the variation in ransom cost is explained by the predictors in your final model (year and sector). The remaining approximatively 40% is variation that the model cannot account for — attributed to factors not captured in the dataset.

Residuals vs Fitted

This indicates some non-linearity in the relationshipthe model is not capturing the full pattern in the data perfectly.

Normal Q-Q

the plot depicts a strong sign that residuals are approximately normally distributed for the bulk of the data.

# this chunks compares this two models to see which one is more effective. 
anova(model_step2, model_step1)

Analysis of Variance Table

Model 1: log_ransom ~ year + revenue
Model 2: log_ransom ~ year + organisation_size + revenue
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     95 89.190                                  
2     90 70.821  5    18.369 4.6688 0.0007781 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Scatterplot Visualization

# Scatterplot 1: Year vs Ransom Cost Are ransomware demands growing year over year? The lm trend line directly answers this
p_s1 <- ggplot(Ransomware_Attack,
               aes(x = year, y = log_ransom,
                   color = sector)) +
  geom_jitter(alpha = 0.55, width = 0.25, size = 2.5) +
  geom_smooth(method   = "lm", formula  = y ~ x, se = FALSE, color = "gray20", linetype = "dashed") +
  scale_y_continuous(
    labels = function(x) paste0("$", round(10^x, 2), "M") #found this on google
  ) +
  scale_color_viridis_d(option= "turbo", name = "sector") +
  labs(
    title   = "Are Ransomware Demands Growing Over Time?",
    x       = "Year",
    y       = "Ransom Cost (USD millions)",
    color   = "Sector",
    caption = "Source: Information is Beautiful — Ransomware Attacks"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p_s1

ggplotly(p_s1)

# Scatterplot 2: Organisation Size vs Ransom Cost Do Ransomware demand more from larger organisations? organisation_size is a factor: 1 (smallest) → 300 (largest)
p_s2 <- ggplot(Ransomware_Attack,
               aes(x    = organisation_size,
                   y    = log_ransom,
                   color = ransom_paid)) +
  geom_jitter(alpha = 0.6, width = 0.2, size = 2.5) +
  scale_color_manual(
    values = c("ransom paid" = "#0307fc",
               "refused"     = "#fc0398",
               "unknown"     = "#03fc1c"),
    name   = "Ransom Paid?"
  ) +
  labs(
    title    = "Organisation Size vs. Ransom Cost",
    x        = "Organisation Size",
    y        = "Ransom Cost (USD millions)",
    caption  = "Source: Information is Beautiful — Ransomware Attacks"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

p_s2

ggplotly(p_s2)

Closing Essay

How the Data Was Cleaned

The raw dataset from the Information is Beautiful Google Sheet presented several structural challenges requiring cleaning decisions. The file contained disturbing title row (“Top ransomware attacks”) at the very top, handled by using skip = 1 inside read_csv() so that the actual column headers loaded correctly as row one and names(Ransomware_Attack) <- tolower(names(Ransomware_Attack)) names(Ransomware_Attack) <- gsub(” “,”_“,names(Ransomware_Attack)) to correct column names. Also, I used Ransomware_Attack <- Ransomware_Attack [-1,] to erase the row below the column names was empty throughout the dataset. However, the data was no still ready. It still presented a large number of missing values.So I had to filter all the missing values in the variables I used and then I used log10 to remove skewness in financial numbers.

What the Visualizations Show

The two scatterplots together reveal a clear and escalating ransomware threat. The Organisation Size vs. Ransom Cost plot challenges the assumption that larger organisations are always hit harder smaller organisations (size 1) show the widest spread of ransom costs, while the largest (size 100–300) face fewer but consistently higher demands reaching tens of millions of dollars. These might be because bigger organisations may have invested more in highly sophisticated security measures.The second plot tells the more alarming story — ransom demands have grown dramatically over time, with the dashed trend line climbing from roughly $0.1M in 2013 to $1M–$10M by 2023.

What I wish I could have done

I would have liked to include a geographic map using rnaturalearth and sf showing total ransom costs by country connecting the financial damage directly to specific regions. The location column in this dataset contains many multi-country strings (e.g., “USA, Canada, Ireland”) which would require extensive string parsing to decompose cleanly.