Project_2

Author

Nseyo O

Splunk SIEM dashboard Example: https://www.dnsstuff.com/siem-tools

For my second project, I chose to use an already visualized dataset. As I am not creating anything new, I will be trying to use basic cybersecurity principles to find important breaches, a method, and maybe a visualization that can show a possible solution.

Massive breaches have been visualized by David McCandless & Tom Evans on Information is Beautiful. Regardless of your cybersecurity system, breaches and response times indicate an organization’s cybersecurity strength. This dataset includes 489 publicly reported data-breach incidents from the early 2000s to 2024 across various industries worldwide.

What concerns us about these breaches? Based on the levels of breaches provided, I will examine what is most important according to cybersecurity standards.

Original Visualization: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/ Dataset: https://docs.google.com/spreadsheets/d/1i0oIJJMRG-7t1GT-mr4smaTTU7988yXVz8nPlwaJ8Xk/edit?gid=2#gid=2

First, load the required libraries and data set.

#| message: false
#| warning: false
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

setwd("/Users/oworenibanseyo/Desktop/Data 110 2025/Datasets")
breaches <- read_csv("breaches_101.csv")

New names:
Rows: 489 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): organisation, alternative name, records lost, date, story, sector,... dbl
(3): year, data sensitivity, ID lgl (1): ...12
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...12`

head(breaches)

# A tibble: 6 × 16
  organisation `alternative name` `records lost`  year date  story sector method
  <chr>        <chr>              <chr>          <dbl> <chr> <chr> <chr>  <chr> 
1 <NA>         <NA>               <NA>              NA <NA>  <NA>  <NA>   <NA>  
2 Kaiser Perm… <NA>               13,400,000      2024 Apr … A le… health oops! 
3 Ticketmaster <NA>               560,000,000     2024 Jun … Hack… media  hacked
4 Stanford Un… <NA>               27,000          2023 May … The … acade… hacked
5 Cooler Mast… <NA>               500,000         2024 May … Thre… tech   hacked
6 Financial B… FBCS               3,200,000       2024 Feb … A U.… tech   hacked
# ℹ 8 more variables: `interesting story` <chr>, `data sensitivity` <dbl>,
#   `displayed records` <chr>, ...12 <lgl>, `source name` <chr>,
#   `1st source link` <chr>, `2nd source link` <chr>, ID <dbl>

Looking at the dataset, you can see that it includes the organization, method of the breach, records stolen, sources, and more.

We have 489 observations and 16 variables in the dataset to work with.

First i’ll remove any spaces from the column names.

names(breaches) <- gsub(" ","",names(breaches))

Then i filter to drop the variables we are not going to use.

breaches_df <- breaches |>
  select(
    organisation, recordslost, year, sector,  method,datasensitivity,sourcename
  ) |>
  drop_na() 

glimpse(breaches_df)

Rows: 487
Columns: 7
$ organisation    <chr> "Kaiser Permanente", "Ticketmaster", "Stanford Univers…
$ recordslost     <chr> "13,400,000", "560,000,000", "27,000", "500,000", "3,2…
$ year            <dbl> 2024, 2024, 2023, 2024, 2024, 2024, 2024, 2024, 2023, …
$ sector          <chr> "health", "media", "academic", "tech", "tech", "financ…
$ method          <chr> "oops!", "hacked", "hacked", "hacked", "hacked", "hack…
$ datasensitivity <dbl> 3, 3, 2, 2, 2, 3, 1, 2, 2, 2, 2, 4, 2, 2, 4, 4, 3, 2, …
$ sourcename      <chr> "Bleeping Computer", "BBC", "Slashdot", "Bleeping Comp…

Now, in our newly cleaned table, you can see that our dataset is more understandable. Now, let’s filter by sensitivity. This original CSV is labeled with a sensitivity scale from 1 to 5.

I will be grouping these into PII and SPII categories. Email and personal information fall under personally identifiable information, while health records and full details will fall under sensitive personally identifiable information.

Sensitivity levels 1. Email address/Online information 2. Personal details 3. Credit card information 4. Health and other personal records 5. Full details

Remember

PII: (sensitivity numbers 1,2,3): Personally Identifiable Information, known as PII, is any info used to identify (infer) on an individual’s identity. Includes full name, date of birth, physical address, phone number, email, IP address and similar info.

SPII: (sensitivity numbers 4 and 5): Sensitive Personally Identifiable information is handled or supposed to be handled with stricter guidelines. These include social security number, financial information, medical information, and biometrics such as fingerprints and facial recognition. These types of information can cause severe damage.

A lot of the sources are viable sources for reporting information, plus I note some specifically used by those in the cybersecurity domain such as “Krebs on Security”

Here, I’ll isolate the most common method for these data breaches, introducing a function called the get mode function, which didn’t give me character error issues.

get_mode <- function(x) {
  ux <- unique(x)
  ux[ which.max(tabulate(match(x, ux))) ]
}


company_summary <- breaches_df |>

  mutate(datasensitivity = as.numeric(datasensitivity)) |>
  group_by(organisation) |>
  summarise(
    breach_count         = n(),
    most_common_method   = get_mode(method)
  ) |>
  arrange(desc(breach_count))


company_summary |> slice_max(breach_count, n = 10)

# A tibble: 24 × 3
   organisation breach_count most_common_method
   <chr>               <int> <chr>             
 1 Facebook                5 hacked            
 2 T-Mobile                4 hacked            
 3 Twitter                 4 hacked            
 4 AOL                     3 hacked            
 5 AT&T                    3 hacked            
 6 Citigroup               3 oops!             
 7 Microsoft               3 hacked            
 8 Uber                    3 hacked            
 9 Yahoo                   3 hacked            
10 Acer                    2 hacked            
# ℹ 14 more rows

By the breach count, I will plot a bar graph to show the top 10 organizations and their methods of data loss, by the breach count given on the Y-axis. It’s on the bottom, as I will flip the coordinates to have the names fit the plot.

top10 <- company_summary |>
  slice_max(breach_count, n = 10)

ggplot(top10, aes(
      x = reorder(organisation, breach_count),
      y = breach_count,
      fill = most_common_method
    )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Organizations by Number of Breaches",
    x     = "",
    y     = "Breach Count",
    caption = "Source:Various networks & https://informationisbeautiful.net/",
    fill  = "Common Method"
  ) +
  theme_minimal()

The “hacking” method seems to be prevalent, and so many ways at that too.

Now to look by records lost.

records_summary <- breaches_df |>
  mutate(
    recordslost = parse_number(recordslost) 
  ) |>
  group_by(organisation) |>
  summarise(
    total_lost = sum(recordslost, na.rm = TRUE)
  ) |>
  arrange(desc(total_lost))


records_summary |>
  slice_max(total_lost, n = 10) |>
  ggplot(aes(
    x = reorder(organisation, total_lost),
    y = total_lost / 1e6
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Organizations by Total Records Lost",
    x     = "",
    y     = "Records Lost (Millions)",
  caption = "Source:Various networks & https://informationisbeautiful.net/"
  ) +
  theme_minimal()

Will i be able to find association with the breach count by records lost? here i will create a linear model to attempt to find a relationship

raw_summary <- breaches_df |>
  mutate(recordslost = readr::parse_number(recordslost)) |>
  group_by(organisation) |>
  summarise(
    breach_count = n(),
    total_lost   = sum(recordslost, na.rm = TRUE)
  )


company_summary <- breaches_df |>
  mutate(datasensitivity = as.numeric(datasensitivity)) |>
  group_by(organisation) |>
  summarise(
    mode_sensitivity = get_mode(datasensitivity)
  )


model_df <- raw_summary |>
  inner_join(company_summary, by = "organisation")

model_1 <- lm(total_lost ~ breach_count, data = model_df)
summary(model_1)


Call:
lm(formula = total_lost ~ breach_count, data = model_df)

Residuals:
       Min         1Q     Median         3Q        Max 
-293549299  -27424625  -25230978  -12730978  972269022 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -99208741   14811079  -6.698 6.34e-11 ***
breach_count 126939719   12858959   9.872  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 107600000 on 448 degrees of freedom
Multiple R-squared:  0.1787,    Adjusted R-squared:  0.1768 
F-statistic: 97.45 on 1 and 448 DF,  p-value: < 2.2e-16

The R-squared shows that only 18% of the variability in records lost is explained by the breach count.

hist(residuals(model_1))

A histogram of the residuals shows quite severe outliers.

Now i will attempt to fit a line to find assciation between the two variables.

ggplot(model_df, aes(x = breach_count, y = total_lost/1e6)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Number of Breaches",
    y = "Total Records Lost (Millions)",
    caption = "Source:Various networks & https://informationisbeautiful.net/",
    title = "Records Lost vs Breach Count"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

According to this linear model, there is little relationship between the amount of records lost and breach count. The model is filled with outliers and there isn’t a consistent spread. It is possible that with one breach, an organization might lose more records than another organization that has suffered multiple breaches.

Now I will attempt to look at the breaches over time. What will the plot show? I will group sensitivity as mentioned earlier as PII and SPII

breaches_classified <- breaches_df |>
  mutate(
    datasensitivity = as.numeric(datasensitivity),
    classification = case_when(
      datasensitivity %in% c(1, 2, 3) ~ "SPII",
      datasensitivity %in% c(4, 5) ~ "PII",
      TRUE ~ NA_character_ 
    )
  ) |>
  filter(!is.na(classification))

breaches_classified |>
  group_by(year, classification)  |>
  summarise(records = sum(readr::parse_number(recordslost), na.rm = TRUE)) |>
  ggplot(aes(x = year, y = records / 1e6, fill = classification)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::comma_format(suffix = "M")) +
  labs(
    title = "PII vs SPII Records Lost Over Time",
    x = "Year", y = "Records Lost (Millions)",
    fill = "Classification",
    caption = "Source:Various networks & https://informationisbeautiful.net/"
  ) +
  theme_minimal(base_size = 13)

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Now to see how our sensitive data has been treated by larde organizations over the years.

ts_df <- breaches_classified |>
  filter(!is.na(year)) |>
  group_by(year, classification) |>
  summarise(breach_count = n(), .groups = "drop")


ggplot(ts_df, aes(x = year, y = breach_count, color = classification)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(min(ts_df$year), max(ts_df$year), by = 1)) +
  labs(
    title    = "Time Series: Number of Breaches per Year",
    subtitle = "Grouped by PII vs SPII",
    x        = "Year",
    y        = "Breach Count",
     caption = "Source:Various networks & https://informationisbeautiful.net/",
    color    = "Classification"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x     = element_text(angle = 45, hjust = 1),
    plot.title      = element_text(face = "bold"),
    legend.position = "bottom"
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Somehow, it seems to be getting worse over the years, with more availability of sensitive data being targeted for hacks.

Here i create a vector focusing on the most important sectors

important_sectors <- c("health", "finance", "web", "government", "tech")

Here I will group by the sectors, summing up the total records lost and breaches.

sector_summary2 <- breaches_df |>
  filter(sector %in% important_sectors) |>
  mutate(records = parse_number(recordslost)) |>
  group_by(sector) |>
  summarise(
    total_breaches = n(),
    total_records_lost = sum(records, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total_breaches))

print(sector_summary2)

# A tibble: 5 × 3
  sector     total_breaches total_records_lost
  <chr>               <int>              <dbl>
1 web                   118         7496799873
2 government             52         1230757573
3 health                 51          220064929
4 finance                49         1686043228
5 tech                   33         1622807720

plot_df <- sector_summary2 |>
  mutate(records_billion = total_records_lost / 1e9)

ggplot(plot_df, aes(
      x    = reorder(sector, records_billion),
      y    = records_billion,
      fill = sector
    )) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(
    labels = label_number(scale = 1, suffix = " B", accuracy = 0.1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "Total Records Lost by Sector",
    x     = NULL,
    y     = "Records Lost (Billions)",
    caption = "Source:Various networks & https://informationisbeautiful.net/"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.y = element_text(face = "bold"),
    plot.title  = element_text(face = "bold", hjust = 0.5)
  )

This data set, however collected, does not represent these companies’ security postures over the years fully, but it does show something, as we might fear getting important health or financial information stolen. This plot simply shows that a lot of these breaches were web-based. Nothing done online is ever 100% safe according to leading cyber security experts. For this, I recommend an advanced SIEM dashboard and tools to catch attacks in real-time thanks to machine learning. SIEM is an application that collects and analyzes log data to monitor critical activities in an organization. The dashboard can analyze trends and catch reported attacks to provide proper mitigating parameters. Unlike normal physical attacks, web-based breaches can take hours to days for an organization to realize they’ve had a data breach.

Also a caution to details put online as the social media apps breached, such as Facebook and Twitter, fall in the web-based sector. At one point, my Facebook account had enough information for one to effortlessly find my home, past childhood details, and more information that can be used for harm.

Sources

https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode