For my second project, I chose to use an already visualized dataset. As I am not creating anything new, I will be trying to use basic cybersecurity principles to find important breaches, a method, and maybe a visualization that can show a possible solution.
Massive breaches have been visualized by David McCandless & Tom Evans on Information is Beautiful. Regardless of your cybersecurity system, breaches and response times indicate an organization’s cybersecurity strength. This dataset includes 489 publicly reported data-breach incidents from the early 2000s to 2024 across various industries worldwide.
What concerns us about these breaches? Based on the levels of breaches provided, I will examine what is most important according to cybersecurity standards.
Original Visualization: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/ Dataset: https://docs.google.com/spreadsheets/d/1i0oIJJMRG-7t1GT-mr4smaTTU7988yXVz8nPlwaJ8Xk/edit?gid=2#gid=2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
New names:
Rows: 489 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): organisation, alternative name, records lost, date, story, sector,... dbl
(3): year, data sensitivity, ID lgl (1): ...12
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...12`
head(breaches)
# A tibble: 6 × 16
organisation `alternative name` `records lost` year date story sector method
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
2 Kaiser Perm… <NA> 13,400,000 2024 Apr … A le… health oops!
3 Ticketmaster <NA> 560,000,000 2024 Jun … Hack… media hacked
4 Stanford Un… <NA> 27,000 2023 May … The … acade… hacked
5 Cooler Mast… <NA> 500,000 2024 May … Thre… tech hacked
6 Financial B… FBCS 3,200,000 2024 Feb … A U.… tech hacked
# ℹ 8 more variables: `interesting story` <chr>, `data sensitivity` <dbl>,
# `displayed records` <chr>, ...12 <lgl>, `source name` <chr>,
# `1st source link` <chr>, `2nd source link` <chr>, ID <dbl>
Looking at the dataset, you can see that it includes the organization, method of the breach, records stolen, sources, and more.
We have 489 observations and 16 variables in the dataset to work with.
First i’ll remove any spaces from the column names.
names(breaches) <-gsub(" ","",names(breaches))
Then i filter to drop the variables we are not going to use.
Now, in our newly cleaned table, you can see that our dataset is more understandable. Now, let’s filter by sensitivity. This original CSV is labeled with a sensitivity scale from 1 to 5.
I will be grouping these into PII and SPII categories. Email and personal information fall under personally identifiable information, while health records and full details will fall under sensitive personally identifiable information.
Sensitivity levels 1. Email address/Online information 2. Personal details 3. Credit card information 4. Health and other personal records 5. Full details
Remember
PII: (sensitivity numbers 1,2,3): Personally Identifiable Information, known as PII, is any info used to identify (infer) on an individual’s identity. Includes full name, date of birth, physical address, phone number, email, IP address and similar info.
SPII: (sensitivity numbers 4 and 5): Sensitive Personally Identifiable information is handled or supposed to be handled with stricter guidelines. These include social security number, financial information, medical information, and biometrics such as fingerprints and facial recognition. These types of information can cause severe damage.
A lot of the sources are viable sources for reporting information, plus I note some specifically used by those in the cybersecurity domain such as “Krebs on Security”
Here, I’ll isolate the most common method for these data breaches, introducing a function called the get mode function, which didn’t give me character error issues.
By the breach count, I will plot a bar graph to show the top 10 organizations and their methods of data loss, by the breach count given on the Y-axis. It’s on the bottom, as I will flip the coordinates to have the names fit the plot.
top10 <- company_summary |>slice_max(breach_count, n =10)ggplot(top10, aes(x =reorder(organisation, breach_count),y = breach_count,fill = most_common_method )) +geom_col() +coord_flip() +labs(title ="Top 10 Organizations by Number of Breaches",x ="",y ="Breach Count",caption ="Source:Various networks & https://informationisbeautiful.net/",fill ="Common Method" ) +theme_minimal()
The “hacking” method seems to be prevalent, and so many ways at that too.
Now to look by records lost.
records_summary <- breaches_df |>mutate(recordslost =parse_number(recordslost) ) |>group_by(organisation) |>summarise(total_lost =sum(recordslost, na.rm =TRUE) ) |>arrange(desc(total_lost))records_summary |>slice_max(total_lost, n =10) |>ggplot(aes(x =reorder(organisation, total_lost),y = total_lost /1e6 )) +geom_col() +coord_flip() +labs(title ="Top 10 Organizations by Total Records Lost",x ="",y ="Records Lost (Millions)",caption ="Source:Various networks & https://informationisbeautiful.net/" ) +theme_minimal()
Will i be able to find association with the breach count by records lost? here i will create a linear model to attempt to find a relationship
model_1 <-lm(total_lost ~ breach_count, data = model_df)summary(model_1)
Call:
lm(formula = total_lost ~ breach_count, data = model_df)
Residuals:
Min 1Q Median 3Q Max
-293549299 -27424625 -25230978 -12730978 972269022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -99208741 14811079 -6.698 6.34e-11 ***
breach_count 126939719 12858959 9.872 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 107600000 on 448 degrees of freedom
Multiple R-squared: 0.1787, Adjusted R-squared: 0.1768
F-statistic: 97.45 on 1 and 448 DF, p-value: < 2.2e-16
The R-squared shows that only 18% of the variability in records lost is explained by the breach count.
hist(residuals(model_1))
A histogram of the residuals shows quite severe outliers.
Now i will attempt to fit a line to find assciation between the two variables.
ggplot(model_df, aes(x = breach_count, y = total_lost/1e6)) +geom_point() +geom_smooth(method ="lm", se =FALSE) +labs(x ="Number of Breaches",y ="Total Records Lost (Millions)",caption ="Source:Various networks & https://informationisbeautiful.net/",title ="Records Lost vs Breach Count" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
According to this linear model, there is little relationship between the amount of records lost and breach count. The model is filled with outliers and there isn’t a consistent spread. It is possible that with one breach, an organization might lose more records than another organization that has suffered multiple breaches.
Now I will attempt to look at the breaches over time. What will the plot show? I will group sensitivity as mentioned earlier as PII and SPII
breaches_classified |>group_by(year, classification) |>summarise(records =sum(readr::parse_number(recordslost), na.rm =TRUE)) |>ggplot(aes(x = year, y = records /1e6, fill = classification)) +geom_col(position ="dodge") +scale_y_continuous(labels = scales::comma_format(suffix ="M")) +labs(title ="PII vs SPII Records Lost Over Time",x ="Year", y ="Records Lost (Millions)",fill ="Classification",caption ="Source:Various networks & https://informationisbeautiful.net/" ) +theme_minimal(base_size =13)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Now to see how our sensitive data has been treated by larde organizations over the years.
ts_df <- breaches_classified |>filter(!is.na(year)) |>group_by(year, classification) |>summarise(breach_count =n(), .groups ="drop")ggplot(ts_df, aes(x = year, y = breach_count, color = classification)) +geom_line(size =1.2) +geom_point(size =2) +scale_x_continuous(breaks =seq(min(ts_df$year), max(ts_df$year), by =1)) +labs(title ="Time Series: Number of Breaches per Year",subtitle ="Grouped by PII vs SPII",x ="Year",y ="Breach Count",caption ="Source:Various networks & https://informationisbeautiful.net/",color ="Classification" ) +theme_minimal(base_size =13) +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(face ="bold"),legend.position ="bottom" )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Somehow, it seems to be getting worse over the years, with more availability of sensitive data being targeted for hacks.
Here i create a vector focusing on the most important sectors
This data set, however collected, does not represent these companies’ security postures over the years fully, but it does show something, as we might fear getting important health or financial information stolen. This plot simply shows that a lot of these breaches were web-based. Nothing done online is ever 100% safe according to leading cyber security experts. For this, I recommend an advanced SIEM dashboard and tools to catch attacks in real-time thanks to machine learning. SIEM is an application that collects and analyzes log data to monitor critical activities in an organization. The dashboard can analyze trends and catch reported attacks to provide proper mitigating parameters. Unlike normal physical attacks, web-based breaches can take hours to days for an organization to realize they’ve had a data breach.
Also a caution to details put online as the social media apps breached, such as Facebook and Twitter, fall in the web-based sector. At one point, my Facebook account had enough information for one to effortlessly find my home, past childhood details, and more information that can be used for harm.