DATA 110 Week 7 Project 1: Have Data Breaches Increased in Size Over the Last Few Decades?

Author

Emilio Difilippantonio

Data Breaches Data Set

This data set is about select data breaches from 2004 to 2019. It gives information such as the entity that whose data was breached, the number of records lost, the year in which the breach occurred, the story behind the breach, the sector in which the breach occurred (retail, government, financial, etc.), the method of the data breach, the data sensitivity level, and the source of the information about the data breach (with links). I will be exploring the correlation between several factors and numerical values, such as the year of the breach, the amount of data breached, the sensitivity level of the breached information, and the sector in which the breach occurred. Data sensitivity is rated on a scale from 1 to 3, with the following meanings:

Level 1 data is public data. How can you breach level 1 data? An example of a level 1 data breach is the first entry in this data set. In June of 2004, a former American Online software engineer sold 92 million screen names and email addresses to spammers, who sent billions of emails to these people. Although people’s screen names and email addresses, when not paired with other information such as real names, is public information, it can still be compiled and used maliciously. In this case, a list of all the information was taken and sold to spammers who misused the information.
Level 2 data is data intended for internal use that, while not devastating to individuals or corporations, isn’t intended to be viewed by the public. This can include communications and plans within a company or between individuals.
Level 3 data is the most sensitive data, and includes social security, birth date, financial records, bank account numbers, log-in information, addresses, phone numbers, and other data that can be used to access a person or a corporation or their assets and resources.

Source of data sensitivity level information: DryvIQ

Source of the data set: Information is Beautiful

# Reading in the data set and loading in the necessary packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(ggfortify)
breaches <- read_csv("biggest_data_breaches.csv")

New names:
Rows: 349 Columns: 28
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(22): Entity, alternative name, story, SECTOR, METHOD, interesting story... dbl
(2): YEAR, ...24 num (1): records lost lgl (3): ...23, ...25, ...26
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`

options(scipen = 0)

# First, I'll clean the data set by making the column names lowercase and removing spaces from them, as well as removing NAs in important columns and removing any objects with a data sensitivity level other than 1, 2, or 3 (the data was entered improperly, so the data frame has inputs in the wrong columns, including numbers in the sensitivity level column that shouldn't be there).
names(breaches) <- tolower(names(breaches))
names(breaches) <- gsub(" ", "_", names(breaches))
breaches$data_sensitivity <- as.numeric(breaches$data_sensitivity)

Warning: NAs introduced by coercion

breaches1 <- breaches |>
  filter(!is.na(year) & !is.na(records_lost) & data_sensitivity %in% c(1,2,3))

# I'm also going to add a "month" column and use it to create an "adjusted year" column. Since two stories in both June and July had the full name of the month instead of just the three letter abbreviation, I removed the space after their three letter abbreviation in the grepl function. I have the spaces after all of the other abbreviations in the grepl function to ensure that it doesn't read another month in the story instead and assign that row the incorrect month (eg. a story contains the word January, though it says it was officially released in February, but because the grepl function is only looking for "jan" before it looks for any other months, it incorrectly assign that row the value, 1, instead of 2).
breaches1 <- breaches1 |> mutate(month =
 ifelse(grepl("Jan ", breaches1$story), 1,
  ifelse(grepl("Feb ", breaches1$story), 2,
   ifelse(grepl("Mar ", breaches1$story), 3,
    ifelse(grepl("Apr ", breaches1$story), 4,
     ifelse(grepl("May ", breaches1$story), 5,
      ifelse(grepl("Jun", breaches1$story), 6,
       ifelse(grepl("Jul", breaches1$story), 7,
        ifelse(grepl("Aug ", breaches1$story), 8,
         ifelse(grepl("Sep ", breaches1$story), 9,
          ifelse(grepl("Oct ", breaches1$story), 10,
           ifelse(grepl("Nov ", breaches1$story), 11,
            ifelse(grepl("Dec ", breaches1$story), 12, NA)
            ))))))))))))
# The year_adjusted column is just the year and month as a decimal value. for example, February of 2008 would be 2008.083, because each month is 0.083 years (rounded) and in February of 2008, one month (January) has already passed.
breaches1 <- breaches1 |> mutate(year_adjusted = year + ((month - 1) / 12))

# Next, I'll remove the extra columns and the "interesting story" column, as it isn't remotely necessary for our analyses and doesn't seem that useful overall
breaches1 <- breaches1[, -c(8, 11, 15:28)]

# Lastly, I'll reorder the columns to my liking
breaches1 <- breaches1[, c("entity", "alternative_name", "records_lost", "displayed_records", "data_sensitivity", "year", "month", "year_adjusted", "story", "sector", "method", "source_name", "1st_source_link", "2nd_source_link")]

# Data analysis
## R-values between year, records lost, and data sensitivity
cor(breaches1$year_adjusted, breaches1$records_lost)

[1] 0.09706118

cor(breaches1$year_adjusted, breaches1$data_sensitivity)

[1] -0.2130653

cor(breaches1$data_sensitivity, breaches1$records_lost)

[1] 0.03847179

# Graphing the correlation between year, records lost, and data sensitivity level of data breaches with a line of regression and the standard error
breaches_graph1 <- breaches1 |> ggplot(aes(x = year_adjusted, y = records_lost)) +
  xlim(2004, 2020) + ylim(0, 3000000000) +
  labs(title = "Correlation between Year and Records Lost\nin Data Breaches",
  caption = "Source: Information is Beautiful") +
  xlab("Year") +
  ylab ("Records Lost in Data Breach") +
  theme_minimal(base_size = 12) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "blue")
breaches_graph1

Warning: Removed 4 rows containing missing values (`geom_smooth()`).

It looks like we have an outlier: the Yahoo data breach that occurred in 2013 (I used the mouse-over tooltip in ggplotly to find what the outlier was, though I had to remove ggplotly from my graphs to render the document).

In December of 2016, the BBC and the New York Times revealed that in 2013, Yahoo had been hacked and 3 billion records containing user names, phone numbers, DOBs, passwords, and security question answers were stolen.

# let's remove the outlier and redo the graph
breaches2 <- breaches1[-101,]
breaches_graph2 <- breaches2 |> ggplot(aes(x = year_adjusted, y = records_lost)) +
  xlim(2004, 2020) + ylim(0, 1500000000) +
  labs(title = "Correlation between Year and Records Lost\nin Data Breaches",
  caption = "Source: Information is Beautiful") +
  xlab("Year") +
  ylab ("Records Lost in Data Breach") +
  theme_minimal(base_size = 12) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "blue")
breaches_graph2

Warning: Removed 17 rows containing missing values (`geom_smooth()`).

That’s better!

According to the linear regression line, the average size of the data breaches hasn’t increased much over the years, but we can see on the graph that recently there have been a few major data breaches, the sizes of which are unrivaled by those of previous years.

Let’s explore the math behind this correlation.

# We'll find the equation of the linear regression model, the p-value, and the adjusted r squared value
summary(lm(records_lost ~ year_adjusted, data = breaches1))


Call:
lm(formula = records_lost ~ year_adjusted, data = breaches1)

Residuals:
       Min         1Q     Median         3Q        Max 
 -78220000  -54474513  -34256880  -12042105 2951891806 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)   -1.101e+10  7.176e+09  -1.535    0.126
year_adjusted  5.493e+06  3.562e+06   1.542    0.124

Residual standard error: 223600000 on 250 degrees of freedom
Multiple R-squared:  0.009421,  Adjusted R-squared:  0.005459 
F-statistic: 2.378 on 1 and 250 DF,  p-value: 0.1243

The equation that best matches this information is as follows: records lost = -11014448108 + 5493056(year). That would mean that

The p-value for this correlation is 0.1243, which isn’t statistically significant.

The adjusted r squared value is 0.005459, which is exceptionally small. It means that only 0.5% of the variation in this graph can be explained by our linear regression model.

Let’s see what else we can do!

# Next, we'll use diagnostic plots to further analyze the correlation between our variables
ggpairs(breaches2, columns = c(3, 5, 8))

There isn’t a correlation between data sensitivity and records lost, but there is a moderately strong correlation between year and records lost and a strong correlation between year and data sensitivity.

Despite this, we’ll keep looking at the correlation between the year and the number of records lost because they are both numeric values, whereas data sensitivity is a factor.

# Next, we'll use some diagnostic plots
autoplot(lm(records_lost ~ year_adjusted, data = breaches2))

There aren’t many outliers, but we can see in the residuals vs fitted graph that as the fitted values increase, many of the residuals increase as well. The normal Q-Q graph shows that the higher theoretical quantiles seem to correlate with higher standardized residuals in a clear pattern. This new information, combined with the fact that several more recent data breaches were very large and that according to our previous graphs, the predicted size of data breaches prior to 2007 or 2008 is negative, makes me think that a linear model may not be the best fit.

# We'll zoom in on the graph we made at the beginning to see if we can see an increase in data breach sizes in the 2000s, which would be hard to see on the other graph due to the large y-axis scale.
breaches_graph3 <- breaches2 |> ggplot(aes(x = year_adjusted, y = records_lost,)) +
  xlim(2004, 2020) + ylim(0, 500000000) +
  labs(title = "Correlation between Year and Records Lost\nin Data Breaches",
  caption = "Source: Information is Beautiful") +
  xlab("Year") +
  ylab ("Records Lost in Data Breach") +
  theme_minimal(base_size = 12) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "blue")
breaches_graph3

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

Warning: Removed 12 rows containing missing values (`geom_smooth()`).

There does indeed seem to be some sort of increase in data breach sizes as the years go on that we couldn’t see in the previous graph.

Official Visualization (to be graded)

# We'll zoom back out and change the method of the regression line to "loess," which stands for, "locally estimated scatterplot smoothing." This means that it won't force the regression line to fit a certain model, and will instead allow us to see which model, if any, best fits the data. We'll see what happens...
# But first, I have to change data sensitivity to a factor in order to color-code the data points
breaches2$data_sensitivity <- as.factor(breaches2$data_sensitivity)
# Ok, we're ready. Without further ado, I give you my final graph.........
breaches_graph4 <- breaches2 |> ggplot(aes(x = year_adjusted, y = records_lost, color = data_sensitivity)) +
  scale_color_brewer(palette = "Set1") +
  xlim(2004, 2020) + ylim(0, 1500000000) +
  labs(title = "Correlation between Year and Records Lost\nin Data Breaches",
    caption = "Source: Information is Beautiful",
    color = "Data Sensitivity Level") +
  xlab("Year of the Data Breach") +
  ylab ("Records Lost in the Data Breach") +
  theme_minimal(base_size = 12) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
theme(legend.position = c(0.2, 0.8)) +
scale_color_discrete(labels = c("Level 1 (low)", "Level 2 (medium)", "Level 3 (high)")) +
guides(color = guide_legend(override.aes = list(alpha = 1)))

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

# Source for line 172 of code: https://stackoverflow.com/questions/5290003/how-to-set-legend-alpha-with-ggplot2
breaches_graph4

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Umm………

Ok.

I really thought that an exponential equation would best fit this data, but based on the regression line of the graph above, that isn’t the case. I find that really surprising: I thought that as the amount of compiled data increased exponentially in recent years, so would the size of data breaches. It seems, however, that I was mistaken. That’s okay, though. The goal of investigating data isn’t to be right, it’s to learn new things. I learned that the size of data breaches in recent years, hasn’t increased as much as I previously thought. This might be due to better security or the difficulty and impracticality of stealing exorbitantly large amounts of data. It is also possible that as the amount of data being stored increased, it started being stored in separate locations. This might mean that any hackers would have to hack into several data banks.

To clean the data set, I first made all the column names lowercase and replaced spaces in the column names with underscores. This standardizes the data and makes it an easier data set with which to work. I then changed the data sensitivity variable to numeric in order to analyze it in my linear regression models. I changed it back to a factor for the last graph in order to color the data points based on the sensitivity level of the information that was breached. Next, I filtered out rows that didn’t have a year or the number of records stolen or didn’t have a security level of 1, 2, or 3. The data set wasn’t inputted that well, so some of the information from columns on the left somehow spilled over into columns on the right (I suspect that commas in the “story” column caused further data in that column to be displaced towards the right. After all, a csv file is a comma-separated values file. This was likely an oversight by the compiler of the data, but was easily fixed, though not ideally, by simply removing rows in which this had occurred). I also added a “month” column and used it to create an “adjusted year” column. I used the grepl function in conjuction with the mutate and ifelse functions to assign each row a numeric value of their month based on the month in which the “story” column says that the breach happened. Since two stories in both June and July had the full name of the month instead of just the three letter abbreviation, I removed the space after their three letter abbreviation in the grepl function to allow it to identify those rows and belonging in either June or July. I have the spaces after all of the other abbreviations in the grepl function to ensure that it doesn’t read another month in the story instead and assign that row the incorrect month (eg. a story contains the word January, though it says it was officially released in February, but because the grepl function is only looking for “jan” before it looks for any other months, it incorrectly assign that row the value, 1, instead of 2). The year_adjusted column is just the year and month as a decimal value. I did this by taking the month, subtracting 1 (to indicate the number of passed months), and dividing by 12 (to put it in terms of years), then adding that to the year in which the data breach occurred. For example, February of 2008 is 2008.083, because one month (January) has already passed, which is 0.083 years (rounded). I used this in the graphs to increase the accuracy of when each data breach happened and reduce the clutter caused by an abundance of points in a few columns (they are spread out between the years on the graph instead of all the data points in one year aligning vertically). I then removed the blank columns (11 and 15 through 28) and the “interesting story” column (8), which simply had a “y” if the story was deemed interesting and was left blank if it wasn’t. Lastly, I reordered the columns for ease of use and my own personal satisfaction.

I wanted to fit the data to other, non-linear models, but I was running out of time and realized that I was way in over my head. I’m used to being confused as to what is happening with certain pieces of code and their error messages, but trying to find out how to fit data to non-linear models had me completely and utterly lost. I also wish I could have made the last graph interactive, but after over an hour (possibly over 2 hours) of trying to figure out why I couldn’t adjust the legend, I realized that ggplotly was preventing me from doing so. I decided that having a customized legend that said exactly what I wanted it to (and whose alpha value was set to 1, instead of the alpha value of 0.5 that is used for the graph) was more important to me than having an interactive graph. I’ve learned several new pieces and increased my existing knowledge of other functions. I’m still surprised that the regression line that best fit the data wasn’t exponential, but I think it’s just as important to share when you were wrong as it is to share when you were right.

P.S.

I just finished the project, but when rendering it, the console told me that I could not render with ggplotly, so I had to remove it from my graphs. This isn’t a big deal, it’s just important to note that I used it to identify the 2013 Yahoo data breach outlier which I subsequently removed, which is why there isn’t any other code used to identify the outlier.