Google Data Analytics Professional Certificate - Capstone Project

Data Breaches chronology

This is an R Markdown document that is part of the Capstone Project from Google Data Analytics Professional Certificate. It’s originally based on the case study “Data Breaches 2004-2017 (EN)”, written by Carlos E. Jimenez-Gomez (found here: https://www.kaggle.com/estratic).It includes public and private organizations between 2004 and 2017, focused on method of leak, sector, and year of the leak. The dataset used in this case study is updated to 2022.

Scenario

I am a hypothetical junior data analyst working for a business intelligence consultant. I have been at my job for six months, and my boss feels I am ready for more responsibility. He has asked me to lead a project for a brand new client — a cyber security company.

First phase: Ask

I am exploring data breaches between 2004 and 2022. Some initial questions I might ask are: 1) What kinds of data breaches occurred? 2) Is there any tendency we can see?

We want help companies get prepared to cyber attacks. One way to do that is to understand what data breaches occurred, when and the level of damage.

Some metrics are dates, number of records leaked, sector of business and type of attack.

The stakeholders are my boss and the cyber security company’s managers.

My insights can help my client to focus on how companies can get protected from data breaches, since we can map the data breaches in the last 18 years.

Second phase: Prepare

The data used in this case study was obtained from https://www.estratic.com/2018/02/09/visualizing-data-breaches-2004-2017-2/

and

David McCandless, InformationIsBeautiful.net. (https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/)

How is the data organized? The dataset is a .CSV file. It has 16 variables (columns) and 417 observations (rows). The variables names are

  • Entity: name of the organization (public or private) that had the breach. String
  • Alternative Name: other known names of the entity. String
  • Story: tells a summary of what happened. String
  • Year: year of the breach. Date
  • Records Lost: number of records that the breach compromised. Integer
  • Sector: organization’s main sector (or field of business). String
  • Method of Leak: main cause of the breach. String
  • 1st source (link): 1st. url with more info about the breach. String
  • 2nd source (link): 2nd. url with more info about the breach. String
  • 3rd source (link): 3rd. url with more info about the breach. String
  • Source name: name of the source of news, official reports, blog, etc. included. Note that some of them have changed after I replaced some previous broken links that the original dataset had. String

The data has credibility. I skimmed the dataset and I did not find biases. This data is reliable, original, comprehensive, current, and cited.

I contacted Mr. Carlos Jimenez, from https://www.kaggle.com/estratic, by e-mail, and he let me use his work, since I cite him. I am very grateful for Mr. Carlos Jimenes.

This case study is used for educational purposes, so I didn’t contact Mr. David McCandless, according to his website (https://informationisbeautiful.net/licensing/)

I ran the data cleanup tools on Google Sheets on the CSV file. The Google Sheets trimmed white spaces and checked for duplicates. It didn’t suggest any modification to the content of the CSV.

Third phase: Process

I chose Google Sheets to cleanup the data because it has automated and reliable tools for that. Second, I imported the CSV into the RStudio Cloud because I feel comfortable using R codes to work with data, since I learned that on my PhD in Applied Linguistics.

Fourth phase: Analyze

I decided to let the data the way I got it, because it looks clean to me.

Let’s go through some important steps.

First, we installed the required packages and loaded them

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Then, we loaded the packages

library(tidyverse)  #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)  #helps wrangle date attributes
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Let’s collect the data into a dataframe

breaches_df = read_csv("breaches.csv")
## New names:
## Rows: 417 Columns: 16
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (11): organisation, alternative_name, date, story, sector, method, inter... dbl
## (3): year, data_sensitivity, ID lgl (1): ...12
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...12`

Let’s check the mean number of records breached, the minimum and the maximum.

mean(breaches_df$records_lost)
## [1] 34387245
median(breaches_df$records_lost)
## [1] 3100000
min(breaches_df$records_lost)
## [1] 30
max(breaches_df$records_lost)
## [1] 7e+08

The code below was supposed to show the average records lost by year.

year_count <- breaches_df %>%                              
  group_by(year) %>%    
  summarise(count = n_distinct(records_lost))
year_count    
## # A tibble: 19 × 2
##     year count
##    <dbl> <int>
##  1  2004     1
##  2  2005     3
##  3  2006     5
##  4  2007    10
##  5  2008    19
##  6  2009     9
##  7  2010    13
##  8  2011    34
##  9  2012    22
## 10  2013    28
## 11  2014    18
## 12  2015    24
## 13  2016    33
## 14  2017    22
## 15  2018    37
## 16  2019    40
## 17  2020    24
## 18  2021    29
## 19  2022    11

Let’s run the average records lost by year

summarize(breaches_df, sum(year))
## # A tibble: 1 × 1
##   `sum(year)`
##         <dbl>
## 1      840345

Here I tried to create a table to show which hacking method occurred by year, but my code simply summed all years as they were numbers, not dates.

The code below show how many times a method occurred by year.

year_hacker_count <- breaches_df %>%                        
  group_by(method) %>%    
  summarise(count = n_distinct(year))  
year_hacker_count   
## # A tibble: 5 × 2
##   method        count
##   <chr>         <int>
## 1 hacked           18
## 2 inside job       11
## 3 lost device      11
## 4 oops!            13
## 5 poor security    10
  • How will these insights help answer your business questions? Basically, the method “hacking” is the most common and which breaks most records.

Fifth phase: Share

Let’s generate some charts

The first chart shows the method of data breach (on x-axis) by year (on y-axis)

ggplot(breaches_df, aes(x = method, y = year)) + geom_point()

#### This second chart counts the method of hacking

ggplot(data = breaches_df) +
  geom_bar(mapping = aes(x = method))

This third chart counts the method of hacking with colors for sectors

ggplot(data = breaches_df) +
  geom_bar(mapping = aes(x = method, fill=sector))

This fourth chart shows each method wrapped.

ggplot(data = breaches_df) +
  geom_bar(mapping = aes(x = year)) +
  facet_wrap(~method )

Sixth phase: Act

  • What is your final conclusion based on your analysis? We need to invest money in firewall.