This is an R Markdown document that is part of the Capstone Project from Google Data Analytics Professional Certificate. It’s originally based on the case study “Data Breaches 2004-2017 (EN)”, written by Carlos E. Jimenez-Gomez (found here: https://www.kaggle.com/estratic).It includes public and private organizations between 2004 and 2017, focused on method of leak, sector, and year of the leak. The dataset used in this case study is updated to 2022.
I am a hypothetical junior data analyst working for a business intelligence consultant. I have been at my job for six months, and my boss feels I am ready for more responsibility. He has asked me to lead a project for a brand new client — a cyber security company.
I am exploring data breaches between 2004 and 2022. Some initial questions I might ask are: 1) What kinds of data breaches occurred? 2) Is there any tendency we can see?
We want help companies get prepared to cyber attacks. One way to do that is to understand what data breaches occurred, when and the level of damage.
Some metrics are dates, number of records leaked, sector of business and type of attack.
The stakeholders are my boss and the cyber security company’s managers.
My insights can help my client to focus on how companies can get protected from data breaches, since we can map the data breaches in the last 18 years.
The data used in this case study was obtained from https://www.estratic.com/2018/02/09/visualizing-data-breaches-2004-2017-2/
and
David McCandless, InformationIsBeautiful.net. (https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/)
How is the data organized? The dataset is a .CSV file. It has 16 variables (columns) and 417 observations (rows). The variables names are
The data has credibility. I skimmed the dataset and I did not find biases. This data is reliable, original, comprehensive, current, and cited.
I contacted Mr. Carlos Jimenez, from https://www.kaggle.com/estratic, by e-mail, and he let me use his work, since I cite him. I am very grateful for Mr. Carlos Jimenes.
This case study is used for educational purposes, so I didn’t contact Mr. David McCandless, according to his website (https://informationisbeautiful.net/licensing/)
I ran the data cleanup tools on Google Sheets on the CSV file. The Google Sheets trimmed white spaces and checked for duplicates. It didn’t suggest any modification to the content of the CSV.
I chose Google Sheets to cleanup the data because it has automated and reliable tools for that. Second, I imported the CSV into the RStudio Cloud because I feel comfortable using R codes to work with data, since I learned that on my PhD in Applied Linguistics.
I decided to let the data the way I got it, because it looks clean to me.
Let’s go through some important steps.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse) #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate) #helps wrangle date attributes
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
breaches_df = read_csv("breaches.csv")
## New names:
## Rows: 417 Columns: 16
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (11): organisation, alternative_name, date, story, sector, method, inter... dbl
## (3): year, data_sensitivity, ID lgl (1): ...12
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...12`
Let’s check the mean number of records breached, the minimum and the maximum.
mean(breaches_df$records_lost)
## [1] 34387245
median(breaches_df$records_lost)
## [1] 3100000
min(breaches_df$records_lost)
## [1] 30
max(breaches_df$records_lost)
## [1] 7e+08
The code below was supposed to show the average records lost by year.
year_count <- breaches_df %>%
group_by(year) %>%
summarise(count = n_distinct(records_lost))
year_count
## # A tibble: 19 × 2
## year count
## <dbl> <int>
## 1 2004 1
## 2 2005 3
## 3 2006 5
## 4 2007 10
## 5 2008 19
## 6 2009 9
## 7 2010 13
## 8 2011 34
## 9 2012 22
## 10 2013 28
## 11 2014 18
## 12 2015 24
## 13 2016 33
## 14 2017 22
## 15 2018 37
## 16 2019 40
## 17 2020 24
## 18 2021 29
## 19 2022 11
Let’s run the average records lost by year
summarize(breaches_df, sum(year))
## # A tibble: 1 × 1
## `sum(year)`
## <dbl>
## 1 840345
Here I tried to create a table to show which hacking method occurred by year, but my code simply summed all years as they were numbers, not dates.
The code below show how many times a method occurred by year.
year_hacker_count <- breaches_df %>%
group_by(method) %>%
summarise(count = n_distinct(year))
year_hacker_count
## # A tibble: 5 × 2
## method count
## <chr> <int>
## 1 hacked 18
## 2 inside job 11
## 3 lost device 11
## 4 oops! 13
## 5 poor security 10