Introduction
Purpose of Analysis: This analysis aims to provide summaries and data visualizations pertaining to healthcare data breaches reported by the OCR.
Explanation of Data: The OCR collects breach data including: organization responsible for the PHI and organizaiton type, state, number of individuals affected, date/type/and location of breach, and business associate involvement. Additionally, some entries have a longer explanation of the situation and resolution.
Proposed Approach: The analysis will focus on graphs and other visualizations to more deeply understand breaches over time, numbers of individuals affected, and any trends in the type of breaches.
How this Analysis Helps: This analysis provides an easily-digestible way for users to understand some key trends and insights related to PHI data breaches.
Packages
Several packages will be critical for our analysis.
Tidyverse: The tidyverse is a collection of packages that have different notations to create a more seamless data science approach.
Dplyr: Dplyr is comparable to the SQL language and helps users manipulate datasets easily.
DT: makes javascript data tables
## -- Attaching packages -------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Import Data and Create Table
## Parsed with column specification:
## cols(
## `Name of Covered Entity` = col_character(),
## State = col_character(),
## `Covered Entity Type` = col_character(),
## `Individuals Affected` = col_double(),
## `Breach Submission Date` = col_character(),
## `Type of Breach` = col_character(),
## `Location of Breached Information` = col_character(),
## `Business Associate Present` = col_character(),
## `Web Description` = col_character()
## )
## Parsed with column specification:
## cols(
## `Name of Covered Entity` = col_character(),
## State = col_character(),
## `Covered Entity Type` = col_character(),
## `Individuals Affected` = col_double(),
## `Breach Submission Date` = col_character(),
## `Type of Breach` = col_character(),
## `Location of Breached Information` = col_character(),
## `Business Associate Present` = col_character(),
## `Web Description` = col_logical()
## )
Cleaning Data 1. Check for missing values. There are no variables with a lot of missing values besides Web Description so we will keep all values.
Remove the duplicate Anthem row
Handling multiple ‘type of breaches’ and ‘location of breaches’ together: will use the str_detect function to overcome. Dummy columns are also created to separate values.
## Name of Covered Entity State
## 0 3
## Covered Entity Type Individuals Affected
## 3 1
## Breach Submission Date Type of Breach
## 0 1
## Location of Breached Information Business Associate Present
## 0 0
## Web Description breach_status
## 742 0
## DatesFormatted
## 0
About the Data
Here are the variables included in the data set. -Name of the Covered Entity: organization responsible for the PHI -State: US state where the breach was reported -Covered Entity Type: type of organization responsible for the PHI -Individuals Affected: number of records affected by the breach -Breach Submission Date: date the breach was reported by the CE -Type of breach how unauthorized access to the PHI was obtained -Location of breached information: where was the PHI when authorized access was obtained -Business Associate Present: was a business associate such as a consultant or contractor involved in the breach -web description: an optional statement explaining what happened and the resolution -breach status: whether the investigation is still active or closed -Dates formatted: date the breach was reported by the CE, formatted for R -type_x: indicates if that breach was or was not each type of breach. 1 is yes, 0 is no. -location_x: inducates if that breach was or was not accessed in different locations. 1 is yes, 0 is no. -year: year of the data breach -IndAffRank: breach ranked by the numbers of individuals affected (1 means it has the greatest number of individuals affected) -day_of_week: the day of the week the breach was reported by the CE regions: what US region the state of the breach is in
There are 2,454 total observations
Note that missing data is indicated as “NA” in the data.
Data Table Here is a data table with the most important variables.
Summary In which states do most breaches occur?
## # A tibble: 53 x 2
## State n
## <chr> <int>
## 1 CA 282
## 2 TX 201
## 3 FL 162
## 4 NY 136
## 5 IL 120
## 6 PA 84
## 7 OH 76
## 8 IN 72
## 9 GA 71
## 10 MA 69
## # ... with 43 more rows
CA, TX, FL, NY, and IL
Which Covered Entity Types do most breaches occur?
## # A tibble: 5 x 2
## `Covered Entity Type` n
## <chr> <int>
## 1 Healthcare Provider 1767
## 2 Business Associate 355
## 3 Health Plan 325
## 4 Healthcare Clearing House 4
## 5 <NA> 3
Healthcare Providers
Summarize Individuals Affected
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 500 981 2261 77038 7784 78800000 1
min: 500, mean: 77,038, max: 78,800,000
How many investigations are closed?
## # A tibble: 2 x 2
## breach_status n
## <chr> <int>
## 1 Closed 2048
## 2 Investigation 406
2,048 open, 406 closed
Reqired Analysis
Question 1: Number of Reported Breaches
Question 2: Average healthcare data breach size by year
Question 3: largest healthcare data breaches
## # A tibble: 25 x 6
## IndAffRank `Name of Covere~ year `Covered Entity~ `Individuals Af~
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Anthem, Inc. Af~ 2015 Health Plan 78800000
## 2 2 Premera Blue Cr~ 2015 Health Plan 11000000
## 3 3 Excellus Health~ 2015 Health Plan 10000000
## 4 4 Science Applica~ 2011 Business Associ~ 4900000
## 5 6 University of C~ 2015 Healthcare Prov~ 4500000
## 6 6 Community Healt~ 2014 Business Associ~ 4500000
## 7 6 Community Healt~ 2014 Business Associ~ 4500000
## 8 8 Advocate Health~ 2013 Healthcare Prov~ 4029530
## 9 9 Medical Informa~ 2015 Business Associ~ 3900000
## 10 10 Banner Health 2016 Healthcare Prov~ 3620000
## # ... with 15 more rows, and 1 more variable: `Type of Breach` <chr>
Question 4: Hacking/IT breaches by year
Question 5: Breaches by Entity Type and Year
## # A tibble: 35 x 3
## # Groups: Covered Entity Type [5]
## `Covered Entity Type` year count
## <chr> <chr> <int>
## 1 Business Associate 2009 3
## 2 Business Associate 2010 44
## 3 Business Associate 2011 45
## 4 Business Associate 2012 40
## 5 Business Associate 2013 64
## 6 Business Associate 2014 77
## 7 Business Associate 2015 12
## 8 Business Associate 2016 20
## 9 Business Associate 2017 20
## 10 Business Associate 2018 30
## # ... with 25 more rows
Question 6: Fridays are the most common day of reporting
## # A tibble: 7 x 2
## day_of_week n
## <chr> <int>
## 1 Friday 767
## 2 Thursday 434
## 3 Tuesday 407
## 4 Monday 394
## 5 Wednesday 384
## 6 Saturday 42
## 7 Sunday 26
Question 7: Type of Breach trends. I can’t figure out how to have each column just count the “1”s in the corresponding dummy variable columns
## # A tibble: 10 x 8
## year hacking improper_disposal loss theft unauthorized unknown other
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 2009 18 18 18 18 18 18 18
## 2 2010 198 198 198 198 198 198 198
## 3 2011 200 200 200 200 200 200 200
## 4 2012 218 218 218 218 218 218 218
## 5 2013 278 278 278 278 278 278 278
## 6 2014 314 314 314 314 314 314 314
## 7 2015 268 268 268 268 268 268 268
## 8 2016 327 327 327 327 327 327 327
## 9 2017 359 359 359 359 359 359 359
## 10 2018 273 273 273 273 273 273 273
Exploratory Analysis
Question 1: Is there a relationship between covered entity and average number of individuals affected? Health Plans have the greatest average number of individuals affected.
Question 2: How long ago did the open cases begin?
## # A tibble: 3 x 2
## year n
## <chr> <int>
## 1 2018 236
## 2 2017 150
## 3 2016 20
Most of the cases opened in 2018. There are 20 outstanding from 2016.
Question 3: When have the recent closed cases been closed?
## # A tibble: 10 x 2
## year n
## <chr> <int>
## 1 2014 314
## 2 2016 307
## 3 2013 278
## 4 2015 268
## 5 2012 218
## 6 2017 209
## 7 2011 200
## 8 2010 199
## 9 2018 37
## 10 2009 18
37 cases from 2018 have been closed, 209 from 2017.
Question 4: Do individuals affected matter if a business associate was present? not a huge difference
Question 5: Are some regions in the US more likely to be breached?
## # A tibble: 4 x 2
## regions n
## <chr> <int>
## 1 South 904
## 2 West 589
## 3 Midwest 555
## 4 Northeast 403
The south has the most breaches. This is likely correlated with their high populations.
Question 6: Average individuals affected by portable objects v laptop breaches
## # A tibble: 1 x 1
## avgIndAffPortable
## <dbl>
## 1 22669.
## # A tibble: 1 x 1
## avgIndAffLaptop
## <dbl>
## 1 21334.
laptop = 21,334 and portable = 22,586. Laptop breaches and breaches from other portable devices affect roughly the same amount of people on average.