Practice Exam

Introduction

Purpose of Analysis: This analysis aims to provide summaries and data visualizations pertaining to healthcare data breaches reported by the OCR.

Explanation of Data: The OCR collects breach data including: organization responsible for the PHI and organizaiton type, state, number of individuals affected, date/type/and location of breach, and business associate involvement. Additionally, some entries have a longer explanation of the situation and resolution.

Proposed Approach: The analysis will focus on graphs and other visualizations to more deeply understand breaches over time, numbers of individuals affected, and any trends in the type of breaches.

How this Analysis Helps: This analysis provides an easily-digestible way for users to understand some key trends and insights related to PHI data breaches.

Packages

Several packages will be critical for our analysis.

Tidyverse: The tidyverse is a collection of packages that have different notations to create a more seamless data science approach.

Dplyr: Dplyr is comparable to the SQL language and helps users manipulate datasets easily.

DT: makes javascript data tables

## -- Attaching packages -------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Import Data and Create Table

## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_character()
## )
## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_logical()
## )

Cleaning Data 1. Check for missing values. There are no variables with a lot of missing values besides Web Description so we will keep all values.

  1. Remove the duplicate Anthem row

  2. Handling multiple ‘type of breaches’ and ‘location of breaches’ together: will use the str_detect function to overcome. Dummy columns are also created to separate values.

##           Name of Covered Entity                            State 
##                                0                                3 
##              Covered Entity Type             Individuals Affected 
##                                3                                1 
##           Breach Submission Date                   Type of Breach 
##                                0                                1 
## Location of Breached Information       Business Associate Present 
##                                0                                0 
##                  Web Description                    breach_status 
##                              742                                0 
##                   DatesFormatted 
##                                0

About the Data

  1. Here are the variables included in the data set. -Name of the Covered Entity: organization responsible for the PHI -State: US state where the breach was reported -Covered Entity Type: type of organization responsible for the PHI -Individuals Affected: number of records affected by the breach -Breach Submission Date: date the breach was reported by the CE -Type of breach how unauthorized access to the PHI was obtained -Location of breached information: where was the PHI when authorized access was obtained -Business Associate Present: was a business associate such as a consultant or contractor involved in the breach -web description: an optional statement explaining what happened and the resolution -breach status: whether the investigation is still active or closed -Dates formatted: date the breach was reported by the CE, formatted for R -type_x: indicates if that breach was or was not each type of breach. 1 is yes, 0 is no. -location_x: inducates if that breach was or was not accessed in different locations. 1 is yes, 0 is no. -year: year of the data breach -IndAffRank: breach ranked by the numbers of individuals affected (1 means it has the greatest number of individuals affected) -day_of_week: the day of the week the breach was reported by the CE regions: what US region the state of the breach is in

  2. There are 2,454 total observations

  3. Note that missing data is indicated as “NA” in the data.

Data Table Here is a data table with the most important variables.

Summary In which states do most breaches occur?

## # A tibble: 53 x 2
##    State     n
##    <chr> <int>
##  1 CA      282
##  2 TX      201
##  3 FL      162
##  4 NY      136
##  5 IL      120
##  6 PA       84
##  7 OH       76
##  8 IN       72
##  9 GA       71
## 10 MA       69
## # ... with 43 more rows

CA, TX, FL, NY, and IL

Which Covered Entity Types do most breaches occur?

## # A tibble: 5 x 2
##   `Covered Entity Type`         n
##   <chr>                     <int>
## 1 Healthcare Provider        1767
## 2 Business Associate          355
## 3 Health Plan                 325
## 4 Healthcare Clearing House     4
## 5 <NA>                          3

Healthcare Providers

Summarize Individuals Affected

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      500      981     2261    77038     7784 78800000        1

min: 500, mean: 77,038, max: 78,800,000

How many investigations are closed?

## # A tibble: 2 x 2
##   breach_status     n
##   <chr>         <int>
## 1 Closed         2048
## 2 Investigation   406

2,048 open, 406 closed

Reqired Analysis

Question 1: Number of Reported Breaches

Question 2: Average healthcare data breach size by year

Question 3: largest healthcare data breaches

## # A tibble: 25 x 6
##    IndAffRank `Name of Covere~ year  `Covered Entity~ `Individuals Af~
##         <dbl> <chr>            <chr> <chr>                       <dbl>
##  1          1 Anthem, Inc. Af~ 2015  Health Plan              78800000
##  2          2 Premera Blue Cr~ 2015  Health Plan              11000000
##  3          3 Excellus Health~ 2015  Health Plan              10000000
##  4          4 Science Applica~ 2011  Business Associ~          4900000
##  5          6 University of C~ 2015  Healthcare Prov~          4500000
##  6          6 Community Healt~ 2014  Business Associ~          4500000
##  7          6 Community Healt~ 2014  Business Associ~          4500000
##  8          8 Advocate Health~ 2013  Healthcare Prov~          4029530
##  9          9 Medical Informa~ 2015  Business Associ~          3900000
## 10         10 Banner Health    2016  Healthcare Prov~          3620000
## # ... with 15 more rows, and 1 more variable: `Type of Breach` <chr>

Question 4: Hacking/IT breaches by year

Question 5: Breaches by Entity Type and Year

## # A tibble: 35 x 3
## # Groups:   Covered Entity Type [5]
##    `Covered Entity Type` year  count
##    <chr>                 <chr> <int>
##  1 Business Associate    2009      3
##  2 Business Associate    2010     44
##  3 Business Associate    2011     45
##  4 Business Associate    2012     40
##  5 Business Associate    2013     64
##  6 Business Associate    2014     77
##  7 Business Associate    2015     12
##  8 Business Associate    2016     20
##  9 Business Associate    2017     20
## 10 Business Associate    2018     30
## # ... with 25 more rows

Question 6: Fridays are the most common day of reporting

## # A tibble: 7 x 2
##   day_of_week     n
##   <chr>       <int>
## 1 Friday        767
## 2 Thursday      434
## 3 Tuesday       407
## 4 Monday        394
## 5 Wednesday     384
## 6 Saturday       42
## 7 Sunday         26

Question 7: Type of Breach trends. I can’t figure out how to have each column just count the “1”s in the corresponding dummy variable columns

## # A tibble: 10 x 8
##    year  hacking improper_disposal  loss theft unauthorized unknown other
##    <chr>   <int>             <int> <int> <int>        <int>   <int> <int>
##  1 2009       18                18    18    18           18      18    18
##  2 2010      198               198   198   198          198     198   198
##  3 2011      200               200   200   200          200     200   200
##  4 2012      218               218   218   218          218     218   218
##  5 2013      278               278   278   278          278     278   278
##  6 2014      314               314   314   314          314     314   314
##  7 2015      268               268   268   268          268     268   268
##  8 2016      327               327   327   327          327     327   327
##  9 2017      359               359   359   359          359     359   359
## 10 2018      273               273   273   273          273     273   273

Exploratory Analysis

Question 1: Is there a relationship between covered entity and average number of individuals affected? Health Plans have the greatest average number of individuals affected.

Question 2: How long ago did the open cases begin?

## # A tibble: 3 x 2
##   year      n
##   <chr> <int>
## 1 2018    236
## 2 2017    150
## 3 2016     20

Most of the cases opened in 2018. There are 20 outstanding from 2016.

Question 3: When have the recent closed cases been closed?

## # A tibble: 10 x 2
##    year      n
##    <chr> <int>
##  1 2014    314
##  2 2016    307
##  3 2013    278
##  4 2015    268
##  5 2012    218
##  6 2017    209
##  7 2011    200
##  8 2010    199
##  9 2018     37
## 10 2009     18

37 cases from 2018 have been closed, 209 from 2017.

Question 4: Do individuals affected matter if a business associate was present? not a huge difference

Question 5: Are some regions in the US more likely to be breached?

## # A tibble: 4 x 2
##   regions       n
##   <chr>     <int>
## 1 South       904
## 2 West        589
## 3 Midwest     555
## 4 Northeast   403

The south has the most breaches. This is likely correlated with their high populations.

Question 6: Average individuals affected by portable objects v laptop breaches

## # A tibble: 1 x 1
##   avgIndAffPortable
##               <dbl>
## 1            22669.
## # A tibble: 1 x 1
##   avgIndAffLaptop
##             <dbl>
## 1          21334.

laptop = 21,334 and portable = 22,586. Laptop breaches and breaches from other portable devices affect roughly the same amount of people on average.