The Office of Civil Rights is responsible for reporting breaches of unsecured Protected Health Information (PHI). All breaches are reported to the Secretary of the U.S. Department of Health and Human Services and include information on breaches that affect 500 or more individuals. The purpose is to generate relevant statistics based on the data to identify trends in data breaches and inform the general public on the frequency of such breaches and to notify consumers who may not yet realize that their PHI has been compromised.
The compiled data includes information on data breach investigations which have been completed as well as those breaches currently under investigation. The data contains 9 key pieces of information:
The organization responsible for the PHI
The state where the breach was reported
The type of organization responsible for the PHI
The number of affected individuals
The date the breach was reported by the covered entity
The type of breach and how unauthorized access to PHI was gained
Where the PHI was when unauthorized access was obtained
Whether a business associate of the covered entity was involved in the breach
A statement explaining what happened and the resolution
My approach begins with clearing the data of empty values and duplicate entries. I also manipulate data to later use with statistical analysis. New variables are introduced that extract and summarise key information. I use the funtions, mutate, summarize, and filter to investigate the data figures. I also include visualiztions, such as scatterplots and bar graphs, to highlight hidden aspects of the discrete data.
Consumers can use this information to determine whether their PHI was compromised by such a breach and begin the process to further protect their information. If consumers suspect they have been a victim of identify theft or fraud, this information can also be used to pursue restitution from the responsible entities. For example, when the federal government was at the helm during past breaches, they have provided victims with credit and fraud monitoring services for an extended period to ensure consumers are protected to the fullest extent possible.
The following packages are necesary to view the Breach information.
| Package | Explanation |
|---|---|
| tidyverse | For all things tidy |
| DT | To display some data using Data tables |
| knitr | For introducing R and HTML together |
| rmdformats | For ready-to-use R Markdown |
| lubridate | To manipulate date formats |
The first data set, “Breach_Archive”, contains data on 2,049 breach investigations that were investigated and ultimately closed.
The second data set, “Breach_Investigation”, contains data on 406 breach investigations currently ongoing.
To identify which records are closed and which are under investigation, a new column “Investigated” was created. This column uses a value of ‘1’ if the data breach investigation is complete, and a ‘0’ if the data breach is currently under investigation.
To combine the files into a single data set, each data set MUST have the same number of columns and each column name must ALSO use the same names. To do this, the function rbind() is used to stack the data from the Investigation Complete data set directly above the data from the Under Investigation file.
row 522 - identified as a duplicate by the ‘duplicated’ function in the base R package
row 794 - identified as a duplicated entry by previous data
## Name of Covered Entity State
## 0 3
## Covered Entity Type Individuals Affected
## 3 1
## Breach Submission Date Type of Breach
## 0 1
## Location of Breached Information Business Associate Present
## 0 0
## Web Description Investigated
## 742 0
## [1] 522
## [1] 794
I also add a new variable to the data, a column that displays the year the breach is reported. Here is a sample view of the field and will be useful in the analysis.
Next I located the row number for the observations with missing values for State and missing values for Individuals Affected…
## [1] 731 1804 2266
## [1] 1246
…and for the missing values in the Covered Entity Type and Type of Breach fields.
## [1] 948 1003 2182
## [1] 1988
I also create new columns that recognizes multiple entries under the Type of Breach category. The 1 or 0 accounts for multiple types of breaches per reported breach.
A similar situation occurs in the Location of Breached Information field and I repet this process for those options.
After this initial cleaning, there are 2,445 total observations of data breaches that are under investigation or have been investigated. Below is a table of the data in alphabetical order by State.
Over 188 million individuals were affected by a HealthCare breach between 2011 and 2016.
### 3.5 The Data Each record is identified as having at least 1 of 7 types of breach:Hacking/IT
Improper_Disposal
Loss
Theft
Unauthorized Access/Disclosure
Unknown
Other
Below is a table of a count of each type of breach. The top 3 include:
Total Thefts (897)
Unauthorized Disclosures (736)
Hacking/IT incidents (539)
On average, 77,000 individuals were affected within each HealthCare data breach.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 500 981 2261 77264 7784 78800000
These are 5 cases of data breaches with the most number of individuals affected:
This plot shows the previous 5 cases as outliers in the data.
## [1] 63551 30799 70320 30000 43000 27113 47000
## [8] 26000 66000 28012 19114 266123 33877 31120
## [15] 25848 22000 46632 18790 18637 93323 55447
## [22] 279663 21665 697800 19564 24809 75000 381504
## [29] 749017 65000 29969 18854 81122 36496 29514
## [36] 33698 64000 300000 28000 25000 18399 87069
## [43] 21880 651971 882590 3466120 23015 3620000 201000
## [50] 29153 31000 22000 27393 40491 19776 400000
## [57] 68631 87314 19898 23341 59000 19397 23000
## [64] 26588 205748 2213597 43961 52076 24188 42372
## [71] 483063 91187 30972 113528 28209 20764 29156
## [78] 84681 54203 10000000 160000 69246 3900000 4500000
## [85] 18213 50000 306789 1100000 20512 90060 39000
## [92] 24967 81463 43068 50000 151626 11000000 78800000
## [99] 697586 38351 355127 557779 63325 18000 19000
## [106] 56694 79000 43890 41000 160000 30000 26115
## [113] 25764 47683 31980 30000 74944 76258 20000
## [120] 35357 307528 82601 33136 2000000 82601 4500000
## [127] 4500000 49714 60582 28300 60582 31677 36400
## [134] 38906 50918 63325 1062509 42713 97000 33702
## [141] 56853 26162 342197 46473 75026 214000 55900
## [148] 27839 55207 405000 41437 398000 22511 25513
## [155] 48752 48752 839711 59000 44000 76183 49000
## [162] 729000 37000 25461 32000 4029530 32151 21000
## [169] 277014 187533 189489 22000 18162 28187 109000
## [176] 18000 43549 29021 56500 19178 56820 27800
## [183] 28893 35488 28187 18000 18000 116506 27799
## [190] 65700 64846 55000 105646 66601 19100 42000
## [197] 228435 315000 780000 27098 20000 50000 943434
## [204] 4900000 55000 1055489 19651 32008 63425 25330
## [211] 78042 400000 32390 24361 22001 1900000 132940
## [218] 84000 93500 514330 20744 1700000 37000 18871
## [225] 231400 156000 24600 398000 115000 475000 1023209
## [232] 19200 33000 19222 22642 24750 21000 25000
## [239] 31700 27000 23753 800000 105470 29000 130495
## [246] 1220000 60998 40000 180111 22012 54165 344579
## [253] 83945 21000 83000 26942 21311 20015 40800
## [260] 502416 38000 31151 18500 417000 301000 1421107
## [267] 33821 105309 19807 19101 44979 205434 44600
## [274] 276057 55947 42625 566236 42200 538127 64487
## [281] 40621 81550 29528 582174 34637 35136 18436
## [288] 63627 134512 63049 24000 53173 279865 29579
## [295] 24000 22000 43563 32000 128000 51232 21856
## [302] 19203 106008 77337 22000 18580 300000 176295
## [309] 56075 500000 20431 19727 80270 65000 85995
## [316] 79930 55700 26873 34055 19000 531000
For the following analysis, the top 3 breaches (78 million, 11 million, and 10 million) are removed from the overall analysis to compare the data closer to the original mean.
## # A tibble: 1 x 4
## `Name of Covered Entity` State `Covered Entity T… `Individuals Affec…
## <chr> <chr> <chr> <dbl>
## 1 Anthem, Inc. Affiliated Cov… IN Health Plan 78800000
## # A tibble: 1 x 4
## `Name of Covered Entity` State `Covered Entity Typ… `Individuals Affecte…
## <chr> <chr> <chr> <dbl>
## 1 Premera Blue Cross WA Health Plan 11000000
## # A tibble: 1 x 4
## `Name of Covered Entity` State `Covered Entity Typ… `Individuals Affect…
## <chr> <chr> <chr> <dbl>
## 1 Excellus Health Plan, In… NY Health Plan 10000000
After removing the outliers, we see that 2011, and the time period between 2013 - 2015, each had over 4 million affected individuals.
In 2011, the average breach affected over 60,000 individuals per breach.
Below is a list of the top 25 breaches by size, including the first 3 outliers that were omitted earlier. Those outliers are again removed in later analysis.
Since 2009, the number of total hacking incidents increased until 2018, when the first decline in the number of hacks decreased. This is likely due to new classifications for data breach attempts or new data breach sources.
Health Care Providers were the most likely targets of data breach attemtps. They suggests that they may be soft targets and need addditional resources for securing data.
Friday is the day of the week with the most number of breaches reported.
##
## Sun Mon Tue Wed Thu Fri Sat
## 26 394 402 380 434 764 42
Here is a graphical representation by day of the week:
The following view shows the 3 top breach types and their trends year over year. As mentioned before, Hacking/IT incidents increased every year expect for 2018. Similarly, the number of Unauthorized Disclosures increased year over year. However, data breaches by theft have decreased year over year.
Of the 233 observations associated with a Hacking, most causes were associated with phising or phising attacks into the systems of unsuspecting employees or users of the information system.
In 765 instances of theft, a stolen laptop was a common theme as was removal of hard drives or the information stored on them.
I want to compare breaches in 3 states during this same time period:
Maryland had the largest breach by CareFirst BlueCross with 1.1 million individuals affected in 2015.
In Maryland, hacking-related breaches were the most common type of incidents followed by thefts. While in Ohio, both hacking-related and thefts occurred a similar number of times. In Virginia, theft of data was by far the most common type of breach that impacted HealthCare organizations.
In total, 2.1 million individuals in Ohio, Maryland, and Virginia were affected by data breaches in HealthCare. Similar to the national trend in previous analysis, breaches by theft were the most common type followed by unauthorized disclosures and breaches by hacking or IT-related incidents.
In a previous analysis, 700+ web descriptions most commonly referenced laptops as an initial source of data breach. The data for the Maryland, Ohio, and Virginia is also consistent in this trend as seen in the following chart. This heavily supports migration to a cloud-based system for many of these organizations.