Introduction:

1.1 Purpose:

The Office of Civil Rights is responsible for reporting breaches of unsecured Protected Health Information (PHI). All breaches are reported to the Secretary of the U.S. Department of Health and Human Services and include information on breaches that affect 500 or more individuals. The purpose is to generate relevant statistics based on the data to identify trends in data breaches and inform the general public on the frequency of such breaches and to notify consumers who may not yet realize that their PHI has been compromised.

1.2 The Data Used

The compiled data includes information on data breach investigations which have been completed as well as those breaches currently under investigation. The data contains 9 key pieces of information:

  1. The organization responsible for the PHI

  2. The state where the breach was reported

  3. The type of organization responsible for the PHI

  4. The number of affected individuals

  5. The date the breach was reported by the covered entity

  6. The type of breach and how unauthorized access to PHI was gained

  7. Where the PHI was when unauthorized access was obtained

  8. Whether a business associate of the covered entity was involved in the breach

  9. A statement explaining what happened and the resolution

1.3 Proposed Analytic Approach

My approach begins with clearing the data of empty values and duplicate entries. I also manipulate data to later use with statistical analysis. New variables are introduced that extract and summarise key information. I use the funtions, mutate, summarize, and filter to investigate the data figures. I also include visualiztions, such as scatterplots and bar graphs, to highlight hidden aspects of the discrete data.

1.4 How This Helps Customers

Consumers can use this information to determine whether their PHI was compromised by such a breach and begin the process to further protect their information. If consumers suspect they have been a victim of identify theft or fraud, this information can also be used to pursue restitution from the responsible entities. For example, when the federal government was at the helm during past breaches, they have provided victims with credit and fraud monitoring services for an extended period to ensure consumers are protected to the fullest extent possible.

Required Packages:

2.1 Packages Used

The following packages are necesary to view the Breach information.

Package Explanation
tidyverse For all things tidy
DT To display some data using Data tables
knitr For introducing R and HTML together
rmdformats For ready-to-use R Markdown
lubridate To manipulate date formats

Data Preparation:

3.1 Import Data Set

The first data set, “Breach_Archive”, contains data on 2,049 breach investigations that were investigated and ultimately closed.

The second data set, “Breach_Investigation”, contains data on 406 breach investigations currently ongoing.

To identify which records are closed and which are under investigation, a new column “Investigated” was created. This column uses a value of ‘1’ if the data breach investigation is complete, and a ‘0’ if the data breach is currently under investigation.

3.2 Combining Data Sets

To combine the files into a single data set, each data set MUST have the same number of columns and each column name must ALSO use the same names. To do this, the function rbind() is used to stack the data from the Investigation Complete data set directly above the data from the Under Investigation file.

3.3 Cleaning the Data

  • There are 750 blank values within the data. The majority of NA values exist within the Web Description and this will be addressed in later analysis. The remaining 8 observations with missing values will be removed from the data. First, I will remove two known duplicate rows of data:
  1. row 522 - identified as a duplicate by the ‘duplicated’ function in the base R package

  2. row 794 - identified as a duplicated entry by previous data

##           Name of Covered Entity                            State 
##                                0                                3 
##              Covered Entity Type             Individuals Affected 
##                                3                                1 
##           Breach Submission Date                   Type of Breach 
##                                0                                1 
## Location of Breached Information       Business Associate Present 
##                                0                                0 
##                  Web Description                     Investigated 
##                              742                                0
## [1] 522
## [1] 794

I also add a new variable to the data, a column that displays the year the breach is reported. Here is a sample view of the field and will be useful in the analysis.

Next I located the row number for the observations with missing values for State and missing values for Individuals Affected

## [1]  731 1804 2266
## [1] 1246

…and for the missing values in the Covered Entity Type and Type of Breach fields.

## [1]  948 1003 2182
## [1] 1988

I also create new columns that recognizes multiple entries under the Type of Breach category. The 1 or 0 accounts for multiple types of breaches per reported breach.

A similar situation occurs in the Location of Breached Information field and I repet this process for those options.

3.4 The Combined Data

After this initial cleaning, there are 2,445 total observations of data breaches that are under investigation or have been investigated. Below is a table of the data in alphabetical order by State.

Over 188 million individuals were affected by a HealthCare breach between 2011 and 2016.

### 3.5 The Data Each record is identified as having at least 1 of 7 types of breach:

  1. Hacking/IT

  2. Improper_Disposal

  3. Loss

  4. Theft

  5. Unauthorized Access/Disclosure

  6. Unknown

  7. Other

Below is a table of a count of each type of breach. The top 3 include:

  1. Total Thefts (897)

  2. Unauthorized Disclosures (736)

  3. Hacking/IT incidents (539)

Data Analysis

4.1 Number of Reported Breaches

On average, 77,000 individuals were affected within each HealthCare data breach.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      500      981     2261    77264     7784 78800000

These are 5 cases of data breaches with the most number of individuals affected:

This plot shows the previous 5 cases as outliers in the data.

##   [1]    63551    30799    70320    30000    43000    27113    47000
##   [8]    26000    66000    28012    19114   266123    33877    31120
##  [15]    25848    22000    46632    18790    18637    93323    55447
##  [22]   279663    21665   697800    19564    24809    75000   381504
##  [29]   749017    65000    29969    18854    81122    36496    29514
##  [36]    33698    64000   300000    28000    25000    18399    87069
##  [43]    21880   651971   882590  3466120    23015  3620000   201000
##  [50]    29153    31000    22000    27393    40491    19776   400000
##  [57]    68631    87314    19898    23341    59000    19397    23000
##  [64]    26588   205748  2213597    43961    52076    24188    42372
##  [71]   483063    91187    30972   113528    28209    20764    29156
##  [78]    84681    54203 10000000   160000    69246  3900000  4500000
##  [85]    18213    50000   306789  1100000    20512    90060    39000
##  [92]    24967    81463    43068    50000   151626 11000000 78800000
##  [99]   697586    38351   355127   557779    63325    18000    19000
## [106]    56694    79000    43890    41000   160000    30000    26115
## [113]    25764    47683    31980    30000    74944    76258    20000
## [120]    35357   307528    82601    33136  2000000    82601  4500000
## [127]  4500000    49714    60582    28300    60582    31677    36400
## [134]    38906    50918    63325  1062509    42713    97000    33702
## [141]    56853    26162   342197    46473    75026   214000    55900
## [148]    27839    55207   405000    41437   398000    22511    25513
## [155]    48752    48752   839711    59000    44000    76183    49000
## [162]   729000    37000    25461    32000  4029530    32151    21000
## [169]   277014   187533   189489    22000    18162    28187   109000
## [176]    18000    43549    29021    56500    19178    56820    27800
## [183]    28893    35488    28187    18000    18000   116506    27799
## [190]    65700    64846    55000   105646    66601    19100    42000
## [197]   228435   315000   780000    27098    20000    50000   943434
## [204]  4900000    55000  1055489    19651    32008    63425    25330
## [211]    78042   400000    32390    24361    22001  1900000   132940
## [218]    84000    93500   514330    20744  1700000    37000    18871
## [225]   231400   156000    24600   398000   115000   475000  1023209
## [232]    19200    33000    19222    22642    24750    21000    25000
## [239]    31700    27000    23753   800000   105470    29000   130495
## [246]  1220000    60998    40000   180111    22012    54165   344579
## [253]    83945    21000    83000    26942    21311    20015    40800
## [260]   502416    38000    31151    18500   417000   301000  1421107
## [267]    33821   105309    19807    19101    44979   205434    44600
## [274]   276057    55947    42625   566236    42200   538127    64487
## [281]    40621    81550    29528   582174    34637    35136    18436
## [288]    63627   134512    63049    24000    53173   279865    29579
## [295]    24000    22000    43563    32000   128000    51232    21856
## [302]    19203   106008    77337    22000    18580   300000   176295
## [309]    56075   500000    20431    19727    80270    65000    85995
## [316]    79930    55700    26873    34055    19000   531000

4.2 Directed Analysis

For the following analysis, the top 3 breaches (78 million, 11 million, and 10 million) are removed from the overall analysis to compare the data closer to the original mean.

## # A tibble: 1 x 4
##   `Name of Covered Entity`     State `Covered Entity T… `Individuals Affec…
##   <chr>                        <chr> <chr>                            <dbl>
## 1 Anthem, Inc. Affiliated Cov… IN    Health Plan                   78800000
## # A tibble: 1 x 4
##   `Name of Covered Entity` State `Covered Entity Typ… `Individuals Affecte…
##   <chr>                    <chr> <chr>                                <dbl>
## 1 Premera Blue Cross       WA    Health Plan                       11000000
## # A tibble: 1 x 4
##   `Name of Covered Entity`  State `Covered Entity Typ… `Individuals Affect…
##   <chr>                     <chr> <chr>                               <dbl>
## 1 Excellus Health Plan, In… NY    Health Plan                      10000000

After removing the outliers, we see that 2011, and the time period between 2013 - 2015, each had over 4 million affected individuals.

In 2011, the average breach affected over 60,000 individuals per breach.

Below is a list of the top 25 breaches by size, including the first 3 outliers that were omitted earlier. Those outliers are again removed in later analysis.

Since 2009, the number of total hacking incidents increased until 2018, when the first decline in the number of hacks decreased. This is likely due to new classifications for data breach attempts or new data breach sources.

Health Care Providers were the most likely targets of data breach attemtps. They suggests that they may be soft targets and need addditional resources for securing data.

4.3 Additional Analysis

Friday is the day of the week with the most number of breaches reported.

## 
## Sun Mon Tue Wed Thu Fri Sat 
##  26 394 402 380 434 764  42

Here is a graphical representation by day of the week:

The following view shows the 3 top breach types and their trends year over year. As mentioned before, Hacking/IT incidents increased every year expect for 2018. Similarly, the number of Unauthorized Disclosures increased year over year. However, data breaches by theft have decreased year over year.

Of the 233 observations associated with a Hacking, most causes were associated with phising or phising attacks into the systems of unsuspecting employees or users of the information system.

In 765 instances of theft, a stolen laptop was a common theme as was removal of hard drives or the information stored on them.

Exploratory Analysis

5.1 Breaches in Ohio, Maryland, and Virginia

I want to compare breaches in 3 states during this same time period:

  1. Ohio (my current state of residence),
  2. Maryland (my home state), and
  3. Virginia (my birth state)

Maryland had the largest breach by CareFirst BlueCross with 1.1 million individuals affected in 2015.

5.2 Type of Breach by State

In Maryland, hacking-related breaches were the most common type of incidents followed by thefts. While in Ohio, both hacking-related and thefts occurred a similar number of times. In Virginia, theft of data was by far the most common type of breach that impacted HealthCare organizations.

5.3 Type of Breach by State (cont.)

In total, 2.1 million individuals in Ohio, Maryland, and Virginia were affected by data breaches in HealthCare. Similar to the national trend in previous analysis, breaches by theft were the most common type followed by unauthorized disclosures and breaches by hacking or IT-related incidents.

5.4 Support for a cloud-based system

In a previous analysis, 700+ web descriptions most commonly referenced laptops as an initial source of data breach. The data for the Maryland, Ohio, and Virginia is also consistent in this trend as seen in the following chart. This heavily supports migration to a cloud-based system for many of these organizations.