Introduction

The Office for Civil Rights (OCR), which resides within the US Department of Health and Human Services (HHS), is responsible for collecting and reporting disclosures of protected health information (PHI) as mandated by law. Part of the law requires that the OCR report cases where covered entities (CE–organizations responsible for protecting health information) have a breach that affects more than 500 individuals. The data reported for each of these breaches include the variables listed in the table below.

Variable

Name of Covered

Description

Organization responsible for the PHI

State US state where the breach was reported
Covered Entity Type Type of organization responsible for the PHI
Individuals Affected Number of records affected by the breach
Breach Submission Date Date the breach was reported by the CE
Type of Breach How unauthorized access to the PHI was obtained
Location of Breached Information Where was the PHI when unauthorized access was obtained
Business Associate Present Was a business associate such as a consultant or contractor involved in the breach
Web Description An optional statement explaining what happened and the resolution

Data Preparation

Before we can begin our analysis, we must complete the data cleaning steps necessary to create effective visualizations. We first remove all rows with any missing data, then delete one of the duplicate observations that identifies the largest breach (by individuals affected). Next, we create 7 dummy variables for each type of breach, which will help with visualizations moving forward. Finally, we covert the Breach Submission Date variable to a functional date format to better assist with future analyses.

Summary Statistics

This table shows the mean number of people affected by each type of entity, in addition to showing the most common type of breach and the most likely state the breach occurred in for each entity.

## # A tibble: 4 × 4
##   `Covered Entity Type`     Average_Number_Of_Individuals_Affe…¹ Most_…² Most_…³
##   <chr>                                                    <dbl> <chr>   <chr>  
## 1 Business Associate                                      59113. Unknown WY     
## 2 Health Plan                                            430358. Unknown WY     
## 3 Healthcare Clearing House                                4438. Unauth… WA     
## 4 Healthcare Provider                                     17470. Unknown WY     
## # … with abbreviated variable names ¹​Average_Number_Of_Individuals_Affected,
## #   ²​Most_Likely_Type_Of_Breach, ³​Most_Common_State_Breach_Occurred_In

Data Visualization

The barplot below shows the number of Healthcare data breaches grouped by year. We can easily see that the most data breaches occurred in 2014 and the smallest number of data breaches occurred in 2018.

The table below shows the top 25 largest healthcare data breaches that are featured in this dataset. It shows us the Name of Covered Entity, Covered Entity Type, Breach Submission Year, Individuals Affected, and the Type of Breach.

## # A tibble: 25 × 5
##    `Name of Covered Entity`                      Cover…¹ Breac…² Indiv…³ Type …⁴
##    <chr>                                         <chr>   <fct>     <dbl> <chr>  
##  1 Anthem, Inc. Affiliated Covered Entity        Health… 2015     7.88e7 Hackin…
##  2 Science Applications International Corporati… Busine… 2011     4.9 e6 Loss   
##  3 Advocate Health and Hospitals Corporation, d… Health… 2013     4.03e6 Theft  
##  4 21st Century Oncology                         Health… 2016     2.21e6 Hackin…
##  5 Xerox State Healthcare, LLC                   Busine… 2014     2   e6 Unauth…
##  6 IBM                                           Busine… 2011     1.9 e6 Unknown
##  7 GRM Information Management Services           Busine… 2011     1.7 e6 Theft  
##  8 AvMed, Inc.                                   Health… 2010     1.22e6 Theft  
##  9 Montana Department of Public Health & Human … Health… 2014     1.06e6 Hackin…
## 10 The Nemours Foundation                        Health… 2011     1.06e6 Loss   
## # … with 15 more rows, and abbreviated variable names ¹​`Covered Entity Type`,
## #   ²​`Breach Submission Year`, ³​`Individuals Affected`, ⁴​`Type of Breach`

The table below shows the Top 10 States that had the highest number of individuals affected from data breaches.

## # A tibble: 10 × 2
##    State `Individuals Affected`
##    <chr>                  <dbl>
##  1 IN                  79576765
##  2 FL                   6001825
##  3 VA                   5158001
##  4 IL                   4692107
##  5 TX                   4040208
##  6 CA                   3052133
##  7 NJ                   3051796
##  8 NY                   2782138
##  9 TN                   1724277
## 10 PR                   1704916

The barplot below showcases the number of healchare hacking incidents by month. We can quickly see that the highest number of breaches occurred in April; whereas, in June there was a noteworthy low number of breaches.

The table below shows the number of breaches by covered entity type. Healthcare Providers appear to have experienced the highest volume of breaches at 1,200 breaches. Business Associate experienced nearly 1,000 less breaches at 285 but still earned second place for this unfortunate ranking. In third place was Health Plan, which experienced 200 separate data breaches, and in last place was Healthcare Clearing House, which only experienced 4 different counts of breaches.

## # A tibble: 4 × 2
##   `Covered Entity Type`     `Count of Breaches`
##   <chr>                                   <int>
## 1 Business Associate                        285
## 2 Health Plan                               200
## 3 Healthcare Clearing House                   4
## 4 Healthcare Provider                      1220

The graph below shows how many data breaches occurred on each day of the week (Sunday, Monday, etc.). It is most likely for a breach to occur on a Friday and least likely for a breach to occur on a Sunday.

From the output below we learn that in 2013 and 2014 there were at least 50 breaches from a “Business Associate” covered entity type and at least 150 breaches from a “Healthcare Provider” covered entity type.

## # A tibble: 2 × 1
##   `Breach Submission Year`
##   <fct>                   
## 1 2013                    
## 2 2014

Our final [Joel-recommended] visualizaiton begs the question: “How has the type of breach changed for each year?” After grouping by breach submission year, we see that the number of Theft data breaches has drastically decreased over the years, while Unauthorized Access and Hacking/IT have grown over time.

Self-Directed Analytical Questions

Which states saw the highest volume of individuals affected from data breaches?

As we can see from the graph above, Indiana by far had the highest number of individuals affected, nearly 70,000,000 more than the second-highest breached state, Florida.

How many Healthcare Provider entity type breaches did the state of Indiana have? How many Health Plan entity type breaches? Business Associate?

It appears that Indiana experienced over 30 data breaches that were Healthcare Providers, over 10 that were Business Associate, and around 7-8 that were Health Plan entity types. Also, we can easily see that of the Healthcare Provider breaches, the most common type of breach was Theft. For Business Associate, the most common type of breach was Other and for Health Plan, it was Unauthorized Access/Disclosure.

Which year did Indiana see the largest number of total data breaches?

Indiana experienced the highest number of data breaches in 2012 and 2013, most of which were Theft types of breaches during these two years.

For curiosity’s sake, let’s create a table to show a breakdown of how many lives were affected from each breach the state of Indiana experienced.

## # A tibble: 51 × 3
##    State `Individuals Affected` `Web Description`                               
##    <chr>                  <dbl> <chr>                                           
##  1 IN                  78800000 "On February 4, 2015, Anthem, Inc. disclosed th…
##  2 IN                    205748 "On January 4, 2016, the covered entity (CE), P…
##  3 IN                    187533 "\\N"                                           
##  4 IN                     63325 "The covered entity (CE), St. Vincent Health, m…
##  5 IN                     63325 "\\N"                                           
##  6 IN                     55000 "$750,000 HIPAA settlement emphasizes the impor…
##  7 IN                     31700 "\\N"                                           
##  8 IN                     28893 "A laptop computer containing the electronic pr…
##  9 IN                     22001 "An unencrypted, password protected laptop comp…
## 10 IN                     20000 "A laptop computer that contained the electroni…
## # … with 41 more rows

This clears up my lingering confusion…at some point in an earlier question, I noticed that the 51 data breaches that occurred in Indiana affected nearly 80,000,000 individuals. I wanted to find the reason why the number of people affected was so high, and it is because the Anthem healthcare data breach occurred in Indiana. The Anthem breach alone affected 78,000,000 people, making it the largest breach in this data set.