Introduction
The Office for Civil Rights (OCR), which resides within the US Department of Health and Human Services (HHS), is responsible for collecting and reporting disclosures of protected health information (PHI) as mandated by law. Part of the law requires that the OCR report cases where covered entities (CE–organizations responsible for protecting health information) have a breach that affects more than 500 individuals. The data reported for each of these breaches include the variables listed in the table below.
VariableName of Covered |
DescriptionOrganization responsible for the PHI |
| State | US state where the breach was reported |
| Covered Entity Type | Type of organization responsible for the PHI |
| Individuals Affected | Number of records affected by the breach |
| Breach Submission Date | Date the breach was reported by the CE |
| Type of Breach | How unauthorized access to the PHI was obtained |
| Location of Breached Information | Where was the PHI when unauthorized access was obtained |
| Business Associate Present | Was a business associate such as a consultant or contractor involved in the breach |
| Web Description | An optional statement explaining what happened and the resolution |
Data Preparation
Before we can begin our analysis, we must complete the data cleaning steps necessary to create effective visualizations. We first remove all rows with any missing data, then delete one of the duplicate observations that identifies the largest breach (by individuals affected). Next, we create 7 dummy variables for each type of breach, which will help with visualizations moving forward. Finally, we covert the Breach Submission Date variable to a functional date format to better assist with future analyses.
Summary Statistics
This table shows the mean number of people affected by each type of entity, in addition to showing the most common type of breach and the most likely state the breach occurred in for each entity.
## # A tibble: 4 × 4
## `Covered Entity Type` Average_Number_Of_Individuals_Affe…¹ Most_…² Most_…³
## <chr> <dbl> <chr> <chr>
## 1 Business Associate 59113. Unknown WY
## 2 Health Plan 430358. Unknown WY
## 3 Healthcare Clearing House 4438. Unauth… WA
## 4 Healthcare Provider 17470. Unknown WY
## # … with abbreviated variable names ¹Average_Number_Of_Individuals_Affected,
## # ²Most_Likely_Type_Of_Breach, ³Most_Common_State_Breach_Occurred_In
Data Visualization
The barplot below shows the number of Healthcare data breaches grouped by year. We can easily see that the most data breaches occurred in 2014 and the smallest number of data breaches occurred in 2018.
The table below shows the top 25 largest healthcare data breaches that are featured in this dataset. It shows us the Name of Covered Entity, Covered Entity Type, Breach Submission Year, Individuals Affected, and the Type of Breach.
## # A tibble: 25 × 5
## `Name of Covered Entity` Cover…¹ Breac…² Indiv…³ Type …⁴
## <chr> <chr> <fct> <dbl> <chr>
## 1 Anthem, Inc. Affiliated Covered Entity Health… 2015 7.88e7 Hackin…
## 2 Science Applications International Corporati… Busine… 2011 4.9 e6 Loss
## 3 Advocate Health and Hospitals Corporation, d… Health… 2013 4.03e6 Theft
## 4 21st Century Oncology Health… 2016 2.21e6 Hackin…
## 5 Xerox State Healthcare, LLC Busine… 2014 2 e6 Unauth…
## 6 IBM Busine… 2011 1.9 e6 Unknown
## 7 GRM Information Management Services Busine… 2011 1.7 e6 Theft
## 8 AvMed, Inc. Health… 2010 1.22e6 Theft
## 9 Montana Department of Public Health & Human … Health… 2014 1.06e6 Hackin…
## 10 The Nemours Foundation Health… 2011 1.06e6 Loss
## # … with 15 more rows, and abbreviated variable names ¹`Covered Entity Type`,
## # ²`Breach Submission Year`, ³`Individuals Affected`, ⁴`Type of Breach`
The table below shows the Top 10 States that had the highest number of individuals affected from data breaches.
## # A tibble: 10 × 2
## State `Individuals Affected`
## <chr> <dbl>
## 1 IN 79576765
## 2 FL 6001825
## 3 VA 5158001
## 4 IL 4692107
## 5 TX 4040208
## 6 CA 3052133
## 7 NJ 3051796
## 8 NY 2782138
## 9 TN 1724277
## 10 PR 1704916
The barplot below showcases the number of healchare hacking incidents by month. We can quickly see that the highest number of breaches occurred in April; whereas, in June there was a noteworthy low number of breaches.
The table below shows the number of breaches by covered entity type. Healthcare Providers appear to have experienced the highest volume of breaches at 1,200 breaches. Business Associate experienced nearly 1,000 less breaches at 285 but still earned second place for this unfortunate ranking. In third place was Health Plan, which experienced 200 separate data breaches, and in last place was Healthcare Clearing House, which only experienced 4 different counts of breaches.
## # A tibble: 4 × 2
## `Covered Entity Type` `Count of Breaches`
## <chr> <int>
## 1 Business Associate 285
## 2 Health Plan 200
## 3 Healthcare Clearing House 4
## 4 Healthcare Provider 1220
The graph below shows how many data breaches occurred on each day of the week (Sunday, Monday, etc.). It is most likely for a breach to occur on a Friday and least likely for a breach to occur on a Sunday.
From the output below we learn that in 2013 and 2014 there were at least 50 breaches from a “Business Associate” covered entity type and at least 150 breaches from a “Healthcare Provider” covered entity type.
## # A tibble: 2 × 1
## `Breach Submission Year`
## <fct>
## 1 2013
## 2 2014
Our final [Joel-recommended] visualizaiton begs the question: “How has the type of breach changed for each year?” After grouping by breach submission year, we see that the number of Theft data breaches has drastically decreased over the years, while Unauthorized Access and Hacking/IT have grown over time.
Self-Directed Analytical Questions
Which states saw the highest volume of individuals affected from data breaches?
As we can see from the graph above, Indiana by far had the highest number of individuals affected, nearly 70,000,000 more than the second-highest breached state, Florida.
How many Healthcare Provider entity type breaches did the state of Indiana have? How many Health Plan entity type breaches? Business Associate?
It appears that Indiana experienced over 30 data breaches that were Healthcare Providers, over 10 that were Business Associate, and around 7-8 that were Health Plan entity types. Also, we can easily see that of the Healthcare Provider breaches, the most common type of breach was Theft. For Business Associate, the most common type of breach was Other and for Health Plan, it was Unauthorized Access/Disclosure.
Which year did Indiana see the largest number of total data breaches?
Indiana experienced the highest number of data breaches in 2012 and 2013, most of which were Theft types of breaches during these two years.
For curiosity’s sake, let’s create a table to show a breakdown of how many lives were affected from each breach the state of Indiana experienced.
## # A tibble: 51 × 3
## State `Individuals Affected` `Web Description`
## <chr> <dbl> <chr>
## 1 IN 78800000 "On February 4, 2015, Anthem, Inc. disclosed th…
## 2 IN 205748 "On January 4, 2016, the covered entity (CE), P…
## 3 IN 187533 "\\N"
## 4 IN 63325 "The covered entity (CE), St. Vincent Health, m…
## 5 IN 63325 "\\N"
## 6 IN 55000 "$750,000 HIPAA settlement emphasizes the impor…
## 7 IN 31700 "\\N"
## 8 IN 28893 "A laptop computer containing the electronic pr…
## 9 IN 22001 "An unencrypted, password protected laptop comp…
## 10 IN 20000 "A laptop computer that contained the electroni…
## # … with 41 more rows
This clears up my lingering confusion…at some point in an earlier question, I noticed that the 51 data breaches that occurred in Indiana affected nearly 80,000,000 individuals. I wanted to find the reason why the number of people affected was so high, and it is because the Anthem healthcare data breach occurred in Indiana. The Anthem breach alone affected 78,000,000 people, making it the largest breach in this data set.