Introduction

The US Department of Health and Human Services (HHS) in the Office for Civil Rights (OCR) is responsible for collecting and reporting disclosures of protected health information (PHI) as mandated by law. Part of the law requires that the OCR report cases where covered entities (CE—organizations responsible for protecting health information) have a breach that affects more than 500 individuals. The data reported for each of these breaches include:

• Name of the covered entity (Organization responsible for the PHI)

• State (US State where the breach was reported)

• Covered Entity Type (Type of organization responsible for the PHI)

• Individuals Affected (Number of records affected by the breach)

• Breach submission date (Date the breach was reported by the CE)

• Type of breach (how unauthorized access to the PHI was obtained)

• Location of breached information (Where was the PHI when unauthorized access was obtained)

• Business associate present (Was a business associate such as a consultant or contractor involved in the breach)

• Web description (A optional statement explaining what happened and the resolution)

This data was used in an assignment for my R Capstone class. The goal of this assignment is to use dplyr and ggplot2 to visualize and summarize the data given to us. At the end of this article is a self-directed analysis that I conducted where I pose two questions that I found interesting and provided an answer to them.

Summary Statistics

Before we dive into the data visualizations, here are some summary statistics to further examine the data within the data set:

  1. Total incidents reported and the total individuals affected, grouped by the type of breach. Arranged from highest to lowest number of incidents.
Type of Breach Total Incidents Total number of Individuals Affected
Theft 712 18552308
Unauthorized Access/Disclosure 424 5566337
Hacking/IT Incident 220 87385368
Loss 129 7821407
Other 74 923010
Improper Disposal 57 889249
Theft, Unauthorized Access/Disclosure 25 242368
Loss, Theft 14 98965
Unknown 9 1915690
Hacking/IT Incident, Unauthorized Access/Disclosure 8 181253
Other, Unauthorized Access/Disclosure 7 140544
Improper Disposal, Loss 3 5690
Improper Disposal, Loss, Theft 3 53338
Loss, Unauthorized Access/Disclosure 3 3210
Other, Theft 3 10259
Hacking/IT Incident, Other 2 3720
Hacking/IT Incident, Theft, Unauthorized Access/Disclosure 2 13800
Loss, Other 2 34534
Other, Theft, Unauthorized Access/Disclosure 2 28396
Other, Unknown 2 317082
Hacking/IT Incident, Other, Unauthorized Access/Disclosure 1 4354
Hacking/IT Incident, Theft 1 27800
Improper Disposal, Theft 1 501
Improper Disposal, Theft, Unauthorized Access/Disclosure 1 17300
Improper Disposal, Unauthorized Access/Disclosure 1 727
Loss, Other, Theft 1 2600
Loss, Unauthorized Access/Disclosure, Unknown 1 2533
Loss, Unknown 1 7335

Based on the table, Theft has the highest total number of breaches with a total 18,552,308 individuals affected.

  1. Top 10 breaches with the highest sum of individuals affected
Name of Covered Entity Individuals Affected
Anthem, Inc. Affiliated Covered Entity 78800000
Science Applications International Corporation (SA 4900000
Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group 4029530
21st Century Oncology 2213597
Xerox State Healthcare, LLC 2000000
IBM 1900000
GRM Information Management Services 1700000
AvMed, Inc. 1220000
Montana Department of Public Health & Human Services 1062509
The Nemours Foundation 1055489

The table shows the name of the breaches that had the top 10 highest number of individuals affected. The first breach with the highest value is the Anthem breach with over 78 million individuals affected.

  1. Total incidents reported and the total individuals affected for if a business associate was present or not.
Business Associate Present Total Incidents Individuals Affected
No 1356 105460995
Yes 353 18788683

The table shows that more incidents occurred when a business associate was not present. Since there was about 1000 more incidents, that also means that more people were affected by the breaches where no business associate was present

Now that some summary statistics have been made for the data, we can now start answering the main questions surrounding the data.

Data Visualizations

Number of healthcare data breaches by year

The first question we want to answer is how many healthcare data breaches are there by year. To answer this question, it is best to create a bar chart so we can better visualize it.

From the graph, it can be noted that from year 2013 to 2014 the number of breaches was at it’s highest, but then dropped in 2015. The number of breaches then increased a bit in 2016 but did not reach the height of what it was in 2013 or 2014. In 2017 it dropped back down significantly before reaching it’s lowest number of breaches in 2018. This is crucial because it shows how there has been a steady an improvement in protected health information since 2014 and is the lowest it has ever been recently.

Top 25 largest healthcare data breaches

The next question we want to answer with this data set is what are the top 25 largest healthcare data breaches. This value is determined is determined by the number if individuals affected. For this question, a table was created listing the name of the entities with the largest healthcare data breaches, the breach submission date, the covered entity type, the number of individuals affected, and the location of breached information

Name of Covered Entity Breach Submission Date Covered Entity Type Individuals Affected Type of Breach
Anthem, Inc. Affiliated Covered Entity 2015-03-13 Health Plan 78800000 Hacking/IT Incident
Science Applications International Corporation (SA 2011-11-04 Business Associate 4900000 Loss
Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group 2013-08-23 Healthcare Provider 4029530 Theft
21st Century Oncology 2016-03-04 Healthcare Provider 2213597 Hacking/IT Incident
Xerox State Healthcare, LLC 2014-09-10 Business Associate 2000000 Unauthorized Access/Disclosure
IBM 2011-04-14 Business Associate 1900000 Unknown
GRM Information Management Services 2011-02-11 Business Associate 1700000 Theft
AvMed, Inc. 2010-06-03 Health Plan 1220000 Theft
Montana Department of Public Health & Human Services 2014-07-07 Health Plan 1062509 Hacking/IT Incident
The Nemours Foundation 2011-10-07 Healthcare Provider 1055489 Loss
BlueCross BlueShield of Tennessee, Inc. 2010-11-01 Health Plan 1023209 Theft
Sutter Medical Foundation 2011-11-17 Healthcare Provider 943434 Theft
Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology and Pain Consultants 2016-08-12 Healthcare Provider 882590 Hacking/IT Incident
Horizon Healthcare Services, Inc., doing business as Horizon Blue Cross Blue Shield of New Jersey, and its affiliates 2014-01-03 Business Associate 839711 Theft
Iron Mountain Data Products, Inc. (now known as 2010-07-19 Business Associate 800000 Loss
Utah Department of Technology Services 2012-04-11 Business Associate 780000 Hacking/IT Incident
AHMC Healthcare Inc. and affiliated Hospitals 2013-10-25 Healthcare Provider 729000 Theft
EISENHOWER MEDICAL CENTER 2011-03-30 Healthcare Provider 514330 Theft
Radiology Regional Center, PA 2016-02-12 Healthcare Provider 483063 Loss
Puerto Rico Department of Health - Triple S Management Corp. 2010-11-04 Health Plan 475000 Unauthorized Access/Disclosure
St Joseph Health System 2014-02-05 Healthcare Provider 405000 Hacking/IT Incident
Spartanburg Regional Healthcare System 2011-05-27 Healthcare Provider 400000 Theft
Triple-S Salud, Inc. - Breach Case#2 2014-01-24 Health Plan 398000 Theft
Triple-S Salud, Inc. 2010-11-18 Health Plan 398000 Theft
Community Health Plan of Washington 2016-12-21 Health Plan 381504 Hacking/IT Incident

From this table, the entity with the largest data breach was Anthem, Inc where over 78 million individuals were impacted. This happened in 2015, it was a health plan entity that was a a victim of a hacking/it breach. These results are a more in depth look at the summary statistic from earlier about the top 10 breaches with the most individuals impacted. This data matters because it shows what places have had a breach, when it happened, and what kind of breach the entity suffered from. This would be useful to see if there are any patterns within this group and see if there is a common trend among them.

Total healthcare records (individuals affected) exposed by state for the top 10 states

This next visual is looking at the total healthcare records by the top 10 states. This bar graph will be divided up by state, and it is looking at the total number of individuals affected from each breach by state.

From this graph, Indiana has the absolute highest number of individuals affected while the rest of the top 9 states look like they have a similar amount of individuals affected throughout. The reason for this is because the Anthem data breach happened in Indiana which was around 78 million individuals. This is crucial information because it shows which states have been hurt the most by breaches, and consider what may be done moving forward to prevent more.

Number of healthcare hacking incidents by month

Next is a bar graph showing the number of hacking incidents by month. The data is divided up by month with 1 representing January and 12 representing December. Here, the graph shows the number of hacking incidents that have occurred by month Based on this graph, month 4, April, has the most hacking/it incidents with March not being too far behind. Overall, the number of hacking/it incidents are relatively consistent during the rest of the months. March and April are the only two months that are higher than the rest, but there is not much different for the rest of the months. This information is crucial for people that want to know more about the patterns in hacking/it breaches. As can be seen from the graph, the month does not provide much insight on patterns, considering the bar chart does not show any significant difference by month.

Number of breaches by covered entity type

The next question wants to look at the number of breaches covered by each entity type. The entity types within this data set include business associate, health plan, healthcare clearing house, and healthcare provider

Covered Entity Type Number of Breaches
Business Associate 285
Health Plan 200
Healthcare Clearing House 4
Healthcare Provider 1220

From the table, healthcare provider’s have the highest number of breaches at 1,220 breaches in total. The entity with the lowest number of breaches are healthcare cleaning houses at a total of 4 breaches. This is useful information to know about because it shows the public which organization is more susceptible to breaches.

Day of the week breaches are most often reported

This next table is summarizing how many breaches occur by day. For this table, it is grouped by day and day 1 represents Monday while day 7 represents Sunday.

Day Number of Breaches
1 286
2 281
3 282
4 300
5 512
6 29
7 19

Based on the table, breaches occur the most on day 5 which is Friday. This kind of information is important to see if the day of the week has any impact on the number of breaches. Not many breaches occur on the weekends, but occur more during the week, specifically near the end of the weekdays.

Business Associate and Healthcare Provider

This next question is asking for the year(s) when the number of breaches from a business associate was at least 50 and when the number of breaches from a healthcare provider was at least 150. Below is a table that shows this data, grouped by year.

Year Covered Entity Type Number of Breaches
2013 Business Associate 64
2013 Healthcare Provider 187
2014 Business Associate 67
2014 Healthcare Provider 179
2015 Healthcare Provider 155
2016 Healthcare Provider 182

In years 2013 and 2014, there were there at least 50 breaches from a business associate covered entity type and at least 150 breaches from a healthcare provider covered entity type. This a very specific question that shows how R can be used to filter out values to give a more specific answer instead of a more broad or general visual.

Type of Breach Over the Years

The next question, we want to answer is how has the type of breach (hacking, improper disposal, loss, etc.) changed for each year. This visual assesses whether hacking / IT Incidents or Theft or any other type of breach were more prevalent in 2014 and whether this trend was maintained in 2015. At the far right side of the table, it also shows the total number of breaches for each year.

Year Unauthorized Access/Disclosure Theft Hacking/IT Incident Improper Disposal Loss Unknown Other total
2009 0 15 0 0 1 0 2 18
2010 10 135 8 10 20 0 23 206
2011 34 122 17 7 18 7 2 207
2012 40 124 17 8 20 2 20 231
2013 73 131 27 13 24 3 19 290
2014 98 111 37 11 30 1 28 316
2015 80 64 25 6 23 0 0 198
2016 96 46 71 7 12 0 0 232
2017 43 17 32 4 9 0 0 105
2018 1 0 0 0 0 0 0 1

In 2014, the total number of breaches was at it’s highest, but by 2015 the total number of breaches dropped significantly from 316 to 198. One of the reasons for this could be that Theft, which is one of the highest types of breaches, dropped from 111 in 2014 to 64 in 2015. From 2015 to 2018, the number of Unknown and Other type breaches decreased completely. From 2017 to 2018, the total number of breaches dropped from 105 to 1. This is important information to know because it shows how the total number of breaches has developed over the years along with how the types of breaches has increased or decreased over the years.

Self Directed Analysis

Question 1: In terms of Location of Breached Information (email, laptop, paper/films, desktop computer, network server, electronic medical record, other portable electronic device, other location), which location had the most breaches?

I find this question interesting because it shows where the breaches occurred specifically. I am curious if there is a pattern or if there is a location where the breach occurs more often.

Email Laptop Other Location Network Server Paper/Films Desktop Computer Electronic Medical Record Other Portable Electronic Device
177 349 385 273 451 210 104 197

Based on the table, the location with the most breaches is the paper/films location. Behind paper/films is other location and laptop in that order. I am honestly a bit surprised that paper/films has the most breaches. Since everything is online nowadays, I assumed there would have been less paper documents, and it would be easier to hack into emails or computers.

Question 2: Which state had the highest number of breaches and across all states, what was the distribution of breach types like? Which type of breach was the most prevalent?

I find this question because I think it would be interesting to see which states have a lot of breaches and which ones have very little breaches. I am also curious to see the distribution of the types of breaches across states. For this graph, the x axis are the states, the y axis is the number of breaches, and the colors represent the types if breaches.

Based on the graph, California has the most breaches, followed by Texas and Florida, but California has a significant lead. Out of all the types of breaches, Theft and Unauthorized Access/Disclosure are the most consistent types of breaches across the states with Theft being one of the largest values. In both California and Texas, Theft and Unauthorized Access/Disclosure were the highest values, with Hacking/IT Incident being more prevalent in Texas than in California.