Introduction
The US Department of Health and Human Services (HHS) in the Office for Civil Rights (OCR) is responsible for collecting and reporting disclosures of protected health information (PHI) as mandated by law. Part of the law requires that the OCR report cases where covered entities (CE—organizations responsible for protecting health information) have a breach that affects more than 500 individuals. The data reported for each of these breaches include:
• Name of the covered entity (Organization responsible for the PHI)
• State (US State where the breach was reported)
• Covered Entity Type (Type of organization responsible for the PHI)
• Individuals Affected (Number of records affected by the breach)
• Breach submission date (Date the breach was reported by the CE)
• Type of breach (how unauthorized access to the PHI was obtained)
• Location of breached information (Where was the PHI when unauthorized access was obtained)
• Business associate present (Was a business associate such as a consultant or contractor involved in the breach)
• Web description (A optional statement explaining what happened and the resolution)
This data was used in an assignment for my R Capstone class. The goal of this assignment is to use dplyr and ggplot2 to visualize and summarize the data given to us. At the end of this article is a self-directed analysis that I conducted where I pose two questions that I found interesting and provided an answer to them.
Summary Statistics
Before we dive into the data visualizations, here are some summary statistics to further examine the data within the data set:
- Total incidents reported and the total individuals affected, grouped by the type of breach. Arranged from highest to lowest number of incidents.
| Type of Breach | Total Incidents | Total number of Individuals Affected |
|---|---|---|
| Theft | 712 | 18552308 |
| Unauthorized Access/Disclosure | 424 | 5566337 |
| Hacking/IT Incident | 220 | 87385368 |
| Loss | 129 | 7821407 |
| Other | 74 | 923010 |
| Improper Disposal | 57 | 889249 |
| Theft, Unauthorized Access/Disclosure | 25 | 242368 |
| Loss, Theft | 14 | 98965 |
| Unknown | 9 | 1915690 |
| Hacking/IT Incident, Unauthorized Access/Disclosure | 8 | 181253 |
| Other, Unauthorized Access/Disclosure | 7 | 140544 |
| Improper Disposal, Loss | 3 | 5690 |
| Improper Disposal, Loss, Theft | 3 | 53338 |
| Loss, Unauthorized Access/Disclosure | 3 | 3210 |
| Other, Theft | 3 | 10259 |
| Hacking/IT Incident, Other | 2 | 3720 |
| Hacking/IT Incident, Theft, Unauthorized Access/Disclosure | 2 | 13800 |
| Loss, Other | 2 | 34534 |
| Other, Theft, Unauthorized Access/Disclosure | 2 | 28396 |
| Other, Unknown | 2 | 317082 |
| Hacking/IT Incident, Other, Unauthorized Access/Disclosure | 1 | 4354 |
| Hacking/IT Incident, Theft | 1 | 27800 |
| Improper Disposal, Theft | 1 | 501 |
| Improper Disposal, Theft, Unauthorized Access/Disclosure | 1 | 17300 |
| Improper Disposal, Unauthorized Access/Disclosure | 1 | 727 |
| Loss, Other, Theft | 1 | 2600 |
| Loss, Unauthorized Access/Disclosure, Unknown | 1 | 2533 |
| Loss, Unknown | 1 | 7335 |
Based on the table, Theft has the highest total number of breaches with a total 18,552,308 individuals affected.
- Top 10 breaches with the highest sum of individuals affected
| Name of Covered Entity | Individuals Affected |
|---|---|
| Anthem, Inc. Affiliated Covered Entity | 78800000 |
| Science Applications International Corporation (SA | 4900000 |
| Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group | 4029530 |
| 21st Century Oncology | 2213597 |
| Xerox State Healthcare, LLC | 2000000 |
| IBM | 1900000 |
| GRM Information Management Services | 1700000 |
| AvMed, Inc. | 1220000 |
| Montana Department of Public Health & Human Services | 1062509 |
| The Nemours Foundation | 1055489 |
The table shows the name of the breaches that had the top 10 highest number of individuals affected. The first breach with the highest value is the Anthem breach with over 78 million individuals affected.
- Total incidents reported and the total individuals affected for if a business associate was present or not.
| Business Associate Present | Total Incidents | Individuals Affected |
|---|---|---|
| No | 1356 | 105460995 |
| Yes | 353 | 18788683 |
The table shows that more incidents occurred when a business associate was not present. Since there was about 1000 more incidents, that also means that more people were affected by the breaches where no business associate was present
Now that some summary statistics have been made for the data, we can now start answering the main questions surrounding the data.
Data Visualizations
Number of healthcare data breaches by year
The first question we want to answer is how many healthcare data breaches are there by year. To answer this question, it is best to create a bar chart so we can better visualize it.
From the graph, it can be noted that from year 2013 to 2014 the number
of breaches was at it’s highest, but then dropped in 2015. The number of
breaches then increased a bit in 2016 but did not reach the height of
what it was in 2013 or 2014. In 2017 it dropped back down significantly
before reaching it’s lowest number of breaches in 2018. This is crucial
because it shows how there has been a steady an improvement in protected
health information since 2014 and is the lowest it has ever been
recently.
Top 25 largest healthcare data breaches
The next question we want to answer with this data set is what are the top 25 largest healthcare data breaches. This value is determined is determined by the number if individuals affected. For this question, a table was created listing the name of the entities with the largest healthcare data breaches, the breach submission date, the covered entity type, the number of individuals affected, and the location of breached information
| Name of Covered Entity | Breach Submission Date | Covered Entity Type | Individuals Affected | Type of Breach |
|---|---|---|---|---|
| Anthem, Inc. Affiliated Covered Entity | 2015-03-13 | Health Plan | 78800000 | Hacking/IT Incident |
| Science Applications International Corporation (SA | 2011-11-04 | Business Associate | 4900000 | Loss |
| Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group | 2013-08-23 | Healthcare Provider | 4029530 | Theft |
| 21st Century Oncology | 2016-03-04 | Healthcare Provider | 2213597 | Hacking/IT Incident |
| Xerox State Healthcare, LLC | 2014-09-10 | Business Associate | 2000000 | Unauthorized Access/Disclosure |
| IBM | 2011-04-14 | Business Associate | 1900000 | Unknown |
| GRM Information Management Services | 2011-02-11 | Business Associate | 1700000 | Theft |
| AvMed, Inc. | 2010-06-03 | Health Plan | 1220000 | Theft |
| Montana Department of Public Health & Human Services | 2014-07-07 | Health Plan | 1062509 | Hacking/IT Incident |
| The Nemours Foundation | 2011-10-07 | Healthcare Provider | 1055489 | Loss |
| BlueCross BlueShield of Tennessee, Inc. | 2010-11-01 | Health Plan | 1023209 | Theft |
| Sutter Medical Foundation | 2011-11-17 | Healthcare Provider | 943434 | Theft |
| Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology and Pain Consultants | 2016-08-12 | Healthcare Provider | 882590 | Hacking/IT Incident |
| Horizon Healthcare Services, Inc., doing business as Horizon Blue Cross Blue Shield of New Jersey, and its affiliates | 2014-01-03 | Business Associate | 839711 | Theft |
| Iron Mountain Data Products, Inc. (now known as | 2010-07-19 | Business Associate | 800000 | Loss |
| Utah Department of Technology Services | 2012-04-11 | Business Associate | 780000 | Hacking/IT Incident |
| AHMC Healthcare Inc. and affiliated Hospitals | 2013-10-25 | Healthcare Provider | 729000 | Theft |
| EISENHOWER MEDICAL CENTER | 2011-03-30 | Healthcare Provider | 514330 | Theft |
| Radiology Regional Center, PA | 2016-02-12 | Healthcare Provider | 483063 | Loss |
| Puerto Rico Department of Health - Triple S Management Corp. | 2010-11-04 | Health Plan | 475000 | Unauthorized Access/Disclosure |
| St Joseph Health System | 2014-02-05 | Healthcare Provider | 405000 | Hacking/IT Incident |
| Spartanburg Regional Healthcare System | 2011-05-27 | Healthcare Provider | 400000 | Theft |
| Triple-S Salud, Inc. - Breach Case#2 | 2014-01-24 | Health Plan | 398000 | Theft |
| Triple-S Salud, Inc. | 2010-11-18 | Health Plan | 398000 | Theft |
| Community Health Plan of Washington | 2016-12-21 | Health Plan | 381504 | Hacking/IT Incident |
From this table, the entity with the largest data breach was Anthem, Inc where over 78 million individuals were impacted. This happened in 2015, it was a health plan entity that was a a victim of a hacking/it breach. These results are a more in depth look at the summary statistic from earlier about the top 10 breaches with the most individuals impacted. This data matters because it shows what places have had a breach, when it happened, and what kind of breach the entity suffered from. This would be useful to see if there are any patterns within this group and see if there is a common trend among them.
Total healthcare records (individuals affected) exposed by state for the top 10 states
This next visual is looking at the total healthcare records by the top 10 states. This bar graph will be divided up by state, and it is looking at the total number of individuals affected from each breach by state.
From this graph, Indiana has the absolute highest number of individuals
affected while the rest of the top 9 states look like they have a
similar amount of individuals affected throughout. The reason for this
is because the Anthem data breach happened in Indiana which was around
78 million individuals. This is crucial information because it shows
which states have been hurt the most by breaches, and consider what may
be done moving forward to prevent more.
Number of healthcare hacking incidents by month
Next is a bar graph showing the number of hacking incidents by month.
The data is divided up by month with 1 representing January and 12
representing December. Here, the graph shows the number of hacking
incidents that have occurred by month
Based on this graph, month 4, April, has the most hacking/it incidents
with March not being too far behind. Overall, the number of hacking/it
incidents are relatively consistent during the rest of the months. March
and April are the only two months that are higher than the rest, but
there is not much different for the rest of the months. This information
is crucial for people that want to know more about the patterns in
hacking/it breaches. As can be seen from the graph, the month does not
provide much insight on patterns, considering the bar chart does not
show any significant difference by month.
Number of breaches by covered entity type
The next question wants to look at the number of breaches covered by each entity type. The entity types within this data set include business associate, health plan, healthcare clearing house, and healthcare provider
| Covered Entity Type | Number of Breaches |
|---|---|
| Business Associate | 285 |
| Health Plan | 200 |
| Healthcare Clearing House | 4 |
| Healthcare Provider | 1220 |
From the table, healthcare provider’s have the highest number of breaches at 1,220 breaches in total. The entity with the lowest number of breaches are healthcare cleaning houses at a total of 4 breaches. This is useful information to know about because it shows the public which organization is more susceptible to breaches.
Day of the week breaches are most often reported
This next table is summarizing how many breaches occur by day. For this table, it is grouped by day and day 1 represents Monday while day 7 represents Sunday.
| Day | Number of Breaches |
|---|---|
| 1 | 286 |
| 2 | 281 |
| 3 | 282 |
| 4 | 300 |
| 5 | 512 |
| 6 | 29 |
| 7 | 19 |
Based on the table, breaches occur the most on day 5 which is Friday. This kind of information is important to see if the day of the week has any impact on the number of breaches. Not many breaches occur on the weekends, but occur more during the week, specifically near the end of the weekdays.
Business Associate and Healthcare Provider
This next question is asking for the year(s) when the number of breaches from a business associate was at least 50 and when the number of breaches from a healthcare provider was at least 150. Below is a table that shows this data, grouped by year.
| Year | Covered Entity Type | Number of Breaches |
|---|---|---|
| 2013 | Business Associate | 64 |
| 2013 | Healthcare Provider | 187 |
| 2014 | Business Associate | 67 |
| 2014 | Healthcare Provider | 179 |
| 2015 | Healthcare Provider | 155 |
| 2016 | Healthcare Provider | 182 |
In years 2013 and 2014, there were there at least 50 breaches from a business associate covered entity type and at least 150 breaches from a healthcare provider covered entity type. This a very specific question that shows how R can be used to filter out values to give a more specific answer instead of a more broad or general visual.
Type of Breach Over the Years
The next question, we want to answer is how has the type of breach (hacking, improper disposal, loss, etc.) changed for each year. This visual assesses whether hacking / IT Incidents or Theft or any other type of breach were more prevalent in 2014 and whether this trend was maintained in 2015. At the far right side of the table, it also shows the total number of breaches for each year.
| Year | Unauthorized Access/Disclosure | Theft | Hacking/IT Incident | Improper Disposal | Loss | Unknown | Other | total |
|---|---|---|---|---|---|---|---|---|
| 2009 | 0 | 15 | 0 | 0 | 1 | 0 | 2 | 18 |
| 2010 | 10 | 135 | 8 | 10 | 20 | 0 | 23 | 206 |
| 2011 | 34 | 122 | 17 | 7 | 18 | 7 | 2 | 207 |
| 2012 | 40 | 124 | 17 | 8 | 20 | 2 | 20 | 231 |
| 2013 | 73 | 131 | 27 | 13 | 24 | 3 | 19 | 290 |
| 2014 | 98 | 111 | 37 | 11 | 30 | 1 | 28 | 316 |
| 2015 | 80 | 64 | 25 | 6 | 23 | 0 | 0 | 198 |
| 2016 | 96 | 46 | 71 | 7 | 12 | 0 | 0 | 232 |
| 2017 | 43 | 17 | 32 | 4 | 9 | 0 | 0 | 105 |
| 2018 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
In 2014, the total number of breaches was at it’s highest, but by 2015 the total number of breaches dropped significantly from 316 to 198. One of the reasons for this could be that Theft, which is one of the highest types of breaches, dropped from 111 in 2014 to 64 in 2015. From 2015 to 2018, the number of Unknown and Other type breaches decreased completely. From 2017 to 2018, the total number of breaches dropped from 105 to 1. This is important information to know because it shows how the total number of breaches has developed over the years along with how the types of breaches has increased or decreased over the years.
Self Directed Analysis
Question 1: In terms of Location of Breached Information (email, laptop, paper/films, desktop computer, network server, electronic medical record, other portable electronic device, other location), which location had the most breaches?
I find this question interesting because it shows where the breaches occurred specifically. I am curious if there is a pattern or if there is a location where the breach occurs more often.
| Laptop | Other Location | Network Server | Paper/Films | Desktop Computer | Electronic Medical Record | Other Portable Electronic Device | |
|---|---|---|---|---|---|---|---|
| 177 | 349 | 385 | 273 | 451 | 210 | 104 | 197 |
Based on the table, the location with the most breaches is the paper/films location. Behind paper/films is other location and laptop in that order. I am honestly a bit surprised that paper/films has the most breaches. Since everything is online nowadays, I assumed there would have been less paper documents, and it would be easier to hack into emails or computers.
Question 2: Which state had the highest number of breaches and across all states, what was the distribution of breach types like? Which type of breach was the most prevalent?
I find this question because I think it would be interesting to see which states have a lot of breaches and which ones have very little breaches. I am also curious to see the distribution of the types of breaches across states. For this graph, the x axis are the states, the y axis is the number of breaches, and the colors represent the types if breaches.
Based on the graph, California has the most breaches, followed by Texas
and Florida, but California has a significant lead. Out of all the types
of breaches, Theft and Unauthorized Access/Disclosure are the most
consistent types of breaches across the states with Theft being one of
the largest values. In both California and Texas, Theft and Unauthorized
Access/Disclosure were the highest values, with Hacking/IT Incident
being more prevalent in Texas than in California.