This document is designed to update the layout of the US Department of Health and Human Services’ page which contains information about data breaches. The previous setup was unappealing and difficult to interpret, so we are redesigning this data’s presentation to facilitate better understanding of breaches.
The data within this document reflects all data breaches in which the protected health information of at least 500 people was compromised. The data available includes the name, State, and type of the entity in question, the number of individuals affected per breach, the date of the breach, and facts surrounding the nature of the breach such as the type, location, and a description of the breach.
I propose effective use of the dplyr and ggplot packages to create more aestetically pleasing visuals. Dplyr can help us manipulate the data to uncover gems that were not initially present in the data, and ggplot can help us display these findings in an way that is visually pleasing and easy to interpret.
The consumer of this analysis can rest assured that they can find the information they seek in a much more streamlined fashion. Data of this nature can tend to be dry and difficult to understand, but thanks to our refresh of this data’s presentation, people will be able to find clear, comprehensible visuals relevant to them in a moment’s notice
| package.name | package.reason |
|---|---|
| tidyverse | Reads in dplyr, ggplot2, and others that allow us to change and visualize our data |
| readxls | Allows us to read in excel/csv files |
| knitr | Allows us to display our results in an RMarkDown file |
| splitstackshape | Allows us to split columns at a deliminater |
Below is a visualization of the number of breach reports year by year. As you can see, the number of reports grew rapidly between 2009 and 2013, and remained high the following years. This could be due to the increases in technology available for hackers. 2017 saw the most breach reports.
This visual shows the average size of a breach per year, barring outliers. This visual somewhat follows the previous graph in that we see early growth in the breach size early on, and stagnating breach sizes at a higher level in more recent years. 2014 saw the highest average breach size.
Below is a table detailing the largest 20 breaches that have taken place, whether completely investigated or still investigating. The Anthem Breach was the largest breach by far, affecting nearly 80 million people. The next largest breaches, though still large and in the millions, are not nearly as bad.
| Name of Covered Entity | Individuals Affected | Status |
|---|---|---|
| Anthem, Inc. Affiliated Covered Entity | 78800000 | C |
| Premera Blue Cross | 11000000 | C |
| Excellus Health Plan, Inc. | 10000000 | C |
| Science Applications International Corporation (SA | 4900000 | C |
| University of California, Los Angeles Health | 4500000 | C |
| Community Health Systems Professional Services Corporations | 4500000 | C |
| Community Health Systems Professional Services Corporation | 4500000 | C |
| Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group | 4029530 | C |
| Medical Informatics Engineering | 3900000 | C |
| Banner Health | 3620000 | C |
| Newkirk Products, Inc. | 3466120 | C |
| 21st Century Oncology | 2213597 | C |
| Xerox State Healthcare, LLC | 2000000 | C |
| IBM | 1900000 | C |
| GRM Information Management Services | 1700000 | C |
| Iowa Health System d/b/a UnityPoint Health | 1421107 | I |
| AvMed, Inc. | 1220000 | C |
| CareFirst BlueCross BlueShield | 1100000 | C |
| Montana Department of Public Health & Human Services | 1062509 | C |
| The Nemours Foundation | 1055489 | C |
The following graph shows the number of breaches associated with a Hacking/IT incident. We can see that hacking is becoming an increasingly prevalent strategy amongst data theives, as the number of Hacking incidents has steadily grown over the past decade. The number is down as of last year, but whether the trend continues remains to be seen.
This is a small table containing the total of individuals affected per entity type. We can see that Health Plan entity types are the most susceptible by a large margin, while Healthcare Clearing House entities seem to be safer in the grand scheme of things. The “0” variable means data was not available.
| Covered Entity Type | sum(Individuals Affected) |
|---|---|
| 0 | 20781 |
| Business Associate | 36103511 |
| Health Plan | 111447139 |
| Healthcare Clearing House | 17754 |
| Healthcare Provider | 41383951 |
This graph shows how many breaches occur on a given weekday. We can see that Monday through Thursday, the breach amount is relatively constant. This number nearly doubles from Thursday to Friday (Perhaps the data theives are trying to catch companies off guard as they head into the weekend). Not many reports come in on the weekend days. This chart tells me that companies should always be on guard, especially on Fridays.
This series of graph delineates how prevalant a particular type of breach is depending on the year. We can see here that theft was the predominant means of data breach for the first few years, until about 2014. In 2014, Unauthorized Access and Hacking starting to increase in frequency. This trend continued until eventually theft was dwarfed by Hacking and Unauthorized access, showing a shift toward strategic hacking attacks and illegal data accessing.
Here is a table giving a breakdown of the total individuals affected based on the state the breach took place, in alphabetical order. Washington, Tennessee, California, Indiana, and New York have seen 10s of millions of individuals affected over the past decade. Ohio seems to be staying relatively safe.
| State | sum(Individuals Affected) |
|---|---|
| 0 | 39426 |
| AK | 75785 |
| AL | 1146315 |
| AR | 488643 |
| AZ | 4792005 |
| CA | 10002397 |
| CO | 286778 |
| CT | 317492 |
| DC | 40441 |
| DE | 49638 |
| FL | 6862514 |
| GA | 3031358 |
| HI | 55136 |
| IA | 1518108 |
| ID | 19786 |
| IL | 4855819 |
| IN | 84079500 |
| KS | 230376 |
| KY | 1046956 |
| LA | 163418 |
| State | sum(Individuals Affected) |
|---|---|
| MA | 406939 |
| MD | 2778182 |
| ME | 10063 |
| MI | 1004053 |
| MN | 391550 |
| MO | 822364 |
| MS | 153069 |
| MT | 1174195 |
| NC | 568779 |
| ND | 17515 |
| NE | 194406 |
| NH | 257191 |
| NJ | 3263956 |
| NM | 77407 |
| NV | 117285 |
| NY | 17150170 |
| OH | 859262 |
| OK | 632455 |
| OR | 420707 |
| PA | 1850138 |
| State | sum(Individuals Affected) |
|---|---|
| PR | 1712827 |
| RI | 103750 |
| SC | 765657 |
| SD | 35640 |
| TN | 11460836 |
| TX | 4641456 |
| UT | 895632 |
| VA | 5926242 |
| VT | 6806 |
| WA | 11775135 |
| WI | 254458 |
| WV | 82653 |
| WY | 60467 |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
This chart outlines the average number of breaches that have taken place each calendar month. This chart reads from left to right, starting from January and ending in December. March, April, July, and September are when the most individuals are affected on average, though there does not seem to be an aggressive amount of variance. With that said, the last few months and the beggining few months of each year seem to be down-times for data theives, at least compared to the months in the middle of the year.