Introduction

This document is designed to update the layout of the US Department of Health and Human Services’ page which contains information about data breaches. The previous setup was unappealing and difficult to interpret, so we are redesigning this data’s presentation to facilitate better understanding of breaches.

The data within this document reflects all data breaches in which the protected health information of at least 500 people was compromised. The data available includes the name, State, and type of the entity in question, the number of individuals affected per breach, the date of the breach, and facts surrounding the nature of the breach such as the type, location, and a description of the breach.

I propose effective use of the dplyr and ggplot packages to create more aestetically pleasing visuals. Dplyr can help us manipulate the data to uncover gems that were not initially present in the data, and ggplot can help us display these findings in an way that is visually pleasing and easy to interpret.

The consumer of this analysis can rest assured that they can find the information they seek in a much more streamlined fashion. Data of this nature can tend to be dry and difficult to understand, but thanks to our refresh of this data’s presentation, people will be able to find clear, comprehensible visuals relevant to them in a moment’s notice

Required Packages

package.name package.reason
tidyverse Reads in dplyr, ggplot2, and others that allow us to change and visualize our data
readxls Allows us to read in excel/csv files
knitr Allows us to display our results in an RMarkDown file
splitstackshape Allows us to split columns at a deliminater

Reports Over the Years

Below is a visualization of the number of breach reports year by year. As you can see, the number of reports grew rapidly between 2009 and 2013, and remained high the following years. This could be due to the increases in technology available for hackers. 2017 saw the most breach reports.

Average Breach Size per Year

This visual shows the average size of a breach per year, barring outliers. This visual somewhat follows the previous graph in that we see early growth in the breach size early on, and stagnating breach sizes at a higher level in more recent years. 2014 saw the highest average breach size.

The Largest Breaches

Below is a table detailing the largest 20 breaches that have taken place, whether completely investigated or still investigating. The Anthem Breach was the largest breach by far, affecting nearly 80 million people. The next largest breaches, though still large and in the millions, are not nearly as bad.

Name of Covered Entity Individuals Affected Status
Anthem, Inc. Affiliated Covered Entity 78800000 C
Premera Blue Cross 11000000 C
Excellus Health Plan, Inc. 10000000 C
Science Applications International Corporation (SA 4900000 C
University of California, Los Angeles Health 4500000 C
Community Health Systems Professional Services Corporations 4500000 C
Community Health Systems Professional Services Corporation 4500000 C
Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group 4029530 C
Medical Informatics Engineering 3900000 C
Banner Health 3620000 C
Newkirk Products, Inc. 3466120 C
21st Century Oncology 2213597 C
Xerox State Healthcare, LLC 2000000 C
IBM 1900000 C
GRM Information Management Services 1700000 C
Iowa Health System d/b/a UnityPoint Health 1421107 I
AvMed, Inc. 1220000 C
CareFirst BlueCross BlueShield 1100000 C
Montana Department of Public Health & Human Services 1062509 C
The Nemours Foundation 1055489 C

Hacking/IT Breaches

The following graph shows the number of breaches associated with a Hacking/IT incident. We can see that hacking is becoming an increasingly prevalent strategy amongst data theives, as the number of Hacking incidents has steadily grown over the past decade. The number is down as of last year, but whether the trend continues remains to be seen.

Breaches By Entity

This is a small table containing the total of individuals affected per entity type. We can see that Health Plan entity types are the most susceptible by a large margin, while Healthcare Clearing House entities seem to be safer in the grand scheme of things. The “0” variable means data was not available.

Covered Entity Type sum(Individuals Affected)
0 20781
Business Associate 36103511
Health Plan 111447139
Healthcare Clearing House 17754
Healthcare Provider 41383951

Breaches by Day of the Week

This graph shows how many breaches occur on a given weekday. We can see that Monday through Thursday, the breach amount is relatively constant. This number nearly doubles from Thursday to Friday (Perhaps the data theives are trying to catch companies off guard as they head into the weekend). Not many reports come in on the weekend days. This chart tells me that companies should always be on guard, especially on Fridays.

Breach Type by Year

This series of graph delineates how prevalant a particular type of breach is depending on the year. We can see here that theft was the predominant means of data breach for the first few years, until about 2014. In 2014, Unauthorized Access and Hacking starting to increase in frequency. This trend continued until eventually theft was dwarfed by Hacking and Unauthorized access, showing a shift toward strategic hacking attacks and illegal data accessing.

Black Market Data: Exploratory Analysis

Those affected by State

Here is a table giving a breakdown of the total individuals affected based on the state the breach took place, in alphabetical order. Washington, Tennessee, California, Indiana, and New York have seen 10s of millions of individuals affected over the past decade. Ohio seems to be staying relatively safe.

Page 1

State sum(Individuals Affected)
0 39426
AK 75785
AL 1146315
AR 488643
AZ 4792005
CA 10002397
CO 286778
CT 317492
DC 40441
DE 49638
FL 6862514
GA 3031358
HI 55136
IA 1518108
ID 19786
IL 4855819
IN 84079500
KS 230376
KY 1046956
LA 163418

Page 2

State sum(Individuals Affected)
MA 406939
MD 2778182
ME 10063
MI 1004053
MN 391550
MO 822364
MS 153069
MT 1174195
NC 568779
ND 17515
NE 194406
NH 257191
NJ 3263956
NM 77407
NV 117285
NY 17150170
OH 859262
OK 632455
OR 420707
PA 1850138

Page 3

State sum(Individuals Affected)
PR 1712827
RI 103750
SC 765657
SD 35640
TN 11460836
TX 4641456
UT 895632
VA 5926242
VT 6806
WA 11775135
WI 254458
WV 82653
WY 60467
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA

Average Affected per Month

This chart outlines the average number of breaches that have taken place each calendar month. This chart reads from left to right, starting from January and ending in December. March, April, July, and September are when the most individuals are affected on average, though there does not seem to be an aggressive amount of variance. With that said, the last few months and the beggining few months of each year seem to be down-times for data theives, at least compared to the months in the middle of the year.