What kind of data are we working with?
The US Department of Health and Human Services (HHS) and it’s respective Office for Civil Rights (OCR), is responsible for collecting and disclosing protected health information. US law states that OCR must report cases where covered entities (those who are responsible for medical information) have a breach that affects more than 500 individuals in a single breach. This data was sourced from the US Department of Health and Human Services. The following variables are listed in the dataset:
| Variable | Description |
|---|---|
| Name of Covered Entity | Those organizations responsible for maintaining medical information. |
| State | State where the breach took place. |
| Covered Entity Type | Determines the type of organization the covered entity is (i.e. healthcare provider). |
| Individuals Affected | Number of those who were affected by the data breach. |
| Breach Submission Date | The data that the OCR was notified of the breach. |
| Type of Breach | The method of the data breach. These methods are Hacking/It incident, Improper Disposal, Loss, Theft, Unauthorized Access, and Unknown |
| Location of Breached Information | States where the medical information was held (i.e. in a laptop). |
| Business Associate Present | States whether or not if someone was present at the time of the breach? |
| Web Description | A brief description of the breach itself. |
Let’s Make the Data Clean
To make life easier down the road, we need to clean up this data to make it more readable and workable. Besides importing the proper packages to run our analysis, we need to make sure the data is consistent by removing missing data points as well as duplicates. Also, since all Types of Breach are located in the same column, we will have to parse them out into their own columns so that we can generate more accurate visualizations.
Some Summary Statistics for You!
The next few data tables are some basic summary statistics to give you a general idea about what’s happening in the dataset. Later in the reading, there will be some interesting questions and cooler visualizations to help answer them.
What is the Worst Kind of Breach?
There are 7 kinds of breach in total, each ranging in how many individuals are affected by that method of breach. The table below shows the worst kind of breach and the average number of individuals affected by that method of breach:
## # A tibble: 1 × 2
## `Type of Breach` avg_vict
## <chr> <dbl>
## 1 Hacking/IT Incident 397206.
Hacking/IT incident takes the cake in terms of worst kinds of breach. The average users affected by Hacking/It incidents are staggeringly high. This is most likely due to the efficiency of Hacking/It related breaches as these are meant to gather as much data as possible and are not limited by human hands.
Which Organizations Have the Most Breaches?
## # A tibble: 10 × 2
## `Biggest Losers` max_i…¹
## <fct> <dbl>
## 1 Anthem, Inc. Affiliated Covered Entity 7.88e7
## 2 Science Applications International Corporation (SA 4.9 e6
## 3 Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Gr… 4.03e6
## 4 21st Century Oncology 2.21e6
## 5 Xerox State Healthcare, LLC 2 e6
## 6 IBM 1.9 e6
## 7 GRM Information Management Services 1.7 e6
## 8 AvMed, Inc. 1.22e6
## 9 Montana Department of Public Health & Human Services 1.06e6
## 10 The Nemours Foundation 1.06e6
## # … with abbreviated variable name ¹max_individuals
This is a “Top 10 Losers” list for organizations. Anthem sets the bar quite high (or low) for the amount of individuals affected by a breach, distantly followed by the Science Applications and International Corporation.
What are the Most Common Methods of Data Breaches?
## # A tibble: 28 × 2
## `Type of Breach` n
## <chr> <int>
## 1 Theft 712
## 2 Unauthorized Access/Disclosure 424
## 3 Hacking/IT Incident 220
## 4 Loss 129
## 5 Other 74
## 6 Improper Disposal 57
## 7 Theft, Unauthorized Access/Disclosure 25
## 8 Loss, Theft 14
## 9 Unknown 9
## 10 Hacking/IT Incident, Unauthorized Access/Disclosure 8
## # … with 18 more rows
Theft is easily the most common way that data is stolen. This isn’t too surprising as humans have gotten good at theft over the course of history, but this doesn’t necessarily mean that Theft is the most efficient way of stealing information (as evident by the Hacking/IT Incident breaches).
Where is Data Taken From?
## # A tibble: 1,638 × 3
## # Groups: Location of Breached Information [64]
## `Location of Breached Information` Name …¹ total…²
## <chr> <chr> <dbl>
## 1 Network Server Anthem… 7.88e7
## 2 Other Scienc… 4.9 e6
## 3 Desktop Computer Advoca… 4.03e6
## 4 Network Server 21st C… 2.21e6
## 5 Desktop Computer, Email, Laptop, Network Server, Other, Othe… Xerox … 2 e6
## 6 Other IBM 1.9 e6
## 7 Electronic Medical Record, Other GRM In… 1.7 e6
## 8 Laptop AvMed,… 1.22e6
## 9 Network Server Montan… 1.06e6
## 10 Other The Ne… 1.06e6
## # … with 1,628 more rows, and abbreviated variable names
## # ¹`Name of Covered Entity`, ²total_individuals
This statistic shows the most common location of breached information and what company was responsible. Network server being number one is expected as it was the biggest breach to date. However, the second biggest breach is Other, so we don’t know exactly what happened, just that SAIC lost over 4 million records. The paper location didn’t make the top ten, so the days of your traditional theft are over when it comes to data. Paper would be considered a more secure alternative to online storage.
An In-Depth Look at Data Breaches
## # A tibble: 1,065 × 6
## # Groups: State, Type of Breach, Covered Entity Type, Business Associate
## # Present [541]
## State `Type of Breach` Covered Entity…¹ Busin…² Locat…³ total…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 IN Hacking/IT Incident Health Plan No Networ… 7.88e7
## 2 VA Loss Business Associ… Yes Other 4.9 e6
## 3 IL Theft Healthcare Prov… No Deskto… 4.03e6
## 4 FL Hacking/IT Incident Healthcare Prov… No Networ… 2.22e6
## 5 TX Unauthorized Access/Disclosure Business Associ… Yes Deskto… 2 e6
## 6 NY Unknown Business Associ… Yes Other 1.9 e6
## 7 NJ Theft Business Associ… Yes Electr… 1.7 e6
## 8 FL Theft Health Plan No Laptop 1.22e6
## 9 CA Theft Healthcare Prov… No Laptop 1.15e6
## 10 MT Hacking/IT Incident Health Plan No Networ… 1.06e6
## # … with 1,055 more rows, and abbreviated variable names
## # ¹`Covered Entity Type`, ²`Business Associate Present`,
## # ³`Location of Breached Information`, ⁴total_individuals
This is another statistic to show more detailed information on the factors that may (or may not) be affecting the individuals affected by these breaches. If you look at the kind of Covered Entity Type, you’ll notice that more Business Associates make the top 10. This may contribute to the mishandling of customer data as business associates may not be as keen to protecting your data as other entity types would be.
Visual of Healthcare Breaches by Year
2014 was an especially bad year for data breaches. During that year, a
Windows vulnerability known as Eternal Blue was discovered and led to
the exploitation of thousands of Windows users. As information security
got better, you notice that the amount of breaches start to go down
year-by-year.
Top 25 Breaches
## # A tibble: 25 × 2
## `Biggest Losers` max_i…¹
## <fct> <dbl>
## 1 Anthem, Inc. Affiliated Covered Entity 7.88e7
## 2 Science Applications International Corporation (SA 4.9 e6
## 3 Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Gr… 4.03e6
## 4 21st Century Oncology 2.21e6
## 5 Xerox State Healthcare, LLC 2 e6
## 6 IBM 1.9 e6
## 7 GRM Information Management Services 1.7 e6
## 8 AvMed, Inc. 1.22e6
## 9 Montana Department of Public Health & Human Services 1.06e6
## 10 The Nemours Foundation 1.06e6
## 11 BlueCross BlueShield of Tennessee, Inc. 1.02e6
## 12 Sutter Medical Foundation 9.43e5
## 13 Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology … 8.83e5
## 14 Horizon Healthcare Services, Inc., doing business as Horizon Blue Cr… 8.40e5
## 15 Iron Mountain Data Products, Inc. (now known as 8 e5
## 16 Utah Department of Technology Services 7.8 e5
## 17 AHMC Healthcare Inc. and affiliated Hospitals 7.29e5
## 18 EISENHOWER MEDICAL CENTER 5.14e5
## 19 Radiology Regional Center, PA 4.83e5
## 20 Puerto Rico Department of Health - Triple S Management Corp. 4.75e5
## 21 St Joseph Health System 4.05e5
## 22 Spartanburg Regional Healthcare System 4 e5
## 23 Triple-S Salud, Inc. 3.98e5
## 24 Triple-S Salud, Inc. - Breach Case#2 3.98e5
## 25 Community Health Plan of Washington 3.82e5
## # … with abbreviated variable name ¹max_individuals
Anthem’s breach leaked the information of 78.8 million users, or about 1 in every 3 insurance users.
Top 10 States by Data Loss
Indiana seems to be the worst state to be in for healthcare insurance
data protection. This is skewed from the Anthem breach, but the breach
was the largest in history. To not include it would be to discount quite
an important data point.
###Number of Healthcare Hacking Incidents by Month April
seems to be the time where hackers like crawling out to steal data. Each
month, denoted by it’s respective number (i.e. 1 = January), does have a
high number of hacking incidents. It is 2023, so insurance companies
need to be at the top of their IT game to avoid Hacking/IT incidents to
prevent more Anthem-scale losses.
Number of Breaches per Entity Type
##
## Business Associate Health Plan Healthcare Clearing House
## 285 200 4
## Healthcare Provider
## 1220
Healthcare providers are most likely to lose your insurance data, as they have 1220 recorded instances of breach. While it is more likely for these types of entities to have a breach due to the nature of their work, Healthcare Providers have 935 more breaches than the next entity type (Business Associate).
On What Day of the Week (Sunday, Monday, etc.) are Breaches Most Often Reported?
## # A tibble: 7 × 2
## # Groups: weekday [7]
## weekday n
## <ord> <int>
## 1 Fri 512
## 2 Thu 300
## 3 Mon 286
## 4 Wed 282
## 5 Tue 281
## 6 Sat 29
## 7 Sun 19
This tells us that Fridays are the biggest breach days, with 512 times a Friday incident occurred. This is followed by Thursday and and Monday. As the weekend approaches, employees may be less likely to notice data breaches or missing items. Surprisingly, the weekends are uneventful, with hardly any breaches taking place during these days.
In which year (or years) were there at least 50 breaches from a ‘Business Associate’ covered entity type and at least 150 breaches from a healthcare provider covered entity type?
## # A tibble: 6 × 3
## # Groups: Covered Entity Type [2]
## `Covered Entity Type` breach_year count
## <chr> <fct> <int>
## 1 Business Associate 2013 64
## 2 Business Associate 2014 67
## 3 Healthcare Provider 2013 187
## 4 Healthcare Provider 2014 179
## 5 Healthcare Provider 2015 155
## 6 Healthcare Provider 2016 182
How has the type of breach (hacking, improper disposal, loss, etc.) changed for each year?
Theft still is considered a classic for stealing data. Despite the
advancement of technology, common theft has been and still is the most
common method of data breaches. In 2013, you start to see Hacking/It
incidents show up more often. By 2016, this breach category is even
higher. While companies should invest more into keeping their digital
information secure, they still need to be on the lookout for common
theft and other traditional methods of breaching data.
Do certain methods of breach occur on different days of the week?
As we saw before, Friday is the biggest breach day. However, there are
almost no Hacking/It incidents on the weekends (to my surprise). These
types of incidents occur mainly on Mondays and Fridays. I believed that
the data would be in it’s most vulnerable state on the weekends since
the IT workers aren’t constantly monitoring their networks.
Interestingly, Thefts skyrocket on Fridays. These could primarily be in
the evening as the workers are headed home for their weekends off.