What kind of data are we working with?

The US Department of Health and Human Services (HHS) and it’s respective Office for Civil Rights (OCR), is responsible for collecting and disclosing protected health information. US law states that OCR must report cases where covered entities (those who are responsible for medical information) have a breach that affects more than 500 individuals in a single breach. This data was sourced from the US Department of Health and Human Services. The following variables are listed in the dataset:

Variable Description
Name of Covered Entity Those organizations responsible for maintaining medical information.
State State where the breach took place.
Covered Entity Type Determines the type of organization the covered entity is (i.e. healthcare provider).
Individuals Affected Number of those who were affected by the data breach.
Breach Submission Date The data that the OCR was notified of the breach.
Type of Breach The method of the data breach. These methods are Hacking/It incident, Improper Disposal, Loss, Theft, Unauthorized Access, and Unknown
Location of Breached Information States where the medical information was held (i.e. in a laptop).
Business Associate Present States whether or not if someone was present at the time of the breach?
Web Description A brief description of the breach itself.

Let’s Make the Data Clean

To make life easier down the road, we need to clean up this data to make it more readable and workable. Besides importing the proper packages to run our analysis, we need to make sure the data is consistent by removing missing data points as well as duplicates. Also, since all Types of Breach are located in the same column, we will have to parse them out into their own columns so that we can generate more accurate visualizations.

Some Summary Statistics for You!

The next few data tables are some basic summary statistics to give you a general idea about what’s happening in the dataset. Later in the reading, there will be some interesting questions and cooler visualizations to help answer them.

What is the Worst Kind of Breach?

There are 7 kinds of breach in total, each ranging in how many individuals are affected by that method of breach. The table below shows the worst kind of breach and the average number of individuals affected by that method of breach:

## # A tibble: 1 × 2
##   `Type of Breach`    avg_vict
##   <chr>                  <dbl>
## 1 Hacking/IT Incident  397206.

Hacking/IT incident takes the cake in terms of worst kinds of breach. The average users affected by Hacking/It incidents are staggeringly high. This is most likely due to the efficiency of Hacking/It related breaches as these are meant to gather as much data as possible and are not limited by human hands.

Which Organizations Have the Most Breaches?

## # A tibble: 10 × 2
##    `Biggest Losers`                                                      max_i…¹
##    <fct>                                                                   <dbl>
##  1 Anthem, Inc. Affiliated Covered Entity                                 7.88e7
##  2 Science Applications International Corporation (SA                     4.9 e6
##  3 Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Gr…  4.03e6
##  4 21st Century Oncology                                                  2.21e6
##  5 Xerox State Healthcare, LLC                                            2   e6
##  6 IBM                                                                    1.9 e6
##  7 GRM Information Management Services                                    1.7 e6
##  8 AvMed, Inc.                                                            1.22e6
##  9 Montana Department of Public Health & Human Services                   1.06e6
## 10 The Nemours Foundation                                                 1.06e6
## # … with abbreviated variable name ¹​max_individuals

This is a “Top 10 Losers” list for organizations. Anthem sets the bar quite high (or low) for the amount of individuals affected by a breach, distantly followed by the Science Applications and International Corporation.

What are the Most Common Methods of Data Breaches?

## # A tibble: 28 × 2
##    `Type of Breach`                                        n
##    <chr>                                               <int>
##  1 Theft                                                 712
##  2 Unauthorized Access/Disclosure                        424
##  3 Hacking/IT Incident                                   220
##  4 Loss                                                  129
##  5 Other                                                  74
##  6 Improper Disposal                                      57
##  7 Theft, Unauthorized Access/Disclosure                  25
##  8 Loss, Theft                                            14
##  9 Unknown                                                 9
## 10 Hacking/IT Incident, Unauthorized Access/Disclosure     8
## # … with 18 more rows

Theft is easily the most common way that data is stolen. This isn’t too surprising as humans have gotten good at theft over the course of history, but this doesn’t necessarily mean that Theft is the most efficient way of stealing information (as evident by the Hacking/IT Incident breaches).

Where is Data Taken From?

## # A tibble: 1,638 × 3
## # Groups:   Location of Breached Information [64]
##    `Location of Breached Information`                            Name …¹ total…²
##    <chr>                                                         <chr>     <dbl>
##  1 Network Server                                                Anthem…  7.88e7
##  2 Other                                                         Scienc…  4.9 e6
##  3 Desktop Computer                                              Advoca…  4.03e6
##  4 Network Server                                                21st C…  2.21e6
##  5 Desktop Computer, Email, Laptop, Network Server, Other, Othe… Xerox …  2   e6
##  6 Other                                                         IBM      1.9 e6
##  7 Electronic Medical Record, Other                              GRM In…  1.7 e6
##  8 Laptop                                                        AvMed,…  1.22e6
##  9 Network Server                                                Montan…  1.06e6
## 10 Other                                                         The Ne…  1.06e6
## # … with 1,628 more rows, and abbreviated variable names
## #   ¹​`Name of Covered Entity`, ²​total_individuals

This statistic shows the most common location of breached information and what company was responsible. Network server being number one is expected as it was the biggest breach to date. However, the second biggest breach is Other, so we don’t know exactly what happened, just that SAIC lost over 4 million records. The paper location didn’t make the top ten, so the days of your traditional theft are over when it comes to data. Paper would be considered a more secure alternative to online storage.

An In-Depth Look at Data Breaches

## # A tibble: 1,065 × 6
## # Groups:   State, Type of Breach, Covered Entity Type, Business Associate
## #   Present [541]
##    State `Type of Breach`               Covered Entity…¹ Busin…² Locat…³ total…⁴
##    <chr> <chr>                          <chr>            <chr>   <chr>     <dbl>
##  1 IN    Hacking/IT Incident            Health Plan      No      Networ…  7.88e7
##  2 VA    Loss                           Business Associ… Yes     Other    4.9 e6
##  3 IL    Theft                          Healthcare Prov… No      Deskto…  4.03e6
##  4 FL    Hacking/IT Incident            Healthcare Prov… No      Networ…  2.22e6
##  5 TX    Unauthorized Access/Disclosure Business Associ… Yes     Deskto…  2   e6
##  6 NY    Unknown                        Business Associ… Yes     Other    1.9 e6
##  7 NJ    Theft                          Business Associ… Yes     Electr…  1.7 e6
##  8 FL    Theft                          Health Plan      No      Laptop   1.22e6
##  9 CA    Theft                          Healthcare Prov… No      Laptop   1.15e6
## 10 MT    Hacking/IT Incident            Health Plan      No      Networ…  1.06e6
## # … with 1,055 more rows, and abbreviated variable names
## #   ¹​`Covered Entity Type`, ²​`Business Associate Present`,
## #   ³​`Location of Breached Information`, ⁴​total_individuals

This is another statistic to show more detailed information on the factors that may (or may not) be affecting the individuals affected by these breaches. If you look at the kind of Covered Entity Type, you’ll notice that more Business Associates make the top 10. This may contribute to the mishandling of customer data as business associates may not be as keen to protecting your data as other entity types would be.

Visual of Healthcare Breaches by Year

2014 was an especially bad year for data breaches. During that year, a Windows vulnerability known as Eternal Blue was discovered and led to the exploitation of thousands of Windows users. As information security got better, you notice that the amount of breaches start to go down year-by-year.

Top 25 Breaches

## # A tibble: 25 × 2
##    `Biggest Losers`                                                      max_i…¹
##    <fct>                                                                   <dbl>
##  1 Anthem, Inc. Affiliated Covered Entity                                 7.88e7
##  2 Science Applications International Corporation (SA                     4.9 e6
##  3 Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Gr…  4.03e6
##  4 21st Century Oncology                                                  2.21e6
##  5 Xerox State Healthcare, LLC                                            2   e6
##  6 IBM                                                                    1.9 e6
##  7 GRM Information Management Services                                    1.7 e6
##  8 AvMed, Inc.                                                            1.22e6
##  9 Montana Department of Public Health & Human Services                   1.06e6
## 10 The Nemours Foundation                                                 1.06e6
## 11 BlueCross BlueShield of Tennessee, Inc.                                1.02e6
## 12 Sutter Medical Foundation                                              9.43e5
## 13 Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology …  8.83e5
## 14 Horizon Healthcare Services, Inc., doing business as Horizon Blue Cr…  8.40e5
## 15 Iron Mountain Data Products, Inc. (now known as                        8   e5
## 16 Utah Department of Technology Services                                 7.8 e5
## 17 AHMC Healthcare Inc. and affiliated Hospitals                          7.29e5
## 18 EISENHOWER MEDICAL CENTER                                              5.14e5
## 19 Radiology Regional Center, PA                                          4.83e5
## 20 Puerto Rico Department of Health - Triple S Management Corp.           4.75e5
## 21 St Joseph Health System                                                4.05e5
## 22 Spartanburg Regional Healthcare System                                 4   e5
## 23 Triple-S Salud, Inc.                                                   3.98e5
## 24 Triple-S Salud, Inc. - Breach Case#2                                   3.98e5
## 25 Community Health Plan of Washington                                    3.82e5
## # … with abbreviated variable name ¹​max_individuals

Anthem’s breach leaked the information of 78.8 million users, or about 1 in every 3 insurance users.

Top 10 States by Data Loss

Indiana seems to be the worst state to be in for healthcare insurance data protection. This is skewed from the Anthem breach, but the breach was the largest in history. To not include it would be to discount quite an important data point.

###Number of Healthcare Hacking Incidents by Month April seems to be the time where hackers like crawling out to steal data. Each month, denoted by it’s respective number (i.e. 1 = January), does have a high number of hacking incidents. It is 2023, so insurance companies need to be at the top of their IT game to avoid Hacking/IT incidents to prevent more Anthem-scale losses.

Number of Breaches per Entity Type

## 
##        Business Associate               Health Plan Healthcare Clearing House 
##                       285                       200                         4 
##       Healthcare Provider 
##                      1220

Healthcare providers are most likely to lose your insurance data, as they have 1220 recorded instances of breach. While it is more likely for these types of entities to have a breach due to the nature of their work, Healthcare Providers have 935 more breaches than the next entity type (Business Associate).

On What Day of the Week (Sunday, Monday, etc.) are Breaches Most Often Reported?

## # A tibble: 7 × 2
## # Groups:   weekday [7]
##   weekday     n
##   <ord>   <int>
## 1 Fri       512
## 2 Thu       300
## 3 Mon       286
## 4 Wed       282
## 5 Tue       281
## 6 Sat        29
## 7 Sun        19

This tells us that Fridays are the biggest breach days, with 512 times a Friday incident occurred. This is followed by Thursday and and Monday. As the weekend approaches, employees may be less likely to notice data breaches or missing items. Surprisingly, the weekends are uneventful, with hardly any breaches taking place during these days.

In which year (or years) were there at least 50 breaches from a ‘Business Associate’ covered entity type and at least 150 breaches from a healthcare provider covered entity type?

## # A tibble: 6 × 3
## # Groups:   Covered Entity Type [2]
##   `Covered Entity Type` breach_year count
##   <chr>                 <fct>       <int>
## 1 Business Associate    2013           64
## 2 Business Associate    2014           67
## 3 Healthcare Provider   2013          187
## 4 Healthcare Provider   2014          179
## 5 Healthcare Provider   2015          155
## 6 Healthcare Provider   2016          182

How has the type of breach (hacking, improper disposal, loss, etc.) changed for each year?

Theft still is considered a classic for stealing data. Despite the advancement of technology, common theft has been and still is the most common method of data breaches. In 2013, you start to see Hacking/It incidents show up more often. By 2016, this breach category is even higher. While companies should invest more into keeping their digital information secure, they still need to be on the lookout for common theft and other traditional methods of breaching data.

Do certain methods of breach occur on different days of the week?

As we saw before, Friday is the biggest breach day. However, there are almost no Hacking/It incidents on the weekends (to my surprise). These types of incidents occur mainly on Mondays and Fridays. I believed that the data would be in it’s most vulnerable state on the weekends since the IT workers aren’t constantly monitoring their networks. Interestingly, Thefts skyrocket on Fridays. These could primarily be in the evening as the workers are headed home for their weekends off.