Data Breach Archive Analysis

Assignment 4 - R Markdown Conversion

Will Wallace


Explaining the dataset

This dataset is a large dataset that includes information on data breaches. Some of the variables include: Name of the covered entity (Organization responsible for the PHI), State (US State where the breach was reported), Covered Entity Type (Type of organization responsible for the PHI), Individuals Affected (Number of records affected by the breach), Breach submission date (Date the breach was reported by the CE), Type of breach (how unauthorized access to the PHI was obtained), Location of breached information (Where was the PHI when unauthorized access was obtained), Business associate present (Was a business associate such as a consultant or contractor involved in the breach), and Web description (A optional statement explaining what happened and the resolution)

A couple summary statistics

To start off the analysis of this data set, I chose a few summaries that might explain the data set and how it can be used.

## # A tibble: 5 x 3
##   State Tot_Affected na.rm
##   <chr>        <dbl> <lgl>
## 1 IN        79576765 TRUE 
## 2 FL         6001825 TRUE 
## 3 VA         5158001 TRUE 
## 4 IL         4692107 TRUE 
## 5 TX         4040208 TRUE
## # A tibble: 5 x 3
##   `Name of Covered Entity`                                    Tot_Affected na.rm
##   <chr>                                                              <dbl> <lgl>
## 1 Anthem, Inc. Affiliated Covered Entity                          78800000 TRUE 
## 2 Science Applications International Corporation (SA               4900000 TRUE 
## 3 Advocate Health and Hospitals Corporation, d/b/a Advocate ~      4029530 TRUE 
## 4 21st Century Oncology                                            2213597 TRUE 
## 5 Xerox State Healthcare, LLC                                      2000000 TRUE
## # A tibble: 1 x 1
##   Count_Breach
##          <int>
## 1         1709
## # A tibble: 1 x 1
##   Total_Affected
##            <dbl>
## 1      124249678
## # A tibble: 5 x 2
##   `Type of Breach`               Count_Breach
##   <chr>                                 <int>
## 1 Theft                                   712
## 2 Unauthorized Access/Disclosure          424
## 3 Hacking/IT Incident                     220
## 4 Loss                                    129
## 5 Other                                    74

With this data, you can see a couple things. For example, quick showings the states with the most individuals affected in data breaches (table 1), the top 5 companies or covered entities in terms of individuals affected (table 2), the number of breaches within the data set (table 3), the total amount of Individuals Affected within the data set (table 4), and the most common types of breach (table 5). These are some summaries that show how this data set can be evaluated.

What are the top 25 largest healthcare data breaches?

This table shows which companies have experienced the largest data breaches. This is something that will forever be on their record. With this, companies can start to dive deeper into what they can do to fix their problems, along with exposing they have a massive problem if they did not already realize it.

What are the top 10 total healthcare data breaches in the US?

This data exposed the most involved individuals among states in data breaches. This information can be used to understand the most common states for data breaches, and help you start to ask the question: Why does this happen in this state so frequently?

What are the top healthcare hacking incidents by month?

This data can help to start understand what months are common for hacking breaches and can be a starting point to helping you figure out why they happen.

How do the number of breaches compare in each entity type?

This shows how the data is being breached. A big point to take from it is the Healthcare Provider section has by far the most data breaches.

How have breaches changed over the years?

This graph shows the number of breaches that have occurred over the years.

My First Question: What is the average individuals affected per type of breach and what are the top 10 largest amount of individuals affected by type of breach?

I think this is an interesting question to understand what type of breaches affect individuals the most.

This graph shows that hacking and IT issues are significantly the most individually affecting data breaches, but looking at the other variables might lead to something. It looks like many companies need to dive deeper to attain a better understanding of where some of these unknown breaches are deriving from.

My Second Question: What are the top 5 most common location of breached information?

This is a question that hasn’t been experimented much within the data set. So I thought it would be good to analyze, and I’m glad I did.

To my surprise, the most common location of breached information is paper/films. I completely expected it to be something like email. I think this is a statistic that may catch many companies off guard.