This dataset is a large dataset that includes information on data breaches. Some of the variables include: Name of the covered entity (Organization responsible for the PHI), State (US State where the breach was reported), Covered Entity Type (Type of organization responsible for the PHI), Individuals Affected (Number of records affected by the breach), Breach submission date (Date the breach was reported by the CE), Type of breach (how unauthorized access to the PHI was obtained), Location of breached information (Where was the PHI when unauthorized access was obtained), Business associate present (Was a business associate such as a consultant or contractor involved in the breach), and Web description (A optional statement explaining what happened and the resolution)
To start off the analysis of this data set, I chose a few summaries that might explain the data set and how it can be used.
## # A tibble: 5 x 3
## State Tot_Affected na.rm
## <chr> <dbl> <lgl>
## 1 IN 79576765 TRUE
## 2 FL 6001825 TRUE
## 3 VA 5158001 TRUE
## 4 IL 4692107 TRUE
## 5 TX 4040208 TRUE
## # A tibble: 5 x 3
## `Name of Covered Entity` Tot_Affected na.rm
## <chr> <dbl> <lgl>
## 1 Anthem, Inc. Affiliated Covered Entity 78800000 TRUE
## 2 Science Applications International Corporation (SA 4900000 TRUE
## 3 Advocate Health and Hospitals Corporation, d/b/a Advocate ~ 4029530 TRUE
## 4 21st Century Oncology 2213597 TRUE
## 5 Xerox State Healthcare, LLC 2000000 TRUE
## # A tibble: 1 x 1
## Count_Breach
## <int>
## 1 1709
## # A tibble: 1 x 1
## Total_Affected
## <dbl>
## 1 124249678
## # A tibble: 5 x 2
## `Type of Breach` Count_Breach
## <chr> <int>
## 1 Theft 712
## 2 Unauthorized Access/Disclosure 424
## 3 Hacking/IT Incident 220
## 4 Loss 129
## 5 Other 74
With this data, you can see a couple things. For example, quick showings the states with the most individuals affected in data breaches (table 1), the top 5 companies or covered entities in terms of individuals affected (table 2), the number of breaches within the data set (table 3), the total amount of Individuals Affected within the data set (table 4), and the most common types of breach (table 5). These are some summaries that show how this data set can be evaluated.
This table shows which companies have experienced the largest data breaches. This is something that will forever be on their record. With this, companies can start to dive deeper into what they can do to fix their problems, along with exposing they have a massive problem if they did not already realize it.
This data exposed the most involved individuals among states in data breaches. This information can be used to understand the most common states for data breaches, and help you start to ask the question: Why does this happen in this state so frequently?
This data can help to start understand what months are common for hacking breaches and can be a starting point to helping you figure out why they happen.
This shows how the data is being breached. A big point to take from it is the Healthcare Provider section has by far the most data breaches.
This graph shows the number of breaches that have occurred over the years.
I think this is an interesting question to understand what type of breaches affect individuals the most.
This graph shows that hacking and IT issues are significantly the most individually affecting data breaches, but looking at the other variables might lead to something. It looks like many companies need to dive deeper to attain a better understanding of where some of these unknown breaches are deriving from.
This is a question that hasn’t been experimented much within the data set. So I thought it would be good to analyze, and I’m glad I did.
To my surprise, the most common location of breached information is paper/films. I completely expected it to be something like email. I think this is a statistic that may catch many companies off guard.