Introduction to the Dataset

This dataset is composed of data points regarding data breaches all over the United States. The following data points are covered in this dataset:

• Name of the covered entity (Organization responsible for the PHI) • State (US State where the breach was reported) • Covered Entity Type (Type of organization responsible for the PHI) • Individuals Affected (Number of records affected by the breach) • Breach submission date (Date the breach was reported by the CE) • Type of breach (how unauthorized access to the PHI was obtained) • Location of breached information (Where was the PHI when unauthorized access was obtained) • Business associate present (Was a business associate such as a consultant or contractor involved in the breach) • Web description (A optional statement explaining what happened and the resolution)

This dataset is composed strictly of data breaches that directly impacted more than 500 individuals since 2009.

Some data cleansing and wrangling was performed on the back end of the script to make the data more readable and easier to analyze!

Summary Statistics

The first summary statistic is set out at aiming a simple question: what is the average amount of people effected in a data breach?

To answer this question, the mean is taken of the entire “Individuals Affected” Column. The average amount of people effected in a data breach was 72,703 people, which is a lot of people! Although this is just a simple mean calculation, it’s worth reminding there’s a huge range of impacted people in data breaches; notably, from 500 to 78 million!

The second summary statistic aims to take a deeper analysis, examining how many times every state in the United States has experienced a data breach in the data set.

To calculate this, the data is grouped by the State column & uses a summarize function to output a data table showing each state represented in the dataset and a count for every time a data breach occurred in a certain state.

Each state has experienced a significant amount of data breaches; we’ll look at this data more in-depth later.

The third summary statistic centers around the various types of data breaches that occur & are named in this dataset.

This outputted table shows every type of data breach that is outlined in the dataset & a count of how many times each type of breach occurs.

It’s worth noting this does include all the instances where multiple data breaches occurred in one instance; hence, where a lot of the 1’s come from. However, it still gives us some insight into which types of breach are more common.

The fourth summary statistic addresses the distribution of the location of breached information.

The outputted table outlines every location of breached information & a count of how many times each location is found in the dataset.

Finally, the fifth & final summary statistic intends to see how much collusion is involved in data breaches.

To do this, the “Business Associate Present” is focused on. This simple table outputs whether or not a business associate was directly involved in the data breach itself; it’s a simple yes-no logic for easier understanding.

It’s heartening to see that the vast majority of data breaches did not have any business associates actively involved in the data breach!

In-depth Analysis

The output for every question in this section will either be a specific visual (e.g. histogram), or a data table.

2013 and 2014 had the most data breaches occur, with 2016 following not too far behind. The amount of data breaches was fairly steady throughout this whole time encompassed in the dataset, with the exception of the beginning (2009) and the ending years (2019, 2020). This could be because the data is incomplete; for example, this data breach set most likely does not encompass all of the year 2009 & on the other end, the data might not be complete for the years 2019 and 2020.

These data breaches are quite large; the largest data breach being 78m people effected in one instance by Anthem, Inc. & the 25th largest data breach being 398,000 people impacted by Triple-S Salud, Inc. These are the top 25 largest healthcare data breaches in the dataset.

It might not be surprising to note that California and Texas are the two states that recorded having the highest amount of data breaches in the dataset, since California and Texas are two of the biggest states in the USA (by population and area)!

For reference, each month is listed numerically on the x-axis (meaning January is the first bar, February is the second bar, etc.). It seems as though there does not seem to be any “seasonality” to the data; data breaches are pretty common to occur throughout the year. However, it does seem that March and April are the two months with the highest amount of data breaches.

There aren’t any entities that stick out as being more prone to data breaches; the maximum number of data breaches occurred by the same company is 6!

For reference, the day of the week is represented numerically (meaning Sunday = 1, Monday = 2, etc.). It looks like Sunday has the highest amount of data breaches reported, which is interesting. Another thing that sticks out is that Tuesday and Wednesday have the fewest amount of data breaches reported by far; it’s not even close! Guess data breaches are just a weekend thing.

This data table shows in which years there were data breaches were business associates were actively involved in the data breach & the covered entity type was the Healthcare Provider. 2016 had the most amount of data breaches with this two-pronged criteria, with 33 data breaches.

This visual is a trend analysis that shows how each type of breach has changed for each year.

Delaware and Maine have the lowest amount of data breaches in this dataset, both having only 2 breaches, which is really impressive! Idaho and Vermont also only had 3 data breaches. Contrasted with states like California and Texas, we see just how much variance can exist in the US dependent on the state!

Healthcare providers make up the vast majority of the data breaches by a wide margin; business associates follow behind as the second most popular entity type.