| Package | Explanation for Use |
|---|---|
| Tidyverse | It’s tidy yo! |
| Dplyr | Used for breaking down and understanding the data. |
| DT | Used to create an interactive data table. |
| Lubridate | Used to parse the date column. |
| Knitr | USed for creating certain tables with restricted digits. |
| Action Performed | Purpose |
|---|---|
| Combining Data | Since the data was originally on two different CSV files, an rbind function was used to combine the data frames. As well, as.data.frame was used to establish the combined data frames with a row name. |
| Duplicate Removal | In many cases, some incidents are represented in duplicate rows. In order to remove those lines of data, a slice function was used to eliminate these duplicates. |
| Missing Data Removal | A few rows of data were missing key pieces of information. Any lines of data with missing information, outside of the web description, were removed using a drop_na function. |
| New Columns | New columns were added based on values found in the type of breach column. Each value in the Type of Breach column was assigned to a new column. For each value found in the Type of Breach column, a 1 was added to the corresponding new column. Otherwise, the new column was filled with a 0. |
| Variable Name | Description |
|---|---|
| Name of Covered Entity | This is the company or entity that experienced the data breach. |
| State | Represents the location of the company that had the data breach. |
| Covered Entity Type | The type of company that experienced the breach. |
| Individuals Affected | The number of individuals involved with the company that had information impacted during the data breach. |
| Breach Submission Date | The date when the breach was submitted for investigation. |
| Location of Breached Information | The orignal location of the data where it was breached from. |
| Business Associate Present | Using values of “Yes” or “No”, determines if there was a business associate present at the time of the breach. |
| Complete_Current | This column was created in order to seperate data based on breach investigations completed and those investigations that are still underway. |
Based on the data, we have now determined that there are 2,292 breaches that have occurred when omitting the top 5% of the data.
In order to see how many of each type of breach occurred, each individual type was identified on its own. This chart shows how many of each type of breach have occurred.
It is important to understand how well data is being protected. In order to better understand our imporvement of data protection, we need to see if our data has been breached more frequently since 2009.
However, the number of individuals that have been affected each year has been a much more concerning factor within each breach.
Obviously with more data breaches there will be more individuals affected as a result. So, by looking at average number of individuals affected by breach, we can gain a better understanding of how severe the breaches have been or are becoming over time.
The graph shows how breaches have shown relative stability, but the number of people affected on each breach is starting to increase.
Since we have removed the top 5% of our data in almost every other analysis, it is important to see how much larger these breaches were than all the others, especially the Anthem breach. The blue shading behind the Individuals Affected column represents how large a breach is in comparison to the largest breach that occurred.
Some of the covered entity types have experienced more data breaches than others. The table below shows the number of breaches each type has experienced and how many individuals have been affected since 2009 as a result.
| Covered Entity Type | Total Breaches | Individuals Affected | Average Individuals Affected |
|---|---|---|---|
| Business Associate | 348 | 31586095 | 90764.64 |
| Health Plan | 315 | 111324440 | 353410.92 |
| Healthcare Clearing House | 4 | 17754 | 4438.50 |
| Healthcare Provider | 1747 | 41323245 | 23653.83 |
The bar graph below shows how many times each entity has experienced a data breach since 2009.
| rowname | V1 |
|---|---|
| Hacking/IT Incident | 533 |
| Improper Disposal | 81 |
| Loss | 183 |
| Other | 93 |
| Theft | 887 |
| Unauthorized Access/Disclosure | 731 |
| Unknown | 15 |
Hacking has become one of the most common ways for data breaches to occur today. At 553 incidents, it has the third most incidents. In order to understand the enormity of the breaches that are occurring, the chart below shows how many hacking incidents have occured over the last few years.
There is some indication that hacking incidents are becoming harder and hard to protect data from. What’s worse is the number of people being impacted each year, excluding the top 5% of hacks.
However, the number of individuals being affected per incident is showing varied changes. The increased number of individuals being affected likely comes as a direct result of the greater number of breaches that are occurring.
Improper disposal makes up a very small portion of all data breaches that occur. It makes up the second fewest of all the incidents that have occurred with 81. With movement towards electronic records, it has become less and less likely for a data breach to occur as a result of physical records being lost.
However, despite the drastically fewer incidents by improper disposal, there are still a large number of individuals that are affected each year as a result.
Relative to other breach types, the number of individuals affected per incident is still fairly large.
Over the last several years, there has been fewer data breaches as a result of lost data. There has steadily been fewer data breaches by loss since 2014.
Despite few occurances, loss still accounts for a significant number of the individuals affected by data breaches. This graph shows how the number of people affected by incidents have changed over the years.
Even with the few number of incidents and high number of individuals affected, there is still not a significantly large number of people impacted by each breach.
The other category classifies all breaches that occurred that are atypical. There is no defining source as to how the breach occurred. As a result, this shows how few other ways there are to have data breaches occur.
However, despite the few number of other incidents, a few large data breaches that occurred in 2014 have made the other category quite large for individuals affected.
For the numebr of individuals affected, 2012 showed the highest average despite having fewer total individuals affected as shown above.
Theft is the most frequently occuring data breach. With 887 incidents since 2009, it has 156 more breaches than the second most type. From 2010 thru 2014, there were a consistent number of thefts of data each year. However, it appears there was increased security introduced in 2015 that have resulted in fewer incidents in each of the following years.
After peaking at over 175,000 people affected in 2014, there were the fewest number of individuals impacted in 2015 at just over 50,000. The number of individuals have not cleared 75,000 in a year since the peak year in 2014.
The number of individuals per breach is still fairly high, but has been steadily been declining.
The unknown breaches, unlike other, are breaches that have unknown sources. The unknown type has been used for unidentifiable causes of data breaches. These cases have been relatively low with none being logged since 2014.
Overall, very few individuals have actually been affected in these cases as there have only been two years that saw over 10,000 individuals affected.
Very few people have been impacted per incident, as shown by the graph below.