Introduction

Since 2009 there have been over 2300 reported data breaches accross the United States that have had over 500 records breached. These breaches have effected over 188 million Americans. Breaches have come in all 50 states and Peurto Rico. In this report we will take a closer look at how many breaches have occured each year, what types of breaches have occured, try to gain some understanding of how they ocrrued in the first place and look at some ways to help mitigate their occruances in the futre.

The Data

The data for this paper comes from the US Department of Health and Human Services in the Office for Civil Rights (OCR). The data has been stores on a OCR data base, but for this report data has been moved to a csv files and hosested on the website: http://asayanalytics.com/. The CSV files from this website (http://asayanalytics.com/breach_archive_csv,https://asayanalytics.com/breach_investigation_csv) have been read into R and this report has been created. The first file containes all reported breaches that have a completed investigation. The second file contains all reaported breaches that have an investigation ongoing. These two files were merged into a functional data set and the following columns produced:

  1. Name of Breached Business: This column contains all the business that have experianced at least 1 data breach of over 500 reacords over the past 8 years
  2. State of Breach Business: This columns contains the two letter abbriviated state name in which the business resides
  3. Breached Business Type: This column identifies the type of business each breached business opperates in
  4. Number of Records Breached: Column of total records breached per reported breached date
  5. Date Breach was Submitted: Column containing the date in which the breached business reported the breach to the OCR 6.Weekday of Breached Submission: Column of the day of the week for the Date Breach was Submitted
  6. Year of Breach: Column of year of for the Data Breach was Submitted
  7. Type of Breach: Column containing how the breach occured (i.e. Hacking/IT Incident, Improper Disposal, Loss, Theft, Unathorized Acces/Disclosure, Unkown and Other)
  8. Source of Breach: Column containing the source of the breach (i.e. Email, Desktop Computer, Electronic Medical Record, Laptop, Network Server, Other Portable Electronic Device, Paper/Film and Other)
  9. Was a Business Associate Present: This column contains a flag denoting if an employee of the business was present at the time of the breach (Yes donotes an employee present)
  10. Is Breach Investigation Completed or on Going: This column identifies breaches that have completed their investigation and those that have not
  11. Web Description (not shown): This data containes the actual worked that were used to descibe the breach. This data will be used later for a word cloud but is not included in the summary table below

The table below contains all the data that will be used in this report. Please note that the following cleaning proceedures have been applied to this data to prepare the raw data for this report:

  1. Any repete data was removed (this was accomplished by identifying all businesses with the same state, number of breached reacords, business type and breach type. Due to the same business being naamed multiple names in dataset, this factor could not be used to help remove duplicate values (~49 records))
  2. For all business with multiple types of breach, the first reported breach type was used as the primary breach type (While there is good aurgument to include all types, for simplicity sake we will only concern ourselves with the first reported breach type)
  3. Any business with multiple source of breach listed, the first listed location was used (much like the type of breach there is good aurgument for using all types, simplicity will help us focus on the first reported source for this report)
  4. Any data that has a null (or blank) value for Business Name, State, Breach Business Type, Number of Reconrds BReached, Date of Breach was Submitted, Type of Breach, or Source of Breach was removed (~8 records)

This report should be used to help the the user understand how often data breaches are happening, what sources are being used to breach data and indetify which states have the most opportunity to improve their security. All tables in this report are interactionable and can be used to sort and filter data to look at specific results.

Breaches by Year

Let us first take a look at how many breaches have been reported since 2009. For this graph we will include all reported breaches, regardless if the investigation is complete or not. The top 5% of reported breaches have been removed (top 5% based upon breached record count)

In the above graph we can see that breaches have grown substantially since 2009 (from 15 in 2009 to 340 in 2017).

Healthcare Breached Records

Now that we understand the ever growing occurnace of data breaches, let us take a closer look at healthcare record breaches. In this graph, the average healthcare record breach count is displayed by year. The top 5% of reported record breaches have been removed.

We learn from the above graph that 2015 and 2016 were a very bad years for healcare data record breaches. However both 2017 and YTD 2018 show that this trend is not continuing.

Largest breaches in Healthcare

Now that we have seen how year over year healthcare breaches have changes, let us take a look at the top 20 healthcare record breaches in our dataset:

From this table we learn that the 2015 Anthem breach was the largest in healcare breach history (78,800,000 records breached), nearly 8 times larger than the next closes one (Premera Blue Cross with 11,000,000 records breached)

Hacking/IT Incidents

Let us now look at how hacking has played a part in data breaches over the past 9 years.

It can clearly been seen that hacking incidents are on the rise and raise concern over what business are doing to ensure that their systems become harder to hack. Hacking incidents, as a whole, have also grown more and more prevelent over the past few years accross America. Any company who does business should be diligent in ensuring that their data is as secure as possible from hackers.

Breach by Type

The below table will allow us to take a look at the number of breaches that have occrered year over year by the business type:

Breach Reporting

So far we have concerned ourselves with mearly understanding our breaches (i.e. their trends, types and sizes). Let us now spend a little time focusing on the timing of the reporting of these breaches. In the below chart we can see what day of the week a breach is reported to the public:

Clearly a business is more likely to report a breache on a Friday than anyother day of the week. One possible explination is that the company hopes to avoid a firestorm of questions and demands for answers as soon as the announcement is made. Another possible explination is that most of these companies corperate offices are closed over the weekend, so by making the announcement on Friday they can avoid having to answer questions for a few days.

Breach Type Year over Year Change

Let us switch our focus again and take a look at how the different types of breaches have changed over the years. The below set of graphs helps us see how each breach type has changed since 2009 till 2018:

Some interesting changes are that theft has been in decline for the past few years, while both Hacking/IT and Unauthorized Access have booth increased. One possible explination is that there has been a change in the classification of theft breach types and more and more types that use to fall under the breach category are now being put into the Hacking/IT and Unauthorized Access categories.

What is Being Said About These Breaches

The below word cloud shows us the top 50 words used to desribe the breaches.

Exploring the Data a Little More

While it has been established that most reported breaches occure on a Friday, do we see a simular trend in months of the year (i.e. do more breaches get reporting during the summer vs. the winter months)? To answer this question we will look at the count of breaches reported by month:

The graph shows that there is a slight increase in reported breaches in March and April. A breach reported in the March April time frame, would have occured 2-3 months prior, indicating that many of the actual breaching takes plus during the winter months.

Let us also consider the question; Are most breaches isolated to a single state, or is the spread fairly equal? To answer this question we will take a look at the number of breaches reported by state. (Just as a reminder, Peurto Rico is part of this dataset and it’s abbreviation is PR):

Both California and Texas both have some very large spikes in counts of data breaches, with Florida, New York and Illinois rounding out the top 5. However, does a large number of data breaches equate to a large number of records breached? To answer this question, let us take a look at the number of records breached by state:

By far Indiana businesses have has the most records breached than any other state. However, the Anthem data breach (the largest in our data set) is located in Indiana. Let us remove this data from our graph and take a another look at how many records business in each state have had breached:

With the removal of the Anthem data breach, New York, Tennesee, Washinton, Arizona and Florida have the most breached records in the United States. So for business in both Florida and New York more data breaches have resulted in more records breached. Business in these states must take extra care when handling their employee’s/customer’s data.

Next let us ask the question, how do breach record counts vary accross the different business types? To answer this, let us look at the graph below:

This graph clearly shows that Health Plan businesses by far account for the most amount of breached records than any other type of business. Because the Anthem oranization is part of the Health Plan business type, let us remove them from our data and take a closer look at how many records are breached by business type:

With the Anthem data breached removed it can be seen that now the Healthcare Provider business type has become are largest bucket. The past few graphs also give us insite into just how large the Anthem data breach was. Both times that this set of data was removed from our graphs the resulting graphs told a very different story. From this we can get a real understanding of how important it is to ensure that extream outlyers like this be removed from any analysis preformed.

Having a better understanding of location of the business that have data breaches and the type of breaches that occure, let us get a better understanding of the source of the breach. In this graph we will look at the count of breaches by source:

Below is a table gives the number of records breached by source. Again the Anthem data breach has been removed:

While sources like email, laptop and network server do account for the majority of data records stolen, paper/film sources account for the largest count of breaches. This indicates that whenever a person is able to gain access to confidential data through electronic means, they are able to steal vast quantities of data. And that many company employees are not being careful enough with their handling of paper documents with sensitive data.

From these findings we can suggest that companies need to ensure that their employees understand proper paper document handling. Companies also need to ensure that they have a paper documentation handling process and work to have it be compliant 100% of the time. Companies also need to make sure that their employees are following the guidlines of email and laptop use. Specifically ensuring that employees who have emails on personal smartphones/computers that they are secured from third party access. Business with employees with latops need to also ensure that potentially harmful third party software and websites are unable to be accessed via a company owned laptop/computer.

Finally let us take a look at how the source of the breaches changes when an associate is present:

This graphs shows that most data breaches truely occure when an associate is not present. This helps to give credance to the idea that the best way to stop theft is to have someone watching.

Conclusion

While data breaches will continue to be part of the world that we live in, there are many simple steps that a company can make to ensure that our data is kept save. Simple steps like only allowing access of person data while on a secured network, not allowing third party software and apps access to company hardware, and training of employees in how to handle paper/film documents can have a profound impact to data breaches.