Since 2009 there have been over 2300 reported data breaches accross the United States that have had over 500 records breached. These breaches have effected over 188 million Americans. Breaches have come in all 50 states and Peurto Rico. In this report we will take a closer look at how many breaches have occured each year, what types of breaches have occured, try to gain some understanding of how they ocrrued in the first place and look at some ways to help mitigate their occruances in the futre.
The data for this paper comes from the US Department of Health and Human Services in the Office for Civil Rights (OCR). The data has been stores on a OCR data base, but for this report data has been moved to a csv files and hosested on the website: http://asayanalytics.com/. The CSV files from this website (http://asayanalytics.com/breach_archive_csv,https://asayanalytics.com/breach_investigation_csv) have been read into R and this report has been created. The first file containes all reported breaches that have a completed investigation. The second file contains all reaported breaches that have an investigation ongoing. These two files were merged into a functional data set and the following columns produced:
The table below contains all the data that will be used in this report. Please note that the following cleaning proceedures have been applied to this data to prepare the raw data for this report:
This report should be used to help the the user understand how often data breaches are happening, what sources are being used to breach data and indetify which states have the most opportunity to improve their security. All tables in this report are interactionable and can be used to sort and filter data to look at specific results.
Let us first take a look at how many breaches have been reported since 2009. For this graph we will include all reported breaches, regardless if the investigation is complete or not. The top 5% of reported breaches have been removed (top 5% based upon breached record count)
In the above graph we can see that breaches have grown substantially since 2009 (from 15 in 2009 to 340 in 2017).
Now that we understand the ever growing occurnace of data breaches, let us take a closer look at healthcare record breaches. In this graph, the average healthcare record breach count is displayed by year. The top 5% of reported record breaches have been removed.
We learn from the above graph that 2015 and 2016 were a very bad years for healcare data record breaches. However both 2017 and YTD 2018 show that this trend is not continuing.
Now that we have seen how year over year healthcare breaches have changes, let us take a look at the top 20 healthcare record breaches in our dataset:
From this table we learn that the 2015 Anthem breach was the largest in healcare breach history (78,800,000 records breached), nearly 8 times larger than the next closes one (Premera Blue Cross with 11,000,000 records breached)
Let us now look at how hacking has played a part in data breaches over the past 9 years.
It can clearly been seen that hacking incidents are on the rise and raise concern over what business are doing to ensure that their systems become harder to hack. Hacking incidents, as a whole, have also grown more and more prevelent over the past few years accross America. Any company who does business should be diligent in ensuring that their data is as secure as possible from hackers.
The below table will allow us to take a look at the number of breaches that have occrered year over year by the business type:
So far we have concerned ourselves with mearly understanding our breaches (i.e. their trends, types and sizes). Let us now spend a little time focusing on the timing of the reporting of these breaches. In the below chart we can see what day of the week a breach is reported to the public:
Clearly a business is more likely to report a breache on a Friday than anyother day of the week. One possible explination is that the company hopes to avoid a firestorm of questions and demands for answers as soon as the announcement is made. Another possible explination is that most of these companies corperate offices are closed over the weekend, so by making the announcement on Friday they can avoid having to answer questions for a few days.
Let us switch our focus again and take a look at how the different types of breaches have changed over the years. The below set of graphs helps us see how each breach type has changed since 2009 till 2018:
Some interesting changes are that theft has been in decline for the past few years, while both Hacking/IT and Unauthorized Access have booth increased. One possible explination is that there has been a change in the classification of theft breach types and more and more types that use to fall under the breach category are now being put into the Hacking/IT and Unauthorized Access categories.
The below word cloud shows us the top 50 words used to desribe the breaches.
While it has been established that most reported breaches occure on a Friday, do we see a simular trend in months of the year (i.e. do more breaches get reporting during the summer vs. the winter months)? To answer this question we will look at the count of breaches reported by month:
The graph shows that there is a slight increase in reported breaches in March and April. A breach reported in the March April time frame, would have occured 2-3 months prior, indicating that many of the actual breaching takes plus during the winter months.
Let us also consider the question; Are most breaches isolated to a single state, or is the spread fairly equal? To answer this question we will take a look at the number of breaches reported by state. (Just as a reminder, Peurto Rico is part of this dataset and it’s abbreviation is PR):
Both California and Texas both have some very large spikes in counts of data breaches, with Florida, New York and Illinois rounding out the top 5. However, does a large number of data breaches equate to a large number of records breached? To answer this question, let us take a look at the number of records breached by state:
By far Indiana businesses have has the most records breached than any other state. However, the Anthem data breach (the largest in our data set) is located in Indiana. Let us remove this data from our graph and take a another look at how many records business in each state have had breached:
With the removal of the Anthem data breach, New York, Tennesee, Washinton, Arizona and Florida have the most breached records in the United States. So for business in both Florida and New York more data breaches have resulted in more records breached. Business in these states must take extra care when handling their employee’s/customer’s data.
Next let us ask the question, how do breach record counts vary accross the different business types? To answer this, let us look at the graph below:
This graph clearly shows that Health Plan businesses by far account for the most amount of breached records than any other type of business. Because the Anthem oranization is part of the Health Plan business type, let us remove them from our data and take a closer look at how many records are breached by business type:
With the Anthem data breached removed it can be seen that now the Healthcare Provider business type has become are largest bucket. The past few graphs also give us insite into just how large the Anthem data breach was. Both times that this set of data was removed from our graphs the resulting graphs told a very different story. From this we can get a real understanding of how important it is to ensure that extream outlyers like this be removed from any analysis preformed.
Having a better understanding of location of the business that have data breaches and the type of breaches that occure, let us get a better understanding of the source of the breach. In this graph we will look at the count of breaches by source:
Below is a table gives the number of records breached by source. Again the Anthem data breach has been removed:
While sources like email, laptop and network server do account for the majority of data records stolen, paper/film sources account for the largest count of breaches. This indicates that whenever a person is able to gain access to confidential data through electronic means, they are able to steal vast quantities of data. And that many company employees are not being careful enough with their handling of paper documents with sensitive data.
From these findings we can suggest that companies need to ensure that their employees understand proper paper document handling. Companies also need to ensure that they have a paper documentation handling process and work to have it be compliant 100% of the time. Companies also need to make sure that their employees are following the guidlines of email and laptop use. Specifically ensuring that employees who have emails on personal smartphones/computers that they are secured from third party access. Business with employees with latops need to also ensure that potentially harmful third party software and websites are unable to be accessed via a company owned laptop/computer.
Finally let us take a look at how the source of the breaches changes when an associate is present:
This graphs shows that most data breaches truely occure when an associate is not present. This helps to give credance to the idea that the best way to stop theft is to have someone watching.
While data breaches will continue to be part of the world that we live in, there are many simple steps that a company can make to ensure that our data is kept save. Simple steps like only allowing access of person data while on a secured network, not allowing third party software and apps access to company hardware, and training of employees in how to handle paper/film documents can have a profound impact to data breaches.