Introduction

The following document contains an elaborate and dynamic report of healthcare information breaches that occurred between 2009 and 2018. Two data sets are used, one that contains information on completed investigation and another with undergoing investigations on the information breaches. Breaches included in the data set were only included if they affected at least 500 persons.

The data set has information of the name of the affected entity, type of breach, number of persons affected, mediums through which the breach occur, date when the breach occur, state, ether or not a business associate was present, and current status of the investigation.

The data set is explored using multiple EDA techniques, including visualization and tables. No statistical analysis was performed on this report. Before the exploratory analysis, the data set was cleaned for better manipulation.

This analysis will help consumers by showing what they should be looking in term of data security offered by their providers. By the time they finish reading this report they will have a clear picture of the current landscape in data breaches and should be ready to know what to protect from.

Brief Numerical Descriptions:

The mean breach size for the data set is 109160, whereas the maximum breach size is 78800000 and the minimum breach size is 500. From this brief glimpse we can see that there is substantial positive skewness of 33.4546291 is caused by some breaches that were enormous in size relative to the majority of the breaches.

Data Table

Scale & Trend of Breaches

From both graphs above we can see that the number of breaches and their size seems to be increasing over time. After removing outliers both graphs are easier to read since trends are easier to visualize than in the original report. ### Ranking of Breaches

Anthem suffered the largest data breach ever reported by a substantial amount relative to the other entities on the table. Anthem is one of the largest medical insurance providers, so it can be expected that they will have an enormous database from which to steal from.

Hacking/IT Incidents Trend

It seems that Hacking/IT Incidents have been exponentially increasing over time. This might be due to the increase in business digitization and increased migration from paper to virtual, so a larger pool of information to steal from is available. It could also be that there are more hackers, hackers are getting smarter, IT Departments are getting dumber, or companies are slacking in their cyber security. More information would be needed to back up any of the latter ideas.

Breaches by Day & Type

Friday seems to be the day where most of the breaches tend to occur and breaches caused by theft have been steadily decreasing, whereas hacking and unauthorized disclosure have been rapidly growing over the years. Again, while we don’t have any empirical evidence to support this, it might be because things are merging/have merged to the digital landscape away from traditional paper & physical records.Either that, or healthcare insurance providers cyber security is slacking.

Most Frequent Words in Breaches by Hacking/IT Incident

Most Frequent Words in Breaches by Theft

EDA

The state of California had by far the largest number of breaches. This could be due to either that California has the largest population in the US or that many healthcare providers are based/have large operations in this state

All breaches investigations prior to 2016 have been completed, whereas the majority of breaches in 2018 are still undergoing. If we had some data on the average length it takes to complete a type of breach, we could build a model to forecast when all the investigations will be completed.

We can see that most of unauthorized access/disclosure tends to occur through paper and films mediums, whereas hacks and IT incidents are prone in network servers or email.Most of the breaches due to theft see to be caused through laptop. I would conjecture two things from the latter point, one is that malware steal documents from laptops (But perhaps that should be classified as hacking) or that people laptops are physically stolen.

It seems that breaches tend to be slightly lower early in the year and then a slight jump occurs between March and April and remains steady the remainder of the year. While the actual causal reason of this seasonality remains unknown and discovering it remains out of the scope of this report, it might be a good indicator for companies to be on the watch during the months on March and April to avoid breaches whatever the type of breach is.

With both tables above we can corroborate what we saw in one of the graph above that Hacking/IT Incidents occur the most in network servers and email, whereas Unauthorized access/Disclosure occurs more frequently in paper/film mediums.

From the occasions that an associate was present during the breach, it seems that unauthorized access/disclosure was the type of breach in which associates were present the most. From one of the tables above we saw how the largest number of unauthorized access/disclosure occurred through paper/film, and perhaps, the times were associates were present the most was when this type of breaches occurred through this medium.

From the past three years of breaches (The only ones were investigations are still undergoing), those from Hacking/IT Incidents have the largest number of breaches that investigations are still undergoing.