Introduction

Purpose:

The purpose of this document is to make the data easier to interact with and understand. Below, you will find summary statistics, interactive tables, graphs, and other forms of basic analysis of the information.

The Data:

The data included in this report is from the US Department of Health and Human Services’ Office for Civil Rights. One of their roles is to collect data about breaches of unsecured protected health information. All breaches with over 500 individuals affected must be reported. For more detailed invormation on the data and the variables included, please see the Data section below

Benefit/Use:

This data is very informative about health care breaches. It could be used to monitor companies’ HIPAA violations and to determine summary information and commonalities in these kinds of breaches. Combined with other data, like popualtion or location data, it can be even more useful.

Packages Needed

Here are the packages needed for running this assignment:

Package Explanation
tidyverse Group of packages that includes readr, dplyr, ggplot2, etc
DT Allows user to make tables that can be interacted with
lubridate Easier to work with dates and date data types
sqldf Allows user to write sql code used for querying dataframes

The Data

The original data for this analysis was loaded from two separate files and combined into one after creating a new variable called investigation_complete that would differenciate the data from each set.

The combined data was then cleaned in 3 steps:

  1. Remove duplicates

  2. Separate data in ‘Type of Breach’ and ‘Location of Breached Information’ columns

  3. Classify ‘Breach Submission Date’ as a date data type

The Variables

Variable Description
Name of the Covered Entity Organization responsible for the PHI
State US State where the breach was reported
Covered Entity Type Type of organization responsible for the PHI
Individuals Affected Number of records affected by the breach
Breach Submission Date Date the breach was reported by the CE
Type of Breach How unauthorized access to the PHI was obtained
Additional Type A second ‘Type of Breach’ column to separate merged data
Location of Breached Information Where was the PHI when unauthorized access was obtained
Additional Location A second ‘Location of Breached Information’ column to separate merged data
Business Associate Present Was a business associate such as a consultant or contractor involved in the Breach
Web description An optional statement explaining what happened and the resolution
Investigation Complete Whether or not the investigation of this breach is complete

About the Data

Observations: 2452

Missing values: reported as NA or as /N

Analysis

Number of Reported Breaches

This chart shows the total number of breaches reported for each year. The outliers on the upper end of the scale have been removed.

Avg Healthcare Breach Size

This chart shows the average size of a breach in each year. The outliers on the upper end of the scale have been removed as they were significantly skewing the chart, which was causing it to misrepresent the data.

Largest Healthcare Data Breaches

This table shows the largest known breaches (effected the most individuals) for which data has been collected. All of these breaches had effected over 700,000 people.

Hacking/IT Incidents by Year

This visualization shows the number of breaches categorized as Hacking/IT Incidents in each year.

Breaches by Entity Type

This visualization of the number of breaches by covered entity type shows the distribution of the breaches between the different entity types. The most breaches, by far, ocurred for the types of entities defined as Healthcare Providers.

Breaches by Day of Week

This graph of the number of breaches distributed across the days of the week on which they happened shows that a significantly higher number of breaches ocurred on Fridays than another other day of the week.

Other Insights

Here is some other information that I was able to gather from the data provided.

This generates two graphs, the first of which shows the average breach size based on whether or not there was a business associate present. The second one shows the same thing but only for the breaches that had over 5000 individuals affected. This information could be used to see if the presence of a business associate affects the size of a breach. The first group is shown in the table above.