The purpose of this document is to make the data easier to interact with and understand. Below, you will find summary statistics, interactive tables, graphs, and other forms of basic analysis of the information.
The data included in this report is from the US Department of Health and Human Services’ Office for Civil Rights. One of their roles is to collect data about breaches of unsecured protected health information. All breaches with over 500 individuals affected must be reported. For more detailed invormation on the data and the variables included, please see the Data section below
This data is very informative about health care breaches. It could be used to monitor companies’ HIPAA violations and to determine summary information and commonalities in these kinds of breaches. Combined with other data, like popualtion or location data, it can be even more useful.
Here are the packages needed for running this assignment:
| Package | Explanation |
|---|---|
| tidyverse | Group of packages that includes readr, dplyr, ggplot2, etc |
| DT | Allows user to make tables that can be interacted with |
| lubridate | Easier to work with dates and date data types |
| sqldf | Allows user to write sql code used for querying dataframes |
The original data for this analysis was loaded from two separate files and combined into one after creating a new variable called investigation_complete that would differenciate the data from each set.
The combined data was then cleaned in 3 steps:
Remove duplicates
Separate data in ‘Type of Breach’ and ‘Location of Breached Information’ columns
Classify ‘Breach Submission Date’ as a date data type
| Variable | Description |
|---|---|
| Name of the Covered Entity | Organization responsible for the PHI |
| State | US State where the breach was reported |
| Covered Entity Type | Type of organization responsible for the PHI |
| Individuals Affected | Number of records affected by the breach |
| Breach Submission Date | Date the breach was reported by the CE |
| Type of Breach | How unauthorized access to the PHI was obtained |
| Additional Type | A second ‘Type of Breach’ column to separate merged data |
| Location of Breached Information | Where was the PHI when unauthorized access was obtained |
| Additional Location | A second ‘Location of Breached Information’ column to separate merged data |
| Business Associate Present | Was a business associate such as a consultant or contractor involved in the Breach |
| Web description | An optional statement explaining what happened and the resolution |
| Investigation Complete | Whether or not the investigation of this breach is complete |
Observations: 2452
Missing values: reported as NA or as /N
This chart shows the total number of breaches reported for each year. The outliers on the upper end of the scale have been removed.
This chart shows the average size of a breach in each year. The outliers on the upper end of the scale have been removed as they were significantly skewing the chart, which was causing it to misrepresent the data.
This table shows the largest known breaches (effected the most individuals) for which data has been collected. All of these breaches had effected over 700,000 people.
This visualization shows the number of breaches categorized as Hacking/IT Incidents in each year.
This visualization of the number of breaches by covered entity type shows the distribution of the breaches between the different entity types. The most breaches, by far, ocurred for the types of entities defined as Healthcare Providers.
This graph of the number of breaches distributed across the days of the week on which they happened shows that a significantly higher number of breaches ocurred on Fridays than another other day of the week.
Here is some other information that I was able to gather from the data provided.
This generates two graphs, the first of which shows the average breach size based on whether or not there was a business associate present. The second one shows the same thing but only for the breaches that had over 5000 individuals affected. This information could be used to see if the presence of a business associate affects the size of a breach. The first group is shown in the table above.