The Office of Civil Rights in the US Department of Health and Human Services collects and reports disclosures (“breaches”) of protected health information. The law requires the public release of information for breaches that have affected more than 500 individuals.
From the data available on breaches that have occured in the last ten years, certain trends and insights can be found. The purpose of this document is explain the processes used to clean, standardize, and manipulate the data, followed by an analysis of some of these trends through visualizations and summary statistics.
This analysis was developed in R and utlized the following packages:
| Package | Description |
|---|---|
| tidyverse | A collection of packages for data manipulation, exploration, and visualization |
| DT | For rendering HTML data tables |
| sqldf | For running SQl statements in R |
| stringr | For manipulation of strings and characters |
| lubridate | For date-time manipulation |
| shiny | For formatting of data tables |
All data for this analaysis was publicly reported by the Office for Civil Rights. Two datasets were used, one for investigations that have been completed, available at http://asayanalytics.com/breach_archive_csv, and one for investigations that are currently ongoing, available at https://asayanalytics.com/breach_investigation_csv. For the purposes of this analysis, these two datasets were combined, with notations made for each investigation’s status.
The original datasets included a total of 2455 records of breaches. Where records did not include number of individuals affected by the breach, the state the breach occured in, the type of breach, or the name of the breached organization, the record was removed, as that information is critical for this analysis. This required a removal of 8 incomplete records.
Also removed were obvious duplicates of breach records, where the organization name and the number of individuals affected were similar between records. This required a removal of 30 duplicate records.
After this data cleaning, the final dataset used for this analysis included 2417 records.
Below is an interactive table of the full, prepared dataset that can be filtered and searched as desired.
The following variables were used for the analysis and visualizations of the data. This summary describes the total number of records for each variable (or the number of distince variables where appropriate), and a minimum, maximum, and mean of varaibles where appropriate.
| Variable | Total (unique) | Min (least) | Max (most) | Average |
|---|---|---|---|---|
| Name of Covered Entity | 2182 | - | - | - |
| State^ | 52 | DE | CA | - |
| Covered Entity Type | 4 | Healthcare Clearing House | Healthcare Provider | - |
| Individuals Affected | 183884733 | 500 | 78800000 | 76079.74 |
| Breach Submission Date | - | 2009-10-21 | 2018-09-28 | 2014-11-13 |
| Status | 2 | Current | Completed | - |
| Type: Hacking Incident | 534 | |||
| Type: Improper Disposal | 81 | |||
| Type: Loss | 182 | |||
| Type: Theft | 884 | |||
| Type: Unauthorized Access/Disclosure | 0 | |||
| Type: Unknown | 16 | |||
| Type: Other | 96 | |||
| Location: Desktop Computer | 278 | |||
| Location: Electronic Medical Record | 173 | |||
| Location: Email | 356 | |||
| Location: Laptop | 420 | |||
| Location: Network Server | 485 | |||
| Location: Portable Electronic Device | 240 | |||
| Location: Paper/Films | 578 | |||
| Location: Other | 297 |
^Includes 50 states plus Puerto Rico and “Unknown”