Overview

The Office of Civil Rights in the US Department of Health and Human Services collects and reports disclosures (“breaches”) of protected health information. The law requires the public release of information for breaches that have affected more than 500 individuals.

From the data available on breaches that have occured in the last ten years, certain trends and insights can be found. The purpose of this document is explain the processes used to clean, standardize, and manipulate the data, followed by an analysis of some of these trends through visualizations and summary statistics.


Tools Used

Packages

This analysis was developed in R and utlized the following packages:

Package Description
tidyverse A collection of packages for data manipulation, exploration, and visualization
DT For rendering HTML data tables
sqldf For running SQl statements in R
stringr For manipulation of strings and characters
lubridate For date-time manipulation
shiny For formatting of data tables

Data

All data for this analaysis was publicly reported by the Office for Civil Rights. Two datasets were used, one for investigations that have been completed, available at http://asayanalytics.com/breach_archive_csv, and one for investigations that are currently ongoing, available at https://asayanalytics.com/breach_investigation_csv. For the purposes of this analysis, these two datasets were combined, with notations made for each investigation’s status.

The original datasets included a total of 2455 records of breaches. Where records did not include number of individuals affected by the breach, the state the breach occured in, the type of breach, or the name of the breached organization, the record was removed, as that information is critical for this analysis. This required a removal of 8 incomplete records.

Also removed were obvious duplicates of breach records, where the organization name and the number of individuals affected were similar between records. This required a removal of 30 duplicate records.

After this data cleaning, the final dataset used for this analysis included 2417 records.

Below is an interactive table of the full, prepared dataset that can be filtered and searched as desired.


Protected Health Information Breaches



Summary of Variables

The following variables were used for the analysis and visualizations of the data. This summary describes the total number of records for each variable (or the number of distince variables where appropriate), and a minimum, maximum, and mean of varaibles where appropriate.

Variable Total (unique) Min (least) Max (most) Average
Name of Covered Entity 2182 - - -
State^ 52 DE CA -
Covered Entity Type 4 Healthcare Clearing House Healthcare Provider -
Individuals Affected 183884733 500 78800000 76079.74
Breach Submission Date - 2009-10-21 2018-09-28 2014-11-13
Status 2 Current Completed -
Type: Hacking Incident 534
Type: Improper Disposal 81
Type: Loss 182
Type: Theft 884
Type: Unauthorized Access/Disclosure 0
Type: Unknown 16
Type: Other 96
Location: Desktop Computer 278
Location: Electronic Medical Record 173
Location: Email 356
Location: Laptop 420
Location: Network Server 485
Location: Portable Electronic Device 240
Location: Paper/Films 578
Location: Other 297

^Includes 50 states plus Puerto Rico and “Unknown”


Analysis

Healthcare Data Breaches by Year


Average Healthcare Data Breach Size by Year


Top 10 Largest Data Breaches


Hacking/IT Incidents by Year


Covered Entity Types by Year


Incident Types by Year


Average Size of Incidents by State