Health Care Data Breach

Sections

  • Packages Required
  • Introduction
  • Data Preparation
  • Data Analysis
  • Exploratory Data Analysis

1. Introduction

This document contains compiled healthcare data breach statistics published by Department of Health and Human Services’ Office for Civil Rights. data breach statistics below only include data breaches of 500 or more records as smaller breaches.

The breaches include closed cases and breaches still being investigated by OCR The purpose of this document is to investigate the data and provide a visual format that includes summary statistics, interactive tables, charts and some simple analytics.

The data used is comprised of: Name of the covered entity (Organization responsible for the PHI) State (US State where the breach was reported) Covered Entity Type (Type of organization responsible for the PHI) Individuals Affected (Number of records affected by the breach) Breach submission date (Date the breach was reported by the CE) Type of breach (how unauthorized access to the PHI was obtained) Location of breached information (Where was the PHI when unauthorized access was obtained) Business associate present (Was a business associate such as a consultant or contractor involved in the breach) Web description (A optional statement explaining what happened and the resolution)

The approach is to start by cleaning the data from empty values and eliminate duplicate and split multiple entries in one field to sole recognizable entries. This is done by creating a extra entries named after the categories defined and factored as 0 for non-existing and 1 for existing.

Functions from libraries like tidyverse will be used mutate, summarize, filter and investigate the data figures. Visualizations like scatterplots and density distributions will be created to help highlight hidden aspects of the discrete data.

Consumers will have access to visual disruptions on which channels for data sharing are more vulnerable, which areas and years have the most breaches and the trend in terms of number of breaches. Furthermore, the visualizations will help create more awareness towards the topic that will help in individuals understand and engage in addressing a solution for this issue through political and social channels.

2. Required Packages

The packages required for this markdown are:

3. Data Preparation

Combining multiple sources

The information of healthcare is distributed in two sources which have the same structure. The first file corresponds to data breaches already investigated http://asayanalytics.com/breach_archive_csv. The second file contains the observations related to data breaches under investigation https://asayanalytics.com/breach_investigation_csv. For the purpose of this analysis, the two sources are combined in o single dataset adding an identifier called “Investigated” flagging the observations from first file with 1 and 0 for the ones that belong of the second file.

Columns of interest

The column “Web Description” from original data sources is excluded for the purposes of this exercise. The analysis is run over a copy from original dataset which doesn’t include “Web Description” column but it is still on original dataset. It could be used later on sentimental or context analysis.

 [1] "Name of Covered Entity"           "State"                           
 [3] "Covered Entity Type"              "Individuals Affected"            
 [5] "Breach Submission Date"           "Type of Breach"                  
 [7] "Location of Breached Information" "Business Associate Present"      
 [9] "Web Description"                  "Investigated"                    

Formating columns

The column “Breach Submission Date” is transformed to Date format. It allows to execute filters using date format.

To speed up coding, a new column called “year” which contains only the year from “Breach Submission Date” that allows to group and filter information by year of report.

Identifying and removing duplicates

Duplicates were identified for health entity names with some similarities but names are not a 100% match, also the date when the breach was reported is different. However, the state, type of entity, number of individuals affected, breach type and the location of the breach were equal. In this analysis, a column “Duplicated” with a value of 1 was added to the dataset to allow to run analysis filtering observations not duplicated (Duplicated = 0)

Columns with problematic content

The column “Type of Breach” contains all posibilities of breach type that apply for every single row in the same column separated by comma. It does not allow to run analysis by different breach type. The category contains 7 possible values (Hacking / IT Incident, Improper disposal, loss, theft, unauthorized access/disclosure, unknown and other). Therefore, in this exercise 7 new columns are created for every category with a value of 1 or 0 if the original column contains the category that belongs to the new column, for example: new column TB_Hacking_IT = 1 when Type of Breach contains “Hacking/IT Incident”.

Same behavior apply for the column “Location of Breached Information” that contains all posibilities of location that apply for every single row in the same column separated by comma. The column contains 8 possible values (Desktop computer, electronic medical record, email, laptop, network server, other portable electronic device, paper/films and other). Same strategy applies, 8 new columns are created for every category with a value of 1 or 0 if the original column contains the value that belongs to the new column, for example: new column LOC_DesktopComputer = 1 when Location of Breached Information contains “Desktop Computer”.

Outliers

For facility of analysis, outlier in column “Individuals Affected” are flagged in a new column called outlier with a value of 1 or 0. Outliers are required to be excluded in some of the analysis ahead.

4. Exploratory Data Analysis

4.1.

Following the design of the example at https://www.hipaajournal.com/healthcare-data-breach-statistics/ your boss would like you to recreate the following graphs from the page, but with a few modifications as outlined below. You should not modify your original data object but instead make subsets, filters, summaries, etc. as required and pipe the result directly into the following requests

Number of Reported Breaches

“Number of Reported Breaches” (with the top 5% of outliers omitted)

Data Breach by Year

“Average Healthcare Data Breach Size by Year” without outliers

Largest Data Breaches

Hacking / IT Incidents by year

Breaches by Entity Type

4.2

On what day of the week

Breach Type by Year

How has the type of breach changed for each year

5. Exploratory Data Analysis

5.1. Location of Breach by Year

5.2. Breach History by Covered Entity Type

Plot of change of covered entity types over the years

5.3 Analysis by Region

Northeast

Number of Breaches in Northwest States

Southeast

Number of Breaches in Southwest States

Midwest

Number of Breaches in Midwest States

Southwest

Number of Breaches in Southwest States

West

Number of Breaches in West States

Top 5 States with more breaches

Number of Breaches in Top 5 States

5.4. By Business Associate

Reinaldo Quintero

2019-10-01