Week 6 - Lonie Moore - Fall 2018 Exam Practice
Introduction and Pre-Analysis
Introduction
The purpose of this document is to investigate, summarize and provide insight into the protected health information (PHI) breach data sourced from the U.S. Department of Health and Human Services. Various data summarization techniques with be used to provide visual insight into the data, including synthesis with another outside data source (U.S. census data which provides populations for each state). This analysis will help the reader by drawing out insights that were not readily available from the origial source.
Definitions & Terminology
Acronyms:
| Acronym | Description |
|---|---|
| HHS | U.S. Department of Health and Human Services |
| OCR | Office for Civil Rights |
| PHI | Protected Health Information |
| CE | Covered Entity |
Definitions:
| Term | Description |
|---|---|
| Breach | A “Breach” is classified as unauthorized disclosure of protected health information affecting 500 or more individuals |
| Breach Type | Includes Hacking/IT Incidents, Theft, Improper Disposal, Unauthorized Access/Disclosure, Loss, Other, and Unknown |
| Breach Location | Include Desktop Computer, Laptop, Paper/Films, Electronic Medical Record, Network Server, Email, Other Portable Electronic Device, and Other |
| Covered Entity | Organizations responsible for protecting health information |
| Covered Entity Type | Includes Health Plan, Healthcare Clearing House, and Healthcare Provider |
Sources
U.S. Department of Health and Human Services
| Description | Source | Link |
|---|---|---|
| HHS Breach Dataset | HHS website | https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf |
| Completed Investigations | Professor link | http://asayanalytics.com/breach_archive_csv |
| Under Investigation | Professor link | https://asayanalytics.com/breach_investigation_csv |
Outside Data Sources
| Description | Source | Link |
|---|---|---|
| State Populations | U.S. Census | https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html |
| State Abbreviations | Lonie | https://cheetahanalytics.com/stateabbrev |
Required Packages
Some of the below information is redundant. For example, the dplyr package is contained within the tidyverse package. Both are provided when a package within tidyverse provides a particularly useful function. The packages required for this markdown are:
| Package | Description |
|---|---|
| tidyverse | the tidyverse collection of packages all together |
| DT | makes interactive javascript data tables |
| skimr | has a useful “skim” function for quick summary data |
| dplyr | also in tidyverse; allows for easy data manipulation in R (filter, select, mutate, group_by, etc) |
| stringr | provides useful functions for searching for specifc strings within a character field |
| lubridate | provides useful date parsing and manipulation functions |
| ggplot2 | makes graphs |
Raw Data
Completed Investigations
The Completed Investigations dataset contains information about each data breach, including a quantitative variable (number of individuals affected), a date variable (breach submission date), a logical variable (Business Associate Present), multiple categorical variables (Name of Covered Entity, State, Covered Entity Type, Type of Breach, and Location of Breached Information) and a character field (Web Description) that contains additional context about the breach. Note that the Type of Breach and Location of Breached Information columns may require additional manipulation before summarizing.
The data set contains 2,043 records and 9 columns.
5 records were removed due to missing data (State (2), Covered Entity Type (2), and Type of Breach (1)). Additionally, 6 duplicate records were removed.
Additionally, 1 record associated with the Covered Entity name “Anthem (Working file)” was removed.
Open Investigations
The Open Investigations dataset has the same layout/format as the completed investigations dataset. The data set contains 404 observations. The original dataset contained 406 observations, but 2 of those observations have been removed due to missing information (1 was missing State and the other was missing Covered Entity Type). Duplicate records were not found after removing the two records containing missing information.
Population Data
I pulled the population data for each state off of the U.S. Census data (census.gov). This dataset gives the 2010 census for each state and the estimated population for each state for each completed year since 2018. There are also a couple grouping variables such as Region Name and Region Number.
State Abbreviations
Because the census dataset contains state names and not abbreviations, I created a state-to-abbreviation file for the purpose of tying the census data to the breach datasets, which uses state abbreviations instead of names.
Summary Data
Combine and Recode
The completed investigation and the open investigation datasets were modified to include a dataset label (open, completed) and combined into a single dataset. I also renamed each column in order to remove spaces and abbreviate. The combined dataset has 2,448 observations and 9 columns.
Another problem with this dataset is that the “Type of Breach” and “Location of Breach” columns contain useful information but the data is coded in such a way that a record can be classified under multiple “types” and “locations”. For example, a breach can involve both Hacking and Unauthorized Access. These two categories are combined into a single character value and treated as a distinct/unique category. Likewise, a breach may involve both laptop and desktop computers, and this combined “laptop, desktop” is treated as a unique location.
One way to manage this unfortunate coding the original dataset would be to create subsets of the original data that search for a particular string within the “Type of Breach” and “Location of Breach” columns. A more useful approach might be to mutate the original dataset, creating separate indicator/dummy variables for each distinct category and location. While this exercise is somewhat mundane and dramatically increases the size of the dataset (more than doubling the number of columns), this will allow for the most flexibility during analysis and ensure that this original coding does not present itself to be an insurmountable obstacle.
The Breach Date field also needs to be recoded to be a “date” data type. I use the “mdy” function from the lubridate package to do this.
Summary
Number of Breaches by Year and Type - Completed Investigations:
Number of Breaches by Year and Type - Open Investigations:Of the completed investigations, each year there are on average 35 breaches caused by Hacking, 60 breaches caused by unauthorized access, and 84 breaches caused by Theft.
However, the average number of breaches per year have increased dramatically over the past few years.
Breaches since 2015:
There are currently 186 open investigations on due to hacking, 54 open investigations due to theft, and 140 open investigations due to unauthorized access. The number of open investigations in the other breach types are relatively much smaller in quantity.
Required Analysis - Part 1
Chart1
Description: number of reported breaches by year (with the top 5% of outliers omitted). Note that the 123 records were omitted based on the top 5% of breaches with the highest number of individuals affected.
The number of reported breaches per year has been increasing steadily at 9% year-over-year growth on average for the past 7 years.
Chart2
Description: Average Healthcare Data Breach Size by Year. As before, the top 5% of outliers with respect to breach size have been omitted prior to summarizing the data.
This graph shows the number of Protected Health Information (PHI) records exposed by year. This graph shows that the size of the breaches have gone up in addition to the number of breaches per the previous chart.
Table1
Description: Largest Healthcare data Breaches
The following table shows the largest data breaches on record, as measured by number of individuals affected.
Chart3
Description: Hacking/IT Incidents by Year
The following chart showings that the number of Hacking/IT Incidents rose steadily from 2010 through 2015, and then rose dramatically since 2015.
```
Table2
Description: Breaches by Entity Type
The graph below shows that there are virtually no breaches in the Healthcare Clearing Houses, lower-comparable amounts of breaches in Health Plans entities, and the vast majority of breaches occur at healthcare provider entities.
Required Analysis - Part 2
Visual1
On what day of the week are breaches most-often reported?
In the following table, the Day of the Week is coded with Sunday = 1 and Saturday = 7. The table shows that 31% of all data breaches occur on Fridays (day 6). Data breaches are very rare on Sundays and Saturdays at 1% and 1.6% of breaches, respectively. On the other four days of the week (during the work week), Mondays through Thursdays, the frequency of breaches is relatively uniform at an average of 16.4% of breaches allocated to each day of the week.
Visual2
Are there any breach type trends over time?
The number of hacking and unauthorized access breaches have increased in the last 7 years, while breaches due to loss, theft, and improper disposal are down over that same span of time. What this indicates is that internal security measures and controls have tightened over the past 7 years which has led to minimized breaches attributable to employee mishandling of data. However, this has not prevented breaches from external/unauthorized sources, which have continued to rise. The trends show that companies are stuggling to keep up with the advances in technology that allow hackers to access their data in new and (perhaps) unpredictable ways, despite those same companies being much better today at managing internal controls than they were 7 years ago.
Exploratory Data Analysis: Synthesize with External Data
Merge Breach and Population Data
I merged the HHS breach data with external population data taken from the U.S. census government website. State abbreviations were not available in the census data, so I created another dataset myself in excel (csv format) for the purpose of mapping the state names to their abbreviations via a left-join using the “merge(by = state, all.x=TRUE)” function in R (which is the equivalent of a left-join). The population data is in the far right columns of the “consolidated” data table, and consist of the following columns:
| Column Name | Description |
|---|---|
| SUMLEV | This is a numeric grouping variable row ID since the rows of the table include subtotals and grand totals (to distinguish between state, region, and country level records) |
| REGION | This is a numeric region id used for mapping state-to-regions (such as “Midwest”) |
| STATE | This is a numeric state id |
| CENSUS2010POP | This is the actual state population as measured by the 2010 census |
| POPESTIMATE2018 | This is the estimated 2018 population for each state based |
Exploratory Data Analysis
Breaches by State / Region
The Region data is taken from the U.S. Census data and was previously combined with the breach data. The four countries regions are:
| Region Number | Region Name |
|---|---|
| 1 | Northeast |
| 2 | Midwest |
| 3 | South |
| 4 | West |
The graph shows that the Southern region of the country is responsible for disproportionately higher instances of data breaches than the rest of the country, accounting for 35% of all breaches. This compares to only 17% of all data breaches occuring in the Northeast, 24% in the Midwest, and 23% in the West.
The largest contributors by state are shown below. California, Texas, Florida and New York were the largest contributors based on the number of breaches.
Texas and Florida account for nearly 16% of all data breaches, which is more than the 24 lowest-ranked states (nearly half of all other states) combined and explains why the South region has disproportionately more data breaches than the other regions.
Indiana, New York, Washington, and Tennessee were the large contributors based on number of records affected.
On a per-capita basis, Indiana is by-far the state with the most individuals affected records at 12.6 times the population of the state itself. This is an extreme outlier, with the “runner up”, Tennessee, coming in at only 1.7 times the state population.
Looking at a scatterplot of number of breaches versus the size of the breach across the four regions, there doesn’t appear to be a strong relationship between the number of breaches for each state and the size of breaches per capita.
| Region Number | Region Name |
|---|---|
| 1 | Northeast |
| 2 | Midwest |
| 3 | South |
| 4 | West |