Week 6 - Lonie Moore - Fall 2018 Exam Practice

Introduction and Pre-Analysis
Raw Data
Summary Data
Required Analysis - Part 1
Required Analysis - Part 2
Exploratory Data Analysis: Synthesize with External Data
Exploratory Data Analysis

Introduction and Pre-Analysis

Introduction

The purpose of this document is to investigate, summarize and provide insight into the protected health information (PHI) breach data sourced from the U.S. Department of Health and Human Services. Various data summarization techniques with be used to provide visual insight into the data, including synthesis with another outside data source (U.S. census data which provides populations for each state). This analysis will help the reader by drawing out insights that were not readily available from the origial source.

Definitions & Terminology

Acronyms:

Acronym	Description
HHS	U.S. Department of Health and Human Services
OCR	Office for Civil Rights
PHI	Protected Health Information
CE	Covered Entity

Definitions:

Term	Description
Breach	A “Breach” is classified as unauthorized disclosure of protected health information affecting 500 or more individuals
Breach Type	Includes Hacking/IT Incidents, Theft, Improper Disposal, Unauthorized Access/Disclosure, Loss, Other, and Unknown
Breach Location	Include Desktop Computer, Laptop, Paper/Films, Electronic Medical Record, Network Server, Email, Other Portable Electronic Device, and Other
Covered Entity	Organizations responsible for protecting health information
Covered Entity Type	Includes Health Plan, Healthcare Clearing House, and Healthcare Provider

Sources

U.S. Department of Health and Human Services

Description	Source	Link
HHS Breach Dataset	HHS website	https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
Completed Investigations	Professor link	http://asayanalytics.com/breach_archive_csv
Under Investigation	Professor link	https://asayanalytics.com/breach_investigation_csv

Outside Data Sources

Description	Source	Link
State Populations	U.S. Census	https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html
State Abbreviations	Lonie	https://cheetahanalytics.com/stateabbrev

Required Packages

Some of the below information is redundant. For example, the dplyr package is contained within the tidyverse package. Both are provided when a package within tidyverse provides a particularly useful function. The packages required for this markdown are:

Package	Description
tidyverse	the tidyverse collection of packages all together
DT	makes interactive javascript data tables
skimr	has a useful “skim” function for quick summary data
dplyr	also in tidyverse; allows for easy data manipulation in R (filter, select, mutate, group_by, etc)
stringr	provides useful functions for searching for specifc strings within a character field
lubridate	provides useful date parsing and manipulation functions
ggplot2	makes graphs

Raw Data

Completed Investigations

The Completed Investigations dataset contains information about each data breach, including a quantitative variable (number of individuals affected), a date variable (breach submission date), a logical variable (Business Associate Present), multiple categorical variables (Name of Covered Entity, State, Covered Entity Type, Type of Breach, and Location of Breached Information) and a character field (Web Description) that contains additional context about the breach. Note that the Type of Breach and Location of Breached Information columns may require additional manipulation before summarizing.

The data set contains 2,043 records and 9 columns.

5 records were removed due to missing data (State (2), Covered Entity Type (2), and Type of Breach (1)). Additionally, 6 duplicate records were removed.

Additionally, 1 record associated with the Covered Entity name “Anthem (Working file)” was removed.

Open Investigations

The Open Investigations dataset has the same layout/format as the completed investigations dataset. The data set contains 404 observations. The original dataset contained 406 observations, but 2 of those observations have been removed due to missing information (1 was missing State and the other was missing Covered Entity Type). Duplicate records were not found after removing the two records containing missing information.

Population Data

I pulled the population data for each state off of the U.S. Census data (census.gov). This dataset gives the 2010 census for each state and the estimated population for each state for each completed year since 2018. There are also a couple grouping variables such as Region Name and Region Number.

State Abbreviations

Because the census dataset contains state names and not abbreviations, I created a state-to-abbreviation file for the purpose of tying the census data to the breach datasets, which uses state abbreviations instead of names.

Summary Data

Combine and Recode

The completed investigation and the open investigation datasets were modified to include a dataset label (open, completed) and combined into a single dataset. I also renamed each column in order to remove spaces and abbreviate. The combined dataset has 2,448 observations and 9 columns.

Another problem with this dataset is that the “Type of Breach” and “Location of Breach” columns contain useful information but the data is coded in such a way that a record can be classified under multiple “types” and “locations”. For example, a breach can involve both Hacking and Unauthorized Access. These two categories are combined into a single character value and treated as a distinct/unique category. Likewise, a breach may involve both laptop and desktop computers, and this combined “laptop, desktop” is treated as a unique location.

One way to manage this unfortunate coding the original dataset would be to create subsets of the original data that search for a particular string within the “Type of Breach” and “Location of Breach” columns. A more useful approach might be to mutate the original dataset, creating separate indicator/dummy variables for each distinct category and location. While this exercise is somewhat mundane and dramatically increases the size of the dataset (more than doubling the number of columns), this will allow for the most flexibility during analysis and ensure that this original coding does not present itself to be an insurmountable obstacle.

The Breach Date field also needs to be recoded to be a “date” data type. I use the “mdy” function from the lubridate package to do this.

Summary

Number of Breaches by Year and Type - Completed Investigations:

Number of Breaches by Year and Type - Open Investigations:

Of the completed investigations, each year there are on average 35 breaches caused by Hacking, 60 breaches caused by unauthorized access, and 84 breaches caused by Theft.

However, the average number of breaches per year have increased dramatically over the past few years.

Breaches since 2015:

There are currently 186 open investigations on due to hacking, 54 open investigations due to theft, and 140 open investigations due to unauthorized access. The number of open investigations in the other breach types are relatively much smaller in quantity.

Required Analysis - Part 1

Chart1

Description: number of reported breaches by year (with the top 5% of outliers omitted). Note that the 123 records were omitted based on the top 5% of breaches with the highest number of individuals affected.

The number of reported breaches per year has been increasing steadily at 9% year-over-year growth on average for the past 7 years.

Chart2

Description: Average Healthcare Data Breach Size by Year. As before, the top 5% of outliers with respect to breach size have been omitted prior to summarizing the data.

This graph shows the number of Protected Health Information (PHI) records exposed by year. This graph shows that the size of the breaches have gone up in addition to the number of breaches per the previous chart.

Table1

Description: Largest Healthcare data Breaches

The following table shows the largest data breaches on record, as measured by number of individuals affected.

Chart3

Description: Hacking/IT Incidents by Year

The following chart showings that the number of Hacking/IT Incidents rose steadily from 2010 through 2015, and then rose dramatically since 2015.

```

Table2

Description: Breaches by Entity Type

The graph below shows that there are virtually no breaches in the Healthcare Clearing Houses, lower-comparable amounts of breaches in Health Plans entities, and the vast majority of breaches occur at healthcare provider entities.

Required Analysis - Part 2

Visual1

On what day of the week are breaches most-often reported?

In the following table, the Day of the Week is coded with Sunday = 1 and Saturday = 7. The table shows that 31% of all data breaches occur on Fridays (day 6). Data breaches are very rare on Sundays and Saturdays at 1% and 1.6% of breaches, respectively. On the other four days of the week (during the work week), Mondays through Thursdays, the frequency of breaches is relatively uniform at an average of 16.4% of breaches allocated to each day of the week.

Visual2

Are there any breach type trends over time?

The number of hacking and unauthorized access breaches have increased in the last 7 years, while breaches due to loss, theft, and improper disposal are down over that same span of time. What this indicates is that internal security measures and controls have tightened over the past 7 years which has led to minimized breaches attributable to employee mishandling of data. However, this has not prevented breaches from external/unauthorized sources, which have continued to rise. The trends show that companies are stuggling to keep up with the advances in technology that allow hackers to access their data in new and (perhaps) unpredictable ways, despite those same companies being much better today at managing internal controls than they were 7 years ago.

Exploratory Data Analysis: Synthesize with External Data

Merge Breach and Population Data

I merged the HHS breach data with external population data taken from the U.S. census government website. State abbreviations were not available in the census data, so I created another dataset myself in excel (csv format) for the purpose of mapping the state names to their abbreviations via a left-join using the “merge(by = state, all.x=TRUE)” function in R (which is the equivalent of a left-join). The population data is in the far right columns of the “consolidated” data table, and consist of the following columns:

Column Name	Description
SUMLEV	This is a numeric grouping variable row ID since the rows of the table include subtotals and grand totals (to distinguish between state, region, and country level records)
REGION	This is a numeric region id used for mapping state-to-regions (such as “Midwest”)
STATE	This is a numeric state id
CENSUS2010POP	This is the actual state population as measured by the 2010 census
POPESTIMATE2018	This is the estimated 2018 population for each state based

Exploratory Data Analysis

Breaches by State / Region

The Region data is taken from the U.S. Census data and was previously combined with the breach data. The four countries regions are:

Region Number	Region Name
1	Northeast
2	Midwest
3	South
4	West

The graph shows that the Southern region of the country is responsible for disproportionately higher instances of data breaches than the rest of the country, accounting for 35% of all breaches. This compares to only 17% of all data breaches occuring in the Northeast, 24% in the Midwest, and 23% in the West.

The largest contributors by state are shown below. California, Texas, Florida and New York were the largest contributors based on the number of breaches.

Texas and Florida account for nearly 16% of all data breaches, which is more than the 24 lowest-ranked states (nearly half of all other states) combined and explains why the South region has disproportionately more data breaches than the other regions.

Indiana, New York, Washington, and Tennessee were the large contributors based on number of records affected.

On a per-capita basis, Indiana is by-far the state with the most individuals affected records at 12.6 times the population of the state itself. This is an extreme outlier, with the “runner up”, Tennessee, coming in at only 1.7 times the state population.

Looking at a scatterplot of number of breaches versus the size of the breach across the four regions, there doesn’t appear to be a strong relationship between the number of breaches for each state and the size of breaches per capita.

Region Number	Region Name
1	Northeast
2	Midwest
3	South
4	West