The Office of Civil Rights is responsible for reporting breaches of unsecured Protected Health Information (PHI). All breaches are reported to the Secretary of the U.S. Department of Health and Human Services and include information on breaches that affect 500 or more individuals. The purpose is to generate relevant statistics based on the data to identify trends in data breaches and inform the general public on the frequency of such breaches and to notify consumers who may not yet realize that their PHI has been compromised.
The compiled data includes information on data breach investigations which have been completed as well as those breaches currently under investigation. The data contains 9 key pieces of information:
The organization responsible for the PHI
The state where the breach was reported
The type of organization responsible for the PHI
The number of affected individuals
The date the breach was reported by the covered entity
The type of breach and how unauthorized access to PHI was gained
Where the PHI was when unauthorized access was obtained
Whether a business associate of the covered entity was involved in the breach
A statement explaining what happened and the resolution
The approach begins with clearing the data of empty values and eliminating duplicate entries and splitting multiple entries in one field into separate entities or variables. New variables are introduced to define and differentiate between the records under investigation and those no longer being investigated. Funtions from libraries like tidyverse will be used mutate, summarize, filter and investigate the data figures. Visualiztions like scatterplots and density distributions will be created to help highlight hidden aspects of the discrete data.
Consumers can use this information to determine whether their PHI was compromised by such a breach and begin the process to further protect their information. If consumers suspect they have been a victim of identify theft or fraud, this information can also be used to pursue restitution from the responsible entities. For example, when the federal government was at the helm during past breaches, they have provided victims with credit and fraud monitoring services for an extended period to ensure consumers are protected to the fullest extent possible.
The following packages are necesary to view the Breach information.
| Package | Explanation |
|---|---|
| tidyverse | For all things tidy |
| DT | To display some data using Data tables |
| knitr | For introducing R and HTML together |
| rmdformats | For ready-to-use R Markdown |
| lubridate | To manipulate date formats |
The first data set, “Breach_Archive”, contains data on 2,049 breach investigations that were investigated and ultimately closed.
The second data set, “Breach_Investigation”, contains data on 406 breach investigations currently ongoing.
To identify which records are closed and which are under investigation, a new column “Investigated” was created. This column uses a value of ‘1’ if the data breach investigation is complete, and a ‘0’ if the data breach is currently under investigation.
To combine the files into a single data set, each data set MUST have the same number of columns and each column name must ALSO use the same names. To do this, the function rbind() is used to stack the data from the Investigation Complete data set directly above the data from the Under Investigation file.
## Name of Covered Entity State
## 0 3
## Covered Entity Type Individuals Affected
## 3 1
## Breach Submission Date Type of Breach
## 0 1
## Location of Breached Information Business Associate Present
## 0 0
## Web Description Investigated
## 742 0
## [1] "Name of Covered Entity" "State"
## [3] "Covered Entity Type" "Individuals Affected"
## [5] "Breach Submission Date" "Type of Breach"
## [7] "Location of Breached Information" "Business Associate Present"
## [9] "Web Description" "Investigated"
breach$BreachSubmissionDate <- as.Date(breach$BreachSubmissionDate, "%m/%d/%Y")
Each record is identified as having at least 1 of 7 types of breach:
Hacking/IT
Improper_Disposal
Loss
Theft
Unauthorized Access/Disclosure
Unknown
Other
breach %>%
group_by(State) %>%
datatable()