Introduction:

1.1 Purpose:

The Office of Civil Rights is responsible for reporting breaches of unsecured Protected Health Information (PHI). All breaches are reported to the Secretary of the U.S. Department of Health and Human Services and include information on breaches that affect 500 or more individuals. The purpose is to generate relevant statistics based on the data to identify trends in data breaches and inform the general public on the frequency of such breaches and to notify consumers who may not yet realize that their PHI has been compromised.

1.2 The Data Used

The compiled data includes information on data breach investigations which have been completed as well as those breaches currently under investigation. The data contains 9 key pieces of information:

  1. The organization responsible for the PHI

  2. The state where the breach was reported

  3. The type of organization responsible for the PHI

  4. The number of affected individuals

  5. The date the breach was reported by the covered entity

  6. The type of breach and how unauthorized access to PHI was gained

  7. Where the PHI was when unauthorized access was obtained

  8. Whether a business associate of the covered entity was involved in the breach

  9. A statement explaining what happened and the resolution

1.3 Proposed Analytic Approach

The approach begins with clearing the data of empty values and eliminating duplicate entries and splitting multiple entries in one field into separate entities or variables. New variables are introduced to define and differentiate between the records under investigation and those no longer being investigated. Funtions from libraries like tidyverse will be used mutate, summarize, filter and investigate the data figures. Visualiztions like scatterplots and density distributions will be created to help highlight hidden aspects of the discrete data.

1.4 How This Helps Customers

Consumers can use this information to determine whether their PHI was compromised by such a breach and begin the process to further protect their information. If consumers suspect they have been a victim of identify theft or fraud, this information can also be used to pursue restitution from the responsible entities. For example, when the federal government was at the helm during past breaches, they have provided victims with credit and fraud monitoring services for an extended period to ensure consumers are protected to the fullest extent possible.

Required Packages:

2.1 Packages Used

The following packages are necesary to view the Breach information.

Package Explanation
tidyverse For all things tidy
DT To display some data using Data tables
knitr For introducing R and HTML together
rmdformats For ready-to-use R Markdown
lubridate To manipulate date formats

Data Preparation:

3.1 Import Data Set

The first data set, “Breach_Archive”, contains data on 2,049 breach investigations that were investigated and ultimately closed.

The second data set, “Breach_Investigation”, contains data on 406 breach investigations currently ongoing.

To identify which records are closed and which are under investigation, a new column “Investigated” was created. This column uses a value of ‘1’ if the data breach investigation is complete, and a ‘0’ if the data breach is currently under investigation.

3.2 Combining Data Sets

To combine the files into a single data set, each data set MUST have the same number of columns and each column name must ALSO use the same names. To do this, the function rbind() is used to stack the data from the Investigation Complete data set directly above the data from the Under Investigation file.

3.3 Cleaning the Data

  • There are 750 total N/A values within the data. The majority of N/A values exist within the Web Description field and are removed from the data moving forward.
##           Name of Covered Entity                            State 
##                                0                                3 
##              Covered Entity Type             Individuals Affected 
##                                3                                1 
##           Breach Submission Date                   Type of Breach 
##                                0                                1 
## Location of Breached Information       Business Associate Present 
##                                0                                0 
##                  Web Description                     Investigated 
##                              742                                0
  • Additionally, the columns, or variable names, have spaces in them. For example, Name_of_Covered_Entity has spaces between each word. At this point, the spaces are removed to make the variables easier to use in later analysis.
##  [1] "Name of Covered Entity"           "State"                           
##  [3] "Covered Entity Type"              "Individuals Affected"            
##  [5] "Breach Submission Date"           "Type of Breach"                  
##  [7] "Location of Breached Information" "Business Associate Present"      
##  [9] "Web Description"                  "Investigated"
  • The dates of the records are characters in YYYY-MM-DD format. The variable is converted into MM-DD-YYYY format by using a date function which is approriate for this type of data.
breach$BreachSubmissionDate <- as.Date(breach$BreachSubmissionDate, "%m/%d/%Y")
  • TDuplicated information was a concern given that the imported files

Each record is identified as having at least 1 of 7 types of breach:

  1. Hacking/IT

  2. Improper_Disposal

  3. Loss

  4. Theft

  5. Unauthorized Access/Disclosure

  6. Unknown

  7. Other

3.4 The Combined Data

breach %>% 
  group_by(State) %>% 
  datatable()

3.5 Summary

Data Analysis

4.1