Introduction

The purpose of this is to create a more visual report of OCR Reports! The data used for this includes the archive of all completed investigations as well as those that are still in progress. This will show which entities have the most reports, what areas of the country, as well as how many people are affected.

Packages Required

  library(tidyverse) #to take advantage of all the capabilities of the tidyverse
  library(dplyr)     #Allows for easy manipulation of dataframes to work with variables
  library(DT)        #making pretty javascript data tables

Downloading Data

Breach_archive <- read_csv("http://asayanalytics.com/breach_archive_csv")
Breach_investigations <- read_csv("https://asayanalytics.com/breach_investigation_csv")

Data Preparation

This shows whether the investigation is still open or finished

  Breach_archive$Closed <- "Yes"
  Breach_investigations$Closed <- "No"

Merge of archived and current investigations datasets

Total_Breaches <-rbind(Breach_archive,Breach_investigations)
# Removing Duplicate rows 
Breach1<-Total_Breaches %>% distinct()
# Investigating Missing Data
colSums(is.na(Total_Breaches)) 
##           Name of Covered Entity                            State 
##                                0                                3 
##              Covered Entity Type             Individuals Affected 
##                                3                                1 
##           Breach Submission Date                   Type of Breach 
##                                0                                1 
## Location of Breached Information       Business Associate Present 
##                                0                                0 
##                  Web Description                           Closed 
##                              742                                0
#Removing rows that have less than ten missing data points, not "Web Description"
Breach_nona <- Breach1 %>% drop_na(`Individuals Affected`,`Type of Breach`,`Covered Entity Type`,`State`)
# Dummy Variable Creations
Breach2 <- Breach_nona %>%
  separate_rows(`Type of Breach`, sep = ",") %>%
  mutate(`Type of Breach` =  `Type of Breach`) %>%
  mutate(value = 1) %>%
  spread(`Type of Breach`, value , fill = 0)

Breach3 <- Breach2 %>%
  separate_rows(`Location of Breached Information`, sep = ",") %>%
  mutate(`Location of Breached Information` =  `Location of Breached Information`) %>%
  mutate(value = 1) %>%
  spread(`Location of Breached Information`, value , fill = 0)

#Change data types
Breach3$DateFormat <-as.Date(Breach3$`Breach Submission Date`,format = "%m/%d/%Y")
#Create Year Column 
Breach3$Year <- format(as.Date(Breach3$`Breach Submission Date`,format = "%m/%d/%Y"),"%Y")

Variables Included in Analysis

Variable Description
Name of Covered Entity Organization responsible for the data breached
State State where the breach is located
Covered Entity Type Type of organization
Individuals Affected Number of health records affected
Breach Submission Date Date the breach was reported
Type of Breach Dummy variable to show how access to records was gained.
Location of Breached Information Dummy variable to show where breach records were kept.
Busienss Associate Present Whether an employee was involved in the breach
Web Description Statement on what occurred during the breach, if available.
Closed If the investigation has been completed or not

There are 2446 total observations included after removing duplicates and NAs. NAs within the “Web Description” column were ignored because it is a subjective column and not used for data analysis.

Data Table

Required Data Analysis • Chart: “Average Healthcare Data Breach Size by Year” (with the top 5% of outliers omitted) • Table: “Largest healthcare data breaches” (including all breaches under investigation in 2017-18)

• Chart: “Hacking / IT Incidents by year”

• Table: “Breaches by Entity Type”

## # A tibble: 4 x 2
##   `Covered Entity Type`         n
##   <chr>                     <int>
## 1 Business Associate          354
## 2 Health Plan                 322
## 3 Healthcare Clearing House     4
## 4 Healthcare Provider        1766

• On what day of the week (Sunday, Monday, etc.) are breaches most often reported?

• How has the type of breach (hacking, improper disposal, loss, etc.) changed for each year? For example, are hacking / IT Incidents more prevalent in 2016 than they were in 2010?

I tried my hardest.

Exploratory Data Analysis