The purpose of this is to create a more visual report of OCR Reports! The data used for this includes the archive of all completed investigations as well as those that are still in progress. This will show which entities have the most reports, what areas of the country, as well as how many people are affected.
library(tidyverse) #to take advantage of all the capabilities of the tidyverse
library(dplyr) #Allows for easy manipulation of dataframes to work with variables
library(DT) #making pretty javascript data tables
Breach_archive <- read_csv("http://asayanalytics.com/breach_archive_csv")
Breach_investigations <- read_csv("https://asayanalytics.com/breach_investigation_csv")
Breach_archive$Closed <- "Yes"
Breach_investigations$Closed <- "No"
Total_Breaches <-rbind(Breach_archive,Breach_investigations)
# Removing Duplicate rows
Breach1<-Total_Breaches %>% distinct()
# Investigating Missing Data
colSums(is.na(Total_Breaches))
## Name of Covered Entity State
## 0 3
## Covered Entity Type Individuals Affected
## 3 1
## Breach Submission Date Type of Breach
## 0 1
## Location of Breached Information Business Associate Present
## 0 0
## Web Description Closed
## 742 0
#Removing rows that have less than ten missing data points, not "Web Description"
Breach_nona <- Breach1 %>% drop_na(`Individuals Affected`,`Type of Breach`,`Covered Entity Type`,`State`)
# Dummy Variable Creations
Breach2 <- Breach_nona %>%
separate_rows(`Type of Breach`, sep = ",") %>%
mutate(`Type of Breach` = `Type of Breach`) %>%
mutate(value = 1) %>%
spread(`Type of Breach`, value , fill = 0)
Breach3 <- Breach2 %>%
separate_rows(`Location of Breached Information`, sep = ",") %>%
mutate(`Location of Breached Information` = `Location of Breached Information`) %>%
mutate(value = 1) %>%
spread(`Location of Breached Information`, value , fill = 0)
#Change data types
Breach3$DateFormat <-as.Date(Breach3$`Breach Submission Date`,format = "%m/%d/%Y")
#Create Year Column
Breach3$Year <- format(as.Date(Breach3$`Breach Submission Date`,format = "%m/%d/%Y"),"%Y")
| Variable | Description |
|---|---|
| Name of Covered Entity | Organization responsible for the data breached |
| State | State where the breach is located |
| Covered Entity Type | Type of organization |
| Individuals Affected | Number of health records affected |
| Breach Submission Date | Date the breach was reported |
| Type of Breach | Dummy variable to show how access to records was gained. |
| Location of Breached Information | Dummy variable to show where breach records were kept. |
| Busienss Associate Present | Whether an employee was involved in the breach |
| Web Description | Statement on what occurred during the breach, if available. |
| Closed | If the investigation has been completed or not |
There are 2446 total observations included after removing duplicates and NAs. NAs within the “Web Description” column were ignored because it is a subjective column and not used for data analysis.
• Chart: “Hacking / IT Incidents by year”
• Table: “Breaches by Entity Type”
## # A tibble: 4 x 2
## `Covered Entity Type` n
## <chr> <int>
## 1 Business Associate 354
## 2 Health Plan 322
## 3 Healthcare Clearing House 4
## 4 Healthcare Provider 1766
• On what day of the week (Sunday, Monday, etc.) are breaches most often reported?
• How has the type of breach (hacking, improper disposal, loss, etc.) changed for each year? For example, are hacking / IT Incidents more prevalent in 2016 than they were in 2010?