This is a document for the practice exam for BAIS 660.
The data used in this document are archives and investigations for healthcare data breaches.
I will look into the data to see if there is any pattern between the different variables and healthcare breaches.
This will help consumers so they can look for these patterns and avoid healthcare providers that may leak their information.
I have included the ggplot2 and tidyverse packages. The tidyverse package will mainly be used to clean and manipulate the data while ggplot2 will be used for visualizations. The DT package will be used for tables. The lubridate package will be used to work with dates. We will also be using the xts package to work with time series objects
Check to see if there are any missing values using the any and is.na functions on each column.
any(is.na(breach_df$entity))
## [1] FALSE
any(is.na(breach_df$state))
## [1] TRUE
any(is.na(breach_df$entity_type))
## [1] TRUE
any(is.na(breach_df$number_affected))
## [1] TRUE
any(is.na(breach_df$breach_date))
## [1] FALSE
any(is.na(breach_df$type_of_breach))
## [1] TRUE
any(is.na(breach_df$location_of_breach))
## [1] FALSE
any(is.na(breach_df$ba_present))
## [1] FALSE
any(is.na(breach_df$web_description))
## [1] TRUE
any(is.na(breach_df$status))
## [1] FALSE
It appears there are missing values in 5 of the columns.
Use the na.omit function to remove all of the missing values and use the duplicated function to detect duplicates
## [1] 0
Then use the str_detect fucntion from the stringr package to see which of the type of breaches have multiple breaches by looking for commas because that is how they are separated. Then use the boolean vector created to replace them
Do the same for location of breaches
Observations:
## [1] 1708
Number of Variables
## [1] 10
The data has 10 variables with 1708 observations. The entity column has the name of the entity that breached the data and the state of where the entity is located. The entity type column has describes what kind of business the enty is in and the number affected is the total number of people affected by the breach. The breach date is when the breach took place. The type of beach is how they information was let go and the location is by what device. LAstly the ba present column is whether or not a business associate was present duruing the breach.
It appears hacking has increased over time.
Below are two tables showing summary statistics for the number affected by location and type of breach
Based off the figures above it appears that theft and “other” affect the most bumber of people. Lets see how these changed over time.
Based off the plot above it looks like there was a huge number of people who were affected by theft from other locations.