Practice Exam

Introduction (1.1-1.4):

Purpose of Analysis: The purpose of this analysis is to analyze and summarize data in regards to Breach of Unsecured Protected Health Information reported by the OCR

Explanation of Data: The data used in this analysis is collected by the OCR. They are responsible for collecting and reporting disclosures of protected health information (PHI) as mandadted by law. Additionally, OCR cases where covered entities (CE) have a breach that affects more than 500 individuals. Included in the data reported in each of these breaches include: State, Covered Entity Type, Breach Submission Date, Type of Breach, and other factors.

Proposed Approach/Analytical Technique: I will be taking this data and create data visualizations that will will help me get a better and more in-depth understanding of the types of breach, individuals that are affected by the breach, and the location of the breached information.

How will my analysis help? It will allow other users to interpret trends and insights on information that is related to protected health information (PHI) data breaches.

Packages Required (2.1-2.3):

The packages that will be used during this analysis include:

dplyr: Provides a flexible grammar of data manipulation. Also similiar to SQL in that it helps manipulate datasets so the data is easy to use in R.

Tidyverse: Designed to make it easy to install and load multiple tidyverse packages in a single step.

ggplot2: Helps you map variables to aesthetics, what graphical primitives to use, and helps you plot data using different visualizations.

DT: Allows you to create an HTML widget to display data from your dataset using the JavaScript library DataTables.

skimr: Provides an alterative to the default summary functions within R. Also offers a human readable output as well.

## -- Attaching packages -------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data Preparation(3.1-3.5):

Importing Data Sets and Creating the Table:

3.1: Importing the Dataset

## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_character()
## )
## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_logical()
## )

3.2: Combining Data Sets

3.3: Cleaning Data Sets

  1. Check for missing Values - Only Web Description has a lot of missing values so data is good for the most part in this area.
##           Name of Covered Entity                            State 
##                                0                                3 
##              Covered Entity Type             Individuals Affected 
##                                3                                1 
##           Breach Submission Date                   Type of Breach 
##                                0                                1 
## Location of Breached Information       Business Associate Present 
##                                0                                0 
##                  Web Description                           status 
##                              742                                0

2a. Removing Missing Values

2b. Remove Duplicates

3. Location of breach and type of breach: I will be looking at multiple types of breaches and multiple types of location breaches and will create dummy variables in order to seperate these values as well.

3.3: Source Data is Explained:

Name of the Covered Entity: Organization responsible for the PHI Name of the covered entity (Organization responsible for the PHI) State (US State where the breach was reported) Covered Entity Type (Type of organization responsible for the PHI) Individuals Affected (Number of records affected by the breach) Breach submission date (Date the breach was reported by the CE) Type of breach (how unauthorized access to the PHI was obtained) Location of breached information (Where was the PHI when unauthorized access was obtained) Business associate present (Was a business associate such as a consultant or contractor involved in the breach) Web description (A optional statement explaining what happened and the resolution) Status: 1 if currently under investigation, 0 if not currently under investigation Year: Year of the Reported Breach

NA = Data that is missing

There are a total of 2,402 observations in this data.

3.4: Data Table

3.5: Summary Information of Data

Number of Individuals Affected Summary

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      500      988     2286    76680     7920 78800000

Minimum Number of Individuals Affected: 500 Average Number of Individuals Affected: 76,680 Max Number of Individuals Affected:78,800,000

Number of Breaches that are either under investigaton (1) or not (0)

## 
##    0    1 
## 2004  398

The average (mean) amount of Number of Individuals Affected by Type of Breach

## # A tibble: 8 x 2
##   `Type of Breach`                  mean
##   <fct>                            <dbl>
## 1 Hacking/IT Incident            276568.
## 2 Improper Disposal               17784.
## 3 Loss                            51767.
## 4 Multiple                        12846.
## 5 Other                           12268.
## 6 Theft                           24645.
## 7 Unauthorized Access/Disclosure  13215.
## 8 Unknown                        191669.

Highest mean of Individuals Affected by Type of Breach was Hacking/IT Incident.

Required Data Analysis (4.1)

Creating Percentile Column

Question 1: Number of Reported Breaches (With the top 5% Omitted)

## Question 2: “Average Healthcare Data Breach Size by Year” (with the top 5% of outliers omitted)

Question 3:“Largest healthcare data breaches” (including all breaches under investigation in 2017-18)

Name of Covered Entity State Individuals Affected Status Year
University of California, Los Angeles Health CA 4500000 0 2015
Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group IL 4029530 0 2013
Banner Health AZ 3620000 0 2016
21st Century Oncology FL 2213597 0 2016
The Nemours Foundation FL 1055489 0 2011
Sutter Medical Foundation AL 943434 0 2011
Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology and Pain Consultants AZ 882590 0 2016
County of Los Angeles Departments of Health and Mental Health CA 749017 0 2016
AHMC Healthcare Inc. and affiliated Hospitals CA 729000 0 2013
Commonwealth Health Corporation KY 697800 0 2017

Question 4: “Hacking / IT Incidents by year”

Question 5: “Breaches by Entity Type”

Covered Entity Type Number of Breaches
Business Associate 342
Health Plan 317
Healthcare Clearing House 4
Healthcare Provider 1736
NA 3

Exploratory Data Analysis (5.1-5.4)

Question 1: How many individuals were affected based on the Type of Breach and whether if a Business Associate was present or not?

Many individuals were affected by a theft, regardless if a business associate is present or not. Although, each type of branch has more individuals that are affected when there is no business associates, while there are less individuals affected in which there is a prescence of a business associate.

Question 2: How does the number of individuals affected, in which there is a current case, compare by each state?

State Number Affected Current.Cases
AK 75785 2
AL 1136366 3
AR 488643 6
AZ 4792005 7
CA 9950736 30
CO 283189 10
CT 310031 5
DC 40441 1
DE 49638 2
FL 6768288 18
GA 3021254 10
HI 54462 1
IA 1518108 6
ID 19786 1
IL 4847583 22
IN 84080000 13
KS 230376 9
KY 1042879 7
LA 160865 2
MA 405578 18
MD 2777562 7
ME 10063 1
MI 998365 20
MN 390000 10
MO 821007 17
MS 151580 2
MT 1174195 2
NC 518155 7
ND 17515 2
NE 194406 9
NH 256420 2
NJ 3263956 15
NM 76867 4
NV 117285 5
NY 17148970 20
OH 859262 11
OK 632455 2
OR 420707 8
PA 1850138 10
PR 1704916 0
RI 103750 4
SC 765657 0
SD 35640 1
TN 6960836 11
TX 4630630 23
UT 895632 4
VA 5926242 5
VT 6806 1
WA 11774508 5
WI 254458 14
WV 82653 0
WY 48532 2
NA 39426 1

Indiana has by far the most people that are affected by current cases, due to the fact that they have the highest number of people affected in any current case (The Anthem, Inc. Affiliated Covered Entity has reported that about 78800000 individuals were affected by this case). Ususally the bigger the state, the more the number of individuals affected, and vice versa in this case.

Question 3: Is there a relationship between the location of the breached information and the number of individuals that were affected, pending on prescence of Business Associate?

Individuals were affected by Paper/Films the most, in terms of the Location of breached Information. Additionally, regardless of the location of breached information, not having a Business Associate present affected more individuals than a business associate being present.