#The purpose is to investigate data breaches in the US Department of Health and Human Services (HHS) in the Office for Civil Rights #We can analysis the cases that have been completed in by these two departments and also look at the cases that are ongoing #The data used is The company that committed a breach, what type of enitity or industry the company is working in

#This data is completed and currently investaging data breaches in the US heal and Human service office for Civil rights office. It says what the breach type is, what state it happened in, what entity type it is, how many individuals affect, when was the breach summitted, type of breach, location of breached inforamtion, what busienss was associated, and what is the web description.

#By using vizualization, datawrangling, and comparative analysis, we can compare the investigated data to the completed files to find similarity to this these HHS problems. This analysis will let consumers know about data breaches that most people didn’t know of and what to look out for in the future of these crimes.

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
## Loading required package: gsubfn
## Loading required package: proto
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite

#Data Source

 HHS_completed <- read_csv("http://asayanalytics.com/breach_archive_csv")
## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_character()
## )
HHS_investigating <- read_csv("https://asayanalytics.com/breach_investigation_csv")
## Parsed with column specification:
## cols(
##   `Name of Covered Entity` = col_character(),
##   State = col_character(),
##   `Covered Entity Type` = col_character(),
##   `Individuals Affected` = col_double(),
##   `Breach Submission Date` = col_character(),
##   `Type of Breach` = col_character(),
##   `Location of Breached Information` = col_character(),
##   `Business Associate Present` = col_character(),
##   `Web Description` = col_logical()
## )

#DataWrangling

#Had to make a data column that told if it was under investigation for the case or not. Then I combined the two dataframes into one since they had the same Header columns. Removed duplicated anthem file as well. Cleaned up both Location Breached and Type of breach to individual columns to get a better representation of the data.

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

#DataDictionary • Name of the covered entity (Organization responsible for the PHI) • State (US State where the breach was reported) • Covered Entity Type (Type of organization responsible for the PHI) • Individuals Affected (Number of records affected by the breach) • Breach submission date (Date the breach was reported by the CE) • Type of breach (how unauthorized access to the PHI was obtained) • Location of breached information (Where was the PHI when unauthorized access was obtained) • Business associate present (Was a business associate such as a consultant or contractor involved in the breach) • Web description (A optional statement explaining what happened and the resolution) • Hackingor IT (was it a Hacking or It breach) • Improperdisposal (was it a Improperdisposal breach) • Loss (was it a Loss breach) • Theft (was it a Theft breach) • Unauthorizedaccessordisclosure (was it a Unauthorize daccess/disclosure breach) • Unknowed (was it a Unknowed breach) • Other (was it a Other breach) • desktop (was it a desktop computer location breach) • electronicmedicalrecord (was it a lelectronic medical record Location breach) • Email (was it a Email location breach) • network (was it a network service location breach) • otherportaleelectronics(was it a other portal electronics location breach) • Paperorfilm (was it a Paper or film location breach) • otherslocations (was it a other location breach)

#4.1 • Chart: “Number of Reported Breaches” (with the top 5% of outliers omitted) • Chart: “Average Healthcare Data Breach Size by Year” (with the top 5% of outliers omitted) • Table: “Largest healthcare data breaches” (including all breaches under investigation in 2017-18) • Chart: “Hacking / IT Incidents by year” • Table: “Breaches by Entity Type”

#I am curious why we had so many breaches between 2013 to 2017. I’m glad to see it went down in 2018, I’m curious to see to see 2019 results would be, hopefully the results continued to decrease.

## # A tibble: 1 x 1
##   Percentage
##        <dbl>
## 1   2069198.

#I was surpised to see health care still very affect even with when the top 5% of outliers aren’t included. I didn’t realze the degree of Individuals affected.

#Largest healthcare data breaches

companyname Department Complete Onlyyear Databreach
Anthem, Inc. Affiliated Covered Entity Health Plan No 2015 78800000
Premera Blue Cross Health Plan No 2015 11000000
Excellus Health Plan, Inc. Health Plan No 2015 10000000
Science Applications International Corporation (SA Business Associate No 2011 4900000
University of California, Los Angeles Health Healthcare Provider No 2015 4500000
Community Health Systems Professional Services Corporations Business Associate No 2014 4500000
Community Health Systems Professional Services Corporation Business Associate No 2014 4500000
Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group Healthcare Provider No 2013 4029530
Medical Informatics Engineering Business Associate No 2015 3900000
Banner Health Healthcare Provider No 2016 3620000
Newkirk Products, Inc. Business Associate No 2016 3466120
21st Century Oncology Healthcare Provider No 2016 2213597
Xerox State Healthcare, LLC Business Associate No 2014 2000000
IBM Business Associate No 2011 1900000
GRM Information Management Services Business Associate No 2011 1700000
AvMed, Inc. Health Plan No 2010 1220000
CareFirst BlueCross BlueShield Health Plan No 2015 1100000
Montana Department of Public Health & Human Services Health Plan No 2014 1062509
The Nemours Foundation Healthcare Provider No 2011 1055489
BlueCross BlueShield of Tennessee, Inc. Health Plan No 2010 1023209
Sutter Medical Foundation Healthcare Provider No 2011 943434
Valley Anesthesiology Consultants, Inc. d/b/a Valley Anesthesiology and Pain Consultants Healthcare Provider No 2016 882590
Horizon Healthcare Services, Inc., doing business as Horizon Blue Cross Blue Shield of New Jersey, and its affiliates Business Associate No 2014 839711
Iron Mountain Data Products, Inc. (now known as Business Associate No 2010 800000
Utah Department of Technology Services Business Associate No 2012 780000
# If I was a consumer that has business with these companies, I would be worried. Some of these are very big providers.

#I this has been some expentionial growth from 2010 - 2017. Luckily it decline in 2018.

#Breaches by Entity Type

Covered Entity Type Count
Business Associate 355
Health Plan 325
Healthcare Clearing House 4
Healthcare Provider 1767
#Its crazy to see the wide r ange in the number of breaches by different entities

On what day of the week (Sunday, Monday, etc.) are breaches most often reported?

DayOnly Count
Friday 767
Thursday 434
Tuesday 407
Monday 394
Wednesday 384
Saturday 42
Sunday 26
#I’m actuall y not surprised friday had the most breach days, the one day everyone tried to leave or not work has the most breaches.

How has the type of breach (hacking, improper disposal, loss, etc.) changed for each year?

BreachType Onlyyear Count
Hacking/IT Incident 2010 8
Hacking/IT Incident 2011 15
Hacking/IT Incident 2012 10
Hacking/IT Incident 2013 27
Hacking/IT Incident 2014 35
Hacking/IT Incident 2015 56
Hacking/IT Incident 2016 113
Hacking/IT Incident 2017 150
Hacking/IT Incident 2018 112
Improper Disposal 2010 8
Improper Disposal 2011 6
Improper Disposal 2012 7
Improper Disposal 2013 12
Improper Disposal 2014 8
Improper Disposal 2015 6
Improper Disposal 2016 7
Improper Disposal 2017 11
Improper Disposal 2018 6
Loss 2009 1
Loss 2010 14
Loss 2011 15
Loss 2012 17
Loss 2013 20
Loss 2014 20
Loss 2015 24
Loss 2016 16
Loss 2017 16
Loss 2018 11
Other 2009 2
Other 2010 21
Other 2011 2
Other 2012 13
Other 2013 16
Other 2014 22
Theft 2009 15
Theft 2010 130
Theft 2011 114
Theft 2012 122
Theft 2013 119
Theft 2014 112
Theft 2015 80
Theft 2016 62
Theft 2017 56
Theft 2018 33
Unauthorized Access/Disclosure 2010 10
Unauthorized Access/Disclosure 2011 29
Unauthorized Access/Disclosure 2012 28
Unauthorized Access/Disclosure 2013 65
Unauthorized Access/Disclosure 2014 86
Unauthorized Access/Disclosure 2015 102
Unauthorized Access/Disclosure 2016 129
Unauthorized Access/Disclosure 2017 126
Unauthorized Access/Disclosure 2018 111
Unknown 2011 7
Unknown 2013 2
Unknown 2014 1

Every Breach type has increase throughout the years except for Unknown. My guess we are starting to understand more Unknow breach types and thats why it has decrease.

Exploratory Analysis

#What company in that is a healthcare provider has been caught on the most wednesdays for breachs in the last 4 years?

This result is interesting. Children’s Mercy Hosipital, Massachusetts General Hospital, and MGA Home Healthcare Colorado Inc are the most caught companys for a breach on a wednesday in the last four years. This is interesting because if this is a pattern, we can figure they will try to do a breach on wednesdays in the future.

What States have the individuals most affected by Breaches?

## Warning: Removed 1 rows containing missing values (geom_point).

#What is the most common breach out of Theft, Hacking, Impropper Disposal for all states in the United States?

State Total_Theft Total_Hacking Total_Impropperdispodal Total_Loss
CA 140 39 5 21
TX 78 52 10 15
NY 57 22 3 10
FL 56 31 3 11
IL 40 28 2 10
PA 34 15 2 9
IN 29 19 4 1
PR 25 0 0 0
TN 24 14 4 6
WA 24 14 2 2
OH 23 11 6 8
GA 22 21 3 5
AZ 20 12 3 6
MA 20 17 2 11
KY 19 11 1 3
MI 19 17 0 9
VA 19 7 2 4
NJ 16 12 2 2
OR 16 11 0 1
CO 15 12 2 3
CT 15 6 0 3
NC 14 14 2 6
AL 13 10 1 1
MO 13 15 4 0
NM 13 5 0 0
LA 11 6 1 2
MD 11 19 0 2
MN 11 11 3 6
WI 10 12 0 2
OK 9 8 1 3
RI 9 1 0 1
SC 8 2 4 1
KS 7 5 0 3
NE 7 7 1 1
NV 7 5 1 1
DC 5 1 0 1
UT 5 6 2 1
AK 4 4 0 0
AR 4 5 0 1
MS 4 6 1 2
MT 4 4 0 2
WV 4 2 1 1
NH 3 3 0 0
VT 3 1 0 0
ID 2 1 0 1
ND 2 2 2 0
IA 1 5 1 3
ME 1 2 0 0
SD 1 2 0 1
WY 1 2 0 1
NA 1 0 0 1
DE 0 2 0 0
HI 0 2 0 1

I chose to look at States stats about Theft , Hacking, Improper Disposal, Loss, and individuals affected because its very interesting information that is not talked about on the news much. I wasn’t surprize certain States were target more based on the headquarters of business. But I didn’t realize the magnatude of individuals that are affected by these situations. I was curious if Wednesdays were unluck day for some Health care Providers, and it turns out three providers have had breaches specifically on wednesdays in the last 4 year. Might want to see if that a Pattern for the Future. I think its really interesting that Hawaii is least targeted state by the four attacks above. Must be the people are on Island time and not worrying about Breaches (joke). I can’t explain why that is but I think it’s a very interesting Stat.