One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task.
Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list)
Please select one, download it and perform the following tasks:
Read the csv into R from the Git repository:
gitURL <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/breaches.csv"
breaches <- read.csv(gitURL)
# preview the file contents
head(breaches)
## X Number Name_of_Covered_Entity State
## 1 1 0 Brooke Army Medical Center TX
## 2 2 1 Mid America Kidney Stone Association, LLC MO
## 3 3 2 Alaska Department of Health and Social Services AK
## 4 4 3 Health Services for Children with Special Needs, Inc. DC
## 5 5 4 L. Douglas Carlson, M.D. CA
## 6 6 5 David I. Cohen, MD CA
## Business_Associate_Involved Individuals_Affected Date_of_Breach
## 1 1000 10/16/2009
## 2 1000 9/22/2009
## 3 501 10/12/2009
## 4 3800 10/9/2009
## 5 5257 9/27/2009
## 6 857 9/27/2009
## Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
## 1 Theft Paper 6/30/2014
## 2 Theft Network Server 5/30/2014
## 3 Theft Other Portable Electronic Device, Other 1/23/2014
## 4 Loss Laptop 1/23/2014
## 5 Theft Desktop Computer 1/23/2014
## 6 Theft Desktop Computer 1/23/2014
## Summary
## 1 A binder containing the protected health information (PHI) of up to 1,272 individuals was stolen from a staff member's vehicle. The PHI included names, telephone numbers, detailed treatment notes, and possibly social security numbers. In response to the breach, the covered entity (CE) sanctioned the workforce member and developed a new policy requiring on-call staff members to submit any information created during their shifts to the main office instead of adding it to the binder. Following OCR's investigation, the CE notified the local media about the breach.
## 2 Five desktop computers containing unencrypted electronic protected health information (e-PHI) were stolen from the covered entity (CE). Originally, the CE reported that over 500 persons were involved, but subsequent investigation showed that about 260 persons were involved. The ePHI included demographic and financial information. The CE provided breach notification to affected individuals and HHS. Following the breach, the CE improved physical security by installing motion detectors and alarm systems security monitoring. It improved technical safeguards by installing enhanced antivirus and encryption software. As a result of OCR's investigation the CE updated its computer password policy.
## 3
## 4 A laptop was lost by an employee while in transit on public transportation. The computer contained the protected health information of 3800 individuals. The protected health information involved in the breach included names, Medicaid ID numbers, dates of birth, and primary physicians. In response to this incident, the covered entity took steps to enforce the requirements of the Privacy & Security Rules. The covered entity has installed encryption software on all employee computers, strengthened access controls including passwords, reviewed and updated security policies and procedures, and updated it risk assessment. In addition, all employees received additional security training. \n\n
## 5 A shared Computer that was used for backup was stolen on 9/27/09 from the reception desk area of the covered entity. The Computer contained certain electronic protected health information (ePHI) of 5,257 individuals who were patients of the CE. The ePHI involved in the breach included names, dates of birth, and clinical information, but there were no social security numbers, financial information, addresses, phone numbers, or other ePHI in any of the reports on the disks or the hard drive on the stolen Computer. Following the breach, the covered entity notified all 5,257 affected individuals and the appropriate media; added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer; added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet; and added administrative safeguards by requiring annual refresher retraining of CE staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules.\n\n
## 6 A shared Computer that was used for backup was stolen from the reception desk area, behind a locked desk area, probably while a cleaning crew had left the main door to the building open and the door to the suite was unlocked and perhaps ajar. The Computer contained certain electronic protected health information (ePHI) of 857 patients. The ePHI involved in the breach included names, dates of birth, and clinical information. Following the breach, the covered entity notified all affected individuals and the media, added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer, added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet, and added administrative safeguards by requiring annual refresher retraining staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules, which has already taken place.\n\n
## breach_start breach_end year
## 1 10/16/2009 <NA> 2009
## 2 9/22/2009 <NA> 2009
## 3 10/12/2009 <NA> 2009
## 4 10/9/2009 <NA> 2009
## 5 9/27/2009 <NA> 2009
## 6 9/27/2009 <NA> 2009
# overview
summary(breaches)
## X Number Name_of_Covered_Entity State
## Min. : 1.0 Min. : 0.0 Length:1055 Length:1055
## 1st Qu.: 264.5 1st Qu.: 263.5 Class :character Class :character
## Median : 528.0 Median : 527.0 Mode :character Mode :character
## Mean : 528.0 Mean : 527.0
## 3rd Qu.: 791.5 3rd Qu.: 790.5
## Max. :1055.0 Max. :1054.0
## Business_Associate_Involved Individuals_Affected Date_of_Breach
## Length:1055 Min. : 500 Length:1055
## Class :character 1st Qu.: 1000 Class :character
## Mode :character Median : 2300 Mode :character
## Mean : 30262
## 3rd Qu.: 6941
## Max. :4900000
## Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
## Length:1055 Length:1055 Length:1055
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Summary breach_start breach_end year
## Length:1055 Length:1055 Length:1055 Min. :1997
## Class :character Class :character Class :character 1st Qu.:2010
## Mode :character Mode :character Mode :character Median :2012
## Mean :2011
## 3rd Qu.:2013
## Max. :2014
# mean and median of Individuals_Affected
mean(breaches[,"Individuals_Affected"])
## [1] 30262.02
median(breaches[,"Individuals_Affected"])
## [1] 2300
# convert Summary field to a numeric vector of character counts for each entry
summary_nchar <- c(nchar(breaches[,"Summary"]))
# mean and median of the # of characters in the Summary field
mean(summary_nchar)
## [1] 116.1801
median(summary_nchar)
## [1] 0
# convert year field to a numeric vector and remove NA values
yearint <- c(breaches[,"year"],na.rm = TRUE)
mean(yearint)
## [1] 2009.546
median(yearint)
## [1] 2012
Let’s look at the breaches that affected more than 100,000 people We can subset at least three ways.
The first way would be to use bracketing:
# using bracketing
breachesbracket <- breaches[which(breaches$Individuals_Affected > 100000),names(breaches) %in% c("State","Individuals_Affected","year")]
head(breachesbracket)
## State Individuals_Affected year
## 64 NY 344579 2009
## 76 IL 180111 2010
## 93 FL 1220000 2009
## 95 NY 130495 2010
## 114 CO 105470 2010
## 118 MA 800000 2010
A second way would be to use the base subset() function:
# using subset
breachessubset <- subset(breaches,Individuals_Affected > 100000, select = c("State","Individuals_Affected","year"))
head(breachessubset)
## State Individuals_Affected year
## 64 NY 344579 2009
## 76 IL 180111 2010
## 93 FL 1220000 2009
## 95 NY 130495 2010
## 114 CO 105470 2010
## 118 MA 800000 2010
A third way would be to use the dplyr package:
# using the dplyr package
breachesdplyr <- select(filter(breaches,Individuals_Affected > 100000),c(State,Individuals_Affected,year))
head(breachesdplyr)
## State Individuals_Affected year
## 1 NY 344579 2009
## 2 IL 180111 2010
## 3 FL 1220000 2009
## 4 NY 130495 2010
## 5 CO 105470 2010
## 6 MA 800000 2010
Note that the dplyr package returns rows renumbered from 1 instead of the original index numbers of the rows selected by the condition, as the bracket and subset() methods did.
Renaming columns in the subset that used the dplyr package. Just for fun, let’s translate the English column names to Latin, and while we’re at it let’s change the snake case to camel case.
names(breachesdplyr)[names(breachesdplyr) == "State"] <- "civitas "
names(breachesdplyr)[names(breachesdplyr) == "Individuals_Affected"] <- "AffectusHominum "
names(breachesdplyr)[names(breachesdplyr) == "year"] <- "anno "
head(breachesdplyr)
## civitas AffectusHominum anno
## 1 NY 344579 2009
## 2 IL 180111 2010
## 3 FL 1220000 2009
## 4 NY 130495 2010
## 5 CO 105470 2010
## 6 MA 800000 2010
summary(breachesdplyr)
## civitas AffectusHominum anno
## Length:35 Min. : 105470 Min. :2008
## Class :character 1st Qu.: 177730 1st Qu.:2010
## Mode :character Median : 344579 Median :2011
## Mean : 730039 Mean :2011
## 3rd Qu.: 819856 3rd Qu.:2012
## Max. :4900000 Max. :2014
# mean and median of Individuals_Affected, now AffectusHominum
mean(breachesdplyr$AffectusHominum)
## [1] 730039.2
median(breachesdplyr$AffectusHominum)
## [1] 344579
# mean and median of year, now anno
mean(breachesdplyr$anno)
## [1] 2011.086
median(breachesdplyr$anno)
## [1] 2011
# for comparison purposes, the original means and medians from the entire set
mean(breaches[,"Individuals_Affected"])
## [1] 30262.02
median(breaches[,"Individuals_Affected"])
## [1] 2300
mean(yearint)
## [1] 2009.546
median(yearint)
## [1] 2012
By subsetting the data for breaches that affected 100K+ people, we can see that the mean increased dramatically from 30,262.02 people affected on average across all breaches to 730,039.2 people affected on average in very large breaches. The medians increased significantly as well from 2300 people affected at the midpoint of all breaches and 344,579 people affected at the midpoint of very large breaches.
Similarly the average year of all breaches was 2011 and the median year was also 2011, while much larger breaches had an average year of 2009 but a median year of 2012.
Replacing values in the subset that used the dplyr package for column “civitas”.
We will replace the value NY with MagnumMalum (Latin for BigApple), FL with SolemStatui (Latin for Sun shine State) and MA with FabaOppidum (Latin for Bean Town).
breachesdplyr
## civitas AffectusHominum anno
## 1 NY 344579 2009
## 2 IL 180111 2010
## 3 FL 1220000 2009
## 4 NY 130495 2010
## 5 CO 105470 2010
## 6 MA 800000 2010
## 7 TN 1023209 2009
## 8 PR 475000 2008
## 9 PR 115000 2010
## 10 FL 156000 2010
## 11 NH 231400 2010
## 12 NY 1700000 2010
## 13 CA 514330 2011
## 14 OK 132940 2011
## 15 CA 1900000 2011
## 16 SC 400000 2011
## 17 AZ 175350 2011
## 18 FL 1055489 2011
## 19 VA 4900000 2011
## 20 AL 943434 2011
## 21 UT 780000 2012
## 22 GA 315000 2012
## 23 SC 228435 2012
## 24 FL 105646 2011
## 25 CA 116506 2012
## 26 CA 109000 2012
## 27 CA 189489 2012
## 28 IN 187533 2013
## 29 TX 277014 2013
## 30 IL 4029530 2013
## 31 CA 729000 2013
## 32 NJ 839711 2013
## 33 PR 398000 2010
## 34 TX 405000 2013
## 35 CA 338700 2014
breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
gsub("NY","MagnumMalum", x)
}))
breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
gsub("MA","FabaOppidum", x)
}))
breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
gsub("FL","SolemStatui", x)
}))
breachesdplyr
## civitas. AffectusHominum. anno.
## 1 MagnumMalum 344579 2009
## 2 IL 180111 2010
## 3 SolemStatui 1220000 2009
## 4 MagnumMalum 130495 2010
## 5 CO 105470 2010
## 6 FabaOppidum 800000 2010
## 7 TN 1023209 2009
## 8 PR 475000 2008
## 9 PR 115000 2010
## 10 SolemStatui 156000 2010
## 11 NH 231400 2010
## 12 MagnumMalum 1700000 2010
## 13 CA 514330 2011
## 14 OK 132940 2011
## 15 CA 1900000 2011
## 16 SC 400000 2011
## 17 AZ 175350 2011
## 18 SolemStatui 1055489 2011
## 19 VA 4900000 2011
## 20 AL 943434 2011
## 21 UT 780000 2012
## 22 GA 315000 2012
## 23 SC 228435 2012
## 24 SolemStatui 105646 2011
## 25 CA 116506 2012
## 26 CA 109000 2012
## 27 CA 189489 2012
## 28 IN 187533 2013
## 29 TX 277014 2013
## 30 IL 4029530 2013
## 31 CA 729000 2013
## 32 NJ 839711 2013
## 33 PR 398000 2010
## 34 TX 405000 2013
## 35 CA 338700 2014
Roger that!
Please see first R script above.