R Markdown

R Bridge Week 2 Assignment

One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task.

Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list)

Please select one, download it and perform the following tasks:

Sample Dataset: HHS Cybersecurity Breaches

Read the csv into R from the Git repository:

gitURL <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/breaches.csv"

breaches <- read.csv(gitURL)

# preview the file contents
head(breaches)
##   X Number                                Name_of_Covered_Entity State
## 1 1      0                            Brooke Army Medical Center    TX
## 2 2      1             Mid America Kidney Stone Association, LLC    MO
## 3 3      2       Alaska Department of Health and Social Services    AK
## 4 4      3 Health Services for Children with Special Needs, Inc.    DC
## 5 5      4                              L. Douglas Carlson, M.D.    CA
## 6 6      5                                    David I. Cohen, MD    CA
##   Business_Associate_Involved Individuals_Affected Date_of_Breach
## 1                                             1000     10/16/2009
## 2                                             1000      9/22/2009
## 3                                              501     10/12/2009
## 4                                             3800      10/9/2009
## 5                                             5257      9/27/2009
## 6                                              857      9/27/2009
##   Type_of_Breach        Location_of_Breached_Information Date_Posted_or_Updated
## 1          Theft                                   Paper              6/30/2014
## 2          Theft                          Network Server              5/30/2014
## 3          Theft Other Portable Electronic Device, Other              1/23/2014
## 4           Loss                                  Laptop              1/23/2014
## 5          Theft                        Desktop Computer              1/23/2014
## 6          Theft                        Desktop Computer              1/23/2014
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         A binder containing the protected health information (PHI) of up to 1,272 individuals was stolen from a staff member's vehicle.  The PHI included names, telephone numbers, detailed treatment notes, and possibly social security numbers.  In response to the breach, the covered entity (CE) sanctioned the workforce member and developed a new policy requiring on-call staff members to submit any information created during their shifts to the main office instead of adding it to the binder.  Following OCR's investigation, the CE notified the local media about the breach.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                  Five desktop computers containing unencrypted electronic protected health information (e-PHI) were stolen from the covered entity (CE).  Originally, the CE reported that over 500 persons were involved, but subsequent investigation showed that about 260 persons were involved.  The ePHI included demographic and financial information. The CE provided breach notification to affected individuals and HHS.  Following the breach, the CE improved physical security by installing motion detectors and alarm systems security monitoring.  It improved technical safeguards by installing enhanced antivirus and encryption software.  As a result of OCR's investigation the CE updated its computer password policy.  
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                      A laptop was lost by an employee while in transit on public transportation.  The computer contained the protected health information of 3800 individuals.  The protected health information involved in the breach included names, Medicaid ID numbers, dates of birth, and primary physicians.  In response to this incident, the covered entity took steps to enforce the requirements of the Privacy & Security Rules.  The covered entity has installed encryption software on all employee computers, strengthened access controls including passwords, reviewed and updated security policies and procedures, and updated it risk assessment.  In addition, all employees received additional security training.  \n\n
## 5 A shared Computer that was used for backup was stolen on 9/27/09 from the reception desk area of the covered entity. The Computer contained certain electronic protected health information (ePHI) of 5,257 individuals who were patients of the CE.  The ePHI involved in the breach included names, dates of birth, and clinical information, but there were no social security numbers, financial information, addresses, phone numbers, or other ePHI in any of the reports on the disks or the hard drive on the stolen Computer. Following the breach, the covered entity notified all 5,257 affected individuals and the appropriate media; added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer; added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet; and added administrative safeguards by requiring annual refresher retraining of CE staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules.\n\n
## 6                                                                                  A shared Computer that was used for backup was stolen from the reception desk area, behind a locked desk area, probably while a cleaning crew had left the main door to the building open and the door to the suite was unlocked and perhaps ajar.  The Computer contained certain electronic protected health information (ePHI) of 857 patients.  The ePHI involved in the breach included names, dates of birth, and clinical information.  Following the breach, the covered entity notified all affected individuals and the media, added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer, added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet, and added administrative safeguards by requiring annual refresher retraining staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules, which has already taken place.\n\n
##   breach_start breach_end year
## 1   10/16/2009       <NA> 2009
## 2    9/22/2009       <NA> 2009
## 3   10/12/2009       <NA> 2009
## 4    10/9/2009       <NA> 2009
## 5    9/27/2009       <NA> 2009
## 6    9/27/2009       <NA> 2009
  1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
# overview
summary(breaches)
##        X              Number       Name_of_Covered_Entity    State          
##  Min.   :   1.0   Min.   :   0.0   Length:1055            Length:1055       
##  1st Qu.: 264.5   1st Qu.: 263.5   Class :character       Class :character  
##  Median : 528.0   Median : 527.0   Mode  :character       Mode  :character  
##  Mean   : 528.0   Mean   : 527.0                                            
##  3rd Qu.: 791.5   3rd Qu.: 790.5                                            
##  Max.   :1055.0   Max.   :1054.0                                            
##  Business_Associate_Involved Individuals_Affected Date_of_Breach    
##  Length:1055                 Min.   :    500      Length:1055       
##  Class :character            1st Qu.:   1000      Class :character  
##  Mode  :character            Median :   2300      Mode  :character  
##                              Mean   :  30262                        
##                              3rd Qu.:   6941                        
##                              Max.   :4900000                        
##  Type_of_Breach     Location_of_Breached_Information Date_Posted_or_Updated
##  Length:1055        Length:1055                      Length:1055           
##  Class :character   Class :character                 Class :character      
##  Mode  :character   Mode  :character                 Mode  :character      
##                                                                            
##                                                                            
##                                                                            
##    Summary          breach_start        breach_end             year     
##  Length:1055        Length:1055        Length:1055        Min.   :1997  
##  Class :character   Class :character   Class :character   1st Qu.:2010  
##  Mode  :character   Mode  :character   Mode  :character   Median :2012  
##                                                           Mean   :2011  
##                                                           3rd Qu.:2013  
##                                                           Max.   :2014
# mean and median of Individuals_Affected

mean(breaches[,"Individuals_Affected"])
## [1] 30262.02
median(breaches[,"Individuals_Affected"])
## [1] 2300
# convert Summary field to a numeric vector of character counts for each entry
summary_nchar <- c(nchar(breaches[,"Summary"]))

# mean and median of the # of characters in the Summary field
mean(summary_nchar)
## [1] 116.1801
median(summary_nchar)
## [1] 0
# convert year field to a numeric vector and remove NA values
yearint <- c(breaches[,"year"],na.rm = TRUE)

mean(yearint)
## [1] 2009.546
median(yearint)
## [1] 2012
  1. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Let’s look at the breaches that affected more than 100,000 people We can subset at least three ways.

The first way would be to use bracketing:

# using bracketing
breachesbracket <- breaches[which(breaches$Individuals_Affected > 100000),names(breaches) %in% c("State","Individuals_Affected","year")]
  
head(breachesbracket)
##     State Individuals_Affected year
## 64     NY               344579 2009
## 76     IL               180111 2010
## 93     FL              1220000 2009
## 95     NY               130495 2010
## 114    CO               105470 2010
## 118    MA               800000 2010

A second way would be to use the base subset() function:

# using subset
breachessubset <- subset(breaches,Individuals_Affected > 100000, select = c("State","Individuals_Affected","year"))

head(breachessubset)
##     State Individuals_Affected year
## 64     NY               344579 2009
## 76     IL               180111 2010
## 93     FL              1220000 2009
## 95     NY               130495 2010
## 114    CO               105470 2010
## 118    MA               800000 2010

A third way would be to use the dplyr package:

# using the dplyr package
breachesdplyr <- select(filter(breaches,Individuals_Affected > 100000),c(State,Individuals_Affected,year))

head(breachesdplyr)
##   State Individuals_Affected year
## 1    NY               344579 2009
## 2    IL               180111 2010
## 3    FL              1220000 2009
## 4    NY               130495 2010
## 5    CO               105470 2010
## 6    MA               800000 2010

Note that the dplyr package returns rows renumbered from 1 instead of the original index numbers of the rows selected by the condition, as the bracket and subset() methods did.

  1. Create new column names for the new data frame.

Renaming columns in the subset that used the dplyr package. Just for fun, let’s translate the English column names to Latin, and while we’re at it let’s change the snake case to camel case.

names(breachesdplyr)[names(breachesdplyr) == "State"] <- "civitas "

names(breachesdplyr)[names(breachesdplyr) == "Individuals_Affected"] <- "AffectusHominum "

names(breachesdplyr)[names(breachesdplyr) == "year"] <- "anno "

head(breachesdplyr)
##   civitas  AffectusHominum  anno 
## 1       NY           344579  2009
## 2       IL           180111  2010
## 3       FL          1220000  2009
## 4       NY           130495  2010
## 5       CO           105470  2010
## 6       MA           800000  2010
  1. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.
summary(breachesdplyr)
##    civitas          AffectusHominum       anno      
##  Length:35          Min.   : 105470   Min.   :2008  
##  Class :character   1st Qu.: 177730   1st Qu.:2010  
##  Mode  :character   Median : 344579   Median :2011  
##                     Mean   : 730039   Mean   :2011  
##                     3rd Qu.: 819856   3rd Qu.:2012  
##                     Max.   :4900000   Max.   :2014
# mean and median of Individuals_Affected, now AffectusHominum

mean(breachesdplyr$AffectusHominum)
## [1] 730039.2
median(breachesdplyr$AffectusHominum)
## [1] 344579
# mean and median of year, now anno
mean(breachesdplyr$anno)
## [1] 2011.086
median(breachesdplyr$anno)
## [1] 2011
# for comparison purposes, the original means and medians from the entire set
mean(breaches[,"Individuals_Affected"])
## [1] 30262.02
median(breaches[,"Individuals_Affected"])
## [1] 2300
mean(yearint)
## [1] 2009.546
median(yearint)
## [1] 2012

By subsetting the data for breaches that affected 100K+ people, we can see that the mean increased dramatically from 30,262.02 people affected on average across all breaches to 730,039.2 people affected on average in very large breaches. The medians increased significantly as well from 2300 people affected at the midpoint of all breaches and 344,579 people affected at the midpoint of very large breaches.

Similarly the average year of all breaches was 2011 and the median year was also 2011, while much larger breaches had an average year of 2009 but a median year of 2012.

  1. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Replacing values in the subset that used the dplyr package for column “civitas”.

We will replace the value NY with MagnumMalum (Latin for BigApple), FL with SolemStatui (Latin for Sun shine State) and MA with FabaOppidum (Latin for Bean Town).

breachesdplyr
##    civitas  AffectusHominum  anno 
## 1        NY           344579  2009
## 2        IL           180111  2010
## 3        FL          1220000  2009
## 4        NY           130495  2010
## 5        CO           105470  2010
## 6        MA           800000  2010
## 7        TN          1023209  2009
## 8        PR           475000  2008
## 9        PR           115000  2010
## 10       FL           156000  2010
## 11       NH           231400  2010
## 12       NY          1700000  2010
## 13       CA           514330  2011
## 14       OK           132940  2011
## 15       CA          1900000  2011
## 16       SC           400000  2011
## 17       AZ           175350  2011
## 18       FL          1055489  2011
## 19       VA          4900000  2011
## 20       AL           943434  2011
## 21       UT           780000  2012
## 22       GA           315000  2012
## 23       SC           228435  2012
## 24       FL           105646  2011
## 25       CA           116506  2012
## 26       CA           109000  2012
## 27       CA           189489  2012
## 28       IN           187533  2013
## 29       TX           277014  2013
## 30       IL          4029530  2013
## 31       CA           729000  2013
## 32       NJ           839711  2013
## 33       PR           398000  2010
## 34       TX           405000  2013
## 35       CA           338700  2014
breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
  gsub("NY","MagnumMalum", x)
}))

breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
  gsub("MA","FabaOppidum", x)
}))

breachesdplyr <- data.frame(lapply(breachesdplyr, function(x) {
  gsub("FL","SolemStatui", x)
}))

breachesdplyr
##       civitas. AffectusHominum. anno.
## 1  MagnumMalum           344579  2009
## 2           IL           180111  2010
## 3  SolemStatui          1220000  2009
## 4  MagnumMalum           130495  2010
## 5           CO           105470  2010
## 6  FabaOppidum           800000  2010
## 7           TN          1023209  2009
## 8           PR           475000  2008
## 9           PR           115000  2010
## 10 SolemStatui           156000  2010
## 11          NH           231400  2010
## 12 MagnumMalum          1700000  2010
## 13          CA           514330  2011
## 14          OK           132940  2011
## 15          CA          1900000  2011
## 16          SC           400000  2011
## 17          AZ           175350  2011
## 18 SolemStatui          1055489  2011
## 19          VA          4900000  2011
## 20          AL           943434  2011
## 21          UT           780000  2012
## 22          GA           315000  2012
## 23          SC           228435  2012
## 24 SolemStatui           105646  2011
## 25          CA           116506  2012
## 26          CA           109000  2012
## 27          CA           189489  2012
## 28          IN           187533  2013
## 29          TX           277014  2013
## 30          IL          4029530  2013
## 31          CA           729000  2013
## 32          NJ           839711  2013
## 33          PR           398000  2010
## 34          TX           405000  2013
## 35          CA           338700  2014
  1. Display enough rows to see examples of all of steps 1-5 above.

Roger that!

  1. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Please see first R script above.