For this assignment, we were to (1) find a job description that we are interested in, (2) find a dataset that we might analyze if we were to get that job, and (3) apply some type of statistical learning algorithm, either classification or prediction, and report using the 5 steps.
Job Description. The job description was for a Data Analytics Associate at Mathematica Policy Research, in Oakland. Responsibilities include “analyzing data quality and integrity using packaged programs such as SAS, R, Python, and STATA, running basic statistics frequencies.”
Dataset. For good or bad, “policy research” could involve almost any kind of data analysis. Thus, after endless forays into various datasets I ended up choosing the Death Records dataset from Kaggle (https://www.kaggle.com/cdc/mortality), which is a compilation of year 2014 US deaths and their causes as reported by the CDC. The dataset has over 2.6M records. It’s a depressing choice, I know, but it was a good size, and had data that I felt was just messy enough for me to have to get some practice on my R code, without being a time suck; so I’m going all in on this.
Stat Learning Algorithm. After spending a lot of time exploring the data, and realizing that it would take a lot more time to really deep dive into this, I settled on experimenting with classification algorithms to predict individuals that die from external causes, like accidents, suicide, poisonings, homicide, etc. as opposed to dying of natural causes such as old age or more progressive diseases. It turns out that the vast majority of deaths fall into the latter category (> 90%) so as part of my study I am interested in seeing the effects of when there’s a severe imbalance in class–so things like the kappa statistic.
So off we go to the five steps.
Data obtained from kaggle.com, source listed above.
morbid <- read.csv("DeathRecords.csv")
str(morbid)
## 'data.frame': 2631171 obs. of 38 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ResidentStatus : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Education1989Revision : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Education2003Revision : int 2 2 7 6 3 5 4 4 3 3 ...
## $ EducationReportingFlag : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MonthOfDeath : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
## $ AgeType : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 87 58 75 74 64 93 82 55 86 23 ...
## $ AgeSubstitutionFlag : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AgeRecode52 : int 43 37 41 40 38 44 42 37 43 30 ...
## $ AgeRecode27 : int 23 17 21 20 18 24 22 17 23 10 ...
## $ AgeRecode12 : int 11 8 10 9 8 11 10 8 11 4 ...
## $ InfantAgeRecode22 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PlaceOfDeathAndDecedentsStatus: int 4 4 4 6 4 6 4 7 4 7 ...
## $ MaritalStatus : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
## $ DayOfWeekOfDeath : int 4 3 2 1 2 4 6 6 6 4 ...
## $ CurrentDataYear : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ InjuryAtWork : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
## $ MannerOfDeath : int 7 7 7 7 7 7 7 7 7 2 ...
## $ MethodOfDisposition : Factor w/ 7 levels "B","C","D","E",..: 2 2 2 2 2 2 2 2 2 6 ...
## $ Autopsy : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ ActivityCode : int 99 99 99 99 99 99 99 99 99 9 ...
## $ PlaceOfInjury : int 99 99 99 99 99 99 99 99 99 5 ...
## $ Icd10Code : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
## $ CauseRecode358 : int 238 214 267 228 214 280 215 214 175 429 ...
## $ CauseRecode113 : int 70 62 86 68 62 111 63 62 111 125 ...
## $ InfantCauseRecode130 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CauseRecode39 : int 24 21 28 22 21 37 21 21 37 40 ...
## $ NumberOfEntityAxisConditions : int 1 3 2 3 1 5 4 2 1 3 ...
## $ NumberOfRecordAxisConditions : int 1 3 2 3 1 5 3 2 1 3 ...
## $ Race : int 1 1 1 1 1 1 1 2 1 1 ...
## $ BridgedRaceFlag : int 0 0 0 0 0 0 0 1 0 0 ...
## $ RaceImputationFlag : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RaceRecode3 : int 1 1 1 1 1 1 1 3 1 1 ...
## $ RaceRecode5 : int 1 1 1 1 1 1 1 2 1 1 ...
## $ HispanicOrigin : int 100 100 100 100 100 100 100 100 100 100 ...
## $ HispanicOriginRaceRecode : int 6 6 6 6 6 6 6 7 6 6 ...
The dataset contains 2,631,171 observations of 38 variables.
First, I’d like to get rid of a lot of variables that are either repetitive (e.g. multiple binnings of the same base variable, such as Age) or otherwise not relevant to my study.
dump <- c("AgeRecode27","AgeRecode52","AgeType","BridgedRaceFlag",
"Education1989Revision","EducationReportingFlag","EntityAxisConditions",
"HispanicOrigin","HispanicOriginRaceRecode","MethodOfDisposition",
"RaceImputationFlag","RaceRecode3","RecordAxisConditions",
"AgeSubstitutionFlag","CauseRecode358","CauseRecode113",
"NumberOfEntityAxisConditions","NumberOfRecordAxisConditions",
"InfantAgeRecode22","InfantCauseRecode130","CurrentDataYear",
"CauseRecode39")
morbid <- morbid[,!(names(morbid) %in% dump)]
str(morbid)
## 'data.frame': 2631171 obs. of 18 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ResidentStatus : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Education2003Revision : int 2 2 7 6 3 5 4 4 3 3 ...
## $ MonthOfDeath : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
## $ Age : int 87 58 75 74 64 93 82 55 86 23 ...
## $ AgeRecode12 : int 11 8 10 9 8 11 10 8 11 4 ...
## $ PlaceOfDeathAndDecedentsStatus: int 4 4 4 6 4 6 4 7 4 7 ...
## $ MaritalStatus : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
## $ DayOfWeekOfDeath : int 4 3 2 1 2 4 6 6 6 4 ...
## $ InjuryAtWork : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
## $ MannerOfDeath : int 7 7 7 7 7 7 7 7 7 2 ...
## $ Autopsy : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ ActivityCode : int 99 99 99 99 99 99 99 99 99 9 ...
## $ PlaceOfInjury : int 99 99 99 99 99 99 99 99 99 5 ...
## $ Icd10Code : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
## $ Race : int 1 1 1 1 1 1 1 2 1 1 ...
## $ RaceRecode5 : int 1 1 1 1 1 1 1 2 1 1 ...
summary(morbid)
## Id ResidentStatus Education2003Revision MonthOfDeath
## Min. : 1 Min. :1.00 Min. :0.000 Min. : 1.000
## 1st Qu.: 657794 1st Qu.:1.00 1st Qu.:2.000 1st Qu.: 3.000
## Median :1315586 Median :1.00 Median :3.000 Median : 6.000
## Mean :1315586 Mean :1.26 Mean :3.227 Mean : 6.489
## 3rd Qu.:1973378 3rd Qu.:1.00 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :2631171 Max. :4.00 Max. :9.000 Max. :12.000
##
## Sex Age AgeRecode12
## F:1299710 Min. : 1.00 Min. : 1.000
## M:1331461 1st Qu.: 63.00 1st Qu.: 8.000
## Median : 77.00 Median :10.000
## Mean : 73.41 Mean : 9.298
## 3rd Qu.: 87.00 3rd Qu.:11.000
## Max. :999.00 Max. :12.000
##
## PlaceOfDeathAndDecedentsStatus MaritalStatus DayOfWeekOfDeath
## Min. :1.000 D:400959 Min. :1.000
## 1st Qu.:1.000 M:980016 1st Qu.:2.000
## Median :4.000 S:333043 Median :4.000
## Mean :3.614 U: 18713 Mean :4.009
## 3rd Qu.:6.000 W:898440 3rd Qu.:6.000
## Max. :9.000 Max. :9.000
##
## InjuryAtWork MannerOfDeath Autopsy ActivityCode PlaceOfInjury
## N: 205381 Min. :0.000 n: 3 Min. : 0.00 Min. : 0.00
## U:2421408 1st Qu.:7.000 N:2258840 1st Qu.:99.00 1st Qu.:99.00
## Y: 4382 Median :7.000 U: 168646 Median :99.00 Median :99.00
## Mean :5.609 y: 1 Mean :92.19 Mean :91.94
## 3rd Qu.:7.000 Y: 203681 3rd Qu.:99.00 3rd Qu.:99.00
## Max. :7.000 Max. :99.00 Max. :99.00
##
## Icd10Code Race RaceRecode5
## I251 : 161961 Min. : 1.000 Min. :1.000
## C349 : 154862 1st Qu.: 1.000 1st Qu.:1.000
## F03 : 122021 Median : 1.000 Median :1.000
## I219 : 114107 Mean : 1.686 Mean :1.202
## J449 : 107836 3rd Qu.: 1.000 3rd Qu.:1.000
## G309 : 91356 Max. :78.000 Max. :4.000
## (Other):1879028
Noticing the numerous categorical data that are expressed in numbers, and also the missing data, such as a Max Age of 999. Assuming we didn’t lose some Biblical characters in 2014, we need to deal with those. Since Age is such as key variable in all of my analysis, I’ve decided to eliminate those rows.
This dataset used a lot of lookup tables, with the main dataset relying on numeric codes for categorical data. These need to be converted to the correct data type.
cols <- c("ResidentStatus","Education2003Revision","AgeRecode12",
"PlaceOfDeathAndDecedentsStatus","DayOfWeekOfDeath",
"MannerOfDeath","ActivityCode","PlaceOfInjury",
"RaceRecode5")
morbid[,cols] <- data.frame(apply(morbid[cols], 2, as.factor))
Need to either eliminate or convert “99” and “999” data,
# Get rid of rows with invalid Age, as it's key to my classification
morbid<-morbid[!(morbid$Age=="999"),]
# Recode 99 to NA and get rid of dirty data
morbid$ActivityCode[morbid$ActivityCode == 99]<- NA
morbid$PlaceOfInjury[morbid$PlaceOfInjury == 99] <- NA
morbid$DayOfWeekOfDeath[morbid$DayOfWeekOfDeath == 9] <- NA
morbid$Autopsy[morbid$Autopsy == "n"] <- "N"
morbid$Autopsy[morbid$Autopsy == "y"] <- "Y"
morbid$AgeRecode12 <- sprintf("%02d",morbid$AgeRecode12) # This is to facilitate ordering of the bins.
summary(morbid)
## Id ResidentStatus Education2003Revision MonthOfDeath
## Min. : 1 1:2087112 3 :1006044 Min. : 1.000
## 1st Qu.: 657842 2: 406899 4 : 289848 1st Qu.: 3.000
## Median :1315600 3: 132030 1 : 278044 Median : 6.000
## Mean :1315602 4: 4559 2 : 259327 Mean : 6.489
## 3rd Qu.:1973337 0 : 241593 3rd Qu.:10.000
## Max. :2631171 6 : 234315 Max. :12.000
## (Other): 321429
## Sex Age AgeRecode12
## F:1299494 Min. : 1.00 Length:2630600
## M:1331106 1st Qu.: 63.00 Class :character
## Median : 77.00 Mode :character
## Mean : 73.21
## 3rd Qu.: 87.00
## Max. :118.00
##
## PlaceOfDeathAndDecedentsStatus MaritalStatus DayOfWeekOfDeath
## 1 :795073 D:400955 4 :380438
## 4 :773741 M:980012 6 :378856
## 6 :523232 S:332792 7 :377989
## 5 :187339 U: 18411 2 :375783
## 2 :174102 W:898430 3 :372708
## 7 :161622 (Other):744700
## (Other): 15491 NA's : 126
## InjuryAtWork MannerOfDeath Autopsy ActivityCode
## N: 205296 0: 364008 n: 0 9 : 198121
## U:2420922 1: 132629 N:2258538 4 : 244
## Y: 4382 2: 43132 U: 168634 8 : 183
## 3: 16829 y: 0 1 : 170
## 4: 3666 Y: 203428 2 : 89
## 5: 10703 (Other): 43
## 7:2059633 NA's :2431750
## PlaceOfInjury Icd10Code Race RaceRecode5
## 0 : 98593 I251 : 161950 Min. : 1.000 1:2241095
## 9 : 53947 C349 : 154859 1st Qu.: 1.000 2: 309372
## 8 : 15671 F03 : 122019 Median : 1.000 3: 18025
## 1 : 9268 I219 : 114102 Mean : 1.686 4: 62108
## 4 : 7222 J449 : 107834 3rd Qu.: 1.000
## (Other): 10012 G309 : 91356 Max. :78.000
## NA's :2435887 (Other):1878480
str(morbid)
## 'data.frame': 2630600 obs. of 18 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ResidentStatus : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ Education2003Revision : Factor w/ 10 levels "0","1","2","3",..: 3 3 8 7 4 6 5 5 4 4 ...
## $ MonthOfDeath : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
## $ Age : int 87 58 75 74 64 93 82 55 86 23 ...
## $ AgeRecode12 : chr "03" "11" "02" "12" ...
## $ PlaceOfDeathAndDecedentsStatus: Factor w/ 8 levels "1","2","3","4",..: 4 4 4 6 4 6 4 7 4 7 ...
## $ MaritalStatus : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
## $ DayOfWeekOfDeath : Factor w/ 8 levels "1","2","3","4",..: 4 3 2 1 2 4 6 6 6 4 ...
## $ InjuryAtWork : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
## $ MannerOfDeath : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 3 ...
## $ Autopsy : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ ActivityCode : Factor w/ 8 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 7 ...
## $ PlaceOfInjury : Factor w/ 11 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 6 ...
## $ Icd10Code : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
## $ Race : int 1 1 1 1 1 1 1 2 1 1 ...
## $ RaceRecode5 : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
OK. This is starting to look a little bit better. We are down to 18 variables. And we got rid of the 571 rows where Age was 999.
Since I don’t like writing a ton of lines of code and since I want to learn dplyr, this is a good opportunity to work on it. You’ll see that I’m still kind of a one-trick pony at this.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mdf <- dplyr::tbl_df(morbid)
options(dplyr.print_max = 1e9)
Mean/sd of Age and Month of Death (the only two continuous variables, alas).
library(scales)
# Mean and sd of Age, by MonthOfDeath
mdf %>%
group_by(MonthOfDeath) %>%
summarise_each(funs(mean(.,na.rm=TRUE),sd(.,na.rm=TRUE),min(.,na.rm=TRUE),
max(.,na.rm=TRUE)), Age)
## Source: local data frame [12 x 5]
##
## MonthOfDeath mean sd min max
## (int) (dbl) (dbl) (int) (int)
## 1 1 73.25837 17.76808 1 116
## 2 2 73.38481 17.80028 1 110
## 3 3 73.47262 17.86081 1 114
## 4 4 73.36695 17.98120 1 112
## 5 5 73.18296 18.09358 1 114
## 6 6 72.83089 18.27541 1 111
## 7 7 72.79966 18.24446 1 114
## 8 8 72.72435 18.35336 1 113
## 9 9 72.98840 18.18062 1 118
## 10 10 73.11801 18.10792 1 113
## 11 11 73.35125 18.04741 1 113
## 12 12 73.90216 17.77857 1 114
Frequencies by Day of Week, Month, Diagnostic Code (ICD10), etc.–kind of a verification/visual validation to make sure the data are OK to use.
mdf %>%
filter(complete.cases(.)) %>%
group_by(DayOfWeekOfDeath) %>%
summarise (n = n()) %>%
mutate(freq = percent(n / sum(n)))
## Source: local data frame [7 x 3]
##
## DayOfWeekOfDeath n freq
## (fctr) (int) (chr)
## 1 1 23892 14.73%
## 2 2 23248 14.33%
## 3 3 22510 13.88%
## 4 4 23013 14.19%
## 5 5 22409 13.82%
## 6 6 23063 14.22%
## 7 7 24050 14.83%
mdf %>%
filter(complete.cases(.)) %>%
group_by(MonthOfDeath) %>%
select(DayOfWeekOfDeath) %>%
table()
## DayOfWeekOfDeath
## MonthOfDeath 1 2 3 4 5 6 7 9
## 1 1723 1847 1691 2242 2111 2201 1835 0
## 2 1834 1764 1697 1680 1761 1783 1841 0
## 3 2161 2094 1699 1702 1717 1756 2300 0
## 4 1753 1734 2129 2129 1690 1767 1776 0
## 5 1772 1799 1690 1613 2144 2228 2306 0
## 6 2351 2222 1742 1811 1767 1741 1906 0
## 7 1842 1766 2119 2186 2081 1800 1912 0
## 8 2396 1905 1825 1736 1766 2184 2318 0
## 9 1912 2264 2212 1783 1746 1775 1849 0
## 10 1868 1740 1794 2178 2096 2200 1850 0
## 11 2335 1831 1767 1730 1758 1757 2352 0
## 12 1945 2282 2145 2223 1772 1871 1805 0
A peek at some diagnostic codes (there are over 3000 of them)
icd <- mdf %>%
filter(complete.cases(.)) %>%
group_by(Icd10Code) %>%
summarise (n = n()) %>%
mutate(freq = (n / sum(n))) %>%
arrange(desc(freq))
head(icd)
## Source: local data frame [6 x 3]
##
## Icd10Code n freq
## (fctr) (int) (dbl)
## 1 X44 18627 0.11485033
## 2 X42 16869 0.10401085
## 3 W19 14472 0.08923143
## 4 X74 12921 0.07966828
## 5 X70 11439 0.07053057
## 6 W18 10852 0.06691124
And a quick peek at deaths grouped by gender and age group (AgeRecode12, which divides the population into 12 age-based bins).
mdf %>%
filter(complete.cases(.)) %>%
group_by(Sex,AgeRecode12) %>%
summarise(n = n()) %>%
mutate(freq = percent(n / sum(n)))
## Source: local data frame [22 x 4]
## Groups: Sex [2]
##
## Sex AgeRecode12 n freq
## (fctr) (chr) (int) (chr)
## 1 F 01 575 1.0%
## 2 F 02 6303 11.5%
## 3 F 03 12623 23.0%
## 4 F 05 454 0.8%
## 5 F 06 441 0.8%
## 6 F 07 2800 5.1%
## 7 F 08 5325 9.7%
## 8 F 09 6304 11.5%
## 9 F 10 8618 15.7%
## 10 F 11 6975 12.7%
## 11 F 12 4399 8.0%
## 12 M 01 767 0.7%
## 13 M 02 8830 8.2%
## 14 M 03 9129 8.5%
## 15 M 05 702 0.7%
## 16 M 06 894 0.8%
## 17 M 07 11897 11.1%
## 18 M 08 17454 16.3%
## 19 M 09 15235 14.2%
## 20 M 10 18190 16.9%
## 21 M 11 15305 14.3%
## 22 M 12 8965 8.3%
Now it seems a bit strange that 23% of all female deaths occurred in the AgeRecode12 “03” group, which are the 5-14 year olds. Was there some catastrophe in 2014?
Anyway, we start thinking about how to label the observations. I’d like to see how Manner of Death lines up with the age groups.
# Start looking at death data to decide how to label the observations.
agemanner<- mdf %>%
filter(complete.cases(.)) %>%
group_by(AgeRecode12,MannerOfDeath) %>%
summarise (n = n()) %>%
mutate(freq = (n / sum(n))) %>%
arrange(desc(freq)) %>%
mutate(freq = percent(freq))
head(agemanner,n = 13)
## Source: local data frame [13 x 4]
## Groups: AgeRecode12 [2]
##
## AgeRecode12 MannerOfDeath n freq
## (chr) (fctr) (int) (chr)
## 1 01 1 1049 78.2%
## 2 01 3 169 12.6%
## 3 01 5 84 6.3%
## 4 01 7 23 1.7%
## 5 01 0 13 1.0%
## 6 01 4 4 0.3%
## 7 02 1 9414 62.2%
## 8 02 7 2516 16.6%
## 9 02 2 2390 15.8%
## 10 02 0 427 2.8%
## 11 02 3 256 1.7%
## 12 02 5 120 0.8%
## 13 02 4 10 0.1%
Hmm. Now I’m really starting to wonder about the accuracy of the AgeRecode12 variable. AgeRecode12 = 2 are the 1-4 year olds, and MannerOfDeath = 2 are death by suicide. I can’t believe that 2,390 1-4 year olds died by suicide; that would just shatter my faith in humanity, really. I think I will create my own binned age categories and see what results I get. (I realize the below nested ifelse statement is super ugly but couldn’t get a switch function to work.)
morbid$agegrp <- as.factor(ifelse(morbid$Age<5,
"01. <5yo",
ifelse(morbid$Age<=10,
"02. 5-10yo",
ifelse(morbid$Age<=15,
"03. 11-15yo",
ifelse(morbid$Age<=20,
"04. 15-20yo",
ifelse(morbid$Age<=25,
"05. 21-25yo",
ifelse(morbid$Age<=30,
"06. 26-30yo",
ifelse(morbid$Age<=40,
"07. 31-40yo",
ifelse(morbid$Age<=50,
"08. 41-50yo",
ifelse(morbid$Age<=60,
"09. 51-60yo",
ifelse(morbid$Age<=70,
"10. 61-70yo",
ifelse(morbid$Age<=80,
"11. 71-80yo",
"12. >80yo"))))))))))))
mdf <- dplyr::tbl_df(morbid)
agemanner <- mdf %>%
filter(complete.cases(.)) %>%
group_by(agegrp,MannerOfDeath) %>%
summarise (n = n()) %>%
mutate(freq = (n / sum(n))) %>%
arrange(desc(freq)) %>%
mutate(freq = percent(freq))
head(agemanner, n = 13)
## Source: local data frame [13 x 4]
## Groups: agegrp [2]
##
## agegrp MannerOfDeath n freq
## (fctr) (fctr) (int) (chr)
## 1 01. <5yo 1 1428 72.7%
## 2 01. <5yo 3 378 19.3%
## 3 01. <5yo 5 115 5.9%
## 4 01. <5yo 7 21 1.1%
## 5 01. <5yo 0 13 0.7%
## 6 01. <5yo 4 8 0.4%
## 7 02. 5-10yo 1 649 69.3%
## 8 02. 5-10yo 3 194 20.7%
## 9 02. 5-10yo 5 54 5.8%
## 10 02. 5-10yo 7 17 1.8%
## 11 02. 5-10yo 2 12 1.3%
## 12 02. 5-10yo 0 7 0.7%
## 13 02. 5-10yo 4 3 0.3%
OK. The 1-4 year old suicide group went away; my faith in humanity has been restored. I think I’ll use my own age groupings instead of what the CDC gave us. Let’s go ahead and drop the AgeRecode12 variable since it’s apparently gibberish.
drop12 <- "AgeRecode12"
morbid <- morbid[,!names(morbid) %in% drop12]
Label external causes of death (“ext”) based on MannerOfDeath codes provided:
| Label | Description | Classification (ext: 1 = yes, 0 = no) |
|---|---|---|
| 1. | Accident | Deem external cause of death (ext = 1) |
| 2. | Suicide | External |
| 3. | Homicide | External |
| 4. | Pending investigation | External |
| 5. | Could not determine | Assumed external |
| 6. | Self-inflicted | External |
| 7. | Natural | Assumed not external cause of death (ext = 0) |
| 0. | Not specified | Assumed not external |
Source: kaggle / CDC
| ICD10 Codes | Diagnosis |
|---|---|
| S00-T88 | Injury, poisoning, and certain other consequences of external causes |
| V00-Y99 | External causes of morbidity |
*Source: http://www.icd10data.com/ICD10CM/Codes*
morbid$ext <- as.factor(ifelse((morbid$MannerOfDeath %in% c("1","2","3","4","5",
"6")),1,
ifelse((substr(morbid$Icd10Code,1,1) %in%
c("S","T","V","W","X","Y")),1,0)))
mdf <- dplyr::tbl_df(morbid)
mdf %>%
group_by(ext) %>%
tally()
## Source: local data frame [2 x 2]
##
## ext n
## (fctr) (int)
## 1 0 2407613
## 2 1 222987
It looks like a little less than 10% of the total dataset will be classified as having died due to external causes. Let’s look at how external vs non-external causes of death line up against our newly created age groupings.
mdf <- dplyr::tbl_df(morbid)
mdf %>%
group_by(agegrp,ext) %>%
summarise(n = n()) %>%
mutate(freq = percent(n / sum(n))) %>%
as.data.frame()
## agegrp ext n freq
## 1 01. <5yo 0 12726 74.1%
## 2 01. <5yo 1 4450 25.9%
## 3 02. 5-10yo 0 5521 72.7%
## 4 02. 5-10yo 1 2069 27.3%
## 5 03. 11-15yo 0 2878 58%
## 6 03. 11-15yo 1 2081 42%
## 7 04. 15-20yo 0 3564 27.8%
## 8 04. 15-20yo 1 9263 72.2%
## 9 05. 21-25yo 0 5489 25.9%
## 10 05. 21-25yo 1 15683 74.1%
## 11 06. 26-30yo 0 7518 32.9%
## 12 06. 26-30yo 1 15314 67.1%
## 13 07. 31-40yo 0 28701 49.9704%
## 14 07. 31-40yo 1 28735 50.0296%
## 15 08. 41-50yo 0 87009 73%
## 16 08. 41-50yo 1 32227 27%
## 17 09. 51-60yo 0 250072 87.2%
## 18 09. 51-60yo 1 36730 12.8%
## 19 10. 61-70yo 0 404360 94.7%
## 20 10. 61-70yo 1 22431 5.3%
## 21 11. 71-80yo 0 527541 96.6%
## 22 11. 71-80yo 1 18564 3.4%
## 23 12. >80yo 0 1072234 96.8%
## 24 12. >80yo 1 35440 3.2%
We see that for under 10yo and over 40yo, “non-external” causes of death far outnumber the externals. However starting with the 11-15yo group and through the 31-40yo group, there is a greatly increased frequency of death by external causes, with the maximum rate of death by external causes seen in the 26-30yo group.
As an aside, I would have loved to make some plots of these, but could not for the life of me figure out how to create plots from dplyr layouts and did not have time to go back to the raw data to recreate. I think I ought to continue to work on this over the summer.
mdf %>%
group_by(Sex,ext) %>%
summarise(n = n()) %>%
mutate(freq = percent(n / sum(n))) %>%
as.data.frame()
## Sex ext n freq
## 1 F 0 1224371 94.2%
## 2 F 1 75123 5.8%
## 3 M 0 1183242 88.9%
## 4 M 1 147864 11.1%
Perhaps unsurprisingly, it appears that males die by external causes at a greater rate than females.
1: White; 2: Black; 3: American Indian; 4: Asian or Pacific Islander
mdf %>%
group_by(RaceRecode5,ext) %>%
summarise(n = n()) %>%
mutate(freq = percent(n / sum(n))) %>%
as.data.frame()
## RaceRecode5 ext n freq
## 1 1 0 2055056 91.7%
## 2 1 1 186039 8.3%
## 3 2 0 280444 90.6%
## 4 2 1 28928 9.4%
## 5 3 0 14871 82.5%
## 6 3 1 3154 17.5%
## 7 4 0 57242 92.2%
## 8 4 1 4866 7.8%
The group that stands out are the American Indians–they seem to have twice the rate of death by external causes compared with everybody else. That’s just horribly tragic.
(For the purposes of the classification algorithms, I will also be including the variable “Race” which is more detailed.)
*** After running all code in RStudio successfully and writing my entire report, I found that when rendering in RMarkdown uses more memory than running in RStudio. So I went back and halved my dataset. So the final run where I actually run the models, will be using a 50/25/25 of 10% of the data. This was a major bummer but a good lesson about why we don’t want to choose too big of a dataset.***
I’ll be using the caret package for this task. Since we have an overabundance of data, I’m going to use a 50/25/25 split on the training, validation, and test datasets.
But first, I need to remove the MannerOfDeath and Icd10Code fields, since those are the fields from which I created the label. I also would like to remove the “Age” variable, since it would be highly correlated with agegrp. Finally remove “Id”, which has no meaning.
drop2 <- c("MannerOfDeath","Icd10Code","Age","Id")
m_all <- morbid[,!(names(morbid) %in% drop2)]
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(432)
trainIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mtrain <- m_all[trainIndex,]
mrest <- m_all[-trainIndex,]
valIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mval <- mrest[valIndex,]
mtest <- mrest[-valIndex,]
The classification is around what factors predict a death by external causes. We’ve looked at a few potentially relevant factors, such as gender, age group and race. I’ll be using C5.0 and Random Forest–unfortunately the same algorithms I used for Project 1, but they are useful for this type of data and I am running out of time.
library(C50)
str(m_all)
## 'data.frame': 2630600 obs. of 15 variables:
## $ ResidentStatus : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ Education2003Revision : Factor w/ 10 levels "0","1","2","3",..: 3 3 8 7 4 6 5 5 4 4 ...
## $ MonthOfDeath : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
## $ PlaceOfDeathAndDecedentsStatus: Factor w/ 8 levels "1","2","3","4",..: 4 4 4 6 4 6 4 7 4 7 ...
## $ MaritalStatus : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
## $ DayOfWeekOfDeath : Factor w/ 8 levels "1","2","3","4",..: 4 3 2 1 2 4 6 6 6 4 ...
## $ InjuryAtWork : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
## $ Autopsy : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ ActivityCode : Factor w/ 8 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 7 ...
## $ PlaceOfInjury : Factor w/ 11 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 6 ...
## $ Race : int 1 1 1 1 1 1 1 2 1 1 ...
## $ RaceRecode5 : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
## $ agegrp : Factor w/ 12 levels "01. <5yo",..: 12 9 11 11 10 12 12 9 12 5 ...
## $ ext : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
m1 <- C5.0(x=mtrain[-15], y=mtrain$ext)
summary(m1)
##
## Call:
## C5.0.default(x = mtrain[-15], y = mtrain$ext)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Jun 9 23:06:41 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1315301 cases (15 attributes) from undefined.data
##
## Decision tree:
##
## InjuryAtWork = U:
## :...Autopsy in {n,y}: 0 (0)
## : Autopsy in {N,U}:
## : :...PlaceOfInjury = 99: 0 (0)
## : : PlaceOfInjury in {0,1,2,3,4,5,6,7,8}:
## : : :...Autopsy = N:
## : : : :...Education2003Revision in {1,2,3,4,5,6,7,8,9}: 0 (70439.2/402.7)
## : : : : Education2003Revision = 0:
## : : : : :...agegrp in {01. <5yo,02. 5-10yo,08. 41-50yo,09. 51-60yo,
## : : : : : 10. 61-70yo,11. 71-80yo,
## : : : : : 12. >80yo}: 0 (6827.7/286.9)
## : : : : agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## : : : : : 07. 31-40yo}:
## : : : : :...PlaceOfDeathAndDecedentsStatus in {1,3,5,
## : : : : : 6}: 0 (74.7/18.8)
## : : : : PlaceOfDeathAndDecedentsStatus in {2,4,7,9}:
## : : : : :...Sex = F: 0 (29.1/11.4)
## : : : : Sex = M: 1 (80.5/25.5)
## : : : Autopsy = U:
## : : : :...agegrp in {10. 61-70yo,11. 71-80yo,12. >80yo}: 0 (5089.6/246.4)
## : : : agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo,07. 31-40yo}:
## : : : :...MonthOfDeath <= 6: 0 (52.7/14)
## : : : : MonthOfDeath > 6:
## : : : : :...PlaceOfDeathAndDecedentsStatus in {5,9}: 0 (3.4/0.1)
## : : : : PlaceOfDeathAndDecedentsStatus in {2,3,4,6,
## : : : : : 7}: 1 (237.1/22.4)
## : : : : PlaceOfDeathAndDecedentsStatus = 1:
## : : : : :...Sex = F: 0 (19.5/5.6)
## : : : : Sex = M: 1 (49.4/16.6)
## : : : agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,08. 41-50yo,
## : : : : 09. 51-60yo}:
## : : : :...MonthOfDeath <= 6: 0 (374.1/19.3)
## : : : MonthOfDeath > 6:
## : : : :...PlaceOfDeathAndDecedentsStatus in {3,7}: 1 (95.4/21.5)
## : : : PlaceOfDeathAndDecedentsStatus in {1,5,6,
## : : : : 9}: 0 (313.2/40.5)
## : : : PlaceOfDeathAndDecedentsStatus in {2,4}:
## : : : :...agegrp in {01. <5yo,02. 5-10yo,
## : : : : 03. 11-15yo}: 1 (21.6/3.3)
## : : : agegrp = 09. 51-60yo: 0 (212.9/67.2)
## : : : agegrp = 08. 41-50yo:
## : : : :...RaceRecode5 in {1,4}: 1 (90.6/35.2)
## : : : RaceRecode5 in {2,3}: 0 (13.5/4.4)
## : : PlaceOfInjury = 9:
## : : :...agegrp in {09. 51-60yo,10. 61-70yo,11. 71-80yo,
## : : : 12. >80yo}: 0 (1020393/7596.9)
## : : agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,05. 21-25yo,
## : : : 06. 26-30yo,07. 31-40yo,08. 41-50yo}:
## : : :...PlaceOfDeathAndDecedentsStatus in {1,2,4,5,6,
## : : : 9}: 0 (55534/1225.3)
## : : PlaceOfDeathAndDecedentsStatus in {3,7}:
## : : :...Autopsy = N: 0 (1998.4/162.3)
## : : Autopsy = U:
## : : :...agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,
## : : : 08. 41-50yo}: 0 (151.5/49.3)
## : : agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo,
## : : : 07. 31-40yo}:
## : : :...MonthOfDeath > 4: 1 (123.1/27.9)
## : : MonthOfDeath <= 4:
## : : :...Sex = F: 0 (9.3/1)
## : : Sex = M:
## : : :...Education2003Revision in {0,2,3,5,6,7,8,
## : : : 9}: 1 (8.6/0.9)
## : : Education2003Revision in {1,4}: 0 (2.8)
## : Autopsy = Y:
## : :...Education2003Revision = 0:
## : :...agegrp in {09. 51-60yo,10. 61-70yo,11. 71-80yo,
## : : : 12. >80yo}: 0 (2872/597)
## : : agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,05. 21-25yo,
## : : : 06. 26-30yo,07. 31-40yo,08. 41-50yo}:
## : : :...PlaceOfDeathAndDecedentsStatus in {1,5,6}: 0 (472/122)
## : : PlaceOfDeathAndDecedentsStatus in {2,3,4,7,9}:
## : : :...agegrp in {07. 31-40yo,08. 41-50yo}:
## : : :...PlaceOfDeathAndDecedentsStatus in {2,3,
## : : : : 4}: 0 (959/363)
## : : : PlaceOfDeathAndDecedentsStatus in {7,9}:
## : : : :...RaceRecode5 = 3: 0 (5)
## : : : RaceRecode5 in {1,2,4}:
## : : : :...ResidentStatus in {1,2}: 1 (519/212)
## : : : ResidentStatus = 4: 0 (2)
## : : : ResidentStatus = 3:
## : : : :...RaceRecode5 in {1,4}: 0 (32/13)
## : : : RaceRecode5 = 2: 1 (9/3)
## : : agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,
## : : : 05. 21-25yo,06. 26-30yo}:
## : : :...PlaceOfDeathAndDecedentsStatus in {3,7,
## : : : 9}: 1 (273/54)
## : : PlaceOfDeathAndDecedentsStatus in {2,4}:
## : : :...Sex = M: 1 (328/98)
## : : Sex = F:
## : : :...ResidentStatus = 4: 1 (0)
## : : ResidentStatus = 3: 0 (5)
## : : ResidentStatus in {1,2}:
## : : :...agegrp in {01. <5yo,03. 11-15yo,
## : : : 05. 21-25yo}: 1 (77/25)
## : : agegrp = 06. 26-30yo: 0 (33/13)
## : : agegrp = 02. 5-10yo:
## : : :...MonthOfDeath <= 8: 1 (10/4)
## : : : MonthOfDeath > 8: 0 (6)
## : : agegrp = 04. 15-20yo:
## : : :...ResidentStatus = 2: 0 (2)
## : : ResidentStatus = 1:
## : : :...Race <= 1: 1 (9/2)
## : : Race > 1: 0 (2)
## : Education2003Revision in {1,2,3,4,5,6,7,8,9}:
## : :...agegrp in {03. 11-15yo,08. 41-50yo,09. 51-60yo,10. 61-70yo,
## : : 11. 71-80yo,12. >80yo}: 0 (33152/2737)
## : agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## : : 07. 31-40yo}:
## : :...PlaceOfDeathAndDecedentsStatus = 1: 0 (2466/275)
## : PlaceOfDeathAndDecedentsStatus in {2,4,5,6}:
## : :...agegrp in {02. 5-10yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## : : : 07. 31-40yo}: 0 (4587/1067)
## : : agegrp = 01. <5yo:
## : : :...RaceRecode5 = 3: 1 (31/9)
## : : RaceRecode5 = 4: 0 (23/6)
## : : RaceRecode5 in {1,2}:
## : : :...DayOfWeekOfDeath in {1,5}: 1 (224/104)
## : : DayOfWeekOfDeath in {2,3,6,7,9}: 0 (463/216)
## : : DayOfWeekOfDeath = 4:
## : : :...ResidentStatus = 2: 0 (13/3)
## : : ResidentStatus in {3,4}: 1 (4/1)
## : : ResidentStatus = 1: [S1]
## : PlaceOfDeathAndDecedentsStatus in {3,7,9}:
## : :...ResidentStatus = 4: 1 (36/7)
## : ResidentStatus in {1,2,3}:
## : :...agegrp = 07. 31-40yo: 0 (622/233)
## : agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,05. 21-25yo,
## : : 06. 26-30yo}:
## : :...Race <= 1:
## : :...Education2003Revision in {2,3,5,8,
## : : : 9}: 1 (251/86)
## : : Education2003Revision in {6,7}: 0 (30/11)
## : : Education2003Revision = 1:
## : : :...DayOfWeekOfDeath in {1,2,7,9}: 1 (40/7)
## : : : DayOfWeekOfDeath in {3,4,6}: 0 (36/15)
## : : : DayOfWeekOfDeath = 5: [S2]
## : : Education2003Revision = 4: [S3]
## : Race > 1:
## : :...MaritalStatus = W: 0 (0)
## : MaritalStatus in {D,U}: 1 (3)
## : MaritalStatus = M:
## : :...Sex = F: 0 (5/1)
## : : Sex = M: 1 (5/1)
## : MaritalStatus = S:
## : :...Education2003Revision in {2,4,5,7,8,
## : : 9}: 0 (79/26)
## : Education2003Revision = 6:
## : :...Race <= 7: 0 (6/1)
## : : Race > 7: 1 (2)
## : Education2003Revision = 3:
## : :...DayOfWeekOfDeath in {1,4,
## : : : 6}: 1 (12/3)
## : : DayOfWeekOfDeath in {2,3,5,
## : : : 9}: 0 (14/2)
## : : DayOfWeekOfDeath = 7:
## : : :...MonthOfDeath <= 7: 1 (5)
## : : MonthOfDeath > 7: 0 (5/1)
## : Education2003Revision = 1:
## : :...agegrp in {02. 5-10yo,04. 15-20yo,
## : : 05. 21-25yo}: 0 (12/3)
## : agegrp = 06. 26-30yo: 1 (2)
## : agegrp = 01. <5yo:
## : :...RaceRecode5 = 3: 0 (3)
## : RaceRecode5 in {1,4}: 1 (1)
## : RaceRecode5 = 2:
## : :...Sex = F: 1 (16/2)
## : Sex = M: 0 (18/6)
## InjuryAtWork in {N,Y}:
## :...PlaceOfDeathAndDecedentsStatus in {1,2,3,4,5,7,9}:
## :...Autopsy in {n,y,Y}: 1 (53442/668)
## : Autopsy in {N,U}:
## : :...Education2003Revision = 9:
## : :...MaritalStatus = U: 1 (226/23)
## : : MaritalStatus in {D,M,S,W}:
## : : :...ResidentStatus = 4: 1 (7)
## : : ResidentStatus in {1,2,3}:
## : : :...InjuryAtWork = Y: 1 (10/1)
## : : InjuryAtWork = N:
## : : :...agegrp in {03. 11-15yo,10. 61-70yo,11. 71-80yo,
## : : : 12. >80yo}: 0 (1128/288)
## : : agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,
## : : : 05. 21-25yo,06. 26-30yo,07. 31-40yo,
## : : : 08. 41-50yo,09. 51-60yo}:
## : : :...PlaceOfDeathAndDecedentsStatus in {1,2,3,4,7,
## : : : 9}: 1 (299/91)
## : : PlaceOfDeathAndDecedentsStatus = 5:
## : : :...agegrp in {02. 5-10yo,04. 15-20yo,
## : : : 05. 21-25yo,08. 41-50yo,
## : : : 09. 51-60yo}: 0 (53/3)
## : : agegrp in {01. <5yo,06. 26-30yo,
## : : : 07. 31-40yo}:
## : : :...MonthOfDeath <= 5: 0 (2)
## : : MonthOfDeath > 5: 1 (4)
## : Education2003Revision in {0,1,2,3,4,5,6,7,8}:
## : :...agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## : : 07. 31-40yo,08. 41-50yo,
## : : 09. 51-60yo}: 1 (19827/1243)
## : agegrp in {01. <5yo,02. 5-10yo,10. 61-70yo,11. 71-80yo,
## : : 12. >80yo}:
## : :...PlaceOfDeathAndDecedentsStatus in {1,2,3,5,7}:
## : :...Autopsy = N: 1 (17997/3192)
## : : Autopsy = U:
## : : :...ResidentStatus in {2,3,4}: 1 (206/33)
## : : ResidentStatus = 1:
## : : :...Race > 1: 0 (26/10)
## : : Race <= 1:
## : : :...MonthOfDeath <= 5: 0 (99/44)
## : : MonthOfDeath > 5: 1 (218/69)
## : PlaceOfDeathAndDecedentsStatus in {4,9}:
## : :...Sex = M:
## : :...Education2003Revision = 0: 1 (199/20)
## : : Education2003Revision in {1,2,3,4,5,6,7,8}:
## : : :...ResidentStatus in {3,4}: 1 (17/1)
## : : ResidentStatus = 1:
## : : :...Autopsy = N: [S4]
## : : : Autopsy = U:
## : : : :...agegrp in {01. <5yo,10. 61-70yo,
## : : : : 11. 71-80yo}: 1 (72/31)
## : : : agegrp in {02. 5-10yo,
## : : : 12. >80yo}: 0 (39/15)
## : : ResidentStatus = 2:
## : : :...DayOfWeekOfDeath in {1,3,7}: 0 (8/1)
## : : DayOfWeekOfDeath in {2,4,9}: 1 (12/1)
## : : DayOfWeekOfDeath = 5:
## : : :...agegrp = 10. 61-70yo: 1 (2)
## : : : agegrp in {01. <5yo,02. 5-10yo,
## : : : 11. 71-80yo,
## : : : 12. >80yo}: 0 (3)
## : : DayOfWeekOfDeath = 6:
## : : :...MonthOfDeath <= 2: 0 (2)
## : : MonthOfDeath > 2: 1 (3)
## : Sex = F:
## : :...Race > 2: 1 (42/9)
## : Race <= 2:
## : :...PlaceOfInjury in {2,3,4,5,6,7,8,
## : : 99}: 1 (74.6/34)
## : PlaceOfInjury = 9: 0 (129.6/53.5)
## : PlaceOfInjury = 1:
## : :...agegrp in {01. <5yo,10. 61-70yo,
## : : : 11. 71-80yo}: 0 (37.2/9.5)
## : : agegrp in {02. 5-10yo,
## : : 12. >80yo}: 1 (86.5/34.7)
## : PlaceOfInjury = 0:
## : :...agegrp in {01. <5yo,02. 5-10yo,10. 61-70yo}:
## : :...ResidentStatus in {1,2,
## : : : 4}: 1 (480.6/168)
## : : ResidentStatus = 3: 0 (3.4)
## : agegrp in {11. 71-80yo,12. >80yo}:
## : :...Education2003Revision in {0,
## : : 6}: 1 (168.7/70.1)
## : Education2003Revision = 7: 0 (44.4/19.7)
## : Education2003Revision = 1: [S5]
## : Education2003Revision = 4:
## : :...MaritalStatus in {D,S,U,
## : : : W}: 1 (110.3/42.7)
## : : MaritalStatus = M: 0 (36.1/13)
## : Education2003Revision = 8:
## : :...MaritalStatus in {D,U,
## : : : W}: 1 (7.7/1.7)
## : : MaritalStatus in {M,S}: 0 (4.3)
## : Education2003Revision = 5:
## : :...ResidentStatus in {2,3,
## : : : 4}: 1 (3)
## : : ResidentStatus = 1:
## : : :...MonthOfDeath <= 6: 0 (36.1/14.9)
## : : MonthOfDeath > 6: 1 (42.7/18)
## : Education2003Revision = 2:
## : :...DayOfWeekOfDeath in {1,
## : : : 3}: 1 (33.8/11.9)
## : : DayOfWeekOfDeath in {2,4,6,7,
## : : : 9}: 0 (76.3/30.7)
## : : DayOfWeekOfDeath = 5: [S6]
## : Education2003Revision = 3: [S7]
## PlaceOfDeathAndDecedentsStatus = 6:
## :...Autopsy in {n,y,Y}: 1 (260/8)
## Autopsy in {N,U}:
## :...Education2003Revision = 9:
## :...MaritalStatus in {D,M,S,W}: 0 (667/50)
## : MaritalStatus = U: 1 (11/4)
## Education2003Revision in {0,1,2,3,4,5,6,7,8}:
## :...Education2003Revision = 0: 1 (305/46)
## Education2003Revision in {1,2,3,4,5,6,7,8}:
## :...Race > 2: 1 (47/8)
## Race <= 2:
## :...agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,05. 21-25yo,
## : 06. 26-30yo,07. 31-40yo,
## : 08. 41-50yo}: 1 (72/13)
## agegrp in {04. 15-20yo,09. 51-60yo,10. 61-70yo,
## : 11. 71-80yo,12. >80yo}:
## :...Autopsy = U: 0 (129/40)
## Autopsy = N:
## :...InjuryAtWork = Y: 0 (38/13)
## InjuryAtWork = N:
## :...Education2003Revision in {1,4,5,6,7,8}:
## :...MaritalStatus in {D,M,U,
## : : W}: 1 (1666/686)
## : MaritalStatus = S:
## : :...agegrp in {04. 15-20yo,
## : : 12. >80yo}: 0 (85/30)
## : agegrp in {09. 51-60yo,
## : : 11. 71-80yo}: 1 (39/16)
## : agegrp = 10. 61-70yo:
## : :...Sex = M: 1 (12/3)
## : Sex = F: [S8]
## Education2003Revision in {2,3}:
## :...Race > 1: 0 (84/30)
## Race <= 1:
## :...agegrp in {04. 15-20yo,
## : 09. 51-60yo}: 1 (58/22)
## agegrp = 10. 61-70yo:
## :...MaritalStatus in {D,U,
## : : W}: 0 (70/23)
## : MaritalStatus = S: [S9]
## : MaritalStatus = M: [S10]
## agegrp = 11. 71-80yo: [S11]
## agegrp = 12. >80yo:
## :...MaritalStatus in {D,M,
## : U}: 1 (484/204)
## MaritalStatus = S: 0 (102/35)
## MaritalStatus = W: [S12]
##
## SubTree [S1]
##
## PlaceOfDeathAndDecedentsStatus in {2,5,6}: 1 (68/30)
## PlaceOfDeathAndDecedentsStatus = 4: 0 (29/11)
##
## SubTree [S2]
##
## PlaceOfDeathAndDecedentsStatus = 3: 0 (4)
## PlaceOfDeathAndDecedentsStatus in {7,9}: 1 (9/2)
##
## SubTree [S3]
##
## PlaceOfDeathAndDecedentsStatus = 3: 1 (9/1)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (1)
## PlaceOfDeathAndDecedentsStatus = 7:
## :...ResidentStatus in {2,3}: 1 (31/11)
## ResidentStatus = 1:
## :...MonthOfDeath <= 2: 1 (5/1)
## MonthOfDeath > 2: 0 (21/6)
##
## SubTree [S4]
##
## PlaceOfDeathAndDecedentsStatus = 4: 1 (3041/997)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (7/2)
##
## SubTree [S5]
##
## PlaceOfDeathAndDecedentsStatus = 4: 1 (144.3/60)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (2.6)
##
## SubTree [S6]
##
## PlaceOfDeathAndDecedentsStatus = 9: 1 (1)
## PlaceOfDeathAndDecedentsStatus = 4:
## :...MonthOfDeath <= 2: 1 (3.9/0.9)
## MonthOfDeath > 2: 0 (11.4/2)
##
## SubTree [S7]
##
## PlaceOfDeathAndDecedentsStatus = 9: 0 (8.7/1)
## PlaceOfDeathAndDecedentsStatus = 4:
## :...MaritalStatus in {S,W}: 0 (424.7/191.1)
## MaritalStatus = U: 1 (1)
## MaritalStatus = D:
## :...Autopsy = N: 1 (50.8/21.3)
## : Autopsy = U: 0 (3.6/1)
## MaritalStatus = M:
## :...ResidentStatus in {3,4}: 0 (0)
## ResidentStatus = 2: 1 (3)
## ResidentStatus = 1:
## :...agegrp = 11. 71-80yo: 0 (75.8/23.9)
## agegrp = 12. >80yo:
## :...DayOfWeekOfDeath in {2,3,7}: 0 (37.4/12.7)
## DayOfWeekOfDeath in {4,5,6,9}: 1 (46.6/19)
## DayOfWeekOfDeath = 1:
## :...MonthOfDeath <= 4: 1 (5)
## MonthOfDeath > 4: 0 (6.3/2)
##
## SubTree [S8]
##
## Education2003Revision in {1,4,5,6,8}: 0 (6)
## Education2003Revision = 7: 1 (2)
##
## SubTree [S9]
##
## Education2003Revision = 2: 0 (5/1)
## Education2003Revision = 3: 1 (22/8)
##
## SubTree [S10]
##
## ResidentStatus = 3: 1 (1)
## ResidentStatus = 1:
## :...Sex = F: 0 (15/4)
## : Sex = M: 1 (14/4)
## ResidentStatus in {2,4}:
## :...Sex = F: 1 (4/1)
## Sex = M: 0 (8/2)
##
## SubTree [S11]
##
## DayOfWeekOfDeath in {2,3,9}: 0 (98/28)
## DayOfWeekOfDeath = 5: 1 (39/18)
## DayOfWeekOfDeath = 1:
## :...MaritalStatus in {D,S,U}: 0 (10/2)
## : MaritalStatus in {M,W}: 1 (32/13)
## DayOfWeekOfDeath = 4:
## :...MonthOfDeath <= 11: 0 (38/11)
## : MonthOfDeath > 11: 1 (8/1)
## DayOfWeekOfDeath = 6:
## :...ResidentStatus in {2,3,4}: 0 (3)
## : ResidentStatus = 1:
## : :...Sex = F: 1 (19/7)
## : Sex = M: 0 (13/2)
## DayOfWeekOfDeath = 7:
## :...MaritalStatus = D: 1 (11/4)
## MaritalStatus in {M,S,U}: 0 (19/5)
## MaritalStatus = W:
## :...ResidentStatus in {3,4}: 0 (0)
## ResidentStatus = 2: 1 (3)
## ResidentStatus = 1:
## :...MonthOfDeath <= 7: 0 (8)
## MonthOfDeath > 7: 1 (7/2)
##
## SubTree [S12]
##
## DayOfWeekOfDeath in {1,2,7,9}: 0 (601/260)
## DayOfWeekOfDeath = 4: 1 (220/103)
## DayOfWeekOfDeath = 6:
## :...MonthOfDeath <= 4: 0 (65/20)
## : MonthOfDeath > 4: 1 (115/51)
## DayOfWeekOfDeath = 5:
## :...PlaceOfInjury in {1,2,3,4,5,6,7,99}: 1 (104.5/46.4)
## : PlaceOfInjury in {8,9}: 0 (10.8/3.1)
## : PlaceOfInjury = 0:
## : :...Education2003Revision = 2: 0 (7.9/2)
## : Education2003Revision = 3: 1 (58.8/25.1)
## DayOfWeekOfDeath = 3:
## :...Education2003Revision = 2: 0 (32/12)
## Education2003Revision = 3:
## :...MonthOfDeath > 9: 0 (54/25)
## MonthOfDeath <= 9:
## :...MonthOfDeath <= 3: 0 (31/12)
## MonthOfDeath > 3: 1 (92/42)
##
##
## Evaluation on training data (1315301 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 197 26128( 2.0%) <<
##
##
## (a) (b) <-classified as
## ------ ------
## 1194916 8891 (a): class 0
## 17237 94257 (b): class 1
##
##
## Attribute usage:
##
## 100.00% InjuryAtWork
## 100.00% Autopsy
## 95.80% agegrp
## 88.38% Education2003Revision
## 14.05% PlaceOfDeathAndDecedentsStatus
## 1.57% PlaceOfInjury
## 1.12% MonthOfDeath
## 0.63% ResidentStatus
## 0.60% Race
## 0.57% MaritalStatus
## 0.57% Sex
## 0.22% DayOfWeekOfDeath
## 0.16% RaceRecode5
##
##
## Time: 21.2 secs
A look at the results indicates that I should look at a few more of the variables in detail, such as InjuryAtWork, Autopsy, and PlaceOfDeathAndDecedentsStatus. It’s possible that those contain information that would unreasonably correlate with deaths from external causes.
mdf <- dplyr::tbl_df(m_all)
mdf %>%
group_by(ext,InjuryAtWork) %>%
summarise(n=n()) %>%
mutate(freq=percent(n/sum(n)))
## Source: local data frame [6 x 4]
## Groups: ext [2]
##
## ext InjuryAtWork n freq
## (fctr) (fctr) (int) (chr)
## 1 0 N 22371 0.9%
## 2 0 U 2385048 99.1%
## 3 0 Y 194 0.0%
## 4 1 N 182925 82.0%
## 5 1 U 35874 16.1%
## 6 1 Y 4188 1.9%
mdf %>%
group_by(ext,Autopsy) %>%
summarise(n=n()) %>%
mutate(freq=percent(n/sum(n)))
## Source: local data frame [6 x 4]
## Groups: ext [2]
##
## ext Autopsy n freq
## (fctr) (fctr) (int) (chr)
## 1 0 N 2162001 89.8%
## 2 0 U 162013 6.7%
## 3 0 Y 83599 3.5%
## 4 1 N 96537 43.3%
## 5 1 U 6621 3.0%
## 6 1 Y 119829 53.7%
mdf %>%
group_by(ext,PlaceOfDeathAndDecedentsStatus) %>%
summarise(n=n()) %>%
mutate(freq=percent(n/sum(n)))
## Source: local data frame [16 x 4]
## Groups: ext [2]
##
## ext PlaceOfDeathAndDecedentsStatus n freq
## (fctr) (fctr) (int) (chr)
## 1 0 1 744897 30.9%
## 2 0 2 143056 5.9%
## 3 0 3 9057 0.4%
## 4 0 4 706143 29.3%
## 5 0 5 181323 7.5%
## 6 0 6 514489 21.4%
## 7 0 7 106709 4.4%
## 8 0 9 1939 0.1%
## 9 1 1 50176 22.5%
## 10 1 2 31046 13.9%
## 11 1 3 3939 1.8%
## 12 1 4 67598 30.3%
## 13 1 5 6016 2.7%
## 14 1 6 8743 3.9%
## 15 1 7 54913 24.6%
## 16 1 9 556 0.2%
Looking at the results, it seems to make sense to remove these for the following reasons:
InjuryAtWork: It seems that when the work injury is included on the death report, there may be a high likelihood for the death to be related to the workplace injury, so this probably needs to be removed.
Autopsy: For the death by non-external causes group, almost 90% of the deceased did not have an autopsy. In the death by external causes group, over half of the deceased did have an autopsy. Since having an autopsy is not a predictor of dying, this needs to be removed.
PlaceOfDeathAndDecedentsStatus: The things that pop out are groups 6 and 7. It turns out that 6 is “Residential Facility”, such as a nursing home. This would correlate too strongly with older people dying in a nursing home. This needs to be removed, and the model rerun.
It turns out that ActivityCode and PlaceOfInjury are mostly NAs and will screw up the random forest; therefore these need to be removed as well.
drop3 <- c("InjuryAtWork","Autopsy","PlaceOfDeathAndDecedentsStatus","ActivityCode","PlaceOfInjury")
m_all <- m_all[,!(names(m_all) %in% drop3)]
set.seed(4321)
# HAD TO CUT THE DATA BY 90% AFTER THE FACT BECAUSE RENDERING IN RMARKDOWN OVERSHOT MY 16GB MEMORY FOR THE RF.
cutIndex <- createDataPartition(m_all$ext, p = .9,list=FALSE,times=1)
mhalf <- m_all[cutIndex,]
mrest <- m_all[-cutIndex,]
# MODELING ON A 50/25/25 PARTITION OF 10% OF THE DATA.
trainIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mtrain <- mrest[trainIndex,]
mrest2 <- mrest[-trainIndex,]
valIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mval <- mrest2[valIndex,]
mtest <- mrest2[-valIndex,]
str(mtrain)
## 'data.frame': 1315301 obs. of 10 variables:
## $ ResidentStatus : Factor w/ 4 levels "1","2","3","4": 3 1 1 1 1 1 2 1 1 1 ...
## $ Education2003Revision: Factor w/ 10 levels "0","1","2","3",..: 5 4 4 3 6 3 4 2 4 6 ...
## $ MonthOfDeath : int 1 1 1 1 1 1 1 2 2 2 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1 1 1 1 ...
## $ MaritalStatus : Factor w/ 5 levels "D","M","S","U",..: 3 2 1 2 2 3 1 5 2 5 ...
## $ DayOfWeekOfDeath : Factor w/ 8 levels "1","2","3","4",..: 7 7 7 7 6 4 1 1 3 7 ...
## $ Race : int 1 1 1 3 1 3 3 1 1 1 ...
## $ RaceRecode5 : Factor w/ 4 levels "1","2","3","4": 1 1 1 3 1 3 3 1 1 1 ...
## $ agegrp : Factor w/ 12 levels "01. <5yo",..: 5 10 8 12 9 7 9 12 12 12 ...
## $ ext : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
m2 <- C5.0(x=mtrain[-10],y=mtrain$ext)
summary(m2)
##
## Call:
## C5.0.default(x = mtrain[-10], y = mtrain$ext)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Jun 9 23:07:40 2016
## -------------------------------
##
## Class specified by attribute `outcome'
## *** ignoring cases with bad or unknown class
##
## Read 131205 cases (10 attributes) from undefined.data
##
## Decision tree:
##
## agegrp in {01. <5yo,02. 5-10yo,08. 41-50yo,09. 51-60yo,10. 61-70yo,11. 71-80yo,
## : 12. >80yo}: 0 (125243/7603)
## agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,07. 31-40yo}:
## :...agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo}:
## :...Education2003Revision in {0,2,3,4,5,8}:
## : :...Sex = M: 1 (1805/354)
## : : Sex = F:
## : : :...RaceRecode5 in {1,3,4}: 1 (533/149)
## : : RaceRecode5 = 2:
## : : :...agegrp in {04. 15-20yo,05. 21-25yo}: 1 (69/30)
## : : agegrp = 06. 26-30yo: 0 (50/17)
## : Education2003Revision in {1,6,7,9}:
## : :...agegrp = 04. 15-20yo: 0 (58/9)
## : agegrp in {05. 21-25yo,06. 26-30yo}:
## : :...RaceRecode5 = 2: 0 (48/18)
## : RaceRecode5 = 3: 1 (2)
## : RaceRecode5 = 4:
## : :...Race <= 28: 0 (9/3)
## : : Race > 28: 1 (5)
## : RaceRecode5 = 1:
## : :...MaritalStatus in {D,M,U,W}: 1 (51/18)
## : MaritalStatus = S:
## : :...Education2003Revision in {6,7,9}: 1 (119/42)
## : Education2003Revision = 1:
## : :...Sex = F: 0 (22/3)
## : Sex = M:
## : :...DayOfWeekOfDeath in {4,6,9}: 0 (13/1)
## : DayOfWeekOfDeath in {1,2,3,5,7}:
## : :...ResidentStatus in {1,4}: 1 (27/5)
## : ResidentStatus in {2,3}: 0 (14/4)
## agegrp in {03. 11-15yo,07. 31-40yo}:
## :...Sex = M:
## :...RaceRecode5 in {2,4}:
## : :...MaritalStatus = U: 0 (1)
## : : MaritalStatus = W: 1 (3)
## : : MaritalStatus = D:
## : : :...Race <= 48: 1 (32/10)
## : : : Race > 48: 0 (2)
## : : MaritalStatus in {M,S}:
## : : :...ResidentStatus in {1,4}: 0 (307/134)
## : : ResidentStatus = 2:
## : : :...agegrp = 07. 31-40yo: 1 (63/27)
## : : : agegrp = 03. 11-15yo:
## : : : :...DayOfWeekOfDeath in {1,2,3,4,5,6,9}: 0 (7)
## : : : DayOfWeekOfDeath = 7: 1 (2)
## : : ResidentStatus = 3:
## : : :...agegrp = 03. 11-15yo: 0 (5)
## : : agegrp = 07. 31-40yo:
## : : :...MonthOfDeath <= 8: 0 (21/3)
## : : MonthOfDeath > 8: 1 (9/2)
## : RaceRecode5 in {1,3}:
## : :...MaritalStatus in {D,W}: 1 (244/79)
## : MaritalStatus = U: 0 (7/2)
## : MaritalStatus in {M,S}:
## : :...Education2003Revision in {0,2,5,7}: 1 (375/139)
## : Education2003Revision = 8:
## : :...MaritalStatus = M: 1 (4/1)
## : : MaritalStatus = S: 0 (5/1)
## : Education2003Revision = 4:
## : :...ResidentStatus in {1,3,4}: 1 (152/64)
## : : ResidentStatus = 2:
## : : :...Race <= 2: 0 (33/12)
## : : Race > 2: 1 (2)
## : Education2003Revision = 6:
## : :...MaritalStatus = S: 1 (55/20)
## : : MaritalStatus = M:
## : : :...ResidentStatus in {1,3,4}: 0 (33/12)
## : : ResidentStatus = 2: 1 (7/1)
## : Education2003Revision = 9:
## : :...MaritalStatus = M: 0 (7/2)
## : : MaritalStatus = S:
## : : :...MonthOfDeath <= 5: 1 (6)
## : : MonthOfDeath > 5: 0 (4/1)
## : Education2003Revision = 1:
## : :...ResidentStatus = 2: 0 (30/9)
## : : ResidentStatus = 3:
## : : :...MaritalStatus = M: 1 (2)
## : : : MaritalStatus = S: 0 (7/2)
## : : ResidentStatus = 4:
## : : :...agegrp = 03. 11-15yo: 0 (2)
## : : : agegrp = 07. 31-40yo: 1 (3)
## : : ResidentStatus = 1:
## : : :...agegrp = 03. 11-15yo: 1 (46/15)
## : : agegrp = 07. 31-40yo:
## : : :...DayOfWeekOfDeath = 1: 1 (8/2)
## : : DayOfWeekOfDeath in {4,5,7,9}: 0 (22/6)
## : : DayOfWeekOfDeath = 2:
## : : :...MonthOfDeath <= 8: 1 (5/1)
## : : : MonthOfDeath > 8: 0 (2)
## : : DayOfWeekOfDeath = 3:
## : : :...MonthOfDeath <= 10: 1 (6/1)
## : : : MonthOfDeath > 10: 0 (2)
## : : DayOfWeekOfDeath = 6:
## : : :...MaritalStatus = M: 0 (4/1)
## : : MaritalStatus = S: 1 (5/1)
## : Education2003Revision = 3:
## : :...MonthOfDeath > 4: 1 (326/113)
## : MonthOfDeath <= 4:
## : :...DayOfWeekOfDeath in {3,7}: 1 (37/12)
## : DayOfWeekOfDeath in {4,5,6,9}: 0 (69/25)
## : DayOfWeekOfDeath = 2:
## : :...MonthOfDeath <= 1: 0 (4)
## : : MonthOfDeath > 1: 1 (16/7)
## : DayOfWeekOfDeath = 1:
## : :...MaritalStatus = M: 1 (7)
## : MaritalStatus = S:
## : :...MonthOfDeath <= 2: 0 (4)
## : MonthOfDeath > 2: 1 (7/2)
## Sex = F:
## :...RaceRecode5 in {2,4}: 0 (275/58)
## RaceRecode5 in {1,3}:
## :...ResidentStatus = 4: 1 (5)
## ResidentStatus in {1,2,3}:
## :...MaritalStatus in {D,W}: 1 (170/73)
## MaritalStatus = U: 0 (11/2)
## MaritalStatus in {M,S}:
## :...ResidentStatus in {2,3}:
## :...RaceRecode5 = 1: 0 (172/49)
## : RaceRecode5 = 3: 1 (4/1)
## ResidentStatus = 1:
## :...Education2003Revision in {2,7,8,9}: 0 (88/35)
## Education2003Revision in {4,5}: 1 (100/46)
## Education2003Revision = 0:
## :...MaritalStatus = M: 0 (28/10)
## : MaritalStatus = S: 1 (21/8)
## Education2003Revision = 6:
## :...MonthOfDeath <= 2: 1 (2)
## : MonthOfDeath > 2: 0 (36/13)
## Education2003Revision = 1:
## :...agegrp = 03. 11-15yo: 1 (28/12)
## : agegrp = 07. 31-40yo:
## : :...MaritalStatus = S: 0 (21/2)
## : MaritalStatus = M:
## : :...DayOfWeekOfDeath in {2,6}: 1 (4)
## : DayOfWeekOfDeath in {1,3,4,5,7,
## : 9}: 0 (3)
## Education2003Revision = 3:
## :...DayOfWeekOfDeath in {1,2,4,6,9}: 0 (94/40)
## DayOfWeekOfDeath = 5: 1 (20/6)
## DayOfWeekOfDeath = 7:
## :...MaritalStatus = M: 1 (16/6)
## : MaritalStatus = S: 0 (13/5)
## DayOfWeekOfDeath = 3:
## :...MaritalStatus = M: 0 (16/5)
## MaritalStatus = S:
## :...MonthOfDeath <= 10: 1 (8/1)
## MonthOfDeath > 10: 0 (2)
##
##
## Evaluation on training data (131205 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 86 9335( 7.1%) <<
##
##
## (a) (b) <-classified as
## ----- -----
## 118707 1248 (a): class 0
## 8087 3163 (b): class 1
##
##
## Attribute usage:
##
## 100.00% agegrp
## 4.32% Sex
## 3.52% Education2003Revision
## 3.12% RaceRecode5
## 2.37% MaritalStatus
## 1.29% ResidentStatus
## 0.44% MonthOfDeath
## 0.33% DayOfWeekOfDeath
## 0.06% Race
##
##
## Time: 0.7 secs
Taking a look at the validation and test runs of this model (ROC on test):
pv1 <- predict(m2, mval)
pt1 <- predict(m2, mtest)
confusionMatrix(pv1,mval$ext,positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59922 4077
## 1 731 1413
##
## Accuracy : 0.9273
## 95% CI : (0.9253, 0.9293)
## No Information Rate : 0.917
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3394
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.25738
## Specificity : 0.98795
## Pos Pred Value : 0.65905
## Neg Pred Value : 0.93630
## Prevalence : 0.08300
## Detection Rate : 0.02136
## Detection Prevalence : 0.03241
## Balanced Accuracy : 0.62266
##
## 'Positive' Class : 1
##
confusionMatrix(pt1,mtest$ext,positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59433 4099
## 1 720 1459
##
## Accuracy : 0.9267
## 95% CI : (0.9246, 0.9286)
## No Information Rate : 0.9154
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.346
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.26250
## Specificity : 0.98803
## Pos Pred Value : 0.66957
## Neg Pred Value : 0.93548
## Prevalence : 0.08458
## Detection Rate : 0.02220
## Detection Prevalence : 0.03316
## Balanced Accuracy : 0.62527
##
## 'Positive' Class : 1
##
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
auc(as.numeric(mtest$ext),as.numeric(pt1))
## Area under the curve: 0.6253
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred <- prediction(predictions = as.numeric(pt1), labels=as.numeric(mtest$ext))
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a=0, b=1)
As I feel that I did a bit of tuning of the C5.0 model as I went along, I’d rather try the Random Forest algorithm to see if that gives us a better outcome. I’ll use the same training, validation, and test datasets.
Since we have 9 explanatory variables, I’ll try an n of 3, and use 300 trees in order to prevent my computer from melting down.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
rf <- randomForest(mtrain$ext ~ ., data = mtrain, mtry=3, ntree=300, na.action=na.roughfix)
rf
##
## Call:
## randomForest(formula = mtrain$ext ~ ., data = mtrain, mtry = 3, ntree = 300, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.74%
## Confusion matrix:
## 0 1 class.error
## 0 1302595 1456 0.001116521
## 1 8260 2990 0.734222222
Look at model performance on validation set
pv2 <- predict(rf, mtest[-10],type="response")
head(pv2, n=10)
## 4 8 22 43 102 107 119 140 159 186
## 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
confusionMatrix(pv2,mtest$ext,positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59406 4103
## 1 746 1451
##
## Accuracy : 0.9262
## 95% CI : (0.9242, 0.9282)
## No Information Rate : 0.9155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3429
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.26125
## Specificity : 0.98760
## Pos Pred Value : 0.66045
## Neg Pred Value : 0.93539
## Prevalence : 0.08453
## Detection Rate : 0.02208
## Detection Prevalence : 0.03344
## Balanced Accuracy : 0.62443
##
## 'Positive' Class : 1
##
Look at model performance on test set, including ROC
pt2 <- predict(rf, mtest[-10],type="response")
head(pt2, n=10)
## 4 8 22 43 102 107 119 140 159 186
## 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
confusionMatrix(pt2,mtest$ext,positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59405 4104
## 1 747 1450
##
## Accuracy : 0.9262
## 95% CI : (0.9241, 0.9282)
## No Information Rate : 0.9155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3426
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.26107
## Specificity : 0.98758
## Pos Pred Value : 0.65999
## Neg Pred Value : 0.93538
## Prevalence : 0.08453
## Detection Rate : 0.02207
## Detection Prevalence : 0.03344
## Balanced Accuracy : 0.62433
##
## 'Positive' Class : 1
##
library(pROC)
auc(as.numeric(mtest$ext),as.numeric(pt2))
## Area under the curve: 0.6243
library(ROCR)
pred <- prediction(predictions = as.numeric(pt2), labels=as.numeric(mtest$ext))
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a=0, b=1)
Below is a wrap up of the results from the two algorithms (model comparison on test dataset):
| Model | Algorithm | Accuracy | Kappa | ROC |
|---|---|---|---|---|
| 1 | C5.0 (DT) | 0.9267 | 0.346 | 0.6253 |
| 2 | Random Forest | 0.9262 | 0.3426 | 0.6243 |
So it appears that (1) the results are consistent, which probably means it’s a fair comparison of the two algorithms, (2) there was no improvement using RF, and (3) with a Kappa statistic that is severely lower than the Accuracy, we can see the effects of the overwhelming prevalence of deaths from non-external causes.
It’s interesting that in this case, the simple C5.0 algorithm may have slightly edged out the Random Forest, although the difference probably isn’t significant. (No, I have not thought about how to go about quantifying the statistical significance of this difference!)
While I feel like I didn’t do the greatest job on this, it was good practice on the skills learned in class, and from the Machine Learning in R book. Thanks so much for a great year, I learned so much in your classes!