A Demographic Look at Death in the United States

About this project.

For this assignment, we were to (1) find a job description that we are interested in, (2) find a dataset that we might analyze if we were to get that job, and (3) apply some type of statistical learning algorithm, either classification or prediction, and report using the 5 steps.

Job Description. The job description was for a Data Analytics Associate at Mathematica Policy Research, in Oakland. Responsibilities include “analyzing data quality and integrity using packaged programs such as SAS, R, Python, and STATA, running basic statistics frequencies.”
Dataset. For good or bad, “policy research” could involve almost any kind of data analysis. Thus, after endless forays into various datasets I ended up choosing the Death Records dataset from Kaggle (https://www.kaggle.com/cdc/mortality), which is a compilation of year 2014 US deaths and their causes as reported by the CDC. The dataset has over 2.6M records. It’s a depressing choice, I know, but it was a good size, and had data that I felt was just messy enough for me to have to get some practice on my R code, without being a time suck; so I’m going all in on this.
Stat Learning Algorithm. After spending a lot of time exploring the data, and realizing that it would take a lot more time to really deep dive into this, I settled on experimenting with classification algorithms to predict individuals that die from external causes, like accidents, suicide, poisonings, homicide, etc. as opposed to dying of natural causes such as old age or more progressive diseases. It turns out that the vast majority of deaths fall into the latter category (> 90%) so as part of my study I am interested in seeing the effects of when there’s a severe imbalance in class–so things like the kappa statistic.

So off we go to the five steps.

I. Get the data.

Data obtained from kaggle.com, source listed above.

morbid <- read.csv("DeathRecords.csv")
str(morbid)

## 'data.frame':    2631171 obs. of  38 variables:
##  $ Id                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ResidentStatus                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Education1989Revision         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Education2003Revision         : int  2 2 7 6 3 5 4 4 3 3 ...
##  $ EducationReportingFlag        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MonthOfDeath                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex                           : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
##  $ AgeType                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age                           : int  87 58 75 74 64 93 82 55 86 23 ...
##  $ AgeSubstitutionFlag           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgeRecode52                   : int  43 37 41 40 38 44 42 37 43 30 ...
##  $ AgeRecode27                   : int  23 17 21 20 18 24 22 17 23 10 ...
##  $ AgeRecode12                   : int  11 8 10 9 8 11 10 8 11 4 ...
##  $ InfantAgeRecode22             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PlaceOfDeathAndDecedentsStatus: int  4 4 4 6 4 6 4 7 4 7 ...
##  $ MaritalStatus                 : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
##  $ DayOfWeekOfDeath              : int  4 3 2 1 2 4 6 6 6 4 ...
##  $ CurrentDataYear               : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ InjuryAtWork                  : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ MannerOfDeath                 : int  7 7 7 7 7 7 7 7 7 2 ...
##  $ MethodOfDisposition           : Factor w/ 7 levels "B","C","D","E",..: 2 2 2 2 2 2 2 2 2 6 ...
##  $ Autopsy                       : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ActivityCode                  : int  99 99 99 99 99 99 99 99 99 9 ...
##  $ PlaceOfInjury                 : int  99 99 99 99 99 99 99 99 99 5 ...
##  $ Icd10Code                     : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
##  $ CauseRecode358                : int  238 214 267 228 214 280 215 214 175 429 ...
##  $ CauseRecode113                : int  70 62 86 68 62 111 63 62 111 125 ...
##  $ InfantCauseRecode130          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CauseRecode39                 : int  24 21 28 22 21 37 21 21 37 40 ...
##  $ NumberOfEntityAxisConditions  : int  1 3 2 3 1 5 4 2 1 3 ...
##  $ NumberOfRecordAxisConditions  : int  1 3 2 3 1 5 3 2 1 3 ...
##  $ Race                          : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ BridgedRaceFlag               : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ RaceImputationFlag            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RaceRecode3                   : int  1 1 1 1 1 1 1 3 1 1 ...
##  $ RaceRecode5                   : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ HispanicOrigin                : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ HispanicOriginRaceRecode      : int  6 6 6 6 6 6 6 7 6 6 ...

The dataset contains 2,631,171 observations of 38 variables.

II. EDA and data preparation for modeling.

Elimination of duplicative or irrelevant features

First, I’d like to get rid of a lot of variables that are either repetitive (e.g. multiple binnings of the same base variable, such as Age) or otherwise not relevant to my study.

dump <- c("AgeRecode27","AgeRecode52","AgeType","BridgedRaceFlag",
          "Education1989Revision","EducationReportingFlag","EntityAxisConditions",
          "HispanicOrigin","HispanicOriginRaceRecode","MethodOfDisposition",
          "RaceImputationFlag","RaceRecode3","RecordAxisConditions",
          "AgeSubstitutionFlag","CauseRecode358","CauseRecode113",
          "NumberOfEntityAxisConditions","NumberOfRecordAxisConditions",
          "InfantAgeRecode22","InfantCauseRecode130","CurrentDataYear",
          "CauseRecode39") 
morbid <- morbid[,!(names(morbid) %in% dump)]
str(morbid)

## 'data.frame':    2631171 obs. of  18 variables:
##  $ Id                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ResidentStatus                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Education2003Revision         : int  2 2 7 6 3 5 4 4 3 3 ...
##  $ MonthOfDeath                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex                           : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
##  $ Age                           : int  87 58 75 74 64 93 82 55 86 23 ...
##  $ AgeRecode12                   : int  11 8 10 9 8 11 10 8 11 4 ...
##  $ PlaceOfDeathAndDecedentsStatus: int  4 4 4 6 4 6 4 7 4 7 ...
##  $ MaritalStatus                 : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
##  $ DayOfWeekOfDeath              : int  4 3 2 1 2 4 6 6 6 4 ...
##  $ InjuryAtWork                  : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ MannerOfDeath                 : int  7 7 7 7 7 7 7 7 7 2 ...
##  $ Autopsy                       : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ActivityCode                  : int  99 99 99 99 99 99 99 99 99 9 ...
##  $ PlaceOfInjury                 : int  99 99 99 99 99 99 99 99 99 5 ...
##  $ Icd10Code                     : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
##  $ Race                          : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ RaceRecode5                   : int  1 1 1 1 1 1 1 2 1 1 ...

summary(morbid)

##        Id          ResidentStatus Education2003Revision  MonthOfDeath   
##  Min.   :      1   Min.   :1.00   Min.   :0.000         Min.   : 1.000  
##  1st Qu.: 657794   1st Qu.:1.00   1st Qu.:2.000         1st Qu.: 3.000  
##  Median :1315586   Median :1.00   Median :3.000         Median : 6.000  
##  Mean   :1315586   Mean   :1.26   Mean   :3.227         Mean   : 6.489  
##  3rd Qu.:1973378   3rd Qu.:1.00   3rd Qu.:4.000         3rd Qu.:10.000  
##  Max.   :2631171   Max.   :4.00   Max.   :9.000         Max.   :12.000  
##                                                                         
##  Sex              Age          AgeRecode12    
##  F:1299710   Min.   :  1.00   Min.   : 1.000  
##  M:1331461   1st Qu.: 63.00   1st Qu.: 8.000  
##              Median : 77.00   Median :10.000  
##              Mean   : 73.41   Mean   : 9.298  
##              3rd Qu.: 87.00   3rd Qu.:11.000  
##              Max.   :999.00   Max.   :12.000  
##                                               
##  PlaceOfDeathAndDecedentsStatus MaritalStatus DayOfWeekOfDeath
##  Min.   :1.000                  D:400959      Min.   :1.000   
##  1st Qu.:1.000                  M:980016      1st Qu.:2.000   
##  Median :4.000                  S:333043      Median :4.000   
##  Mean   :3.614                  U: 18713      Mean   :4.009   
##  3rd Qu.:6.000                  W:898440      3rd Qu.:6.000   
##  Max.   :9.000                                Max.   :9.000   
##                                                               
##  InjuryAtWork MannerOfDeath   Autopsy      ActivityCode   PlaceOfInjury  
##  N: 205381    Min.   :0.000   n:      3   Min.   : 0.00   Min.   : 0.00  
##  U:2421408    1st Qu.:7.000   N:2258840   1st Qu.:99.00   1st Qu.:99.00  
##  Y:   4382    Median :7.000   U: 168646   Median :99.00   Median :99.00  
##               Mean   :5.609   y:      1   Mean   :92.19   Mean   :91.94  
##               3rd Qu.:7.000   Y: 203681   3rd Qu.:99.00   3rd Qu.:99.00  
##               Max.   :7.000               Max.   :99.00   Max.   :99.00  
##                                                                          
##    Icd10Code            Race         RaceRecode5   
##  I251   : 161961   Min.   : 1.000   Min.   :1.000  
##  C349   : 154862   1st Qu.: 1.000   1st Qu.:1.000  
##  F03    : 122021   Median : 1.000   Median :1.000  
##  I219   : 114107   Mean   : 1.686   Mean   :1.202  
##  J449   : 107836   3rd Qu.: 1.000   3rd Qu.:1.000  
##  G309   :  91356   Max.   :78.000   Max.   :4.000  
##  (Other):1879028

Noticing the numerous categorical data that are expressed in numbers, and also the missing data, such as a Max Age of 999. Assuming we didn’t lose some Biblical characters in 2014, we need to deal with those. Since Age is such as key variable in all of my analysis, I’ve decided to eliminate those rows.

Convert “categorical but numeric” data to factors

This dataset used a lot of lookup tables, with the main dataset relying on numeric codes for categorical data. These need to be converted to the correct data type.

cols <- c("ResidentStatus","Education2003Revision","AgeRecode12",
          "PlaceOfDeathAndDecedentsStatus","DayOfWeekOfDeath",
          "MannerOfDeath","ActivityCode","PlaceOfInjury",
          "RaceRecode5")
morbid[,cols] <- data.frame(apply(morbid[cols], 2, as.factor))

Missing data

Need to either eliminate or convert “99” and “999” data,

# Get rid of rows with invalid Age, as it's key to my classification
morbid<-morbid[!(morbid$Age=="999"),]

# Recode 99 to NA and get rid of dirty data
morbid$ActivityCode[morbid$ActivityCode == 99]<- NA
morbid$PlaceOfInjury[morbid$PlaceOfInjury == 99] <- NA
morbid$DayOfWeekOfDeath[morbid$DayOfWeekOfDeath == 9] <- NA
morbid$Autopsy[morbid$Autopsy == "n"] <- "N"
morbid$Autopsy[morbid$Autopsy == "y"] <- "Y"
morbid$AgeRecode12 <- sprintf("%02d",morbid$AgeRecode12) # This is to facilitate ordering of the bins.
summary(morbid)

##        Id          ResidentStatus Education2003Revision  MonthOfDeath   
##  Min.   :      1   1:2087112      3      :1006044       Min.   : 1.000  
##  1st Qu.: 657842   2: 406899      4      : 289848       1st Qu.: 3.000  
##  Median :1315600   3: 132030      1      : 278044       Median : 6.000  
##  Mean   :1315602   4:   4559      2      : 259327       Mean   : 6.489  
##  3rd Qu.:1973337                  0      : 241593       3rd Qu.:10.000  
##  Max.   :2631171                  6      : 234315       Max.   :12.000  
##                                   (Other): 321429                       
##  Sex              Age         AgeRecode12       
##  F:1299494   Min.   :  1.00   Length:2630600    
##  M:1331106   1st Qu.: 63.00   Class :character  
##              Median : 77.00   Mode  :character  
##              Mean   : 73.21                     
##              3rd Qu.: 87.00                     
##              Max.   :118.00                     
##                                                 
##  PlaceOfDeathAndDecedentsStatus MaritalStatus DayOfWeekOfDeath
##  1      :795073                 D:400955      4      :380438  
##  4      :773741                 M:980012      6      :378856  
##  6      :523232                 S:332792      7      :377989  
##  5      :187339                 U: 18411      2      :375783  
##  2      :174102                 W:898430      3      :372708  
##  7      :161622                               (Other):744700  
##  (Other): 15491                               NA's   :   126  
##  InjuryAtWork MannerOfDeath Autopsy      ActivityCode    
##  N: 205296    0: 364008     n:      0   9      : 198121  
##  U:2420922    1: 132629     N:2258538   4      :    244  
##  Y:   4382    2:  43132     U: 168634   8      :    183  
##               3:  16829     y:      0   1      :    170  
##               4:   3666     Y: 203428   2      :     89  
##               5:  10703                 (Other):     43  
##               7:2059633                 NA's   :2431750  
##  PlaceOfInjury       Icd10Code            Race        RaceRecode5
##  0      :  98593   I251   : 161950   Min.   : 1.000   1:2241095  
##  9      :  53947   C349   : 154859   1st Qu.: 1.000   2: 309372  
##  8      :  15671   F03    : 122019   Median : 1.000   3:  18025  
##  1      :   9268   I219   : 114102   Mean   : 1.686   4:  62108  
##  4      :   7222   J449   : 107834   3rd Qu.: 1.000              
##  (Other):  10012   G309   :  91356   Max.   :78.000              
##  NA's   :2435887   (Other):1878480

str(morbid)

## 'data.frame':    2630600 obs. of  18 variables:
##  $ Id                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ResidentStatus                : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Education2003Revision         : Factor w/ 10 levels "0","1","2","3",..: 3 3 8 7 4 6 5 5 4 4 ...
##  $ MonthOfDeath                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex                           : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
##  $ Age                           : int  87 58 75 74 64 93 82 55 86 23 ...
##  $ AgeRecode12                   : chr  "03" "11" "02" "12" ...
##  $ PlaceOfDeathAndDecedentsStatus: Factor w/ 8 levels "1","2","3","4",..: 4 4 4 6 4 6 4 7 4 7 ...
##  $ MaritalStatus                 : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
##  $ DayOfWeekOfDeath              : Factor w/ 8 levels "1","2","3","4",..: 4 3 2 1 2 4 6 6 6 4 ...
##  $ InjuryAtWork                  : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ MannerOfDeath                 : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 3 ...
##  $ Autopsy                       : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ActivityCode                  : Factor w/ 8 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 7 ...
##  $ PlaceOfInjury                 : Factor w/ 11 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 6 ...
##  $ Icd10Code                     : Factor w/ 3695 levels "A020","A021",..: 1570 1440 1762 1521 1440 1846 1441 1440 1070 3540 ...
##  $ Race                          : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ RaceRecode5                   : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 1 1 ...

OK. This is starting to look a little bit better. We are down to 18 variables. And we got rid of the 571 rows where Age was 999.

Create a tbl_df for easier EDA

Since I don’t like writing a ton of lines of code and since I want to learn dplyr, this is a good opportunity to work on it. You’ll see that I’m still kind of a one-trick pony at this.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mdf <- dplyr::tbl_df(morbid)
options(dplyr.print_max = 1e9)

Basic stats and frequencies.

Mean/sd of Age and Month of Death (the only two continuous variables, alas).

library(scales)
# Mean and sd of Age, by MonthOfDeath
mdf %>%
      group_by(MonthOfDeath) %>%
      summarise_each(funs(mean(.,na.rm=TRUE),sd(.,na.rm=TRUE),min(.,na.rm=TRUE),
                          max(.,na.rm=TRUE)), Age)

## Source: local data frame [12 x 5]
## 
##    MonthOfDeath     mean       sd   min   max
##           (int)    (dbl)    (dbl) (int) (int)
## 1             1 73.25837 17.76808     1   116
## 2             2 73.38481 17.80028     1   110
## 3             3 73.47262 17.86081     1   114
## 4             4 73.36695 17.98120     1   112
## 5             5 73.18296 18.09358     1   114
## 6             6 72.83089 18.27541     1   111
## 7             7 72.79966 18.24446     1   114
## 8             8 72.72435 18.35336     1   113
## 9             9 72.98840 18.18062     1   118
## 10           10 73.11801 18.10792     1   113
## 11           11 73.35125 18.04741     1   113
## 12           12 73.90216 17.77857     1   114

Frequencies by Day of Week, Month, Diagnostic Code (ICD10), etc.–kind of a verification/visual validation to make sure the data are OK to use.

mdf %>%
      filter(complete.cases(.)) %>%
      group_by(DayOfWeekOfDeath) %>%
      summarise (n = n()) %>%
      mutate(freq = percent(n / sum(n)))

## Source: local data frame [7 x 3]
## 
##   DayOfWeekOfDeath     n   freq
##             (fctr) (int)  (chr)
## 1                1 23892 14.73%
## 2                2 23248 14.33%
## 3                3 22510 13.88%
## 4                4 23013 14.19%
## 5                5 22409 13.82%
## 6                6 23063 14.22%
## 7                7 24050 14.83%

mdf %>%
      filter(complete.cases(.)) %>%
      group_by(MonthOfDeath) %>%
      select(DayOfWeekOfDeath) %>%
      table()

##             DayOfWeekOfDeath
## MonthOfDeath    1    2    3    4    5    6    7    9
##           1  1723 1847 1691 2242 2111 2201 1835    0
##           2  1834 1764 1697 1680 1761 1783 1841    0
##           3  2161 2094 1699 1702 1717 1756 2300    0
##           4  1753 1734 2129 2129 1690 1767 1776    0
##           5  1772 1799 1690 1613 2144 2228 2306    0
##           6  2351 2222 1742 1811 1767 1741 1906    0
##           7  1842 1766 2119 2186 2081 1800 1912    0
##           8  2396 1905 1825 1736 1766 2184 2318    0
##           9  1912 2264 2212 1783 1746 1775 1849    0
##           10 1868 1740 1794 2178 2096 2200 1850    0
##           11 2335 1831 1767 1730 1758 1757 2352    0
##           12 1945 2282 2145 2223 1772 1871 1805    0

A peek at some diagnostic codes (there are over 3000 of them)

icd <- mdf %>%
      filter(complete.cases(.)) %>%
      group_by(Icd10Code) %>%
      summarise (n = n()) %>%
      mutate(freq = (n / sum(n))) %>%
      arrange(desc(freq))
head(icd)

## Source: local data frame [6 x 3]
## 
##   Icd10Code     n       freq
##      (fctr) (int)      (dbl)
## 1       X44 18627 0.11485033
## 2       X42 16869 0.10401085
## 3       W19 14472 0.08923143
## 4       X74 12921 0.07966828
## 5       X70 11439 0.07053057
## 6       W18 10852 0.06691124

And a quick peek at deaths grouped by gender and age group (AgeRecode12, which divides the population into 12 age-based bins).

mdf %>%
      filter(complete.cases(.)) %>%
      group_by(Sex,AgeRecode12) %>%
      summarise(n = n()) %>%
      mutate(freq = percent(n / sum(n)))

## Source: local data frame [22 x 4]
## Groups: Sex [2]
## 
##       Sex AgeRecode12     n  freq
##    (fctr)       (chr) (int) (chr)
## 1       F          01   575  1.0%
## 2       F          02  6303 11.5%
## 3       F          03 12623 23.0%
## 4       F          05   454  0.8%
## 5       F          06   441  0.8%
## 6       F          07  2800  5.1%
## 7       F          08  5325  9.7%
## 8       F          09  6304 11.5%
## 9       F          10  8618 15.7%
## 10      F          11  6975 12.7%
## 11      F          12  4399  8.0%
## 12      M          01   767  0.7%
## 13      M          02  8830  8.2%
## 14      M          03  9129  8.5%
## 15      M          05   702  0.7%
## 16      M          06   894  0.8%
## 17      M          07 11897 11.1%
## 18      M          08 17454 16.3%
## 19      M          09 15235 14.2%
## 20      M          10 18190 16.9%
## 21      M          11 15305 14.3%
## 22      M          12  8965  8.3%

Now it seems a bit strange that 23% of all female deaths occurred in the AgeRecode12 “03” group, which are the 5-14 year olds. Was there some catastrophe in 2014?

Anyway, we start thinking about how to label the observations. I’d like to see how Manner of Death lines up with the age groups.

# Start looking at death data to decide how to label the observations.
agemanner<- mdf %>%
      filter(complete.cases(.)) %>%
      group_by(AgeRecode12,MannerOfDeath) %>%
      summarise (n = n()) %>%
      mutate(freq = (n / sum(n))) %>%
      arrange(desc(freq)) %>%
      mutate(freq = percent(freq))
head(agemanner,n = 13)

## Source: local data frame [13 x 4]
## Groups: AgeRecode12 [2]
## 
##    AgeRecode12 MannerOfDeath     n  freq
##          (chr)        (fctr) (int) (chr)
## 1           01             1  1049 78.2%
## 2           01             3   169 12.6%
## 3           01             5    84  6.3%
## 4           01             7    23  1.7%
## 5           01             0    13  1.0%
## 6           01             4     4  0.3%
## 7           02             1  9414 62.2%
## 8           02             7  2516 16.6%
## 9           02             2  2390 15.8%
## 10          02             0   427  2.8%
## 11          02             3   256  1.7%
## 12          02             5   120  0.8%
## 13          02             4    10  0.1%

Hmm. Now I’m really starting to wonder about the accuracy of the AgeRecode12 variable. AgeRecode12 = 2 are the 1-4 year olds, and MannerOfDeath = 2 are death by suicide. I can’t believe that 2,390 1-4 year olds died by suicide; that would just shatter my faith in humanity, really. I think I will create my own binned age categories and see what results I get. (I realize the below nested ifelse statement is super ugly but couldn’t get a switch function to work.)

morbid$agegrp <- as.factor(ifelse(morbid$Age<5,
                        "01.    <5yo",
                        ifelse(morbid$Age<=10,
                               "02.  5-10yo",
                               ifelse(morbid$Age<=15,
                                      "03. 11-15yo",
                                      ifelse(morbid$Age<=20,
                                             "04. 15-20yo",
                                             ifelse(morbid$Age<=25,
                                                    "05. 21-25yo",
                                                    ifelse(morbid$Age<=30,
                                                           "06. 26-30yo",
                                                           ifelse(morbid$Age<=40,
                                                                  "07. 31-40yo",
                                                                  ifelse(morbid$Age<=50,
                                                                         "08. 41-50yo",
                                                                         ifelse(morbid$Age<=60,
                                                                                "09. 51-60yo",
                                                                                ifelse(morbid$Age<=70,
                                                                                       "10. 61-70yo",
                                                                                       ifelse(morbid$Age<=80,
                                                                                              "11. 71-80yo",
                                                                                              "12.   >80yo"))))))))))))

mdf <- dplyr::tbl_df(morbid)
agemanner <- mdf %>%
      filter(complete.cases(.)) %>%
      group_by(agegrp,MannerOfDeath) %>%
      summarise (n = n()) %>%
      mutate(freq = (n / sum(n))) %>%
      arrange(desc(freq)) %>%
      mutate(freq = percent(freq))
head(agemanner, n = 13)

## Source: local data frame [13 x 4]
## Groups: agegrp [2]
## 
##         agegrp MannerOfDeath     n  freq
##         (fctr)        (fctr) (int) (chr)
## 1  01.    <5yo             1  1428 72.7%
## 2  01.    <5yo             3   378 19.3%
## 3  01.    <5yo             5   115  5.9%
## 4  01.    <5yo             7    21  1.1%
## 5  01.    <5yo             0    13  0.7%
## 6  01.    <5yo             4     8  0.4%
## 7  02.  5-10yo             1   649 69.3%
## 8  02.  5-10yo             3   194 20.7%
## 9  02.  5-10yo             5    54  5.8%
## 10 02.  5-10yo             7    17  1.8%
## 11 02.  5-10yo             2    12  1.3%
## 12 02.  5-10yo             0     7  0.7%
## 13 02.  5-10yo             4     3  0.3%

OK. The 1-4 year old suicide group went away; my faith in humanity has been restored. I think I’ll use my own age groupings instead of what the CDC gave us. Let’s go ahead and drop the AgeRecode12 variable since it’s apparently gibberish.

drop12 <- "AgeRecode12"
morbid <- morbid[,!names(morbid) %in% drop12]

Create labels for classification.

Label external causes of death (“ext”) based on MannerOfDeath codes provided:

Label	Description	Classification (ext: 1 = yes, 0 = no)
1.	Accident	Deem external cause of death (ext = 1)
2.	Suicide	External
3.	Homicide	External
4.	Pending investigation	External
5.	Could not determine	Assumed external
6.	Self-inflicted	External
7.	Natural	Assumed not external cause of death (ext = 0)
0.	Not specified	Assumed not external

Source: kaggle / CDC

ICD10 Codes	Diagnosis
S00-T88	Injury, poisoning, and certain other consequences of external causes
V00-Y99	External causes of morbidity

*Source: http://www.icd10data.com/ICD10CM/Codes*

morbid$ext <- as.factor(ifelse((morbid$MannerOfDeath %in% c("1","2","3","4","5",
                                                  "6")),1,
                     ifelse((substr(morbid$Icd10Code,1,1) %in%  
                                   c("S","T","V","W","X","Y")),1,0)))

mdf <- dplyr::tbl_df(morbid)
mdf %>%
      group_by(ext) %>%
      tally()

## Source: local data frame [2 x 2]
## 
##      ext       n
##   (fctr)   (int)
## 1      0 2407613
## 2      1  222987

It looks like a little less than 10% of the total dataset will be classified as having died due to external causes. Let’s look at how external vs non-external causes of death line up against our newly created age groupings.

External causes of death by age group:

mdf <- dplyr::tbl_df(morbid)
mdf %>%
      group_by(agegrp,ext) %>%
      summarise(n = n()) %>%
      mutate(freq = percent(n / sum(n))) %>%
      as.data.frame()

##         agegrp ext       n     freq
## 1  01.    <5yo   0   12726    74.1%
## 2  01.    <5yo   1    4450    25.9%
## 3  02.  5-10yo   0    5521    72.7%
## 4  02.  5-10yo   1    2069    27.3%
## 5  03. 11-15yo   0    2878      58%
## 6  03. 11-15yo   1    2081      42%
## 7  04. 15-20yo   0    3564    27.8%
## 8  04. 15-20yo   1    9263    72.2%
## 9  05. 21-25yo   0    5489    25.9%
## 10 05. 21-25yo   1   15683    74.1%
## 11 06. 26-30yo   0    7518    32.9%
## 12 06. 26-30yo   1   15314    67.1%
## 13 07. 31-40yo   0   28701 49.9704%
## 14 07. 31-40yo   1   28735 50.0296%
## 15 08. 41-50yo   0   87009      73%
## 16 08. 41-50yo   1   32227      27%
## 17 09. 51-60yo   0  250072    87.2%
## 18 09. 51-60yo   1   36730    12.8%
## 19 10. 61-70yo   0  404360    94.7%
## 20 10. 61-70yo   1   22431     5.3%
## 21 11. 71-80yo   0  527541    96.6%
## 22 11. 71-80yo   1   18564     3.4%
## 23 12.   >80yo   0 1072234    96.8%
## 24 12.   >80yo   1   35440     3.2%

We see that for under 10yo and over 40yo, “non-external” causes of death far outnumber the externals. However starting with the 11-15yo group and through the 31-40yo group, there is a greatly increased frequency of death by external causes, with the maximum rate of death by external causes seen in the 26-30yo group.

As an aside, I would have loved to make some plots of these, but could not for the life of me figure out how to create plots from dplyr layouts and did not have time to go back to the raw data to recreate. I think I ought to continue to work on this over the summer.

External causes of death vs gender:

mdf %>%
      group_by(Sex,ext) %>%
      summarise(n = n()) %>%
      mutate(freq = percent(n / sum(n))) %>%
      as.data.frame()

##   Sex ext       n  freq
## 1   F   0 1224371 94.2%
## 2   F   1   75123  5.8%
## 3   M   0 1183242 88.9%
## 4   M   1  147864 11.1%

Perhaps unsurprisingly, it appears that males die by external causes at a greater rate than females.

External causes of death vs race:

1: White; 2: Black; 3: American Indian; 4: Asian or Pacific Islander

mdf %>% 
      group_by(RaceRecode5,ext) %>%
      summarise(n = n()) %>%
      mutate(freq = percent(n / sum(n))) %>%
      as.data.frame()

##   RaceRecode5 ext       n  freq
## 1           1   0 2055056 91.7%
## 2           1   1  186039  8.3%
## 3           2   0  280444 90.6%
## 4           2   1   28928  9.4%
## 5           3   0   14871 82.5%
## 6           3   1    3154 17.5%
## 7           4   0   57242 92.2%
## 8           4   1    4866  7.8%

The group that stands out are the American Indians–they seem to have twice the rate of death by external causes compared with everybody else. That’s just horribly tragic.

(For the purposes of the classification algorithms, I will also be including the variable “Race” which is more detailed.)

Partition the data.

*** After running all code in RStudio successfully and writing my entire report, I found that when rendering in RMarkdown uses more memory than running in RStudio. So I went back and halved my dataset. So the final run where I actually run the models, will be using a 50/25/25 of 10% of the data. This was a major bummer but a good lesson about why we don’t want to choose too big of a dataset.***

I’ll be using the caret package for this task. Since we have an overabundance of data, I’m going to use a 50/25/25 split on the training, validation, and test datasets.

But first, I need to remove the MannerOfDeath and Icd10Code fields, since those are the fields from which I created the label. I also would like to remove the “Age” variable, since it would be highly correlated with agegrp. Finally remove “Id”, which has no meaning.

drop2 <- c("MannerOfDeath","Icd10Code","Age","Id")
m_all <- morbid[,!(names(morbid) %in% drop2)]

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(432)
trainIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mtrain <- m_all[trainIndex,]
mrest <- m_all[-trainIndex,]
valIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mval <- mrest[valIndex,]
mtest <- mrest[-valIndex,]

III. Train a model.

The classification is around what factors predict a death by external causes. We’ve looked at a few potentially relevant factors, such as gender, age group and race. I’ll be using C5.0 and Random Forest–unfortunately the same algorithms I used for Project 1, but they are useful for this type of data and I am running out of time.

library(C50)
str(m_all)

## 'data.frame':    2630600 obs. of  15 variables:
##  $ ResidentStatus                : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Education2003Revision         : Factor w/ 10 levels "0","1","2","3",..: 3 3 8 7 4 6 5 5 4 4 ...
##  $ MonthOfDeath                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex                           : Factor w/ 2 levels "F","M": 2 2 1 2 2 1 2 2 1 2 ...
##  $ PlaceOfDeathAndDecedentsStatus: Factor w/ 8 levels "1","2","3","4",..: 4 4 4 6 4 6 4 7 4 7 ...
##  $ MaritalStatus                 : Factor w/ 5 levels "D","M","S","U",..: 2 1 5 1 1 5 2 3 5 3 ...
##  $ DayOfWeekOfDeath              : Factor w/ 8 levels "1","2","3","4",..: 4 3 2 1 2 4 6 6 6 4 ...
##  $ InjuryAtWork                  : Factor w/ 3 levels "N","U","Y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Autopsy                       : Factor w/ 5 levels "n","N","U","y",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ActivityCode                  : Factor w/ 8 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 7 ...
##  $ PlaceOfInjury                 : Factor w/ 11 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA 6 ...
##  $ Race                          : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ RaceRecode5                   : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
##  $ agegrp                        : Factor w/ 12 levels "01.    <5yo",..: 12 9 11 11 10 12 12 9 12 5 ...
##  $ ext                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...

m1 <- C5.0(x=mtrain[-15], y=mtrain$ext)
summary(m1)

## 
## Call:
## C5.0.default(x = mtrain[-15], y = mtrain$ext)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Jun  9 23:06:41 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1315301 cases (15 attributes) from undefined.data
## 
## Decision tree:
## 
## InjuryAtWork = U:
## :...Autopsy in {n,y}: 0 (0)
## :   Autopsy in {N,U}:
## :   :...PlaceOfInjury = 99: 0 (0)
## :   :   PlaceOfInjury in {0,1,2,3,4,5,6,7,8}:
## :   :   :...Autopsy = N:
## :   :   :   :...Education2003Revision in {1,2,3,4,5,6,7,8,9}: 0 (70439.2/402.7)
## :   :   :   :   Education2003Revision = 0:
## :   :   :   :   :...agegrp in {01. <5yo,02. 5-10yo,08. 41-50yo,09. 51-60yo,
## :   :   :   :       :          10. 61-70yo,11. 71-80yo,
## :   :   :   :       :          12. >80yo}: 0 (6827.7/286.9)
## :   :   :   :       agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## :   :   :   :       :          07. 31-40yo}:
## :   :   :   :       :...PlaceOfDeathAndDecedentsStatus in {1,3,5,
## :   :   :   :           :                                  6}: 0 (74.7/18.8)
## :   :   :   :           PlaceOfDeathAndDecedentsStatus in {2,4,7,9}:
## :   :   :   :           :...Sex = F: 0 (29.1/11.4)
## :   :   :   :               Sex = M: 1 (80.5/25.5)
## :   :   :   Autopsy = U:
## :   :   :   :...agegrp in {10. 61-70yo,11. 71-80yo,12. >80yo}: 0 (5089.6/246.4)
## :   :   :       agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo,07. 31-40yo}:
## :   :   :       :...MonthOfDeath <= 6: 0 (52.7/14)
## :   :   :       :   MonthOfDeath > 6:
## :   :   :       :   :...PlaceOfDeathAndDecedentsStatus in {5,9}: 0 (3.4/0.1)
## :   :   :       :       PlaceOfDeathAndDecedentsStatus in {2,3,4,6,
## :   :   :       :       :                                  7}: 1 (237.1/22.4)
## :   :   :       :       PlaceOfDeathAndDecedentsStatus = 1:
## :   :   :       :       :...Sex = F: 0 (19.5/5.6)
## :   :   :       :           Sex = M: 1 (49.4/16.6)
## :   :   :       agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,08. 41-50yo,
## :   :   :       :          09. 51-60yo}:
## :   :   :       :...MonthOfDeath <= 6: 0 (374.1/19.3)
## :   :   :           MonthOfDeath > 6:
## :   :   :           :...PlaceOfDeathAndDecedentsStatus in {3,7}: 1 (95.4/21.5)
## :   :   :               PlaceOfDeathAndDecedentsStatus in {1,5,6,
## :   :   :               :                                  9}: 0 (313.2/40.5)
## :   :   :               PlaceOfDeathAndDecedentsStatus in {2,4}:
## :   :   :               :...agegrp in {01. <5yo,02. 5-10yo,
## :   :   :                   :          03. 11-15yo}: 1 (21.6/3.3)
## :   :   :                   agegrp = 09. 51-60yo: 0 (212.9/67.2)
## :   :   :                   agegrp = 08. 41-50yo:
## :   :   :                   :...RaceRecode5 in {1,4}: 1 (90.6/35.2)
## :   :   :                       RaceRecode5 in {2,3}: 0 (13.5/4.4)
## :   :   PlaceOfInjury = 9:
## :   :   :...agegrp in {09. 51-60yo,10. 61-70yo,11. 71-80yo,
## :   :       :          12. >80yo}: 0 (1020393/7596.9)
## :   :       agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,05. 21-25yo,
## :   :       :          06. 26-30yo,07. 31-40yo,08. 41-50yo}:
## :   :       :...PlaceOfDeathAndDecedentsStatus in {1,2,4,5,6,
## :   :           :                                  9}: 0 (55534/1225.3)
## :   :           PlaceOfDeathAndDecedentsStatus in {3,7}:
## :   :           :...Autopsy = N: 0 (1998.4/162.3)
## :   :               Autopsy = U:
## :   :               :...agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,
## :   :                   :          08. 41-50yo}: 0 (151.5/49.3)
## :   :                   agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo,
## :   :                   :          07. 31-40yo}:
## :   :                   :...MonthOfDeath > 4: 1 (123.1/27.9)
## :   :                       MonthOfDeath <= 4:
## :   :                       :...Sex = F: 0 (9.3/1)
## :   :                           Sex = M:
## :   :                           :...Education2003Revision in {0,2,3,5,6,7,8,
## :   :                               :                         9}: 1 (8.6/0.9)
## :   :                               Education2003Revision in {1,4}: 0 (2.8)
## :   Autopsy = Y:
## :   :...Education2003Revision = 0:
## :       :...agegrp in {09. 51-60yo,10. 61-70yo,11. 71-80yo,
## :       :   :          12. >80yo}: 0 (2872/597)
## :       :   agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,05. 21-25yo,
## :       :   :          06. 26-30yo,07. 31-40yo,08. 41-50yo}:
## :       :   :...PlaceOfDeathAndDecedentsStatus in {1,5,6}: 0 (472/122)
## :       :       PlaceOfDeathAndDecedentsStatus in {2,3,4,7,9}:
## :       :       :...agegrp in {07. 31-40yo,08. 41-50yo}:
## :       :           :...PlaceOfDeathAndDecedentsStatus in {2,3,
## :       :           :   :                                  4}: 0 (959/363)
## :       :           :   PlaceOfDeathAndDecedentsStatus in {7,9}:
## :       :           :   :...RaceRecode5 = 3: 0 (5)
## :       :           :       RaceRecode5 in {1,2,4}:
## :       :           :       :...ResidentStatus in {1,2}: 1 (519/212)
## :       :           :           ResidentStatus = 4: 0 (2)
## :       :           :           ResidentStatus = 3:
## :       :           :           :...RaceRecode5 in {1,4}: 0 (32/13)
## :       :           :               RaceRecode5 = 2: 1 (9/3)
## :       :           agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,04. 15-20yo,
## :       :           :          05. 21-25yo,06. 26-30yo}:
## :       :           :...PlaceOfDeathAndDecedentsStatus in {3,7,
## :       :               :                                  9}: 1 (273/54)
## :       :               PlaceOfDeathAndDecedentsStatus in {2,4}:
## :       :               :...Sex = M: 1 (328/98)
## :       :                   Sex = F:
## :       :                   :...ResidentStatus = 4: 1 (0)
## :       :                       ResidentStatus = 3: 0 (5)
## :       :                       ResidentStatus in {1,2}:
## :       :                       :...agegrp in {01. <5yo,03. 11-15yo,
## :       :                           :          05. 21-25yo}: 1 (77/25)
## :       :                           agegrp = 06. 26-30yo: 0 (33/13)
## :       :                           agegrp = 02. 5-10yo:
## :       :                           :...MonthOfDeath <= 8: 1 (10/4)
## :       :                           :   MonthOfDeath > 8: 0 (6)
## :       :                           agegrp = 04. 15-20yo:
## :       :                           :...ResidentStatus = 2: 0 (2)
## :       :                               ResidentStatus = 1:
## :       :                               :...Race <= 1: 1 (9/2)
## :       :                                   Race > 1: 0 (2)
## :       Education2003Revision in {1,2,3,4,5,6,7,8,9}:
## :       :...agegrp in {03. 11-15yo,08. 41-50yo,09. 51-60yo,10. 61-70yo,
## :           :          11. 71-80yo,12. >80yo}: 0 (33152/2737)
## :           agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## :           :          07. 31-40yo}:
## :           :...PlaceOfDeathAndDecedentsStatus = 1: 0 (2466/275)
## :               PlaceOfDeathAndDecedentsStatus in {2,4,5,6}:
## :               :...agegrp in {02. 5-10yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
## :               :   :          07. 31-40yo}: 0 (4587/1067)
## :               :   agegrp = 01. <5yo:
## :               :   :...RaceRecode5 = 3: 1 (31/9)
## :               :       RaceRecode5 = 4: 0 (23/6)
## :               :       RaceRecode5 in {1,2}:
## :               :       :...DayOfWeekOfDeath in {1,5}: 1 (224/104)
## :               :           DayOfWeekOfDeath in {2,3,6,7,9}: 0 (463/216)
## :               :           DayOfWeekOfDeath = 4:
## :               :           :...ResidentStatus = 2: 0 (13/3)
## :               :               ResidentStatus in {3,4}: 1 (4/1)
## :               :               ResidentStatus = 1: [S1]
## :               PlaceOfDeathAndDecedentsStatus in {3,7,9}:
## :               :...ResidentStatus = 4: 1 (36/7)
## :                   ResidentStatus in {1,2,3}:
## :                   :...agegrp = 07. 31-40yo: 0 (622/233)
## :                       agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,05. 21-25yo,
## :                       :          06. 26-30yo}:
## :                       :...Race <= 1:
## :                           :...Education2003Revision in {2,3,5,8,
## :                           :   :                         9}: 1 (251/86)
## :                           :   Education2003Revision in {6,7}: 0 (30/11)
## :                           :   Education2003Revision = 1:
## :                           :   :...DayOfWeekOfDeath in {1,2,7,9}: 1 (40/7)
## :                           :   :   DayOfWeekOfDeath in {3,4,6}: 0 (36/15)
## :                           :   :   DayOfWeekOfDeath = 5: [S2]
## :                           :   Education2003Revision = 4: [S3]
## :                           Race > 1:
## :                           :...MaritalStatus = W: 0 (0)
## :                               MaritalStatus in {D,U}: 1 (3)
## :                               MaritalStatus = M:
## :                               :...Sex = F: 0 (5/1)
## :                               :   Sex = M: 1 (5/1)
## :                               MaritalStatus = S:
## :                               :...Education2003Revision in {2,4,5,7,8,
## :                                   :                         9}: 0 (79/26)
## :                                   Education2003Revision = 6:
## :                                   :...Race <= 7: 0 (6/1)
## :                                   :   Race > 7: 1 (2)
## :                                   Education2003Revision = 3:
## :                                   :...DayOfWeekOfDeath in {1,4,
## :                                   :   :                    6}: 1 (12/3)
## :                                   :   DayOfWeekOfDeath in {2,3,5,
## :                                   :   :                    9}: 0 (14/2)
## :                                   :   DayOfWeekOfDeath = 7:
## :                                   :   :...MonthOfDeath <= 7: 1 (5)
## :                                   :       MonthOfDeath > 7: 0 (5/1)
## :                                   Education2003Revision = 1:
## :                                   :...agegrp in {02. 5-10yo,04. 15-20yo,
## :                                       :          05. 21-25yo}: 0 (12/3)
## :                                       agegrp = 06. 26-30yo: 1 (2)
## :                                       agegrp = 01. <5yo:
## :                                       :...RaceRecode5 = 3: 0 (3)
## :                                           RaceRecode5 in {1,4}: 1 (1)
## :                                           RaceRecode5 = 2:
## :                                           :...Sex = F: 1 (16/2)
## :                                               Sex = M: 0 (18/6)
## InjuryAtWork in {N,Y}:
## :...PlaceOfDeathAndDecedentsStatus in {1,2,3,4,5,7,9}:
##     :...Autopsy in {n,y,Y}: 1 (53442/668)
##     :   Autopsy in {N,U}:
##     :   :...Education2003Revision = 9:
##     :       :...MaritalStatus = U: 1 (226/23)
##     :       :   MaritalStatus in {D,M,S,W}:
##     :       :   :...ResidentStatus = 4: 1 (7)
##     :       :       ResidentStatus in {1,2,3}:
##     :       :       :...InjuryAtWork = Y: 1 (10/1)
##     :       :           InjuryAtWork = N:
##     :       :           :...agegrp in {03. 11-15yo,10. 61-70yo,11. 71-80yo,
##     :       :               :          12. >80yo}: 0 (1128/288)
##     :       :               agegrp in {01. <5yo,02. 5-10yo,04. 15-20yo,
##     :       :               :          05. 21-25yo,06. 26-30yo,07. 31-40yo,
##     :       :               :          08. 41-50yo,09. 51-60yo}:
##     :       :               :...PlaceOfDeathAndDecedentsStatus in {1,2,3,4,7,
##     :       :                   :                                  9}: 1 (299/91)
##     :       :                   PlaceOfDeathAndDecedentsStatus = 5:
##     :       :                   :...agegrp in {02. 5-10yo,04. 15-20yo,
##     :       :                       :          05. 21-25yo,08. 41-50yo,
##     :       :                       :          09. 51-60yo}: 0 (53/3)
##     :       :                       agegrp in {01. <5yo,06. 26-30yo,
##     :       :                       :          07. 31-40yo}:
##     :       :                       :...MonthOfDeath <= 5: 0 (2)
##     :       :                           MonthOfDeath > 5: 1 (4)
##     :       Education2003Revision in {0,1,2,3,4,5,6,7,8}:
##     :       :...agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,
##     :           :          07. 31-40yo,08. 41-50yo,
##     :           :          09. 51-60yo}: 1 (19827/1243)
##     :           agegrp in {01. <5yo,02. 5-10yo,10. 61-70yo,11. 71-80yo,
##     :           :          12. >80yo}:
##     :           :...PlaceOfDeathAndDecedentsStatus in {1,2,3,5,7}:
##     :               :...Autopsy = N: 1 (17997/3192)
##     :               :   Autopsy = U:
##     :               :   :...ResidentStatus in {2,3,4}: 1 (206/33)
##     :               :       ResidentStatus = 1:
##     :               :       :...Race > 1: 0 (26/10)
##     :               :           Race <= 1:
##     :               :           :...MonthOfDeath <= 5: 0 (99/44)
##     :               :               MonthOfDeath > 5: 1 (218/69)
##     :               PlaceOfDeathAndDecedentsStatus in {4,9}:
##     :               :...Sex = M:
##     :                   :...Education2003Revision = 0: 1 (199/20)
##     :                   :   Education2003Revision in {1,2,3,4,5,6,7,8}:
##     :                   :   :...ResidentStatus in {3,4}: 1 (17/1)
##     :                   :       ResidentStatus = 1:
##     :                   :       :...Autopsy = N: [S4]
##     :                   :       :   Autopsy = U:
##     :                   :       :   :...agegrp in {01. <5yo,10. 61-70yo,
##     :                   :       :       :          11. 71-80yo}: 1 (72/31)
##     :                   :       :       agegrp in {02. 5-10yo,
##     :                   :       :                  12. >80yo}: 0 (39/15)
##     :                   :       ResidentStatus = 2:
##     :                   :       :...DayOfWeekOfDeath in {1,3,7}: 0 (8/1)
##     :                   :           DayOfWeekOfDeath in {2,4,9}: 1 (12/1)
##     :                   :           DayOfWeekOfDeath = 5:
##     :                   :           :...agegrp = 10. 61-70yo: 1 (2)
##     :                   :           :   agegrp in {01. <5yo,02. 5-10yo,
##     :                   :           :              11. 71-80yo,
##     :                   :           :              12. >80yo}: 0 (3)
##     :                   :           DayOfWeekOfDeath = 6:
##     :                   :           :...MonthOfDeath <= 2: 0 (2)
##     :                   :               MonthOfDeath > 2: 1 (3)
##     :                   Sex = F:
##     :                   :...Race > 2: 1 (42/9)
##     :                       Race <= 2:
##     :                       :...PlaceOfInjury in {2,3,4,5,6,7,8,
##     :                           :                 99}: 1 (74.6/34)
##     :                           PlaceOfInjury = 9: 0 (129.6/53.5)
##     :                           PlaceOfInjury = 1:
##     :                           :...agegrp in {01. <5yo,10. 61-70yo,
##     :                           :   :          11. 71-80yo}: 0 (37.2/9.5)
##     :                           :   agegrp in {02. 5-10yo,
##     :                           :              12. >80yo}: 1 (86.5/34.7)
##     :                           PlaceOfInjury = 0:
##     :                           :...agegrp in {01. <5yo,02. 5-10yo,10. 61-70yo}:
##     :                               :...ResidentStatus in {1,2,
##     :                               :   :                  4}: 1 (480.6/168)
##     :                               :   ResidentStatus = 3: 0 (3.4)
##     :                               agegrp in {11. 71-80yo,12. >80yo}:
##     :                               :...Education2003Revision in {0,
##     :                                   :                         6}: 1 (168.7/70.1)
##     :                                   Education2003Revision = 7: 0 (44.4/19.7)
##     :                                   Education2003Revision = 1: [S5]
##     :                                   Education2003Revision = 4:
##     :                                   :...MaritalStatus in {D,S,U,
##     :                                   :   :                 W}: 1 (110.3/42.7)
##     :                                   :   MaritalStatus = M: 0 (36.1/13)
##     :                                   Education2003Revision = 8:
##     :                                   :...MaritalStatus in {D,U,
##     :                                   :   :                 W}: 1 (7.7/1.7)
##     :                                   :   MaritalStatus in {M,S}: 0 (4.3)
##     :                                   Education2003Revision = 5:
##     :                                   :...ResidentStatus in {2,3,
##     :                                   :   :                  4}: 1 (3)
##     :                                   :   ResidentStatus = 1:
##     :                                   :   :...MonthOfDeath <= 6: 0 (36.1/14.9)
##     :                                   :       MonthOfDeath > 6: 1 (42.7/18)
##     :                                   Education2003Revision = 2:
##     :                                   :...DayOfWeekOfDeath in {1,
##     :                                   :   :                    3}: 1 (33.8/11.9)
##     :                                   :   DayOfWeekOfDeath in {2,4,6,7,
##     :                                   :   :                    9}: 0 (76.3/30.7)
##     :                                   :   DayOfWeekOfDeath = 5: [S6]
##     :                                   Education2003Revision = 3: [S7]
##     PlaceOfDeathAndDecedentsStatus = 6:
##     :...Autopsy in {n,y,Y}: 1 (260/8)
##         Autopsy in {N,U}:
##         :...Education2003Revision = 9:
##             :...MaritalStatus in {D,M,S,W}: 0 (667/50)
##             :   MaritalStatus = U: 1 (11/4)
##             Education2003Revision in {0,1,2,3,4,5,6,7,8}:
##             :...Education2003Revision = 0: 1 (305/46)
##                 Education2003Revision in {1,2,3,4,5,6,7,8}:
##                 :...Race > 2: 1 (47/8)
##                     Race <= 2:
##                     :...agegrp in {01. <5yo,02. 5-10yo,03. 11-15yo,05. 21-25yo,
##                         :          06. 26-30yo,07. 31-40yo,
##                         :          08. 41-50yo}: 1 (72/13)
##                         agegrp in {04. 15-20yo,09. 51-60yo,10. 61-70yo,
##                         :          11. 71-80yo,12. >80yo}:
##                         :...Autopsy = U: 0 (129/40)
##                             Autopsy = N:
##                             :...InjuryAtWork = Y: 0 (38/13)
##                                 InjuryAtWork = N:
##                                 :...Education2003Revision in {1,4,5,6,7,8}:
##                                     :...MaritalStatus in {D,M,U,
##                                     :   :                 W}: 1 (1666/686)
##                                     :   MaritalStatus = S:
##                                     :   :...agegrp in {04. 15-20yo,
##                                     :       :          12. >80yo}: 0 (85/30)
##                                     :       agegrp in {09. 51-60yo,
##                                     :       :          11. 71-80yo}: 1 (39/16)
##                                     :       agegrp = 10. 61-70yo:
##                                     :       :...Sex = M: 1 (12/3)
##                                     :           Sex = F: [S8]
##                                     Education2003Revision in {2,3}:
##                                     :...Race > 1: 0 (84/30)
##                                         Race <= 1:
##                                         :...agegrp in {04. 15-20yo,
##                                             :          09. 51-60yo}: 1 (58/22)
##                                             agegrp = 10. 61-70yo:
##                                             :...MaritalStatus in {D,U,
##                                             :   :                 W}: 0 (70/23)
##                                             :   MaritalStatus = S: [S9]
##                                             :   MaritalStatus = M: [S10]
##                                             agegrp = 11. 71-80yo: [S11]
##                                             agegrp = 12. >80yo:
##                                             :...MaritalStatus in {D,M,
##                                                 :                 U}: 1 (484/204)
##                                                 MaritalStatus = S: 0 (102/35)
##                                                 MaritalStatus = W: [S12]
## 
## SubTree [S1]
## 
## PlaceOfDeathAndDecedentsStatus in {2,5,6}: 1 (68/30)
## PlaceOfDeathAndDecedentsStatus = 4: 0 (29/11)
## 
## SubTree [S2]
## 
## PlaceOfDeathAndDecedentsStatus = 3: 0 (4)
## PlaceOfDeathAndDecedentsStatus in {7,9}: 1 (9/2)
## 
## SubTree [S3]
## 
## PlaceOfDeathAndDecedentsStatus = 3: 1 (9/1)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (1)
## PlaceOfDeathAndDecedentsStatus = 7:
## :...ResidentStatus in {2,3}: 1 (31/11)
##     ResidentStatus = 1:
##     :...MonthOfDeath <= 2: 1 (5/1)
##         MonthOfDeath > 2: 0 (21/6)
## 
## SubTree [S4]
## 
## PlaceOfDeathAndDecedentsStatus = 4: 1 (3041/997)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (7/2)
## 
## SubTree [S5]
## 
## PlaceOfDeathAndDecedentsStatus = 4: 1 (144.3/60)
## PlaceOfDeathAndDecedentsStatus = 9: 0 (2.6)
## 
## SubTree [S6]
## 
## PlaceOfDeathAndDecedentsStatus = 9: 1 (1)
## PlaceOfDeathAndDecedentsStatus = 4:
## :...MonthOfDeath <= 2: 1 (3.9/0.9)
##     MonthOfDeath > 2: 0 (11.4/2)
## 
## SubTree [S7]
## 
## PlaceOfDeathAndDecedentsStatus = 9: 0 (8.7/1)
## PlaceOfDeathAndDecedentsStatus = 4:
## :...MaritalStatus in {S,W}: 0 (424.7/191.1)
##     MaritalStatus = U: 1 (1)
##     MaritalStatus = D:
##     :...Autopsy = N: 1 (50.8/21.3)
##     :   Autopsy = U: 0 (3.6/1)
##     MaritalStatus = M:
##     :...ResidentStatus in {3,4}: 0 (0)
##         ResidentStatus = 2: 1 (3)
##         ResidentStatus = 1:
##         :...agegrp = 11. 71-80yo: 0 (75.8/23.9)
##             agegrp = 12. >80yo:
##             :...DayOfWeekOfDeath in {2,3,7}: 0 (37.4/12.7)
##                 DayOfWeekOfDeath in {4,5,6,9}: 1 (46.6/19)
##                 DayOfWeekOfDeath = 1:
##                 :...MonthOfDeath <= 4: 1 (5)
##                     MonthOfDeath > 4: 0 (6.3/2)
## 
## SubTree [S8]
## 
## Education2003Revision in {1,4,5,6,8}: 0 (6)
## Education2003Revision = 7: 1 (2)
## 
## SubTree [S9]
## 
## Education2003Revision = 2: 0 (5/1)
## Education2003Revision = 3: 1 (22/8)
## 
## SubTree [S10]
## 
## ResidentStatus = 3: 1 (1)
## ResidentStatus = 1:
## :...Sex = F: 0 (15/4)
## :   Sex = M: 1 (14/4)
## ResidentStatus in {2,4}:
## :...Sex = F: 1 (4/1)
##     Sex = M: 0 (8/2)
## 
## SubTree [S11]
## 
## DayOfWeekOfDeath in {2,3,9}: 0 (98/28)
## DayOfWeekOfDeath = 5: 1 (39/18)
## DayOfWeekOfDeath = 1:
## :...MaritalStatus in {D,S,U}: 0 (10/2)
## :   MaritalStatus in {M,W}: 1 (32/13)
## DayOfWeekOfDeath = 4:
## :...MonthOfDeath <= 11: 0 (38/11)
## :   MonthOfDeath > 11: 1 (8/1)
## DayOfWeekOfDeath = 6:
## :...ResidentStatus in {2,3,4}: 0 (3)
## :   ResidentStatus = 1:
## :   :...Sex = F: 1 (19/7)
## :       Sex = M: 0 (13/2)
## DayOfWeekOfDeath = 7:
## :...MaritalStatus = D: 1 (11/4)
##     MaritalStatus in {M,S,U}: 0 (19/5)
##     MaritalStatus = W:
##     :...ResidentStatus in {3,4}: 0 (0)
##         ResidentStatus = 2: 1 (3)
##         ResidentStatus = 1:
##         :...MonthOfDeath <= 7: 0 (8)
##             MonthOfDeath > 7: 1 (7/2)
## 
## SubTree [S12]
## 
## DayOfWeekOfDeath in {1,2,7,9}: 0 (601/260)
## DayOfWeekOfDeath = 4: 1 (220/103)
## DayOfWeekOfDeath = 6:
## :...MonthOfDeath <= 4: 0 (65/20)
## :   MonthOfDeath > 4: 1 (115/51)
## DayOfWeekOfDeath = 5:
## :...PlaceOfInjury in {1,2,3,4,5,6,7,99}: 1 (104.5/46.4)
## :   PlaceOfInjury in {8,9}: 0 (10.8/3.1)
## :   PlaceOfInjury = 0:
## :   :...Education2003Revision = 2: 0 (7.9/2)
## :       Education2003Revision = 3: 1 (58.8/25.1)
## DayOfWeekOfDeath = 3:
## :...Education2003Revision = 2: 0 (32/12)
##     Education2003Revision = 3:
##     :...MonthOfDeath > 9: 0 (54/25)
##         MonthOfDeath <= 9:
##         :...MonthOfDeath <= 3: 0 (31/12)
##             MonthOfDeath > 3: 1 (92/42)
## 
## 
## Evaluation on training data (1315301 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     197 26128( 2.0%)   <<
## 
## 
##       (a)     (b)    <-classified as
##    ------  ------
##   1194916    8891    (a): class 0
##     17237   94257    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% InjuryAtWork
##  100.00% Autopsy
##   95.80% agegrp
##   88.38% Education2003Revision
##   14.05% PlaceOfDeathAndDecedentsStatus
##    1.57% PlaceOfInjury
##    1.12% MonthOfDeath
##    0.63% ResidentStatus
##    0.60% Race
##    0.57% MaritalStatus
##    0.57% Sex
##    0.22% DayOfWeekOfDeath
##    0.16% RaceRecode5
## 
## 
## Time: 21.2 secs

A look at the results indicates that I should look at a few more of the variables in detail, such as InjuryAtWork, Autopsy, and PlaceOfDeathAndDecedentsStatus. It’s possible that those contain information that would unreasonably correlate with deaths from external causes.

mdf <- dplyr::tbl_df(m_all)
mdf %>% 
      group_by(ext,InjuryAtWork) %>% 
      summarise(n=n()) %>% 
      mutate(freq=percent(n/sum(n)))

## Source: local data frame [6 x 4]
## Groups: ext [2]
## 
##      ext InjuryAtWork       n  freq
##   (fctr)       (fctr)   (int) (chr)
## 1      0            N   22371  0.9%
## 2      0            U 2385048 99.1%
## 3      0            Y     194  0.0%
## 4      1            N  182925 82.0%
## 5      1            U   35874 16.1%
## 6      1            Y    4188  1.9%

mdf %>% 
      group_by(ext,Autopsy) %>% 
      summarise(n=n()) %>% 
      mutate(freq=percent(n/sum(n)))

## Source: local data frame [6 x 4]
## Groups: ext [2]
## 
##      ext Autopsy       n  freq
##   (fctr)  (fctr)   (int) (chr)
## 1      0       N 2162001 89.8%
## 2      0       U  162013  6.7%
## 3      0       Y   83599  3.5%
## 4      1       N   96537 43.3%
## 5      1       U    6621  3.0%
## 6      1       Y  119829 53.7%

mdf %>% 
      group_by(ext,PlaceOfDeathAndDecedentsStatus) %>% 
      summarise(n=n()) %>% 
      mutate(freq=percent(n/sum(n)))

## Source: local data frame [16 x 4]
## Groups: ext [2]
## 
##       ext PlaceOfDeathAndDecedentsStatus      n  freq
##    (fctr)                         (fctr)  (int) (chr)
## 1       0                              1 744897 30.9%
## 2       0                              2 143056  5.9%
## 3       0                              3   9057  0.4%
## 4       0                              4 706143 29.3%
## 5       0                              5 181323  7.5%
## 6       0                              6 514489 21.4%
## 7       0                              7 106709  4.4%
## 8       0                              9   1939  0.1%
## 9       1                              1  50176 22.5%
## 10      1                              2  31046 13.9%
## 11      1                              3   3939  1.8%
## 12      1                              4  67598 30.3%
## 13      1                              5   6016  2.7%
## 14      1                              6   8743  3.9%
## 15      1                              7  54913 24.6%
## 16      1                              9    556  0.2%

Looking at the results, it seems to make sense to remove these for the following reasons:

InjuryAtWork: It seems that when the work injury is included on the death report, there may be a high likelihood for the death to be related to the workplace injury, so this probably needs to be removed.
Autopsy: For the death by non-external causes group, almost 90% of the deceased did not have an autopsy. In the death by external causes group, over half of the deceased did have an autopsy. Since having an autopsy is not a predictor of dying, this needs to be removed.
PlaceOfDeathAndDecedentsStatus: The things that pop out are groups 6 and 7. It turns out that 6 is “Residential Facility”, such as a nursing home. This would correlate too strongly with older people dying in a nursing home. This needs to be removed, and the model rerun.

It turns out that ActivityCode and PlaceOfInjury are mostly NAs and will screw up the random forest; therefore these need to be removed as well.

drop3 <- c("InjuryAtWork","Autopsy","PlaceOfDeathAndDecedentsStatus","ActivityCode","PlaceOfInjury")
m_all <- m_all[,!(names(m_all) %in% drop3)]
set.seed(4321)

# HAD TO CUT THE DATA BY 90% AFTER THE FACT BECAUSE RENDERING IN RMARKDOWN OVERSHOT MY 16GB MEMORY FOR THE RF. 
cutIndex <- createDataPartition(m_all$ext, p = .9,list=FALSE,times=1)
mhalf <- m_all[cutIndex,]
mrest <- m_all[-cutIndex,]

# MODELING ON A 50/25/25 PARTITION OF 10% OF THE DATA.
trainIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mtrain <- mrest[trainIndex,]
mrest2 <- mrest[-trainIndex,]
valIndex <- createDataPartition(m_all$ext, p = .5,list=FALSE,times=1)
mval <- mrest2[valIndex,]
mtest <- mrest2[-valIndex,]
str(mtrain)

## 'data.frame':    1315301 obs. of  10 variables:
##  $ ResidentStatus       : Factor w/ 4 levels "1","2","3","4": 3 1 1 1 1 1 2 1 1 1 ...
##  $ Education2003Revision: Factor w/ 10 levels "0","1","2","3",..: 5 4 4 3 6 3 4 2 4 6 ...
##  $ MonthOfDeath         : int  1 1 1 1 1 1 1 2 2 2 ...
##  $ Sex                  : Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1 1 1 1 ...
##  $ MaritalStatus        : Factor w/ 5 levels "D","M","S","U",..: 3 2 1 2 2 3 1 5 2 5 ...
##  $ DayOfWeekOfDeath     : Factor w/ 8 levels "1","2","3","4",..: 7 7 7 7 6 4 1 1 3 7 ...
##  $ Race                 : int  1 1 1 3 1 3 3 1 1 1 ...
##  $ RaceRecode5          : Factor w/ 4 levels "1","2","3","4": 1 1 1 3 1 3 3 1 1 1 ...
##  $ agegrp               : Factor w/ 12 levels "01.    <5yo",..: 5 10 8 12 9 7 9 12 12 12 ...
##  $ ext                  : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...

m2 <- C5.0(x=mtrain[-10],y=mtrain$ext)
summary(m2)

## 
## Call:
## C5.0.default(x = mtrain[-10], y = mtrain$ext)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Jun  9 23:07:40 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## *** ignoring cases with bad or unknown class
## 
## Read 131205 cases (10 attributes) from undefined.data
## 
## Decision tree:
## 
## agegrp in {01. <5yo,02. 5-10yo,08. 41-50yo,09. 51-60yo,10. 61-70yo,11. 71-80yo,
## :          12. >80yo}: 0 (125243/7603)
## agegrp in {03. 11-15yo,04. 15-20yo,05. 21-25yo,06. 26-30yo,07. 31-40yo}:
## :...agegrp in {04. 15-20yo,05. 21-25yo,06. 26-30yo}:
##     :...Education2003Revision in {0,2,3,4,5,8}:
##     :   :...Sex = M: 1 (1805/354)
##     :   :   Sex = F:
##     :   :   :...RaceRecode5 in {1,3,4}: 1 (533/149)
##     :   :       RaceRecode5 = 2:
##     :   :       :...agegrp in {04. 15-20yo,05. 21-25yo}: 1 (69/30)
##     :   :           agegrp = 06. 26-30yo: 0 (50/17)
##     :   Education2003Revision in {1,6,7,9}:
##     :   :...agegrp = 04. 15-20yo: 0 (58/9)
##     :       agegrp in {05. 21-25yo,06. 26-30yo}:
##     :       :...RaceRecode5 = 2: 0 (48/18)
##     :           RaceRecode5 = 3: 1 (2)
##     :           RaceRecode5 = 4:
##     :           :...Race <= 28: 0 (9/3)
##     :           :   Race > 28: 1 (5)
##     :           RaceRecode5 = 1:
##     :           :...MaritalStatus in {D,M,U,W}: 1 (51/18)
##     :               MaritalStatus = S:
##     :               :...Education2003Revision in {6,7,9}: 1 (119/42)
##     :                   Education2003Revision = 1:
##     :                   :...Sex = F: 0 (22/3)
##     :                       Sex = M:
##     :                       :...DayOfWeekOfDeath in {4,6,9}: 0 (13/1)
##     :                           DayOfWeekOfDeath in {1,2,3,5,7}:
##     :                           :...ResidentStatus in {1,4}: 1 (27/5)
##     :                               ResidentStatus in {2,3}: 0 (14/4)
##     agegrp in {03. 11-15yo,07. 31-40yo}:
##     :...Sex = M:
##         :...RaceRecode5 in {2,4}:
##         :   :...MaritalStatus = U: 0 (1)
##         :   :   MaritalStatus = W: 1 (3)
##         :   :   MaritalStatus = D:
##         :   :   :...Race <= 48: 1 (32/10)
##         :   :   :   Race > 48: 0 (2)
##         :   :   MaritalStatus in {M,S}:
##         :   :   :...ResidentStatus in {1,4}: 0 (307/134)
##         :   :       ResidentStatus = 2:
##         :   :       :...agegrp = 07. 31-40yo: 1 (63/27)
##         :   :       :   agegrp = 03. 11-15yo:
##         :   :       :   :...DayOfWeekOfDeath in {1,2,3,4,5,6,9}: 0 (7)
##         :   :       :       DayOfWeekOfDeath = 7: 1 (2)
##         :   :       ResidentStatus = 3:
##         :   :       :...agegrp = 03. 11-15yo: 0 (5)
##         :   :           agegrp = 07. 31-40yo:
##         :   :           :...MonthOfDeath <= 8: 0 (21/3)
##         :   :               MonthOfDeath > 8: 1 (9/2)
##         :   RaceRecode5 in {1,3}:
##         :   :...MaritalStatus in {D,W}: 1 (244/79)
##         :       MaritalStatus = U: 0 (7/2)
##         :       MaritalStatus in {M,S}:
##         :       :...Education2003Revision in {0,2,5,7}: 1 (375/139)
##         :           Education2003Revision = 8:
##         :           :...MaritalStatus = M: 1 (4/1)
##         :           :   MaritalStatus = S: 0 (5/1)
##         :           Education2003Revision = 4:
##         :           :...ResidentStatus in {1,3,4}: 1 (152/64)
##         :           :   ResidentStatus = 2:
##         :           :   :...Race <= 2: 0 (33/12)
##         :           :       Race > 2: 1 (2)
##         :           Education2003Revision = 6:
##         :           :...MaritalStatus = S: 1 (55/20)
##         :           :   MaritalStatus = M:
##         :           :   :...ResidentStatus in {1,3,4}: 0 (33/12)
##         :           :       ResidentStatus = 2: 1 (7/1)
##         :           Education2003Revision = 9:
##         :           :...MaritalStatus = M: 0 (7/2)
##         :           :   MaritalStatus = S:
##         :           :   :...MonthOfDeath <= 5: 1 (6)
##         :           :       MonthOfDeath > 5: 0 (4/1)
##         :           Education2003Revision = 1:
##         :           :...ResidentStatus = 2: 0 (30/9)
##         :           :   ResidentStatus = 3:
##         :           :   :...MaritalStatus = M: 1 (2)
##         :           :   :   MaritalStatus = S: 0 (7/2)
##         :           :   ResidentStatus = 4:
##         :           :   :...agegrp = 03. 11-15yo: 0 (2)
##         :           :   :   agegrp = 07. 31-40yo: 1 (3)
##         :           :   ResidentStatus = 1:
##         :           :   :...agegrp = 03. 11-15yo: 1 (46/15)
##         :           :       agegrp = 07. 31-40yo:
##         :           :       :...DayOfWeekOfDeath = 1: 1 (8/2)
##         :           :           DayOfWeekOfDeath in {4,5,7,9}: 0 (22/6)
##         :           :           DayOfWeekOfDeath = 2:
##         :           :           :...MonthOfDeath <= 8: 1 (5/1)
##         :           :           :   MonthOfDeath > 8: 0 (2)
##         :           :           DayOfWeekOfDeath = 3:
##         :           :           :...MonthOfDeath <= 10: 1 (6/1)
##         :           :           :   MonthOfDeath > 10: 0 (2)
##         :           :           DayOfWeekOfDeath = 6:
##         :           :           :...MaritalStatus = M: 0 (4/1)
##         :           :               MaritalStatus = S: 1 (5/1)
##         :           Education2003Revision = 3:
##         :           :...MonthOfDeath > 4: 1 (326/113)
##         :               MonthOfDeath <= 4:
##         :               :...DayOfWeekOfDeath in {3,7}: 1 (37/12)
##         :                   DayOfWeekOfDeath in {4,5,6,9}: 0 (69/25)
##         :                   DayOfWeekOfDeath = 2:
##         :                   :...MonthOfDeath <= 1: 0 (4)
##         :                   :   MonthOfDeath > 1: 1 (16/7)
##         :                   DayOfWeekOfDeath = 1:
##         :                   :...MaritalStatus = M: 1 (7)
##         :                       MaritalStatus = S:
##         :                       :...MonthOfDeath <= 2: 0 (4)
##         :                           MonthOfDeath > 2: 1 (7/2)
##         Sex = F:
##         :...RaceRecode5 in {2,4}: 0 (275/58)
##             RaceRecode5 in {1,3}:
##             :...ResidentStatus = 4: 1 (5)
##                 ResidentStatus in {1,2,3}:
##                 :...MaritalStatus in {D,W}: 1 (170/73)
##                     MaritalStatus = U: 0 (11/2)
##                     MaritalStatus in {M,S}:
##                     :...ResidentStatus in {2,3}:
##                         :...RaceRecode5 = 1: 0 (172/49)
##                         :   RaceRecode5 = 3: 1 (4/1)
##                         ResidentStatus = 1:
##                         :...Education2003Revision in {2,7,8,9}: 0 (88/35)
##                             Education2003Revision in {4,5}: 1 (100/46)
##                             Education2003Revision = 0:
##                             :...MaritalStatus = M: 0 (28/10)
##                             :   MaritalStatus = S: 1 (21/8)
##                             Education2003Revision = 6:
##                             :...MonthOfDeath <= 2: 1 (2)
##                             :   MonthOfDeath > 2: 0 (36/13)
##                             Education2003Revision = 1:
##                             :...agegrp = 03. 11-15yo: 1 (28/12)
##                             :   agegrp = 07. 31-40yo:
##                             :   :...MaritalStatus = S: 0 (21/2)
##                             :       MaritalStatus = M:
##                             :       :...DayOfWeekOfDeath in {2,6}: 1 (4)
##                             :           DayOfWeekOfDeath in {1,3,4,5,7,
##                             :                                9}: 0 (3)
##                             Education2003Revision = 3:
##                             :...DayOfWeekOfDeath in {1,2,4,6,9}: 0 (94/40)
##                                 DayOfWeekOfDeath = 5: 1 (20/6)
##                                 DayOfWeekOfDeath = 7:
##                                 :...MaritalStatus = M: 1 (16/6)
##                                 :   MaritalStatus = S: 0 (13/5)
##                                 DayOfWeekOfDeath = 3:
##                                 :...MaritalStatus = M: 0 (16/5)
##                                     MaritalStatus = S:
##                                     :...MonthOfDeath <= 10: 1 (8/1)
##                                         MonthOfDeath > 10: 0 (2)
## 
## 
## Evaluation on training data (131205 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      86 9335( 7.1%)   <<
## 
## 
##      (a)    (b)    <-classified as
##    -----  -----
##   118707   1248    (a): class 0
##     8087   3163    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% agegrp
##    4.32% Sex
##    3.52% Education2003Revision
##    3.12% RaceRecode5
##    2.37% MaritalStatus
##    1.29% ResidentStatus
##    0.44% MonthOfDeath
##    0.33% DayOfWeekOfDeath
##    0.06% Race
## 
## 
## Time: 0.7 secs

IV. Evaluate the model.

Taking a look at the validation and test runs of this model (ROC on test):

pv1 <- predict(m2, mval)
pt1 <- predict(m2, mtest)

confusionMatrix(pv1,mval$ext,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 59922  4077
##          1   731  1413
##                                           
##                Accuracy : 0.9273          
##                  95% CI : (0.9253, 0.9293)
##     No Information Rate : 0.917           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3394          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.25738         
##             Specificity : 0.98795         
##          Pos Pred Value : 0.65905         
##          Neg Pred Value : 0.93630         
##              Prevalence : 0.08300         
##          Detection Rate : 0.02136         
##    Detection Prevalence : 0.03241         
##       Balanced Accuracy : 0.62266         
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pt1,mtest$ext,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 59433  4099
##          1   720  1459
##                                           
##                Accuracy : 0.9267          
##                  95% CI : (0.9246, 0.9286)
##     No Information Rate : 0.9154          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.346           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26250         
##             Specificity : 0.98803         
##          Pos Pred Value : 0.66957         
##          Neg Pred Value : 0.93548         
##              Prevalence : 0.08458         
##          Detection Rate : 0.02220         
##    Detection Prevalence : 0.03316         
##       Balanced Accuracy : 0.62527         
##                                           
##        'Positive' Class : 1               
##

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

auc(as.numeric(mtest$ext),as.numeric(pt1))

## Area under the curve: 0.6253

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred <- prediction(predictions = as.numeric(pt1), labels=as.numeric(mtest$ext))
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a=0, b=1)

V. Improve the model.

As I feel that I did a bit of tuning of the C5.0 model as I went along, I’d rather try the Random Forest algorithm to see if that gives us a better outcome. I’ll use the same training, validation, and test datasets.

Since we have 9 explanatory variables, I’ll try an n of 3, and use 300 trees in order to prevent my computer from melting down.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

rf <- randomForest(mtrain$ext ~ ., data = mtrain, mtry=3, ntree=300, na.action=na.roughfix)
rf

## 
## Call:
##  randomForest(formula = mtrain$ext ~ ., data = mtrain, mtry = 3,      ntree = 300, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.74%
## Confusion matrix:
##         0    1 class.error
## 0 1302595 1456 0.001116521
## 1    8260 2990 0.734222222

Look at model performance on validation set

pv2 <- predict(rf, mtest[-10],type="response")
head(pv2, n=10)

##   4   8  22  43 102 107 119 140 159 186 
##   0   0   0   0   0   0   0   0   0   0 
## Levels: 0 1

confusionMatrix(pv2,mtest$ext,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 59406  4103
##          1   746  1451
##                                           
##                Accuracy : 0.9262          
##                  95% CI : (0.9242, 0.9282)
##     No Information Rate : 0.9155          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3429          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26125         
##             Specificity : 0.98760         
##          Pos Pred Value : 0.66045         
##          Neg Pred Value : 0.93539         
##              Prevalence : 0.08453         
##          Detection Rate : 0.02208         
##    Detection Prevalence : 0.03344         
##       Balanced Accuracy : 0.62443         
##                                           
##        'Positive' Class : 1               
##

Look at model performance on test set, including ROC

pt2 <- predict(rf, mtest[-10],type="response")
head(pt2, n=10)

##   4   8  22  43 102 107 119 140 159 186 
##   0   0   0   0   0   0   0   0   0   0 
## Levels: 0 1

confusionMatrix(pt2,mtest$ext,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 59405  4104
##          1   747  1450
##                                           
##                Accuracy : 0.9262          
##                  95% CI : (0.9241, 0.9282)
##     No Information Rate : 0.9155          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3426          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26107         
##             Specificity : 0.98758         
##          Pos Pred Value : 0.65999         
##          Neg Pred Value : 0.93538         
##              Prevalence : 0.08453         
##          Detection Rate : 0.02207         
##    Detection Prevalence : 0.03344         
##       Balanced Accuracy : 0.62433         
##                                           
##        'Positive' Class : 1               
##

library(pROC)
auc(as.numeric(mtest$ext),as.numeric(pt2))

## Area under the curve: 0.6243

library(ROCR)
pred <- prediction(predictions = as.numeric(pt2), labels=as.numeric(mtest$ext))
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a=0, b=1)

Wrap-up

Below is a wrap up of the results from the two algorithms (model comparison on test dataset):

Model	Algorithm	Accuracy	Kappa	ROC
1	C5.0 (DT)	0.9267	0.346	0.6253
2	Random Forest	0.9262	0.3426	0.6243

So it appears that (1) the results are consistent, which probably means it’s a fair comparison of the two algorithms, (2) there was no improvement using RF, and (3) with a Kappa statistic that is severely lower than the Accuracy, we can see the effects of the overwhelming prevalence of deaths from non-external causes.

It’s interesting that in this case, the simple C5.0 algorithm may have slightly edged out the Random Forest, although the difference probably isn’t significant. (No, I have not thought about how to go about quantifying the statistical significance of this difference!)

While I feel like I didn’t do the greatest job on this, it was good practice on the skills learned in class, and from the Machine Learning in R book. Thanks so much for a great year, I learned so much in your classes!

Things I learned from this project.

Don’t trust anybody’s data, even the CDC’s. That AgeRecode12 field was utter garbage.
I need to decide which data to use faster. At some point it doesn’t matter what it is, it just needs to get done, but I think it’s a form of procrastination and I am not sure how I am going to fix this.
When I create new categorical variables that appear numeric, remember to convert them to factors.
This dataset was too big for my computers. When the algorithms run in RMarkdown, at one point (chunk 22), memory usage spikes to over 16GB. The algorithms ran fine in RStudio, but it was not possible to render in RMarkdown. Therefore I had to reduce the size of the data partitions after I had written up my entire report. This was a major bummer but a good lesson.

Things I feel I should have done better.

Visualizations! Correlations would have been good, as well as frequencies etc.
I should have used some other algorithms.

Items for further study.

How to deal with lookup tables in R.
How to integrate graphics (hello, ggplot2?) into my dplyr stuff.
How to write switch functions!
Functional coding.