DATA 622 Homework 2

library(tidyverse)

## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.3.5     v purrr   0.3.2
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(DataExplorer)
library(rpart)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Objective

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Data Exploration

I have decided to use data from “data.cityofnewyork.us” website, which has a collection of different data sets to choose from and work with. Within this website I found a data set with the leading causes of death in New York City that seemed interesting.

We will first load the data, which I have saved in a local folder, and explore it.

data <- readr::read_csv("New_York_City_Leading_Causes_of_Death.csv")

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   `Leading Cause` = col_character(),
##   Sex = col_character(),
##   `Race Ethnicity` = col_character(),
##   Deaths = col_character(),
##   `Death Rate` = col_character(),
##   `Age Adjusted Death Rate` = col_character()
## )

The data contain the variables “Year, Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, and Age Adjusted Death Rate”.

head(data)

## # A tibble: 6 x 7
##    Year `Leading Cause`             Sex   `Race Ethnicity`   Deaths `Death Rate`
##   <dbl> <chr>                       <chr> <chr>              <chr>  <chr>       
## 1  2009 Nephritis, Nephrotic Syndr~ F     Other Race/ Ethni~ .      .           
## 2  2013 Influenza (Flu) and Pneumo~ F     Hispanic           204    16.3        
## 3  2012 Assault (Homicide: Y87.1, ~ M     Other Race/ Ethni~ .      .           
## 4  2007 Essential Hypertension and~ F     Not Stated/Unknown 5      .           
## 5  2014 Cerebrovascular Disease (S~ F     White Non-Hispanic 418    29.5        
## 6  2009 Essential Hypertension and~ M     Asian and Pacific~ 26     5.1         
## # ... with 1 more variable: Age Adjusted Death Rate <chr>

Since “Leading Cause” and “Race Ethnicity” seem like categorical variables we may want to use in our model, let’s take a look at how many different categories we would find within each of these variables.

data %>%
  distinct(`Leading Cause`)

## # A tibble: 34 x 1
##    `Leading Cause`                                                              
##    <chr>                                                                        
##  1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)      
##  2 Influenza (Flu) and Pneumonia (J09-J18)                                      
##  3 Assault (Homicide: Y87.1, X85-Y09)                                           
##  4 Essential Hypertension and Renal Diseases (I10, I12)                         
##  5 Cerebrovascular Disease (Stroke: I60-I69)                                    
##  6 All Other Causes                                                             
##  7 Mental and Behavioral Disorders due to Accidental Poisoning and Other Psycho~
##  8 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)             
##  9 Intentional Self-Harm (Suicide: X60-X84, Y87.0)                              
## 10 Chronic Lower Respiratory Diseases (J40-J47)                                 
## # ... with 24 more rows

data %>%
  distinct(`Race Ethnicity`)

## # A tibble: 8 x 1
##   `Race Ethnicity`          
##   <chr>                     
## 1 Other Race/ Ethnicity     
## 2 Hispanic                  
## 3 Not Stated/Unknown        
## 4 White Non-Hispanic        
## 5 Asian and Pacific Islander
## 6 Black Non-Hispanic        
## 7 Non-Hispanic White        
## 8 Non-Hispanic Black

Seems like we have 34 different categories for “Leading Cause” and 8 for “Race Ethnicity”.

From the structure of the data below, we can observe that a lot of these variables are type “Character”. We may need to transform some of the variables into factors in order to better analyze the data with R.

str(data)

## spec_tbl_df [1,272 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Year                   : num [1:1272] 2009 2013 2012 2007 2014 ...
##  $ Leading Cause          : chr [1:1272] "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)" "Influenza (Flu) and Pneumonia (J09-J18)" "Assault (Homicide: Y87.1, X85-Y09)" "Essential Hypertension and Renal Diseases (I10, I12)" ...
##  $ Sex                    : chr [1:1272] "F" "F" "M" "F" ...
##  $ Race Ethnicity         : chr [1:1272] "Other Race/ Ethnicity" "Hispanic" "Other Race/ Ethnicity" "Not Stated/Unknown" ...
##  $ Deaths                 : chr [1:1272] "." "204" "." "5" ...
##  $ Death Rate             : chr [1:1272] "." "16.3" "." "." ...
##  $ Age Adjusted Death Rate: chr [1:1272] "." "18.5" "." "." ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Year = col_double(),
##   ..   `Leading Cause` = col_character(),
##   ..   Sex = col_character(),
##   ..   `Race Ethnicity` = col_character(),
##   ..   Deaths = col_character(),
##   ..   `Death Rate` = col_character(),
##   ..   `Age Adjusted Death Rate` = col_character()
##   .. )

These data contain 1,272 observations and 7 variables.

dim(data)

## [1] 1272    7

Although, below it shows that there are no missing values for “Deaths”, I did note on the file that there are about 138 observations that have a “.” instead of a value for “Deaths”. Something we will have to think about when dealing with these missing values in our models. we can also observe that we have missing values for “Age Adjusted Death Rate” and “Death Rate”, but I do not anticipate I will be using these variables in the models.

plot_missing(data)

Additionally, it seems that the variable “Sex” may have possible values that include “F”, “Female”, “M” and “Male”. We will substitute the values “Female” and “Male” with their corresponding letter equivalent. And the variables “Leading Cause” and “Race Ethnicity” have some repetitive values phrased in different ways. We will take care of cleaning up these variables in our next section.

Data Preparation

We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the previous variables that were of type character into factor.

data_prepared <- data

#Value substitutions cleanup
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)"] <- "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)"
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Chronic Liver Disease and Cirrhosis (K70, K73)"] <- "Chronic Liver Disease and Cirrhosis (K70, K73-K74)"
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Intentional Self-Harm (Suicide: X60-X84, Y87.0)"] <- "Intentional Self-Harm (Suicide: U03, X60-X84, Y87.0)"
data_prepared$Sex[data_prepared$Sex == "Female"] <- "F"
data_prepared$Sex[data_prepared$Sex == "Male"] <- "M"
data_prepared$`Race Ethnicity`[data_prepared$`Race Ethnicity` == "Non-Hispanic Black"] <- "Black Non-Hispanic"
data_prepared$`Race Ethnicity`[data_prepared$`Race Ethnicity` == "Non-Hispanic White"] <- "White Non-Hispanic"

#Data type change for columns of interest
data_prepared$Year <- as.integer(data_prepared$Year)
data_prepared$`Leading Cause` <- as.factor(data_prepared$`Leading Cause`)
data_prepared$Sex <- as.factor(data_prepared$Sex)
data_prepared$`Race Ethnicity` <- as.factor(data_prepared$`Race Ethnicity`)
data_prepared$Deaths <- as.integer(data_prepared$Deaths)

## Warning: NAs introduced by coercion

We have changed the structure of our variables as it is shown below.

str(data_prepared)

## spec_tbl_df [1,272 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Year                   : int [1:1272] 2009 2013 2012 2007 2014 2009 2013 2007 2013 2008 ...
##  $ Leading Cause          : Factor w/ 31 levels "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",..: 26 20 7 18 9 18 20 7 2 9 ...
##  $ Sex                    : Factor w/ 2 levels "F","M": 1 1 2 1 1 2 2 2 2 2 ...
##  $ Race Ethnicity         : Factor w/ 6 levels "Asian and Pacific Islander",..: 5 3 5 4 6 1 6 2 4 3 ...
##  $ Deaths                 : int [1:1272] NA 204 NA 5 418 26 618 267 139 124 ...
##  $ Death Rate             : chr [1:1272] "." "16.3" "." "." ...
##  $ Age Adjusted Death Rate: chr [1:1272] "." "18.5" "." "." ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Year = col_double(),
##   ..   `Leading Cause` = col_character(),
##   ..   Sex = col_character(),
##   ..   `Race Ethnicity` = col_character(),
##   ..   Deaths = col_character(),
##   ..   `Death Rate` = col_character(),
##   ..   `Age Adjusted Death Rate` = col_character()
##   .. )

hist(data_prepared$Deaths)

Given that the distribution of “Deaths” is skewed to the right, I will use the median to replace the missing values that were present on this column to avoid including any bias in the results.

data_prepared$Deaths[is.na(data_prepared$Deaths)] <- median(data_prepared$Deaths, na.rm=TRUE)

# We use this line of code to get rid of decimals.
data_prepared$Deaths <- trunc(data_prepared$Deaths)

We can also observe the summary statistics for our transformed data.

summary(data_prepared)

##       Year                                             Leading Cause Sex    
##  Min.   :2007   All Other Causes                              :110   F:644  
##  1st Qu.:2009   Diseases of Heart (I00-I09, I11, I13, I20-I51):110   M:628  
##  Median :2011   Influenza (Flu) and Pneumonia (J09-J18)       :110          
##  Mean   :2012   Malignant Neoplasms (Cancer: C00-C97)         :110          
##  3rd Qu.:2013   Diabetes Mellitus (E10-E14)                   :106          
##  Max.   :2019   Cerebrovascular Disease (Stroke: I60-I69)     :103          
##                 (Other)                                       :623          
##                     Race Ethnicity     Deaths        Death Rate       
##  Asian and Pacific Islander:199    Min.   :   1.0   Length:1272       
##  Black Non-Hispanic        :200    1st Qu.:  36.0   Class :character  
##  Hispanic                  :199    Median : 136.0   Mode  :character  
##  Not Stated/Unknown        :223    Mean   : 391.8                     
##  Other Race/ Ethnicity     :253    3rd Qu.: 266.2                     
##  White Non-Hispanic        :198    Max.   :7050.0                     
##                                                                       
##  Age Adjusted Death Rate
##  Length:1272            
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
##

Build Models

I would like to predict the number of “Deaths” variable and find out whether variables “Leading Cause”, “Sex” and “Race Ethnicity” have any relationship with our dependent variable.

The code below has been borrowed from the class examples in order to run a decision tree model on our data.

We first take a look at the “Sex” variable to see if it is evenly distributed among our “Race Ethnicity” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.

# Two-way table of factor variables
xtabs(~Sex + `Race Ethnicity`, data = data_prepared)

##    Race Ethnicity
## Sex Asian and Pacific Islander Black Non-Hispanic Hispanic Not Stated/Unknown
##   F                         99                100      100                118
##   M                        100                100       99                105
##    Race Ethnicity
## Sex Other Race/ Ethnicity White Non-Hispanic
##   F                   128                 99
##   M                   125                 99

Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.

# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
train <- data_prepared[ind==1,]
test <- data_prepared[ind==2,]

First Decision Tree

We run the first decision tree model to predict number of “Deaths” with the variable “Leading Cause” below:

# grow tree
first_model <- rpart(Deaths ~ `Leading Cause`,
   method="anova", train)

summary(first_model) # detailed summary of splits

## Call:
## rpart(formula = Deaths ~ `Leading Cause`, data = train, method = "anova")
##   n= 1013 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.30826324      0 1.0000000 1.0014415 0.14785103
## 2 0.02183541      1 0.6917368 0.6942796 0.10517275
## 3 0.01000000      2 0.6699014 0.6976495 0.09964031
## 
## Variable importance
## Leading Cause 
##           100 
## 
## Node number 1: 1013 observations,    complexity param=0.3082632
##   mean=394.5025, MSE=674619.5 
##   left son=2 (763 obs) right son=3 (250 obs)
##   Primary splits:
##       Leading Cause splits as  LRLLLLLLLLLLLLLLRLLLLLRLLLLLLLL, improve=0.3082632, (0 missing)
## 
## Node number 2: 763 observations
##   mean=133.4679, MSE=15124.95 
## 
## Node number 3: 250 observations,    complexity param=0.02183541
##   mean=1191.18, MSE=1844741 
##   left son=6 (78 obs) right son=7 (172 obs)
##   Primary splits:
##       Leading Cause splits as  -L--------------R-----R--------, improve=0.03235595, (0 missing)
## 
## Node number 6: 78 observations
##   mean=828.3846, MSE=591049.4 
## 
## Node number 7: 172 observations
##   mean=1355.703, MSE=2326520

plotcp(first_model) # visualize cross-validation results

As we can see in the graphs below, the R-square increases with the number of splits, and the XRelative Error decreases as the number of splits increase.

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(first_model) # visualize cross-validation results

## 
## Regression tree:
## rpart(formula = Deaths ~ `Leading Cause`, data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Leading Cause
## 
## Root node error: 683389531/1013 = 674619
## 
## n= 1013 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.308263      0   1.00000 1.00144 0.14785
## 2 0.021835      1   0.69174 0.69428 0.10517
## 3 0.010000      2   0.66990 0.69765 0.09964

# plot tree
plot(first_model, uniform=TRUE,
   main="Regression Tree for Number of Deaths ")
text(first_model, use.n=TRUE, all=TRUE, cex=.8)

# make predictions
p <- predict(first_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p)^2))

## [1] 662.6958

# R-square
(cor(test$Deaths, p))^2

## [1] 0.2599384

We can observe above that the RMSE and R-square results are not great when making our predictions on the first model.

Second Decision Tree

We run the second decision tree model to predict number of “Deaths” with the variables “Race Ethnicity” and “Sex” below:

# grow second tree
second_model <- rpart(Deaths ~ `Race Ethnicity` + Sex,
   method="anova", train)

summary(second_model) # detailed summary of splits

## Call:
## rpart(formula = Deaths ~ `Race Ethnicity` + Sex, data = train, 
##     method = "anova")
##   n= 1013 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.17590636      0 1.0000000 1.0022630 0.1479408
## 2 0.02837262      1 0.8240936 0.8293548 0.1190863
## 3 0.02029702      2 0.7957210 0.8205950 0.1128873
## 4 0.01000000      3 0.7754240 0.7838013 0.1077029
## 
## Variable importance
## Race Ethnicity 
##            100 
## 
## Node number 1: 1013 observations,    complexity param=0.1759064
##   mean=394.5025, MSE=674619.5 
##   left son=2 (695 obs) right son=3 (318 obs)
##   Primary splits:
##       Race Ethnicity splits as  LRLLLR, improve=1.759064e-01, (0 missing)
##       Sex            splits as  RL,     improve=2.280689e-06, (0 missing)
## 
## Node number 2: 695 observations,    complexity param=0.02029702
##   mean=161.4835, MSE=79789.07 
##   left son=4 (538 obs) right son=5 (157 obs)
##   Primary splits:
##       Race Ethnicity splits as  L-RLL-, improve=0.2501338000, (0 missing)
##       Sex            splits as  LR,     improve=0.0005989079, (0 missing)
## 
## Node number 3: 318 observations,    complexity param=0.02837262
##   mean=903.7736, MSE=1596615 
##   left son=6 (160 obs) right son=7 (158 obs)
##   Primary splits:
##       Race Ethnicity splits as  -L---R, improve=0.0381891900, (0 missing)
##       Sex            splits as  RL,     improve=0.0009853138, (0 missing)
##   Surrogate splits:
##       Sex splits as  LR, agree=0.509, adj=0.013, (0 split)
## 
## Node number 4: 538 observations
##   mean=85.16729, MSE=14644.65 
## 
## Node number 5: 157 observations
##   mean=423, MSE=214674 
## 
## Node number 6: 160 observations
##   mean=658.3937, MSE=531117.1 
## 
## Node number 7: 158 observations
##   mean=1152.259, MSE=2552882

plotcp(second_model) # visualize cross-validation results

As we can see in the graphs below, the R-square increases with the number of splits, and the XRelative Error decreases as the number of splits increase.

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results

## 
## Regression tree:
## rpart(formula = Deaths ~ `Race Ethnicity` + Sex, data = train, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Race Ethnicity
## 
## Root node error: 683389531/1013 = 674619
## 
## n= 1013 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.175906      0   1.00000 1.00226 0.14794
## 2 0.028373      1   0.82409 0.82935 0.11909
## 3 0.020297      2   0.79572 0.82059 0.11289
## 4 0.010000      3   0.77542 0.78380 0.10770

# plot tree
plot(second_model, uniform=TRUE,
   main="Regression Tree for Number of Deaths ")
text(second_model, use.n=TRUE, all=TRUE, cex=.8)

# make predictions
p2 <- predict(second_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p2)^2))

## [1] 654.8837

# R-square
(cor(test$Deaths, p2))^2

## [1] 0.2525162

We can observe above that the RMSE and R-square results did not improve after switching variables and making predictions on the second model.

Random Forest

We will now use a random forest model and add more predictor variables to our algorithm.

I first had to create another copy for our train and test data and change the column names to remove spaces as the randomForest() function was not working properly.

train2 <- train
colnames(train2)[2] <- "LeadingCause"
colnames(train2)[4] <- "RaceEthnicity"

test2 <- test
colnames(test2)[2] <- "LeadingCause"
colnames(test2)[4] <- "RaceEthnicity"

third_model <- randomForest(Deaths ~ LeadingCause + RaceEthnicity + Sex, importance = TRUE, na.action = na.omit, train2)

From the results below we can see that the percentage of variance explained with this model is about 76%, which suggests this is a very good model.

#Print regression model
print(third_model)

## 
## Call:
##  randomForest(formula = Deaths ~ LeadingCause + RaceEthnicity +      Sex, data = train2, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 176892.3
##                     % Var explained: 73.78

Below, we can observe that the “Error” decreases drastically only after a few trees in the model.

# Plot the error vs the number of trees graph
plot(third_model)

# make predictions
p3 <- predict(third_model, test2)
# Root Mean Square Error
sqrt(mean((test2$Deaths - p3)^2))

## [1] 341.5055

# R-square
(cor(test2$Deaths, p3))^2

## [1] 0.9323252

As we can observe above, the RMSE and R-square improved drastically with the random forest model and by adding extra predictor variables to our model.

Decision Tree with Same Predictors as RF

As a curiosity, I decided to run an extra decision tree model with the same predictors as the random forest previously.

# grow third tree
fourth_model <- rpart(Deaths ~ `Leading Cause` + `Race Ethnicity` + Sex,
   method="anova", train)

We would actually be able to get a better RMSE and R-square if we were to use the decision tree model with the same predictors we have used for our random forest model above. However, such a high performing R-square may suggest that we are over-fitting our data with this and the previous model.

# make predictions
p4 <- predict(fourth_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p4)^2))

## [1] 194.1992

# R-square
(cor(test$Deaths, p4))^2

## [1] 0.9447492

Conclusion

After executing different decision tree models with different predictor variables, I have not been able to determine which would be the most appropriate model to use. The models with fewers predictor variables do not perform well, and the models that include more predictor variables seem to work extraordinarily well and may suggest over-fit. In any case, if I have a decision tree that performs just as well as a random forest, I would most likely choose the decision tree as it is the simpler model.

Additionally, based on real cases where decision trees have gone “wrong”, I would argue that the machine learning version of a decision tree is an algorithm that can simplify an outcome for the user with its ability to compute different probabilities and arrive at an optimum result. Although, decision trees that are represented graphically can become very complicated, the algorithm behind a decision tree in machine learning can actually do a lot of the visual work for you but in a rather mathematical manner.