library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.3.5 v purrr 0.3.2
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DataExplorer)
library(rpart)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
I have decided to use data from “data.cityofnewyork.us” website, which has a collection of different data sets to choose from and work with. Within this website I found a data set with the leading causes of death in New York City that seemed interesting.
We will first load the data, which I have saved in a local folder, and explore it.
data <- readr::read_csv("New_York_City_Leading_Causes_of_Death.csv")
## Parsed with column specification:
## cols(
## Year = col_double(),
## `Leading Cause` = col_character(),
## Sex = col_character(),
## `Race Ethnicity` = col_character(),
## Deaths = col_character(),
## `Death Rate` = col_character(),
## `Age Adjusted Death Rate` = col_character()
## )
The data contain the variables “Year, Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, and Age Adjusted Death Rate”.
head(data)
## # A tibble: 6 x 7
## Year `Leading Cause` Sex `Race Ethnicity` Deaths `Death Rate`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2009 Nephritis, Nephrotic Syndr~ F Other Race/ Ethni~ . .
## 2 2013 Influenza (Flu) and Pneumo~ F Hispanic 204 16.3
## 3 2012 Assault (Homicide: Y87.1, ~ M Other Race/ Ethni~ . .
## 4 2007 Essential Hypertension and~ F Not Stated/Unknown 5 .
## 5 2014 Cerebrovascular Disease (S~ F White Non-Hispanic 418 29.5
## 6 2009 Essential Hypertension and~ M Asian and Pacific~ 26 5.1
## # ... with 1 more variable: Age Adjusted Death Rate <chr>
Since “Leading Cause” and “Race Ethnicity” seem like categorical variables we may want to use in our model, let’s take a look at how many different categories we would find within each of these variables.
data %>%
distinct(`Leading Cause`)
## # A tibble: 34 x 1
## `Leading Cause`
## <chr>
## 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 Influenza (Flu) and Pneumonia (J09-J18)
## 3 Assault (Homicide: Y87.1, X85-Y09)
## 4 Essential Hypertension and Renal Diseases (I10, I12)
## 5 Cerebrovascular Disease (Stroke: I60-I69)
## 6 All Other Causes
## 7 Mental and Behavioral Disorders due to Accidental Poisoning and Other Psycho~
## 8 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## 9 Intentional Self-Harm (Suicide: X60-X84, Y87.0)
## 10 Chronic Lower Respiratory Diseases (J40-J47)
## # ... with 24 more rows
data %>%
distinct(`Race Ethnicity`)
## # A tibble: 8 x 1
## `Race Ethnicity`
## <chr>
## 1 Other Race/ Ethnicity
## 2 Hispanic
## 3 Not Stated/Unknown
## 4 White Non-Hispanic
## 5 Asian and Pacific Islander
## 6 Black Non-Hispanic
## 7 Non-Hispanic White
## 8 Non-Hispanic Black
Seems like we have 34 different categories for “Leading Cause” and 8 for “Race Ethnicity”.
From the structure of the data below, we can observe that a lot of these variables are type “Character”. We may need to transform some of the variables into factors in order to better analyze the data with R.
str(data)
## spec_tbl_df [1,272 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Year : num [1:1272] 2009 2013 2012 2007 2014 ...
## $ Leading Cause : chr [1:1272] "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)" "Influenza (Flu) and Pneumonia (J09-J18)" "Assault (Homicide: Y87.1, X85-Y09)" "Essential Hypertension and Renal Diseases (I10, I12)" ...
## $ Sex : chr [1:1272] "F" "F" "M" "F" ...
## $ Race Ethnicity : chr [1:1272] "Other Race/ Ethnicity" "Hispanic" "Other Race/ Ethnicity" "Not Stated/Unknown" ...
## $ Deaths : chr [1:1272] "." "204" "." "5" ...
## $ Death Rate : chr [1:1272] "." "16.3" "." "." ...
## $ Age Adjusted Death Rate: chr [1:1272] "." "18.5" "." "." ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. `Leading Cause` = col_character(),
## .. Sex = col_character(),
## .. `Race Ethnicity` = col_character(),
## .. Deaths = col_character(),
## .. `Death Rate` = col_character(),
## .. `Age Adjusted Death Rate` = col_character()
## .. )
These data contain 1,272 observations and 7 variables.
dim(data)
## [1] 1272 7
Although, below it shows that there are no missing values for “Deaths”, I did note on the file that there are about 138 observations that have a “.” instead of a value for “Deaths”. Something we will have to think about when dealing with these missing values in our models. we can also observe that we have missing values for “Age Adjusted Death Rate” and “Death Rate”, but I do not anticipate I will be using these variables in the models.
plot_missing(data)
Additionally, it seems that the variable “Sex” may have possible values that include “F”, “Female”, “M” and “Male”. We will substitute the values “Female” and “Male” with their corresponding letter equivalent. And the variables “Leading Cause” and “Race Ethnicity” have some repetitive values phrased in different ways. We will take care of cleaning up these variables in our next section.
We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the previous variables that were of type character into factor.
data_prepared <- data
#Value substitutions cleanup
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)"] <- "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)"
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Chronic Liver Disease and Cirrhosis (K70, K73)"] <- "Chronic Liver Disease and Cirrhosis (K70, K73-K74)"
data_prepared$`Leading Cause`[data_prepared$`Leading Cause` == "Intentional Self-Harm (Suicide: X60-X84, Y87.0)"] <- "Intentional Self-Harm (Suicide: U03, X60-X84, Y87.0)"
data_prepared$Sex[data_prepared$Sex == "Female"] <- "F"
data_prepared$Sex[data_prepared$Sex == "Male"] <- "M"
data_prepared$`Race Ethnicity`[data_prepared$`Race Ethnicity` == "Non-Hispanic Black"] <- "Black Non-Hispanic"
data_prepared$`Race Ethnicity`[data_prepared$`Race Ethnicity` == "Non-Hispanic White"] <- "White Non-Hispanic"
#Data type change for columns of interest
data_prepared$Year <- as.integer(data_prepared$Year)
data_prepared$`Leading Cause` <- as.factor(data_prepared$`Leading Cause`)
data_prepared$Sex <- as.factor(data_prepared$Sex)
data_prepared$`Race Ethnicity` <- as.factor(data_prepared$`Race Ethnicity`)
data_prepared$Deaths <- as.integer(data_prepared$Deaths)
## Warning: NAs introduced by coercion
We have changed the structure of our variables as it is shown below.
str(data_prepared)
## spec_tbl_df [1,272 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Year : int [1:1272] 2009 2013 2012 2007 2014 2009 2013 2007 2013 2008 ...
## $ Leading Cause : Factor w/ 31 levels "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",..: 26 20 7 18 9 18 20 7 2 9 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 1 2 1 1 2 2 2 2 2 ...
## $ Race Ethnicity : Factor w/ 6 levels "Asian and Pacific Islander",..: 5 3 5 4 6 1 6 2 4 3 ...
## $ Deaths : int [1:1272] NA 204 NA 5 418 26 618 267 139 124 ...
## $ Death Rate : chr [1:1272] "." "16.3" "." "." ...
## $ Age Adjusted Death Rate: chr [1:1272] "." "18.5" "." "." ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. `Leading Cause` = col_character(),
## .. Sex = col_character(),
## .. `Race Ethnicity` = col_character(),
## .. Deaths = col_character(),
## .. `Death Rate` = col_character(),
## .. `Age Adjusted Death Rate` = col_character()
## .. )
hist(data_prepared$Deaths)
Given that the distribution of “Deaths” is skewed to the right, I will use the median to replace the missing values that were present on this column to avoid including any bias in the results.
data_prepared$Deaths[is.na(data_prepared$Deaths)] <- median(data_prepared$Deaths, na.rm=TRUE)
# We use this line of code to get rid of decimals.
data_prepared$Deaths <- trunc(data_prepared$Deaths)
We can also observe the summary statistics for our transformed data.
summary(data_prepared)
## Year Leading Cause Sex
## Min. :2007 All Other Causes :110 F:644
## 1st Qu.:2009 Diseases of Heart (I00-I09, I11, I13, I20-I51):110 M:628
## Median :2011 Influenza (Flu) and Pneumonia (J09-J18) :110
## Mean :2012 Malignant Neoplasms (Cancer: C00-C97) :110
## 3rd Qu.:2013 Diabetes Mellitus (E10-E14) :106
## Max. :2019 Cerebrovascular Disease (Stroke: I60-I69) :103
## (Other) :623
## Race Ethnicity Deaths Death Rate
## Asian and Pacific Islander:199 Min. : 1.0 Length:1272
## Black Non-Hispanic :200 1st Qu.: 36.0 Class :character
## Hispanic :199 Median : 136.0 Mode :character
## Not Stated/Unknown :223 Mean : 391.8
## Other Race/ Ethnicity :253 3rd Qu.: 266.2
## White Non-Hispanic :198 Max. :7050.0
##
## Age Adjusted Death Rate
## Length:1272
## Class :character
## Mode :character
##
##
##
##
I would like to predict the number of “Deaths” variable and find out whether variables “Leading Cause”, “Sex” and “Race Ethnicity” have any relationship with our dependent variable.
The code below has been borrowed from the class examples in order to run a decision tree model on our data.
We first take a look at the “Sex” variable to see if it is evenly distributed among our “Race Ethnicity” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.
# Two-way table of factor variables
xtabs(~Sex + `Race Ethnicity`, data = data_prepared)
## Race Ethnicity
## Sex Asian and Pacific Islander Black Non-Hispanic Hispanic Not Stated/Unknown
## F 99 100 100 118
## M 100 100 99 105
## Race Ethnicity
## Sex Other Race/ Ethnicity White Non-Hispanic
## F 128 99
## M 125 99
Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
train <- data_prepared[ind==1,]
test <- data_prepared[ind==2,]
We run the first decision tree model to predict number of “Deaths” with the variable “Leading Cause” below:
# grow tree
first_model <- rpart(Deaths ~ `Leading Cause`,
method="anova", train)
summary(first_model) # detailed summary of splits
## Call:
## rpart(formula = Deaths ~ `Leading Cause`, data = train, method = "anova")
## n= 1013
##
## CP nsplit rel error xerror xstd
## 1 0.30826324 0 1.0000000 1.0014415 0.14785103
## 2 0.02183541 1 0.6917368 0.6942796 0.10517275
## 3 0.01000000 2 0.6699014 0.6976495 0.09964031
##
## Variable importance
## Leading Cause
## 100
##
## Node number 1: 1013 observations, complexity param=0.3082632
## mean=394.5025, MSE=674619.5
## left son=2 (763 obs) right son=3 (250 obs)
## Primary splits:
## Leading Cause splits as LRLLLLLLLLLLLLLLRLLLLLRLLLLLLLL, improve=0.3082632, (0 missing)
##
## Node number 2: 763 observations
## mean=133.4679, MSE=15124.95
##
## Node number 3: 250 observations, complexity param=0.02183541
## mean=1191.18, MSE=1844741
## left son=6 (78 obs) right son=7 (172 obs)
## Primary splits:
## Leading Cause splits as -L--------------R-----R--------, improve=0.03235595, (0 missing)
##
## Node number 6: 78 observations
## mean=828.3846, MSE=591049.4
##
## Node number 7: 172 observations
## mean=1355.703, MSE=2326520
plotcp(first_model) # visualize cross-validation results
As we can see in the graphs below, the R-square increases with the number of splits, and the XRelative Error decreases as the number of splits increase.
# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(first_model) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = Deaths ~ `Leading Cause`, data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] Leading Cause
##
## Root node error: 683389531/1013 = 674619
##
## n= 1013
##
## CP nsplit rel error xerror xstd
## 1 0.308263 0 1.00000 1.00144 0.14785
## 2 0.021835 1 0.69174 0.69428 0.10517
## 3 0.010000 2 0.66990 0.69765 0.09964
# plot tree
plot(first_model, uniform=TRUE,
main="Regression Tree for Number of Deaths ")
text(first_model, use.n=TRUE, all=TRUE, cex=.8)
# make predictions
p <- predict(first_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p)^2))
## [1] 662.6958
# R-square
(cor(test$Deaths, p))^2
## [1] 0.2599384
We can observe above that the RMSE and R-square results are not great when making our predictions on the first model.
We run the second decision tree model to predict number of “Deaths” with the variables “Race Ethnicity” and “Sex” below:
# grow second tree
second_model <- rpart(Deaths ~ `Race Ethnicity` + Sex,
method="anova", train)
summary(second_model) # detailed summary of splits
## Call:
## rpart(formula = Deaths ~ `Race Ethnicity` + Sex, data = train,
## method = "anova")
## n= 1013
##
## CP nsplit rel error xerror xstd
## 1 0.17590636 0 1.0000000 1.0022630 0.1479408
## 2 0.02837262 1 0.8240936 0.8293548 0.1190863
## 3 0.02029702 2 0.7957210 0.8205950 0.1128873
## 4 0.01000000 3 0.7754240 0.7838013 0.1077029
##
## Variable importance
## Race Ethnicity
## 100
##
## Node number 1: 1013 observations, complexity param=0.1759064
## mean=394.5025, MSE=674619.5
## left son=2 (695 obs) right son=3 (318 obs)
## Primary splits:
## Race Ethnicity splits as LRLLLR, improve=1.759064e-01, (0 missing)
## Sex splits as RL, improve=2.280689e-06, (0 missing)
##
## Node number 2: 695 observations, complexity param=0.02029702
## mean=161.4835, MSE=79789.07
## left son=4 (538 obs) right son=5 (157 obs)
## Primary splits:
## Race Ethnicity splits as L-RLL-, improve=0.2501338000, (0 missing)
## Sex splits as LR, improve=0.0005989079, (0 missing)
##
## Node number 3: 318 observations, complexity param=0.02837262
## mean=903.7736, MSE=1596615
## left son=6 (160 obs) right son=7 (158 obs)
## Primary splits:
## Race Ethnicity splits as -L---R, improve=0.0381891900, (0 missing)
## Sex splits as RL, improve=0.0009853138, (0 missing)
## Surrogate splits:
## Sex splits as LR, agree=0.509, adj=0.013, (0 split)
##
## Node number 4: 538 observations
## mean=85.16729, MSE=14644.65
##
## Node number 5: 157 observations
## mean=423, MSE=214674
##
## Node number 6: 160 observations
## mean=658.3937, MSE=531117.1
##
## Node number 7: 158 observations
## mean=1152.259, MSE=2552882
plotcp(second_model) # visualize cross-validation results
As we can see in the graphs below, the R-square increases with the number of splits, and the XRelative Error decreases as the number of splits increase.
# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = Deaths ~ `Race Ethnicity` + Sex, data = train,
## method = "anova")
##
## Variables actually used in tree construction:
## [1] Race Ethnicity
##
## Root node error: 683389531/1013 = 674619
##
## n= 1013
##
## CP nsplit rel error xerror xstd
## 1 0.175906 0 1.00000 1.00226 0.14794
## 2 0.028373 1 0.82409 0.82935 0.11909
## 3 0.020297 2 0.79572 0.82059 0.11289
## 4 0.010000 3 0.77542 0.78380 0.10770
# plot tree
plot(second_model, uniform=TRUE,
main="Regression Tree for Number of Deaths ")
text(second_model, use.n=TRUE, all=TRUE, cex=.8)
# make predictions
p2 <- predict(second_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p2)^2))
## [1] 654.8837
# R-square
(cor(test$Deaths, p2))^2
## [1] 0.2525162
We can observe above that the RMSE and R-square results did not improve after switching variables and making predictions on the second model.
We will now use a random forest model and add more predictor variables to our algorithm.
I first had to create another copy for our train and test data and change the column names to remove spaces as the randomForest() function was not working properly.
train2 <- train
colnames(train2)[2] <- "LeadingCause"
colnames(train2)[4] <- "RaceEthnicity"
test2 <- test
colnames(test2)[2] <- "LeadingCause"
colnames(test2)[4] <- "RaceEthnicity"
third_model <- randomForest(Deaths ~ LeadingCause + RaceEthnicity + Sex, importance = TRUE, na.action = na.omit, train2)
From the results below we can see that the percentage of variance explained with this model is about 76%, which suggests this is a very good model.
#Print regression model
print(third_model)
##
## Call:
## randomForest(formula = Deaths ~ LeadingCause + RaceEthnicity + Sex, data = train2, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 176892.3
## % Var explained: 73.78
Below, we can observe that the “Error” decreases drastically only after a few trees in the model.
# Plot the error vs the number of trees graph
plot(third_model)
# make predictions
p3 <- predict(third_model, test2)
# Root Mean Square Error
sqrt(mean((test2$Deaths - p3)^2))
## [1] 341.5055
# R-square
(cor(test2$Deaths, p3))^2
## [1] 0.9323252
As we can observe above, the RMSE and R-square improved drastically with the random forest model and by adding extra predictor variables to our model.
As a curiosity, I decided to run an extra decision tree model with the same predictors as the random forest previously.
# grow third tree
fourth_model <- rpart(Deaths ~ `Leading Cause` + `Race Ethnicity` + Sex,
method="anova", train)
We would actually be able to get a better RMSE and R-square if we were to use the decision tree model with the same predictors we have used for our random forest model above. However, such a high performing R-square may suggest that we are over-fitting our data with this and the previous model.
# make predictions
p4 <- predict(fourth_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p4)^2))
## [1] 194.1992
# R-square
(cor(test$Deaths, p4))^2
## [1] 0.9447492
After executing different decision tree models with different predictor variables, I have not been able to determine which would be the most appropriate model to use. The models with fewers predictor variables do not perform well, and the models that include more predictor variables seem to work extraordinarily well and may suggest over-fit. In any case, if I have a decision tree that performs just as well as a random forest, I would most likely choose the decision tree as it is the simpler model.
Additionally, based on real cases where decision trees have gone “wrong”, I would argue that the machine learning version of a decision tree is an algorithm that can simplify an outcome for the user with its ability to compute different probabilities and arrive at an optimum result. Although, decision trees that are represented graphically can become very complicated, the algorithm behind a decision tree in machine learning can actually do a lot of the visual work for you but in a rather mathematical manner.