Introduction

Here we have Covid-19 dataset included 12 variables and 51 observations. We will use regression trees methods to predict COVID death determinants.

Covid Data

library(tidyverse)
library(rpart)
library(rpart.plot)
covid_df_train <- read_csv("./data/covidForRegression.csv")
covid_df_train

Decision Trees

Decision Tree is one of the popular analytic methods. It is a non-parametric regression method. Non-parametric no need to assume the data linearity and normality. Decision Tree can use to build a non-linear model. There are two types of Decision Trees. If the problem we want to determine is having a categorical response variable, and we need to split a response variable into classes, we will use a Classification Tree, which is an algorithm that is able to identify the class of categorical variables. For example, Yes or no, death or alive, male or female. If we want to predict a response variable, which is a continuous response variable, we will use a Regression Tree, which is an algorithm that is the same as regression analysis giving a continuous result. The predicted result will be continuous. For example, people’s’ height and weight. Here we want to predict COVID death, which is a continuous response variable. In the following, we will build a Regression tree to Predict COVID death.

Regression Trees | Analysiss

Prediciting COVID death case Using Regression Trees

reg.tree <- rpart(deaths ~ ., data = covid_df_train)
rpart.plot(reg.tree)

Build a training set

Here we use covid data to build a regression tree to predic death case.

We randomly split data into 70% of the observations for training the models, and leaving 30% for validation.

# build a training set 
# ensure reproducibility
set.seed(100)
# sample 70% of the row indices for training the models
train <- sample(1:nrow(covid_df_train), nrow(covid_df_train)*0.7) 
tree.covid <- rpart(deaths ~ ., subset = train, data = covid_df_train, method = "anova")
rpart.plot(tree.covid)

summary(tree.covid)
## Call:
## rpart(formula = deaths ~ ., data = covid_df_train, subset = train, 
##     method = "anova")
##   n= 35 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.67253561      0 1.0000000 1.0401256 0.3566814
## 2 0.07783447      1 0.3274644 0.5191207 0.1672439
## 3 0.01000000      2 0.2496299 0.4687290 0.1677479
## 
## Variable importance
##                     White_alone                       confirmed 
##                              23                              20 
## Black_or_African_American_alone           Some_other_race_alone 
##                              16                              14 
##               Two_or_more_races                     Asian_alone 
##                              14                              11 
##                       Longitude                        Latitude 
##                               1                               1 
## 
## Node number 1: 35 observations,    complexity param=0.6725356
##   mean=7858.029, MSE=6.170986e+07 
##   left son=2 (27 obs) right son=3 (8 obs)
##   Primary splits:
##       White_alone                     < 5807693   to the left,  improve=0.6725356, (0 missing)
##       confirmed                       < 586455.5  to the left,  improve=0.6329116, (0 missing)
##       Some_other_race_alone           < 330157    to the left,  improve=0.5113241, (0 missing)
##       Black_or_African_American_alone < 1172574   to the left,  improve=0.4906350, (0 missing)
##       Asian_alone                     < 318567.5  to the left,  improve=0.4739144, (0 missing)
##   Surrogate splits:
##       confirmed                       < 586455.5  to the left,  agree=0.971, adj=0.875, (0 split)
##       Black_or_African_American_alone < 1366993   to the left,  agree=0.914, adj=0.625, (0 split)
##       Some_other_race_alone           < 313499.5  to the left,  agree=0.914, adj=0.625, (0 split)
##       Two_or_more_races               < 257444    to the left,  agree=0.914, adj=0.625, (0 split)
##       Asian_alone                     < 285532    to the left,  agree=0.886, adj=0.500, (0 split)
## 
## Node number 2: 27 observations,    complexity param=0.07783447
##   mean=4351.333, MSE=9934962 
##   left son=4 (16 obs) right son=5 (11 obs)
##   Primary splits:
##       Black_or_African_American_alone < 368957    to the left,  improve=0.6267070, (0 missing)
##       confirmed                       < 198306.5  to the left,  improve=0.5886668, (0 missing)
##       White_alone                     < 1697487   to the left,  improve=0.5241572, (0 missing)
##       Some_other_race_alone           < 26820.5   to the left,  improve=0.4601538, (0 missing)
##       Asian_alone                     < 44878     to the left,  improve=0.3479326, (0 missing)
##   Surrogate splits:
##       confirmed             < 373415.5  to the left,  agree=0.815, adj=0.545, (0 split)
##       Longitude             < -92.91345 to the left,  agree=0.815, adj=0.545, (0 split)
##       Some_other_race_alone < 53460     to the left,  agree=0.741, adj=0.364, (0 split)
##       White_alone           < 1697487   to the left,  agree=0.741, adj=0.364, (0 split)
##       Latitude              < 35.03085  to the right, agree=0.741, adj=0.364, (0 split)
## 
## Node number 3: 8 observations
##   mean=19693.12, MSE=5.487854e+07 
## 
## Node number 4: 16 observations
##   mean=2282.375, MSE=2390688 
## 
## Node number 5: 11 observations
##   mean=7360.727, MSE=5625689

Values on the node represent:

  • death cases
  • Percentage of observations account for each node

Check importance

tree.covid$variable.importance
##                     White_alone                       confirmed 
##                      1513703720                      1362697662 
## Black_or_African_American_alone           Some_other_race_alone 
##                      1075968307                       968988969 
##               Two_or_more_races                     Asian_alone 
##                       907857919                       726286335 
##                       Longitude                        Latitude 
##                        91696575                        61131050

Variable.importance command can examine variable importance for each predictor variables. Here we can see that the most important variable is White_alone, and the last one is Latitude.

cross-validation

printcp(tree.covid)
## 
## Regression tree:
## rpart(formula = deaths ~ ., data = covid_df_train, subset = train, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Black_or_African_American_alone White_alone                    
## 
## Root node error: 2159844989/35 = 61709857
## 
## n= 35 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.672536      0   1.00000 1.04013 0.35668
## 2 0.077834      1   0.32746 0.51912 0.16724
## 3 0.010000      2   0.24963 0.46873 0.16775
plotcp(x = tree.covid)

Rpart function, by default, will cross-validate the results of the tree and trim the tree.

The complexity parameter (cp) is used to control the size of the tree and to select the optimal tree size.

Y-axis illustrates the relative cross validation error for various cp values. Smaller cp values lead to larger trees (we can see the upper x-axis for tree size)

If the cost of adding another variable to the tree from the current node is above the value of cp, then tree building does not continue.

Then, we can use this predict function to make prediction.

Prediction function

# predict testing set
pred <- predict(tree.covid, covid_df_train[-train,])
pred
##         1         2         3         4         5         6         7         8 
##  2282.375 19693.125  2282.375  7360.727  2282.375  7360.727  2282.375  2282.375 
##         9        10        11        12        13        14        15        16 
##  2282.375 19693.125 19693.125  2282.375  2282.375  2282.375  7360.727  2282.375