We will use the DescTools, caret, rpart and rpart.plot packages. If you do not already have the rpart and rpart.plot packages installed, you will first install it using the install.packages()
function.
install.packages(c("rpart", "rpart.plot"))
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(rpart)
library(rpart.plot)
In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount
) at or below average (Below_Avg
) or above average (Above_Avg
). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.
Variables include:
Young
, Middle
, Old
)Male
, Female
)Married
) or single (Single
)Close
) or not (Far
)Below_Avg
) or above (Above_Avg
)We use the read.csv()
function to import the CSV file into R as a dataframe named DM
. We set stringsAsFactors = FALSE
to keep any character columns as-is.
DM <- read.csv(file = "DirectMarketing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the DM
dataframe to look at the variable types and to check for missing (NA
) values.
Abstract(DM)
## ------------------------------------------------------------------------------
## DM
##
## data frame: 1000 obs. of 9 variables
## 1000 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 Age character .
## 2 Gender character .
## 3 OwnHome character .
## 4 Married character .
## 5 Location character .
## 6 Salary integer .
## 7 Children integer .
## 8 Catalogs integer .
## 9 Amount character .
Next, we can convert our target class variable that we want to predict, Amount
to a nominal factor variable. Note: although it is ordinal, we will use it as a nominal variable for compatibility. We will identify level 1 as Below_Avg
and level 2 as Above_Avg
to reflect the known ordering.
DM$Amount <- factor(x = DM$Amount,
levels = c("Below_Avg", "Above_Avg"))
We can plot the distribution using a bar plot, which is the default plot for the plot()
function when applied to factor variables.
plot(x = DM$Amount,
main = "Amount Spent")
First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms
) so that we can more easily refer to them.
noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
FUN = factor)
Next, we can convert our ordinal variables. The Age
variable takes on three levels, which we will need to specify with the correct ordering. Since the Children
and Catalogs
variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords
) so that we can more easily refer to them.
ords <- c("Age", "Children", "Catalogs")
DM$Age <- factor(x = DM$Age,
levels = c("Young", "Middle", "Old"),
ordered = TRUE)
DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
FUN = factor,
ordered = TRUE)
We only have one numerical variable, Salary. We can set up a named value, nums
(although we do not need to).
nums <- c("Salary")
Finally, we can create our convenience vector of predictor variables (vars
) containing all of our predictors. We can view the predictors by name by running a code line of the vars
object.
vars <- c(noms, ords, nums)
vars
## [1] "Gender" "OwnHome" "Married" "Location" "Age" "Children" "Catalogs"
## [8] "Salary"
First, we can view summary()
information for our prepared data.
summary(DM)
## Age Gender OwnHome Married Location
## Young :287 Female:506 Own :516 Married:502 Close:710
## Middle:508 Male :494 Rent:484 Single :498 Far :290
## Old :205
##
##
##
## Salary Children Catalogs Amount
## Min. : 10100 0:462 6 :252 Below_Avg:601
## 1st Qu.: 29975 1:267 12:282 Above_Avg:399
## Median : 53700 2:146 18:233
## Mean : 56104 3:125 24:233
## 3rd Qu.: 77025
## Max. :168800
For Decision Tree classification, We know that we can handle missing values (we can impute or eliminate up-front or tell the model how to handle them), irrelevant and redundant variables can be included and no rescaling/standardization is needed. For this reason, we can use the dataset as-is in our modeling, without any transformations.
na.omit()
) or perform imputation (or we can handle them during analysis).PlotMiss(DM)
We use the createDataPartition()
function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition()
we preserve the distribution of our outcome (Y) variable (Amount
). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831) # initialize the random seed
# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM$Amount, # target variable
p = 0.80, # proportion in training set
list = FALSE)
Subset the rows of the DM
dataframe to include the row numbers in the sub
object to create the train
dataframe. We use all observations not in the sub
object to create the test
dataframe.
train <- DM[sub, ]
test <- DM[-sub, ]
rpart()
and default cp)We use the rpart()
function in the rpart package to perform basic Decision Tree classification on our training dataset (train
). If we have NA
values that we did not handle during preprocessing, we can use the na.action
argument, which defaults to “na.rpart”, which removes observations with NA values
for Y (the target variable) but keeps observations with NA
values for predictor variables. The cp
hyperparameter defaults to 0.01.
Behind the scenes, the rpart()
function will perform cross validation, so we will initialize our random seed value before modeling.
set.seed(831)
DM.rpart <- rpart(formula = Amount ~ ., # Y ~ all other variables in dataframe
data = train, # include only relevant variables
method = "class") # classification
We can see the basic output of the model by running a code line with the name of the rpart object we created (DM.rpart
).
DM.rpart
## n= 801
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 801 320 Below_Avg (0.60049938 0.39950062)
## 2) Salary< 58850 439 63 Below_Avg (0.85649203 0.14350797)
## 4) Salary< 31950 205 0 Below_Avg (1.00000000 0.00000000) *
## 5) Salary>=31950 234 63 Below_Avg (0.73076923 0.26923077)
## 10) Catalogs=6,12 130 11 Below_Avg (0.91538462 0.08461538) *
## 11) Catalogs=18,24 104 52 Below_Avg (0.50000000 0.50000000)
## 22) Location=Close 67 24 Below_Avg (0.64179104 0.35820896)
## 44) Salary< 45400 37 8 Below_Avg (0.78378378 0.21621622) *
## 45) Salary>=45400 30 14 Above_Avg (0.46666667 0.53333333)
## 90) Catalogs=6,12,18 16 5 Below_Avg (0.68750000 0.31250000) *
## 91) Catalogs=24 14 3 Above_Avg (0.21428571 0.78571429) *
## 23) Location=Far 37 9 Above_Avg (0.24324324 0.75675676) *
## 3) Salary>=58850 362 105 Above_Avg (0.29005525 0.70994475)
## 6) Children=2,3 106 44 Below_Avg (0.58490566 0.41509434)
## 12) Catalogs=6,12 57 12 Below_Avg (0.78947368 0.21052632) *
## 13) Catalogs=18,24 49 17 Above_Avg (0.34693878 0.65306122) *
## 7) Children=0,1 256 43 Above_Avg (0.16796875 0.83203125) *
We can also obtain information about variable importance from the variable.importance
component of our list object created using the rpart()
function. Higher values indicate that the variable is more important in the construction of the tree and therefore in the model. The variable importance values are provided in decreasing order of importance.
DM.rpart$variable.importance
## Salary Married Age OwnHome Catalogs Children Gender Location
## 148.76980 78.70445 47.45996 46.27343 43.39174 28.07054 21.68372 12.07368
We can use either the prp()
or rpart.plot()
functions from the rpart.plot package to visualize the decision tree. We set extra = 2
so that the proportion of correct predictions are displayed on the terminal nodes. As indicated by the root node, ‘Yes’ is to the left and ‘No’ is to the right of the variable split.
prp(x = DM.rpart, # rpart object
extra = 2) # include proportion of correct predictions
To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample.
First, we use the predict()
function to obtain the class predictions (type = "class"
) for the training data (train
) based on our DT model. This creates a vector of class predictions.
base.trpreds <- predict(object = DM.rpart, # DT model
newdata = train, # training data
type = "class") # class predictions
We can use the confusionMatrix() function from the caret package to obtain aconfusion matrix and obtain performance measures for our model applied to the training dataset (train). We can set mode = "everything"
to obtain all available performance measures. We identify the Above_Avg
class as positive
, since that is the class we are more interested in being able to predict. We will save it so that we can make comparisons.
DT_train_conf <- confusionMatrix(data = base.trpreds, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
DT_train_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 409 36
## Above_Avg 72 284
##
## Accuracy : 0.8652
## 95% CI : (0.8395, 0.8881)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7242
##
## Mcnemar's Test P-Value : 0.0007575
##
## Sensitivity : 0.8875
## Specificity : 0.8503
## Pos Pred Value : 0.7978
## Neg Pred Value : 0.9191
## Precision : 0.7978
## Recall : 0.8875
## F1 : 0.8402
## Prevalence : 0.3995
## Detection Rate : 0.3546
## Detection Prevalence : 0.4444
## Balanced Accuracy : 0.8689
##
## 'Positive' Class : Above_Avg
##
To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict()
function to obtain the class predictions (type = "class"
) for the testing data based on our DT model.
base.tepreds <- predict(object = DM.rpart, # DT model
newdata = test, # testing data
type = "class")
Again, we use the confusionMatrix()
function from the caret package to obtain a confusion matrix and obtain performance measures for our model, this time applied to the testing dataset (test
).
DT_test_conf <- confusionMatrix(data = base.tepreds, # predictions
reference = test$Amount, # actual
positive = "Above_Avg",
mode = "everything")
DT_test_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 101 10
## Above_Avg 19 69
##
## Accuracy : 0.8543
## 95% CI : (0.7975, 0.9002)
## No Information Rate : 0.603
## P-Value [Acc > NIR] : 8.632e-15
##
## Kappa : 0.7014
##
## Mcnemar's Test P-Value : 0.1374
##
## Sensitivity : 0.8734
## Specificity : 0.8417
## Pos Pred Value : 0.7841
## Neg Pred Value : 0.9099
## Precision : 0.7841
## Recall : 0.8734
## F1 : 0.8263
## Prevalence : 0.3970
## Detection Rate : 0.3467
## Detection Prevalence : 0.4422
## Balanced Accuracy : 0.8575
##
## 'Positive' Class : Above_Avg
##
As shown, the model has high Accuracy and good agreement between the actual and predicted Amount
values based on Kappa. At the class-level, we also see good performance for predicting our class level of interest (Above_Avg
) based on Sensitivity/Recall, Precision and F1. The performance for predicting Below_Avg
is also strong based on the Specificity.
To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind()
function to compare our confusionMatrix()
output side-by-side.
First, we can compare the overall performance on the training and testing sets.
cbind(Training = DT_train_conf$overall,
Testing = DT_test_conf$overall)
## Training Testing
## Accuracy 8.651685e-01 8.542714e-01
## Kappa 7.241771e-01 7.014331e-01
## AccuracyLower 8.395454e-01 7.974599e-01
## AccuracyUpper 8.880667e-01 9.001806e-01
## AccuracyNull 6.004994e-01 6.030151e-01
## AccuracyPValue 5.962647e-61 8.632318e-15
## McnemarPValue 7.574950e-04 1.373948e-01
Based on the side-by side comparison of the overall measures, the model seems to be balanced, based on the consistency across the training and testing sets for the Accuracy and Kappa values.
Next, we can compare the class level performance on the training and testing sets.
cbind(Training = DT_train_conf$byClass,
Testing = DT_test_conf$byClass)
## Training Testing
## Sensitivity 0.8875000 0.8734177
## Specificity 0.8503119 0.8416667
## Pos Pred Value 0.7977528 0.7840909
## Neg Pred Value 0.9191011 0.9099099
## Precision 0.7977528 0.7840909
## Recall 0.8875000 0.8734177
## F1 0.8402367 0.8263473
## Prevalence 0.3995006 0.3969849
## Detection Rate 0.3545568 0.3467337
## Detection Prevalence 0.4444444 0.4422111
## Balanced Accuracy 0.8689059 0.8575422
Based on the side-by side comparison of the class-level measures, the model seems to be balanced, based on the consistency across the training and testing sets for the class-level measures. The performance is good on the training data, so we do not have reason to believe that the model is underfitting and because we have consistency across the training and testing measures, we do not see evidence of overfitting. For this reason, we would say that the model is balanced.
Behind the scenes of the rpart()
function, cross-validation is being used to find the optimal value of cp
, the complexity parameter. This value determines the depth of the tree (tree size, or the number of terminal nodes).
The rpart()
function uses a default value of 0.01
, and then grows the tree based on this value. The cross validation being done considers the impact of using larger cp
values than the default and the effect on the tree in terms of size and error. We look for a tradeoff between accuracy/error and model complexity (tree size). This also helps to reduce the chances of overfitting. Larger values for cp
will impose a higher penalty and will result in a smaller tree. The cross-validated error can be obtained and used to find a cp
value that is an optimal trade-off between minimizing the error (misclassification) and complexity (tree depth). We will choose the cp
value that results in the smallest cross-validated error.
We can view the cross-validation results either visually or as a table. To view the table as a plot, we can use the plotcp()
function. Using the plot, we can choose the size of the tree (nsplit
+ 1) that is the first to the left that is below the dashed line.
plotcp(x = DM.rpart)
The printcp()
function can be used to print the CP
values, (in the cptable
list component of the rpart object), the number of variables (used to split), error
(scaled error, when nsplit
= 0, rel error
= 1), cross-validated error (xerror
) and the standard deviation of the cross-validated error (xstd
).
printcp(x = DM.rpart)
##
## Classification tree:
## rpart(formula = Amount ~ ., data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] Catalogs Children Location Salary
##
## Root node error: 320/801 = 0.3995
##
## n= 801
##
## CP nsplit rel error xerror xstd
## 1 0.475000 0 1.00000 1.00000 0.043319
## 2 0.056250 1 0.52500 0.52813 0.036086
## 3 0.046875 2 0.46875 0.50000 0.035361
## 4 0.019792 3 0.42188 0.45625 0.034145
## 5 0.012500 6 0.36250 0.42188 0.033108
## 6 0.010000 8 0.33750 0.42813 0.033303
We can identify the best cp, that has the minimum cross-validated error (xerror
) in the cptable
and save it for use in the prune()
function, which will prune the tree based on that cp
value.
min_cp <- DM.rpart$cptable[which.min(DM.rpart$cptable[,"xerror"]),"CP"]
min_cp
## [1] 0.0125
We can use this value as the cp, instead of the default (0.01), and use the prune()
function to stop the tree from growing beyond the corresponding number of splits for this cp value in the cptable
.
pruneTree <- prune(tree = DM.rpart, # original DT object
cp = min_cp) # optimal cp value
We can use the prp()
function to view our pruned tree.
prp(x = pruneTree, # prune object
extra = 2) # include correct predictions on terminal nodes
As shown, since the optimal cp
value was larger than the default, we penalize the tree more for having a larger size, creating a smaller tree than our original model. Next, we can use this pruned tree model to obtain performance and goodness of fit information.
We use the predict()
function to generate class predictions for our training data set using the pruned Decision Tree.
tune.trpreds <- predict(object = pruneTree,
newdata = train,
type = "class")
Again, we can use the confusionMatrix()
function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = "everything"
to obtain all available performance measures.
DT_trtune_conf <- confusionMatrix(data = tune.trpreds, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
DT_trtune_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 412 47
## Above_Avg 69 273
##
## Accuracy : 0.8552
## 95% CI : (0.8289, 0.8788)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7016
##
## Mcnemar's Test P-Value : 0.0512
##
## Sensitivity : 0.8531
## Specificity : 0.8565
## Pos Pred Value : 0.7982
## Neg Pred Value : 0.8976
## Precision : 0.7982
## Recall : 0.8531
## F1 : 0.8248
## Prevalence : 0.3995
## Detection Rate : 0.3408
## Detection Prevalence : 0.4270
## Balanced Accuracy : 0.8548
##
## 'Positive' Class : Above_Avg
##
To assess and interpret the model performance, we use the performance measures based on the testing data set.
We use the predict()
function to generate class predictions for our testing data set using the pruned Decision Tree.
tune.tepreds <- predict(object = pruneTree,
newdata = test,
type = "class")
Again, we use the confusionMatrix() function to obtain the model performance.
DT_tetune_conf <- confusionMatrix(data = tune.tepreds, # predictions
reference = test$Amount, # actual
positive = "Above_Avg",
mode = "everything")
DT_tetune_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below_Avg Above_Avg
## Below_Avg 104 15
## Above_Avg 16 64
##
## Accuracy : 0.8442
## 95% CI : (0.7862, 0.8916)
## No Information Rate : 0.603
## P-Value [Acc > NIR] : 1.189e-13
##
## Kappa : 0.6753
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8101
## Specificity : 0.8667
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.8739
## Precision : 0.8000
## Recall : 0.8101
## F1 : 0.8050
## Prevalence : 0.3970
## Detection Rate : 0.3216
## Detection Prevalence : 0.4020
## Balanced Accuracy : 0.8384
##
## 'Positive' Class : Above_Avg
##
Based on the output, we see that the model has good performance based on Accuracy and good consistency between the predicted and actual values based on the Kappa value. At the class-level, we see that based on the Specificity and Sensitivity/Recall values, we are fitting the negative class (Below_Avg
) better than we are fitting the positive class (Above_Avg
). Based on this, it appears that a more complex tree may be preferred, because it may be better able to predict our class value of interest, Above_Avg
for the Amount
variable.
To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind()
function to compare our confusionMatrix()
output side-by-side.
First, we can compare the overall performance on the training and testing sets.
cbind(Training = DT_trtune_conf$overall,
Testing = DT_tetune_conf$overall)
## Training Testing
## Accuracy 8.551810e-01 8.442211e-01
## Kappa 7.016012e-01 6.753329e-01
## AccuracyLower 8.288808e-01 7.862307e-01
## AccuracyUpper 8.788333e-01 8.916355e-01
## AccuracyNull 6.004994e-01 6.030151e-01
## AccuracyPValue 4.685959e-56 1.188529e-13
## McnemarPValue 5.119984e-02 1.000000e+00
Next, we can compare the class level performance on the training and testing sets.
cbind(Training = DT_trtune_conf$byClass,
Testing = DT_tetune_conf$byClass)
## Training Testing
## Sensitivity 0.8531250 0.8101266
## Specificity 0.8565489 0.8666667
## Pos Pred Value 0.7982456 0.8000000
## Neg Pred Value 0.8976035 0.8739496
## Precision 0.7982456 0.8000000
## Recall 0.8531250 0.8101266
## F1 0.8247734 0.8050314
## Prevalence 0.3995006 0.3969849
## Detection Rate 0.3408240 0.3216080
## Detection Prevalence 0.4269663 0.4020101
## Balanced Accuracy 0.8548369 0.8383966
As shown, we see consistent performance across our training and testing data and the model appears to be balanced.