Preliminary

We will use the DescTools, caret, rpart and rpart.plot packages. If you do not already have the rpart and rpart.plot packages installed, you will first install it using the install.packages() function.

install.packages(c("rpart", "rpart.plot"))

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(rpart)
library(rpart.plot)

In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount) at or below average (Below_Avg) or above average (Above_Avg). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.

Variables include:

  • Age: customer age (Young, Middle, Old)
  • Gender: customer gender (Male, Female)
  • OwnHome: if the customer owns a home (Own) or rents (Rent)
  • Married: if the customer is married (Married) or single (Single)
  • Location: if the customer lives close to the company (Close) or not (Far)
  • Salary: the customers yearly salary
  • Children: the number of children that the customer have (0, 1, 2, 3)
  • Catalogs: the total number of catalogs that the customer has received (6, 12, 18, 24)
  • Amount: the level of money that the customer has spent, either at or below average (Below_Avg) or above (Above_Avg)

We use the read.csv() function to import the CSV file into R as a dataframe named DM. We set stringsAsFactors = FALSE to keep any character columns as-is.

DM <- read.csv(file = "DirectMarketing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the DM dataframe to look at the variable types and to check for missing (NA) values.

Abstract(DM)
## ------------------------------------------------------------------------------ 
## DM
## 
## data frame:  1000 obs. of  9 variables
##      1000 complete cases (100.0%)
## 
##   Nr  ColName   Class      NAs  Levels
##   1   Age       character  .          
##   2   Gender    character  .          
##   3   OwnHome   character  .          
##   4   Married   character  .          
##   5   Location  character  .          
##   6   Salary    integer    .          
##   7   Children  integer    .          
##   8   Catalogs  integer    .          
##   9   Amount    character  .

Preparing Target (Y) Variable

Next, we can convert our target class variable that we want to predict, Amount to a nominal factor variable. Note: although it is ordinal, we will use it as a nominal variable for compatibility. We will identify level 1 as Below_Avg and level 2 as Above_Avg to reflect the known ordering.

DM$Amount <- factor(x = DM$Amount,
                    levels = c("Below_Avg", "Above_Avg"))

We can plot the distribution using a bar plot, which is the default plot for the plot() function when applied to factor variables.

plot(x = DM$Amount,
     main = "Amount Spent")

Preparing Predictor (X) Variables

Nominal Variables

First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms) so that we can more easily refer to them.

noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
                     FUN = factor)

Ordinal Variables

Next, we can convert our ordinal variables. The Age variable takes on three levels, which we will need to specify with the correct ordering. Since the Children and Catalogs variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords) so that we can more easily refer to them.

ords <- c("Age", "Children", "Catalogs")

DM$Age <- factor(x = DM$Age,
                 levels = c("Young", "Middle", "Old"),
                 ordered = TRUE)

DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
                                         FUN = factor,
                                         ordered = TRUE)

Numerical Variables

We only have one numerical variable, Salary. We can set up a named value, nums (although we do not need to).

nums <- c("Salary")

Finally, we can create our convenience vector of predictor variables (vars) containing all of our predictors. We can view the predictors by name by running a code line of the vars object.

vars <- c(noms, ords, nums)
vars
## [1] "Gender"   "OwnHome"  "Married"  "Location" "Age"      "Children" "Catalogs"
## [8] "Salary"

Data Preprocessing & Transformation

First, we can view summary() information for our prepared data.

summary(DM)
##      Age         Gender    OwnHome       Married     Location  
##  Young :287   Female:506   Own :516   Married:502   Close:710  
##  Middle:508   Male  :494   Rent:484   Single :498   Far  :290  
##  Old   :205                                                    
##                                                                
##                                                                
##                                                                
##      Salary       Children Catalogs       Amount   
##  Min.   : 10100   0:462    6 :252   Below_Avg:601  
##  1st Qu.: 29975   1:267    12:282   Above_Avg:399  
##  Median : 53700   2:146    18:233                  
##  Mean   : 56104   3:125    24:233                  
##  3rd Qu.: 77025                                    
##  Max.   :168800

For Decision Tree classification, We know that we can handle missing values (we can impute or eliminate up-front or tell the model how to handle them), irrelevant and redundant variables can be included and no rescaling/standardization is needed. For this reason, we can use the dataset as-is in our modeling, without any transformations.

  1. Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation (or we can handle them during analysis).
PlotMiss(DM)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Amount). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831) # initialize the random seed

# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM$Amount, # target variable
                           p = 0.80, # proportion in training set
                           list = FALSE)

Subset the rows of the DM dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- DM[sub, ] 
test <- DM[-sub, ]

Analysis

Basic Model (using rpart() and default cp)

We use the rpart() function in the rpart package to perform basic Decision Tree classification on our training dataset (train). If we have NA values that we did not handle during preprocessing, we can use the na.action argument, which defaults to “na.rpart”, which removes observations with NA values for Y (the target variable) but keeps observations with NA values for predictor variables. The cp hyperparameter defaults to 0.01.

Behind the scenes, the rpart() function will perform cross validation, so we will initialize our random seed value before modeling.

set.seed(831)

DM.rpart <- rpart(formula = Amount ~ ., # Y ~ all other variables in dataframe
                  data = train, # include only relevant variables
                  method = "class") # classification

We can see the basic output of the model by running a code line with the name of the rpart object we created (DM.rpart).

DM.rpart
## n= 801 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 801 320 Below_Avg (0.60049938 0.39950062)  
##    2) Salary< 58850 439  63 Below_Avg (0.85649203 0.14350797)  
##      4) Salary< 31950 205   0 Below_Avg (1.00000000 0.00000000) *
##      5) Salary>=31950 234  63 Below_Avg (0.73076923 0.26923077)  
##       10) Catalogs=6,12 130  11 Below_Avg (0.91538462 0.08461538) *
##       11) Catalogs=18,24 104  52 Below_Avg (0.50000000 0.50000000)  
##         22) Location=Close 67  24 Below_Avg (0.64179104 0.35820896)  
##           44) Salary< 45400 37   8 Below_Avg (0.78378378 0.21621622) *
##           45) Salary>=45400 30  14 Above_Avg (0.46666667 0.53333333)  
##             90) Catalogs=6,12,18 16   5 Below_Avg (0.68750000 0.31250000) *
##             91) Catalogs=24 14   3 Above_Avg (0.21428571 0.78571429) *
##         23) Location=Far 37   9 Above_Avg (0.24324324 0.75675676) *
##    3) Salary>=58850 362 105 Above_Avg (0.29005525 0.70994475)  
##      6) Children=2,3 106  44 Below_Avg (0.58490566 0.41509434)  
##       12) Catalogs=6,12 57  12 Below_Avg (0.78947368 0.21052632) *
##       13) Catalogs=18,24 49  17 Above_Avg (0.34693878 0.65306122) *
##      7) Children=0,1 256  43 Above_Avg (0.16796875 0.83203125) *

We can also obtain information about variable importance from the variable.importance component of our list object created using the rpart() function. Higher values indicate that the variable is more important in the construction of the tree and therefore in the model. The variable importance values are provided in decreasing order of importance.

DM.rpart$variable.importance
##    Salary   Married       Age   OwnHome  Catalogs  Children    Gender  Location 
## 148.76980  78.70445  47.45996  46.27343  43.39174  28.07054  21.68372  12.07368

We can use either the prp() or rpart.plot() functions from the rpart.plot package to visualize the decision tree. We set extra = 2 so that the proportion of correct predictions are displayed on the terminal nodes. As indicated by the root node, ‘Yes’ is to the left and ‘No’ is to the right of the variable split.

prp(x = DM.rpart, # rpart object
    extra = 2) # include proportion of correct predictions

Model Performance & Fit

Training Performance

To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample.

First, we use the predict() function to obtain the class predictions (type = "class") for the training data (train) based on our DT model. This creates a vector of class predictions.

base.trpreds <- predict(object = DM.rpart, # DT model
                        newdata = train, # training data
                        type = "class") # class predictions

We can use the confusionMatrix() function from the caret package to obtain aconfusion matrix and obtain performance measures for our model applied to the training dataset (train). We can set mode = "everything" to obtain all available performance measures. We identify the Above_Avg class as positive, since that is the class we are more interested in being able to predict. We will save it so that we can make comparisons.

DT_train_conf <- confusionMatrix(data = base.trpreds, # predictions
                                 reference = train$Amount, # actual
                                 positive = "Above_Avg",
                                 mode = "everything")
DT_train_conf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg       409        36
##   Above_Avg        72       284
##                                           
##                Accuracy : 0.8652          
##                  95% CI : (0.8395, 0.8881)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7242          
##                                           
##  Mcnemar's Test P-Value : 0.0007575       
##                                           
##             Sensitivity : 0.8875          
##             Specificity : 0.8503          
##          Pos Pred Value : 0.7978          
##          Neg Pred Value : 0.9191          
##               Precision : 0.7978          
##                  Recall : 0.8875          
##                      F1 : 0.8402          
##              Prevalence : 0.3995          
##          Detection Rate : 0.3546          
##    Detection Prevalence : 0.4444          
##       Balanced Accuracy : 0.8689          
##                                           
##        'Positive' Class : Above_Avg       
## 
Testing Performance

To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the class predictions (type = "class") for the testing data based on our DT model.

base.tepreds <- predict(object = DM.rpart, # DT model
                        newdata = test, # testing data
                        type = "class")

Again, we use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model, this time applied to the testing dataset (test).

DT_test_conf <- confusionMatrix(data = base.tepreds, # predictions
                                reference = test$Amount, # actual
                                positive = "Above_Avg",
                                mode = "everything")
DT_test_conf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg       101        10
##   Above_Avg        19        69
##                                           
##                Accuracy : 0.8543          
##                  95% CI : (0.7975, 0.9002)
##     No Information Rate : 0.603           
##     P-Value [Acc > NIR] : 8.632e-15       
##                                           
##                   Kappa : 0.7014          
##                                           
##  Mcnemar's Test P-Value : 0.1374          
##                                           
##             Sensitivity : 0.8734          
##             Specificity : 0.8417          
##          Pos Pred Value : 0.7841          
##          Neg Pred Value : 0.9099          
##               Precision : 0.7841          
##                  Recall : 0.8734          
##                      F1 : 0.8263          
##              Prevalence : 0.3970          
##          Detection Rate : 0.3467          
##    Detection Prevalence : 0.4422          
##       Balanced Accuracy : 0.8575          
##                                           
##        'Positive' Class : Above_Avg       
## 

As shown, the model has high Accuracy and good agreement between the actual and predicted Amount values based on Kappa. At the class-level, we also see good performance for predicting our class level of interest (Above_Avg) based on Sensitivity/Recall, Precision and F1. The performance for predicting Below_Avg is also strong based on the Specificity.

Model Goodness of Fit

To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = DT_train_conf$overall,
      Testing = DT_test_conf$overall)
##                    Training      Testing
## Accuracy       8.651685e-01 8.542714e-01
## Kappa          7.241771e-01 7.014331e-01
## AccuracyLower  8.395454e-01 7.974599e-01
## AccuracyUpper  8.880667e-01 9.001806e-01
## AccuracyNull   6.004994e-01 6.030151e-01
## AccuracyPValue 5.962647e-61 8.632318e-15
## McnemarPValue  7.574950e-04 1.373948e-01

Based on the side-by side comparison of the overall measures, the model seems to be balanced, based on the consistency across the training and testing sets for the Accuracy and Kappa values.

Next, we can compare the class level performance on the training and testing sets.

cbind(Training = DT_train_conf$byClass,
      Testing = DT_test_conf$byClass)
##                       Training   Testing
## Sensitivity          0.8875000 0.8734177
## Specificity          0.8503119 0.8416667
## Pos Pred Value       0.7977528 0.7840909
## Neg Pred Value       0.9191011 0.9099099
## Precision            0.7977528 0.7840909
## Recall               0.8875000 0.8734177
## F1                   0.8402367 0.8263473
## Prevalence           0.3995006 0.3969849
## Detection Rate       0.3545568 0.3467337
## Detection Prevalence 0.4444444 0.4422111
## Balanced Accuracy    0.8689059 0.8575422

Based on the side-by side comparison of the class-level measures, the model seems to be balanced, based on the consistency across the training and testing sets for the class-level measures. The performance is good on the training data, so we do not have reason to believe that the model is underfitting and because we have consistency across the training and testing measures, we do not see evidence of overfitting. For this reason, we would say that the model is balanced.

Hyperparameter Tuning Model (Pruning the Tree)

Behind the scenes of the rpart() function, cross-validation is being used to find the optimal value of cp, the complexity parameter. This value determines the depth of the tree (tree size, or the number of terminal nodes).

The rpart() function uses a default value of 0.01, and then grows the tree based on this value. The cross validation being done considers the impact of using larger cp values than the default and the effect on the tree in terms of size and error. We look for a tradeoff between accuracy/error and model complexity (tree size). This also helps to reduce the chances of overfitting. Larger values for cp will impose a higher penalty and will result in a smaller tree. The cross-validated error can be obtained and used to find a cp value that is an optimal trade-off between minimizing the error (misclassification) and complexity (tree depth). We will choose the cp value that results in the smallest cross-validated error.

We can view the cross-validation results either visually or as a table. To view the table as a plot, we can use the plotcp() function. Using the plot, we can choose the size of the tree (nsplit + 1) that is the first to the left that is below the dashed line.

plotcp(x = DM.rpart)

The printcp() function can be used to print the CP values, (in the cptable list component of the rpart object), the number of variables (used to split), error (scaled error, when nsplit = 0, rel error = 1), cross-validated error (xerror) and the standard deviation of the cross-validated error (xstd).

printcp(x = DM.rpart)
## 
## Classification tree:
## rpart(formula = Amount ~ ., data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] Catalogs Children Location Salary  
## 
## Root node error: 320/801 = 0.3995
## 
## n= 801 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.475000      0   1.00000 1.00000 0.043319
## 2 0.056250      1   0.52500 0.52813 0.036086
## 3 0.046875      2   0.46875 0.50000 0.035361
## 4 0.019792      3   0.42188 0.45625 0.034145
## 5 0.012500      6   0.36250 0.42188 0.033108
## 6 0.010000      8   0.33750 0.42813 0.033303

We can identify the best cp, that has the minimum cross-validated error (xerror) in the cptable and save it for use in the prune() function, which will prune the tree based on that cp value.

min_cp <- DM.rpart$cptable[which.min(DM.rpart$cptable[,"xerror"]),"CP"]
min_cp
## [1] 0.0125

We can use this value as the cp, instead of the default (0.01), and use the prune() function to stop the tree from growing beyond the corresponding number of splits for this cp value in the cptable.

pruneTree <- prune(tree = DM.rpart, # original DT object
                   cp = min_cp) # optimal cp value

We can use the prp() function to view our pruned tree.

prp(x = pruneTree, # prune object
    extra = 2) # include correct predictions on terminal nodes

As shown, since the optimal cp value was larger than the default, we penalize the tree more for having a larger size, creating a smaller tree than our original model. Next, we can use this pruned tree model to obtain performance and goodness of fit information.

Model Performance & Fit

Training Performance

We use the predict() function to generate class predictions for our training data set using the pruned Decision Tree.

tune.trpreds <- predict(object = pruneTree,
                        newdata = train, 
                        type = "class")

Again, we can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = "everything" to obtain all available performance measures.

DT_trtune_conf <- confusionMatrix(data = tune.trpreds, # predictions
                                  reference = train$Amount, # actual
                                  positive = "Above_Avg",
                                  mode = "everything")
DT_trtune_conf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg       412        47
##   Above_Avg        69       273
##                                           
##                Accuracy : 0.8552          
##                  95% CI : (0.8289, 0.8788)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7016          
##                                           
##  Mcnemar's Test P-Value : 0.0512          
##                                           
##             Sensitivity : 0.8531          
##             Specificity : 0.8565          
##          Pos Pred Value : 0.7982          
##          Neg Pred Value : 0.8976          
##               Precision : 0.7982          
##                  Recall : 0.8531          
##                      F1 : 0.8248          
##              Prevalence : 0.3995          
##          Detection Rate : 0.3408          
##    Detection Prevalence : 0.4270          
##       Balanced Accuracy : 0.8548          
##                                           
##        'Positive' Class : Above_Avg       
## 
Testing Performance

To assess and interpret the model performance, we use the performance measures based on the testing data set.

We use the predict() function to generate class predictions for our testing data set using the pruned Decision Tree.

tune.tepreds <- predict(object = pruneTree,
                        newdata = test,
                        type = "class")

Again, we use the confusionMatrix() function to obtain the model performance.

DT_tetune_conf <- confusionMatrix(data = tune.tepreds, # predictions
                                  reference = test$Amount, # actual
                                  positive = "Above_Avg",
                                  mode = "everything")
DT_tetune_conf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Below_Avg Above_Avg
##   Below_Avg       104        15
##   Above_Avg        16        64
##                                           
##                Accuracy : 0.8442          
##                  95% CI : (0.7862, 0.8916)
##     No Information Rate : 0.603           
##     P-Value [Acc > NIR] : 1.189e-13       
##                                           
##                   Kappa : 0.6753          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8101          
##             Specificity : 0.8667          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8739          
##               Precision : 0.8000          
##                  Recall : 0.8101          
##                      F1 : 0.8050          
##              Prevalence : 0.3970          
##          Detection Rate : 0.3216          
##    Detection Prevalence : 0.4020          
##       Balanced Accuracy : 0.8384          
##                                           
##        'Positive' Class : Above_Avg       
## 

Based on the output, we see that the model has good performance based on Accuracy and good consistency between the predicted and actual values based on the Kappa value. At the class-level, we see that based on the Specificity and Sensitivity/Recall values, we are fitting the negative class (Below_Avg) better than we are fitting the positive class (Above_Avg). Based on this, it appears that a more complex tree may be preferred, because it may be better able to predict our class value of interest, Above_Avg for the Amount variable.

Goodness of Fit

To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = DT_trtune_conf$overall,
      Testing = DT_tetune_conf$overall)
##                    Training      Testing
## Accuracy       8.551810e-01 8.442211e-01
## Kappa          7.016012e-01 6.753329e-01
## AccuracyLower  8.288808e-01 7.862307e-01
## AccuracyUpper  8.788333e-01 8.916355e-01
## AccuracyNull   6.004994e-01 6.030151e-01
## AccuracyPValue 4.685959e-56 1.188529e-13
## McnemarPValue  5.119984e-02 1.000000e+00

Next, we can compare the class level performance on the training and testing sets.

cbind(Training = DT_trtune_conf$byClass,
      Testing = DT_tetune_conf$byClass)
##                       Training   Testing
## Sensitivity          0.8531250 0.8101266
## Specificity          0.8565489 0.8666667
## Pos Pred Value       0.7982456 0.8000000
## Neg Pred Value       0.8976035 0.8739496
## Precision            0.7982456 0.8000000
## Recall               0.8531250 0.8101266
## F1                   0.8247734 0.8050314
## Prevalence           0.3995006 0.3969849
## Detection Rate       0.3408240 0.3216080
## Detection Prevalence 0.4269663 0.4020101
## Balanced Accuracy    0.8548369 0.8383966

As shown, we see consistent performance across our training and testing data and the model appears to be balanced.