We will use the DescTools, caret, randomForest and ipred packages. If you do not already have the randomForest and ipred packages installed, you will first install them using the install.packages() function.
install.packages(c("randomForest", "ipred"))
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(randomForest)
library(ipred)
In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount) at or below average (Below_Avg) or above average (Above_Avg). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.
Variables include:
Young, Middle, Old)Male, Female)Married) or single (Single)Close) or not (Far)Below_Avg) or above (Above_Avg)We use the read.csv() function to import the CSV file into R as a dataframe named DM. We set stringsAsFactors = FALSE to keep any character columns as-is.
DM <- read.csv(file = "DirectMarketing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the DM dataframe to look at the variable types and to check for missing (NA) values.
Abstract(DM)
## ------------------------------------------------------------------------------
## DM
##
## data frame: 1000 obs. of 9 variables
## 1000 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 Age character .
## 2 Gender character .
## 3 OwnHome character .
## 4 Married character .
## 5 Location character .
## 6 Salary integer .
## 7 Children integer .
## 8 Catalogs integer .
## 9 Amount character .
Next, we can convert our target class variable that we want to predict, Amount to a nominal factor variable.
DM$Amount <- factor(x = DM$Amount)
We can plot the distribution using a bar plot, which is the default plot for the plot() function when applied to factor variables.
plot(x = DM$Amount,
main = "Amount Spent")
First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms) so that we can more easily refer to them.
noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
FUN = factor)
Next, we can convert our ordinal variables. The Age variable takes on three levels, which we will need to specify with the correct ordering. Since the Children and Catalogs variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords) so that we can more easily refer to them.
ords <- c("Age", "Children", "Catalogs")
DM$Age <- factor(x = DM$Age,
levels = c("Young", "Middle", "Old"),
ordered = TRUE)
DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
FUN = factor,
ordered = TRUE)
We only have one numerical variable, Salary.
First, we can view summary() information for our prepared data.
summary(DM)
## Age Gender OwnHome Married Location
## Young :287 Female:506 Own :516 Married:502 Close:710
## Middle:508 Male :494 Rent:484 Single :498 Far :290
## Old :205
##
##
##
## Salary Children Catalogs Amount
## Min. : 10100 0:462 6 :252 Above_Avg:399
## 1st Qu.: 29975 1:267 12:282 Below_Avg:601
## Median : 53700 2:146 18:233
## Mean : 56104 3:125 24:233
## 3rd Qu.: 77025
## Max. :168800
Since we will be building ensemble classifiers made of Decision Trees, we have the same considerations as Decision Trees. We know that Decision Trees can handle missing values (we can impute or eliminate up-front or tell the model how to handle them), irrelevant and redundant variables and no rescaling/standardization is needed. For this reason, we can use the dataset as-is in our modeling, without any transformations.
na.omit()) or perform imputation (or we can handle them during analysis).PlotMiss(DM)
We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 85/15 split (85% in training and 15% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Amount). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831) # initialize the random seed
# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM$Amount, # target variable
p = 0.85, # proportion in training set
list = FALSE)
We subset the rows of the DM dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- DM[sub, ]
test <- DM[-sub, ]
We use the bagging() function in the ipred package to perform bagging. NA values are handled as in the rpart() function (NA predictor values are ignored and NA target values are removed row-wise).
set.seed(831) # initialize random seed
bag.mod <- bagging(formula = Amount ~ ., # use all other variables to predict Amount
nbagg = 500, # build 500 trees
data = train, # use train data
coob = TRUE) # compute out of bag (OOB) error
We can run a code line of the name of the model to view basic information about the model.
bag.mod
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Amount ~ ., data = train, nbagg = 500,
## coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.1422
We use the predict() function to generate class predictions and do not specify anything for the newdata argument so that the out-of-bag estimate is computed. This allows us to honestly assess our training performance.
bagtrpreds <- predict(object = bag.mod)
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the training dataset (train). Since we prioritize predicting the Above_Avg class, we set positive = "Above_Avg".
bag_train_conf <- confusionMatrix(data = bagtrpreds, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
We will view the output when assessing goodness of fit.
We use the predict() function to generate class predictions for our testing set (test).
bagtepreds <- predict(object = bag.mod, # Bagging model
newdata = test, # testing data
type = "class")
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).
bag_test_conf <- confusionMatrix(data = bagtepreds, # predictions
reference = test$Amount, # actual
positive = "Above_Avg",
mode = "everything")
bag_test_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Above_Avg Below_Avg
## Above_Avg 46 12
## Below_Avg 13 78
##
## Accuracy : 0.8322
## 95% CI : (0.7624, 0.8884)
## No Information Rate : 0.604
## P-Value [Acc > NIR] : 1.443e-09
##
## Kappa : 0.6482
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7797
## Specificity : 0.8667
## Pos Pred Value : 0.7931
## Neg Pred Value : 0.8571
## Precision : 0.7931
## Recall : 0.7797
## F1 : 0.7863
## Prevalence : 0.3960
## Detection Rate : 0.3087
## Detection Prevalence : 0.3893
## Balanced Accuracy : 0.8232
##
## 'Positive' Class : Above_Avg
##
As shown, the model is doing better at predicting our negative class, Below_Avg, based on sensitivity and specificity, but the misclassifications per-class are approximately the same (there are more observations of Below_Avg than Above_Avg). Our Accuracy is good and our Kappa value shows good agreement between our actual and predicted values. The model has good performance, although the performance for the negative class, Below_Avg, is better than for our class of interest.
To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side. We want our model to be balanced, and have consistent performance across the training and testing sets.
# Overall
cbind(Training = bag_train_conf$overall,
Testing = bag_test_conf$overall)
## Training Testing
## Accuracy 8.578143e-01 8.322148e-01
## Kappa 7.017555e-01 6.482199e-01
## AccuracyLower 8.325244e-01 7.623590e-01
## AccuracyUpper 8.805949e-01 8.883770e-01
## AccuracyNull 6.004700e-01 6.040268e-01
## AccuracyPValue 8.507076e-61 1.442926e-09
## McnemarPValue 2.753129e-01 1.000000e+00
# Class-Level
cbind(Training = bag_train_conf$byClass,
Testing = bag_test_conf$byClass)
## Training Testing
## Sensitivity 0.8029412 0.7796610
## Specificity 0.8943249 0.8666667
## Pos Pred Value 0.8348624 0.7931034
## Neg Pred Value 0.8721374 0.8571429
## Precision 0.8348624 0.7931034
## Recall 0.8029412 0.7796610
## F1 0.8185907 0.7863248
## Prevalence 0.3995300 0.3959732
## Detection Rate 0.3207991 0.3087248
## Detection Prevalence 0.3842538 0.3892617
## Balanced Accuracy 0.8486330 0.8231638
Based on the output above, we see that the model is performing consistently across the training and testing sets and we would describe the fit as balanced.
We use the randomForest() function in the randomForest package to build the model. We specify that 500 trees will be used with the ntree argument.
If NA values are present, the na.action argument can be used to identify how they should be handled (such as na.omit). The default is na.fail, which will give an error if NAs are present.
By default m (mtry) is equal to the square root of our number of predictor (X) variables, rounded down to the nearest integer.
floor(sqrt(ncol(train)-1))
## [1] 2
After building a default model, we will tune the model by changing the m (mtry) hyperparameter value.
set.seed(831) # initialize random seed
rf_mod <- randomForest(formula = Amount ~. , # use all other variables to predict Amount
data = train, # training data
importance = TRUE, # obtain variable importance
ntree = 500) # number of trees in forest
We can view basic information about our model by running a code line with the name of the randomForest object, rf_mod.
rf_mod
##
## Call:
## randomForest(formula = Amount ~ ., data = train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.57%
## Confusion matrix:
## Above_Avg Below_Avg class.error
## Above_Avg 273 67 0.1970588
## Below_Avg 57 454 0.1115460
We can visualize the most important variables in the Random Forest model using the varImpPlot() function from the randomForest package. The plot displays the variables in decreasing order of importance, so our most important variables are shown at the top of the plot. In the left plot, the mean decrease in accuracy if the variable were not used ot split is displayed. In the right plot, the scaled average decrease in Gini Impurity when the variable is used to split is displayed.
varImpPlot(x = rf_mod, # randomForest object
main = "Variable Importance Plot") # title
As shown, our two most important variables based on btoh criteria are
Salary and Catalogs.
We use the predict() function to generate class predictions for our training set. Again, to gain an honest prediction of our performance, we do not specify anything for the newdata argument, so that we use the out-of-bag estimation.
base.RFpreds <- predict(object = rf_mod, # RF model
type = "class") # class predictions
Next, we can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance training measures.
RF_btrain_conf <- confusionMatrix(data = base.RFpreds, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
We will view the output when assessing goodness of fit.
To assess model performance, we we look at how well the model performs using the testing data. First, we use the predict() function to generate class predictions for our testing set (test).
base.teRFpreds <- predict(object = rf_mod, # RF model
newdata = test, # testing data
type = "class")
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).
RF_btest_conf <- confusionMatrix(data = base.teRFpreds, # predictions
reference = test$Amount, # actual
positive = "Above_Avg",
mode = "everything")
RF_btest_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Above_Avg Below_Avg
## Above_Avg 47 14
## Below_Avg 12 76
##
## Accuracy : 0.8255
## 95% CI : (0.7549, 0.8827)
## No Information Rate : 0.604
## P-Value [Acc > NIR] : 4.605e-09
##
## Kappa : 0.6373
##
## Mcnemar's Test P-Value : 0.8445
##
## Sensitivity : 0.7966
## Specificity : 0.8444
## Pos Pred Value : 0.7705
## Neg Pred Value : 0.8636
## Precision : 0.7705
## Recall : 0.7966
## F1 : 0.7833
## Prevalence : 0.3960
## Detection Rate : 0.3154
## Detection Prevalence : 0.4094
## Balanced Accuracy : 0.8205
##
## 'Positive' Class : Above_Avg
##
As in the case of Bagging, we see around the same number of misclassifications for each class based on the confusion matrix. Again, we have higher Specificity than Sensitivity, although the values are pretty close. We have good overall performance based on our Accuracy and Kappa values.
To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.
# Overall
cbind(Training = RF_btrain_conf$overall,
Testing = RF_btest_conf$overall)
## Training Testing
## Accuracy 8.542891e-01 8.255034e-01
## Kappa 6.948145e-01 6.373338e-01
## AccuracyLower 8.287726e-01 7.548702e-01
## AccuracyUpper 8.773218e-01 8.827328e-01
## AccuracyNull 6.004700e-01 6.040268e-01
## AccuracyPValue 5.267649e-59 4.604956e-09
## McnemarPValue 4.189617e-01 8.445193e-01
# Class-Level
cbind(Training = RF_btrain_conf$byClass,
Testing = RF_btest_conf$byClass)
## Training Testing
## Sensitivity 0.8029412 0.7966102
## Specificity 0.8884540 0.8444444
## Pos Pred Value 0.8272727 0.7704918
## Neg Pred Value 0.8714012 0.8636364
## Precision 0.8272727 0.7704918
## Recall 0.8029412 0.7966102
## F1 0.8149254 0.7833333
## Prevalence 0.3995300 0.3959732
## Detection Rate 0.3207991 0.3154362
## Detection Prevalence 0.3877791 0.4093960
## Balanced Accuracy 0.8456976 0.8205273
Based on the output above for the overall and class-level performance, we see that the performance is consistent across our training and testing data sets and we would describe the model as balanced.
Next, we will tune the number of variables to randomly sample as potential variables to split on (m, the mtry argument). We use the tuneRF() function in the randomForest package. The output will typically be a plot, where we choose the mtry with the smallest OOB Error. By setting doBest = TRUE, the best mtry will be used to automatically create the best fitting, tuned RF model.
set.seed(831) # initialize random seed
tuneR <- tuneRF(x = train[, -9], # use all variable except the 9th (amount) as predictors
y = train$Amount, # use Amount as the target variable
ntreeTry = 500, # 500 trees in the forest
doBest = TRUE) # use the best m value to create an RF model
## mtry = 2 OOB error = 14.57%
## Searching left ...
## mtry = 1 OOB error = 15.98%
## -0.09677419 0.05
## Searching right ...
## mtry = 4 OOB error = 13.4%
## 0.08064516 0.05
## mtry = 8 OOB error = 13.63%
## -0.01754386 0.05
Based on the plot, we see that the optimal
mtry (m) value is 4, which is higher than our default value, 2.
We can run a code line with the name of the tuneRF object, tuneR, to obtain basic information about the model.
tuneR
##
## Call:
## randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 13.16%
## Confusion matrix:
## Above_Avg Below_Avg class.error
## Above_Avg 277 63 0.18529412
## Below_Avg 49 462 0.09589041
Again, we can use the varImpPlot() function to obtain variable importance information for our tuned model. For the output of tuning, we only obtain 1 measure, the average decrease in Gini Impurity. Based on the plot, we see that Salary and Catalogs are still our two most important variables.
varImpPlot(x = tuneR, main = "Variable Importance Plot")
We obtain predictions based on our out-of-bag estimation using the predict() function to generate class predictions.
tune.trRFpreds <- predict(object = tuneR, # Tuned RF model
type = "class") # class predictions
We use the class predictions and actual class in the confusionMatrix() function to obtain training performance, which we will use to assess goodness of fit.
RF_ttrain_conf <- confusionMatrix(data = tune.trRFpreds, # predictions
reference = train$Amount, # actual
positive = "Above_Avg",
mode = "everything")
We will wait to view the performance until we assess goodness of fit.
To assess model performance, we we look at how well the model performs using the testing data. First, we use the predict() function to generate class predictions for our testing set (test).
tune.teRFpreds <- predict(object = tuneR, # tuned RF model
newdata = test, # testing data
type = "class")
We use the class predictions and actual class in the confusionMatrix() function to obtain testing performance.
RF_ttest_conf <- confusionMatrix(data = tune.teRFpreds, # predictions
reference = test$Amount, # actual
positive = "Above_Avg",
mode = "everything")
RF_ttest_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Above_Avg Below_Avg
## Above_Avg 45 9
## Below_Avg 14 81
##
## Accuracy : 0.8456
## 95% CI : (0.7774, 0.8996)
## No Information Rate : 0.604
## P-Value [Acc > NIR] : 1.23e-10
##
## Kappa : 0.6725
##
## Mcnemar's Test P-Value : 0.4042
##
## Sensitivity : 0.7627
## Specificity : 0.9000
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.8526
## Precision : 0.8333
## Recall : 0.7627
## F1 : 0.7965
## Prevalence : 0.3960
## Detection Rate : 0.3020
## Detection Prevalence : 0.3624
## Balanced Accuracy : 0.8314
##
## 'Positive' Class : Above_Avg
##
Based on the output, the tuned model shows an improvement in performance over the default model. Although our sensitivity is around the same, our specificity has increased, suggesting that this model is better at predicting the Below_Avg class level. Based on our F1, this model is doing a pretty good job of predicting our positive class. Based on our overall model performance measures of Accuracy and Kappa, we have a good model.
To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.
# Overall
cbind(Training = RF_ttrain_conf$overall,
Testing = RF_ttest_conf$overall)
## Training Testing
## Accuracy 8.683901e-01 8.456376e-01
## Kappa 7.238022e-01 6.725275e-01
## AccuracyLower 8.438097e-01 7.774373e-01
## AccuracyUpper 8.903844e-01 8.995602e-01
## AccuracyNull 6.004700e-01 6.040268e-01
## AccuracyPValue 2.120031e-66 1.230087e-10
## McnemarPValue 2.193026e-01 4.042485e-01
# Class-Level
cbind(Training = RF_ttrain_conf$byClass,
Testing = RF_ttest_conf$byClass)
## Training Testing
## Sensitivity 0.8147059 0.7627119
## Specificity 0.9041096 0.9000000
## Pos Pred Value 0.8496933 0.8333333
## Neg Pred Value 0.8800000 0.8526316
## Precision 0.8496933 0.8333333
## Recall 0.8147059 0.7627119
## F1 0.8318318 0.7964602
## Prevalence 0.3995300 0.3959732
## Detection Rate 0.3254994 0.3020134
## Detection Prevalence 0.3830787 0.3624161
## Balanced Accuracy 0.8594077 0.8313559
Based on the output, we see that the model is performing very consistently across the training and testing sets for both our overall and class-level performance measures.
Our goal of the classification analysis is to be able to build the best possible model, which the company should use to strategically target their customers. We will consider our Bagging model and tuned Random Forest model. The next step is to compare these models so that we can make a recommendation to the company.
When choosing the model we want to consider performance, goodness of fit and any other relevant factors that can impact the choice, such as scalability, computational complexity and interpretability.
First, we can use the cbind() function to compare the testing performance of the Bagging and Random Forest models for the overall and class-level measures.
# Overall
cbind(Bagging = bag_test_conf$overall,
RF = RF_ttest_conf$overall)
## Bagging RF
## Accuracy 8.322148e-01 8.456376e-01
## Kappa 6.482199e-01 6.725275e-01
## AccuracyLower 7.623590e-01 7.774373e-01
## AccuracyUpper 8.883770e-01 8.995602e-01
## AccuracyNull 6.040268e-01 6.040268e-01
## AccuracyPValue 1.442926e-09 1.230087e-10
## McnemarPValue 1.000000e+00 4.042485e-01
# Class-Level
cbind(Bagging = bag_test_conf$byClass,
RF = RF_ttest_conf$byClass)
## Bagging RF
## Sensitivity 0.7796610 0.7627119
## Specificity 0.8666667 0.9000000
## Pos Pred Value 0.7931034 0.8333333
## Neg Pred Value 0.8571429 0.8526316
## Precision 0.7931034 0.8333333
## Recall 0.7796610 0.7627119
## F1 0.7863248 0.7964602
## Prevalence 0.3959732 0.3959732
## Detection Rate 0.3087248 0.3020134
## Detection Prevalence 0.3892617 0.3624161
## Balanced Accuracy 0.8231638 0.8313559
As shown, the models have pretty similar performance, although the Random Forest model has slightly better performance based on Accuracy, Kappa, Precision, F1 and Specificity. The Bagging model, however, has slightly higher Sensitivity/Recall, suggesting that it is doing a better job of predicting our class-level of interest, Above_Avg.
Next, we can consider the goodness of fit for each model. When comparing across methods, if one model is overfitting and another is balanced, we would prefer the model that is balanced. In this case, both models are balanced.
Since the performance of the models is very similar and both are balanced, to make a determination of which model to recommend to the company, we may want to consider additional priorities, such as scalability, implementation and interpretation. Both models are efficient, and can be easily implemented, particularly considering that the amount of preprocessing and transformation is minimal. The Random Forest performed better across most of our performance measures and the implementation using the randomForest package also allows us to get information about variable importance. For this reason, we may suggest that the company implement the Random Forest model because of this added interpretability.