Table of Contents

1. Introduction

2. Initial Setup ? load the package and dataset

3. Data Preparation and Preprocessing

3.1. How to split the dataset into training and validation?
3.2. Descriptive statistics
3.3. How to fill missing values using preProcess()?
3.4. How to create One-Hot Encoding (dummy variables)?
3.5. How to preprocess to transform the data?

4. How to visualize the importance of variables using featurePlot()

5. How to do feature selection using recursive feature elimination (rfe)?

6. Training and Tuning the model

6.1. How to train() the model and interpret the results?
6.2 How to compute variable importance?
6.3. Prepare the test dataset and predict
6.4. Predict on test data 6.5. Confusion Matrix

7. How to do hyperparameter tuning to optimize the model for better performance?

7.1. Setting up the trainControl()
7.2 Hyperparameter Tuning using tuneLength
7.3. Hyperparameter Tuning using tuneGrid

8. How to evaluate the performance of multiple machine learning algorithms?

8.1. Training Adaboost
8.2. Training Random Forest
8.3. Training SVM
8.4. Run resamples() to compare the models

9. Ensembling the predictions

9.1. How to ensemble predictions from multiple models using caretEnsemble?
9.2. How to combine the predictions of multiple models to form a final prediction

10. Conclusion

1. Introduction

Caret is short for Classification And Regression Training. It integrates all activities related to model development in a streamlined workflow. For nearly every major ML algorithm available in R.

With R having so many implementations of ML algorithms, it can be challenging to keep track of which algorithm resides in which package. Sometimes the syntax and the way to implement the algorithm differ across packages. This combined with data preprocessing, consulting help page, hyperparameter tuning to find best model can make building predictive models an involved task.

Well, thanks to caret because no matter which package the algorithm resides, caret will remember that for us. It may just prompt us to run install.package for that particular algorithm’s package.

Later in this report I will present how to see all the available ML algorithms supported by caret (it’s a long list!) and what hyperparameters to tune. Also, we will not stop with the caret package but go a step ahead and see how to ensemble the predictions from many best models. We will use caretEnsemble for this and may be produce an even better prediction.

To make it simpler, this report is structured to cover the following 5 topics:

Data Preparation and Preprocessing
Visualize the importance of variables
Feature Selection using RFE
Training and Tuning the model
Ensembling the predictions

Now that we have a fair idea of what caret is about, let’s get started with the basics.

2. Initial Setup – load the package and dataset

For this report, I am going to use a modified version of the Orange Juice Data, originally made available in the ISLR package.

The goal of this dataset is to predict which of the two brands of orange juices did the customers buy. The predictor variables are characteristics of the customer and the product itself.

It contains 1070 rows with 18 columns. The response variable is ‘Purchase’ which takes either the value ‘CH’(citrus hill) or ‘MM’(minute maid).

Let’s import the dataset and see it’s structure and starting few rows.

Install and Activate Caret Package:

# Install and Load the caret package
#install.packages(caret)
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
# Import dataset
orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv')

# Structure of the dataframe
str(orange)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : chr  "CH" "CH" "CH" "MM" ...
##  $ WeekofPurchase: int  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : int  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : int  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : int  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : chr  "No" "No" "No" "No" ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : int  1 1 1 1 0 0 0 0 0 0 ...
# See top 6 rows and 10 columns
head(orange[, 1:10])
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH
## 1         0 0.500000
## 2         1 0.600000
## 3         0 0.680000
## 4         0 0.400000
## 5         0 0.956535
## 6         1 0.965228

3. Data Preparation and Preprocessing

3.1. How to split the dataset into training and validation?

The dataset is ready. The first step is to split it into training(80%) and test(20%) datasets using caret’s createDataPartition function.

The advantage of using createDataPartition() over the traditional random sample() is, it preserves the proportion of the categories in Y variable, that can be disturbed if we sample randomly.

Let me quickly refresh why are splitting the dataset into training and test data.The reason is when building models the algorithm should see only the training data to learn the relationship between X and Y. This learned information forms what is called a machine learning model. The model is then used to predict the Y in test data by looking only the X values of test data.

Finally, the predicted values of Y is compared to the known Y from test dataset to evaluate how good the model really is. Alright, let?s create the training and test datasets.

Code to split data in train and test data:

# Create the training and test datasets
set.seed(100)

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainData <- orange[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]

# Store X and Y for later use.
x = trainData[, 2:18]
y = trainData$Purchase

createDataPartition() takes as input the Y variable in the source dataset and the percentage data that should go into training as the p argument. It returns the rownumbers that should form the training dataset. Plus, we need to set list=F, to prevent returning the result as a list.

3.2. Descriptive statistics

Before moving to missing value imputation and feature preprocessing, let’s observe the descriptive statistics of each column in the training dataset.

The skimr package provides a nice solution to show key descriptive stats for each column.

The skimr::skim_to_wide() produces a nice dataframe containing the descriptive stats of each of the columns. The dataframe output includes a nice histogram drawn without any plotting help.

Skimr package to present data:

library(skimr)
## Warning: package 'skimr' was built under R version 4.2.3
skimmed <- skim_to_wide(trainData)
## Warning: 'skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")
skimmed[, c(1:5, 9:11, 13, 15:16)]
Data summary
Name Piped data
Number of rows 857
Number of columns 18
_______________________
Column type frequency:
character 2
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min whitespace
Purchase 0 1 2 0
Store7 0 1 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p25 p75 p100
WeekofPurchase 0 1.00 254.16 15.64 240.00 268.00 278.00
StoreID 1 1.00 4.01 2.33 2.00 7.00 7.00
PriceCH 0 1.00 1.87 0.10 1.79 1.99 2.09
PriceMM 2 1.00 2.08 0.14 1.99 2.18 2.29
DiscCH 1 1.00 0.05 0.12 0.00 0.00 0.50
DiscMM 4 1.00 0.13 0.22 0.00 0.24 0.80
SpecialCH 2 1.00 0.15 0.36 0.00 0.00 1.00
SpecialMM 5 0.99 0.17 0.37 0.00 0.00 1.00
LoyalCH 3 1.00 0.56 0.31 0.33 0.84 1.00
SalePriceMM 5 0.99 1.96 0.26 1.69 2.13 2.29
SalePriceCH 1 1.00 1.81 0.15 1.75 1.89 2.09
PriceDiff 0 1.00 0.14 0.27 0.00 0.32 0.64
PctDiscMM 4 1.00 0.06 0.10 0.00 0.12 0.40
PctDiscCH 2 1.00 0.03 0.06 0.00 0.00 0.25
ListPriceDiff 0 1.00 0.22 0.11 0.14 0.30 0.44
STORE 2 1.00 1.59 1.43 0.00 3.00 4.00

Notice the number of missing values for each feature, mean, median, proportion split of categories in the factor variables, percentiles and the histogram in the last column.

3.3. How to fill missing values using preProcess()?

We’ve seen that the dataset has few missing values across all columns, we may to do well to fill it, fill it up with some meaningful values. If the feature is a continuous variable, it is a common practice to replace the missing values with the mean of the column. And if it’s a categorical variable, replace the missings with the most frequently occurring value, aka, the mode. But this is quite a basic and a rather rudimentary approach.

Instead what can be done is, we can actually predict the missing values by considering the rest of the available variables as predictors. A popular algorithm to do imputation is the k-Nearest Neighbors.

This can be quickly and easily be done using caret. Because, caret offers a nice convenient preProcess function that can predict missing values besides other preprocessing.

To predict the missing values with k-Nearest Neighbors using preProcess():

1. we need to set the method=knnfill for k-Nearest Neighbors and apply it     on the        training data. This creates a preprocess model.
2. Then use predict() on the created preprocess model by setting the newdata                argument on the same training data.

Caret also provides bag fill as an alternative imputation algorithm.

Create the knn imputation model on the training data

#?preProcess
preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
preProcess_missingdata_model
## Created from 827 samples and 18 variables
## 
## Pre-processing:
##   - centered (16)
##   - ignored (2)
##   - 5 nearest neighbor imputation (16)
##   - scaled (16)

The above output shows the various preprocessing steps done in the process of knn imputation.

That is, it has centered (subtract by mean) 16 variables, ignored 2, used k=5 (considered 5 nearest neighbors) to predict the missing values and finally scaled (divide by standard deviation) 16 variables.

Let’s now use this model to predict the missing values in trainData.

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
## Warning: package 'RANN' was built under R version 4.2.3
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
anyNA(trainData)
## [1] FALSE

All the missing values are successfully filled.

3.4. How to create One-Hot Encoding (dummy variables)?

Let me first explain what is one-hot encoding and why it is required.

Suppose if we have a categorical column as one of the features, it needs to be converted to numeric in order for it to be used by the machine learning algorithms.

Just replacing the categories with a number may not be meaningful especially if there is no intrinsic ordering amongst the categories. So what we can do instead is to convert the categorical variable with as many binary (1 or 0) variables as there are categories. An important aspect we should be careful about here is, in real-world environments, we might get new values of categorical variables in the new scoring data. So, we should ensure the dummyVars model is built on the training data alone and that model is in turn used to create the dummy vars on the test data.

In caret, one-hot-encodings can be created using dummyVars(). Just pass in all the features to dummyVars() as the training data and all the factor columns will automatically be converted to one-hot-encodings.

Code for one hot encoding with Caret

dummies_model <- dummyVars(Purchase ~ ., data=trainData)

# Create the dummy variables using predict. The Y variable (Purchase) will not be present in trainData_mat.
trainData_mat <- predict(dummies_model, newdata = trainData)

# # Convert to dataframe
trainData <- data.frame(trainData_mat)

# # See the structure of the new dataset
str(trainData)
## 'data.frame':    857 obs. of  18 variables:
##  $ WeekofPurchase: num  -1.097 -0.969 -0.586 -1.737 -1.673 ...
##  $ StoreID       : num  -1.29 -1.29 -1.29 -1.29 1.29 ...
##  $ PriceCH       : num  -1.1422 -1.1422 -0.0592 -1.7329 -1.7329 ...
##  $ PriceMM       : num  -0.6795 -0.6795 0.0498 -2.8676 -2.8676 ...
##  $ DiscCH        : num  -0.444 -0.444 0.981 -0.444 -0.444 ...
##  $ DiscMM        : num  -0.578 0.793 -0.578 -0.578 -0.578 ...
##  $ SpecialCH     : num  -0.425 -0.425 -0.425 -0.425 -0.425 ...
##  $ SpecialMM     : num  -0.447 2.235 -0.447 -0.447 -0.447 ...
##  $ LoyalCH       : num  -0.211 0.116 0.378 -0.539 1.284 ...
##  $ SalePriceMM   : num  0.13 -1.037 0.519 -1.037 -1.037 ...
##  $ SalePriceCH   : num  -0.432 -0.432 -0.843 -0.843 -0.843 ...
##  $ PriceDiff     : num  0.352 -0.744 0.936 -0.525 -0.525 ...
##  $ Store7No      : num  1 1 1 1 0 0 0 0 0 0 ...
##  $ Store7Yes     : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ PctDiscMM     : num  -0.587 0.861 -0.587 -0.587 -0.587 ...
##  $ PctDiscCH     : num  -0.44 -0.44 1 -0.44 -0.44 ...
##  $ ListPriceDiff : num  0.21 0.21 0.118 -2.012 -2.012 ...
##  $ STORE         : num  -0.412 -0.412 -0.412 -0.412 -1.111 ...

In above case, we had one categorical variable, Store7 with 2 categories. It was one-hot-encoded to produce two new columns – Store7.No and Store7.Yes. Caret Package – A Practical Guide to Machine Learning in R Caret Package is a comprehensive framework for building machine learning models in R. In this report, I explain nearly all the core features of the caret package and walk us through the step-by-step process of building predictive models.

Be it a decision tree or xgboost, caret helps to find the optimal model in the shortest possible time.

3. Data Preparation and Preprocessing

3.1. How to split the dataset into training and validation?

The dataset is ready. The first step is to split it into training(80%) and test(20%) datasets using caret’s createDataPartition function.

The advantage of using createDataPartition() over the traditional random sample() is, it preserves the proportion of the categories in Y variable, that can be disturbed if we sample randomly.

Let me quickly refresh why are splitting the dataset into training and test data.

The reason is when building models the algorithm should see only the training data to learn the relationship between X and Y. This learned information forms what is called a machine learning model. The model is then used to predict the Y in test data by looking only the X values of test data.

Finally, the predicted values of Y is compared to the known Y from test dataset to evaluate how good the model really is. Alright, let?s create the training and test datasets.

# Create the training and test datasets
set.seed(100)

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainData <- orange[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]

# Store X and Y for later use.
x = trainData[, 2:18]
y = trainData$Purchase

createDataPartition() takes as input the Y variable in the source dataset and the percentage data that should go into training as the p argument. It returns the rownumbers that should form the training dataset. Plus, we need to set list=F, to prevent returning the result as a list.

3.2. Descriptive statistics

Before moving to missing value imputation and feature preprocessing, let’s observe the descriptive statistics of each column in the training dataset.

The skimr package provides a nice solution to show key descriptive stats for each column.

The skimr::skim_to_wide() produces a nice dataframe containing the descriptive stats of each of the columns. The dataframe output includes a nice histogram drawn without any plotting help.

library(skimr)
skimmed <- skim_to_wide(trainData)
## Warning: 'skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")
skimmed[, c(1:5, 9:11, 13, 15:16)]
Data summary
Name Piped data
Number of rows 857
Number of columns 18
_______________________
Column type frequency:
character 2
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min whitespace
Purchase 0 1 2 0
Store7 0 1 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p25 p75 p100
WeekofPurchase 0 1.00 254.16 15.64 240.00 268.00 278.00
StoreID 1 1.00 4.01 2.33 2.00 7.00 7.00
PriceCH 0 1.00 1.87 0.10 1.79 1.99 2.09
PriceMM 2 1.00 2.08 0.14 1.99 2.18 2.29
DiscCH 1 1.00 0.05 0.12 0.00 0.00 0.50
DiscMM 4 1.00 0.13 0.22 0.00 0.24 0.80
SpecialCH 2 1.00 0.15 0.36 0.00 0.00 1.00
SpecialMM 5 0.99 0.17 0.37 0.00 0.00 1.00
LoyalCH 3 1.00 0.56 0.31 0.33 0.84 1.00
SalePriceMM 5 0.99 1.96 0.26 1.69 2.13 2.29
SalePriceCH 1 1.00 1.81 0.15 1.75 1.89 2.09
PriceDiff 0 1.00 0.14 0.27 0.00 0.32 0.64
PctDiscMM 4 1.00 0.06 0.10 0.00 0.12 0.40
PctDiscCH 2 1.00 0.03 0.06 0.00 0.00 0.25
ListPriceDiff 0 1.00 0.22 0.11 0.14 0.30 0.44
STORE 2 1.00 1.59 1.43 0.00 3.00 4.00

Notice the number of missing values for each feature, mean, median, proportion split of categories in the factor variables, percentiles and the histogram in the last column.

We will examine how important each of these features is in predicting the response (Purchase) in section 4, once we are done with the data preprocessing.

3.3. How to fill missing values using preProcess()?

We’ve seen that the dataset has few missing values across all columns, we may to do well to fill it.

fill, means to fill it up with some meaningful values. If the feature is a continuous variable, it is a common practice to replace the missing values with the mean of the column. And if it’s a categorical variable, replace the missings with the most frequently occurring value, aka, the mode. But this is quite a basic and a rather rudimentary approach.

Instead what can be done is, we can actually predict the missing values by considering the rest of the available variables as predictors. A popular algorithm to do imputation is the k-Nearest Neighbors.

This can be quickly and easily be done using caret. Because, caret offers a nice convenient preProcess function that can predict missing values besides other preprocessing.

To predict the missing values with k-Nearest Neighbors using preProcess():

we need to set the method=knnfill for k-Nearest Neighbors and apply it on the training data. This creates a preprocess model. Then use predict() on the created preprocess model by setting the newdata argument on the same training data. Caret also provides bagfill as an alternative imputation algorithm.

Create the knn imputation model on the training data

preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
preProcess_missingdata_model
## Created from 827 samples and 18 variables
## 
## Pre-processing:
##   - centered (16)
##   - ignored (2)
##   - 5 nearest neighbor imputation (16)
##   - scaled (16)

The above output shows the various preprocessing steps done in the process of knn imputation.

That is, it has centered (subtract by mean) 16 variables, ignored 2, used k=5 (considered 5 nearest neighbors) to predict the missing values and finally scaled (divide by standard deviation) 16 variables.

Let’s now use this model to predict the missing values in trainData.

### Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
anyNA(trainData)
## [1] FALSE

All the missing values are successfully filld.

3.4. How to create One-Hot Encoding (dummy variables)?

Let me first explain what is one-hot encoding and why it is required.

Suppose if we have a categorical column as one of the features, it needs to be converted to numeric in order for it to be used by the machine learning algorithms.

Just replacing the categories with a number may not be meaningful especially if there is no intrinsic ordering amongst the categories. So what we can do instead is to convert the categorical variable with as many binary (1 or 0) variables as there are categories.

An important aspect we should be careful about here is, in real-world environments, we might get new values of categorical variables in the new scoring data.

So, we should ensure the dummyVars model is built on the training data alone and that model is in turn used to create the dummy vars on the test data.

One Hot Encoding

In caret, one-hot-encodings can be created using dummyVars().

Just pass in all the features to dummyVars() as the training data and all the factor columns will automatically be converted to one-hot-encodings.

One-Hot Encoding

In above case, we had one categorical variable, Store7 with 2 categories.

It was one-hot-encoded to produce two new columns – Store7.No and Store7.Yes.

3.5. How to preprocess to transform the data?

With the missing values handled and the factors one-hot-encoded, our training dataset is now ready to undergo variable transformations if required.

So what type of preprocessing are available in caret?

range: Normalize values so it ranges between 0 and 1
center: Subtract Mean
scale: Divide by standard deviation
BoxCox: Remove skewness leading to normality. Values must be > 0
YeoJohnson: Like BoxCox, but works for negative values.
expoTrans: Exponential transformation, works for negative values.
pca: Replace with principal components
ica: Replace with independent components
spatialSign: Project the data to a unit circle

For our problem, let’s convert all the numeric variables to range between 0 and 1, by setting method=range in preProcess().

preProcess_range_model <- preProcess(trainData, method='range')
trainData <- predict(preProcess_range_model, newdata = trainData)

# Append the Y variable
trainData$Purchase <- y

apply(trainData[, 1:10], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})
##     Purchase WeekofPurchase StoreID     PriceCH PriceMM     DiscCH DiscMM 
## min "CH"     "0.00000000"   "0.0000000" "0.000" "0.0000000" "0.00" "0.000"
## max "MM"     "1.00000000"   "1.0000000" "1.000" "1.0000000" "1.00" "1.000"
##     SpecialCH SpecialMM LoyalCH       
## min "0.0"     "0.0"     "0.000000e+00"
## max "1.0"     "1.0"     "9.999870e-01"

All the predictor now range between 0 and 1.

4. How to visualize the importance of variables using featurePlot()

Now that the preprocessing is complete, let’s visually examine how the predictors influence the Y (Purchase).

In this problem, the X variables are numeric whereas the Y is categorical.

So how to gauge if a given X is an important predictor of Y?

A simple common sense approach is, if we group the X variable by the categories of Y, a significant mean shift amongst the X’s groups is a strong indicator (if not the only indicator) that X will have a significant role to help predict Y.

It is possible to watch this shift visually using box plots and density plots. In fact, caret’s featurePlot() function makes it so convenient.

Simply set the X and Y parameters and set plot=‘box’. we can additionally adjust the label font size (using strip) and the scales to be free as I have done in the below plot.

featurePlot(x = trainData[, 1:18], 
            y = as.factor(trainData$Purchase), 
            plot = "density",
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")))

Having visualized the relationships between X and Y, We can only say which variables are likely to be important to predict Y.

It may not be wise to conclude which variables are NOT important. Because sometimes, variables with uninteresting pattern can help explain certain aspects of Y that the visually important variables may not.

So to be safe, let’s not arrive at conclusions about excluding variables prematurely.

5. How to do feature selection using recursive feature elimination (rfe)?

Most machine learning algorithms are able to determine what features to predict Y. But in some scenarios, we might need to be careful to include only variables that may be significantly important and makes strong business sense.

This is quite common in banking, economics and financial institutions.

Or we might just be doing an exploratory analysis to determine important predictors and report it as a metric in our analytics dashboard. Or if we are using a traditional algorithm like like linear or logistic regression, determining what variable to feed to the model is in the hand of analyst.

Given such requirements, we might need a rigorous way to determine the important variables first before feeding them to the ML algorithm.

A good choice of selecting the important features is the recursive feature elimination (RFE).

So how does recursive feature elimination work? RFE works in 3 broad steps:

Step 1: Build a ML model on a training dataset and estimate the feature importances on the test dataset.

Step 2: Keeping priority to the most important variables, iterate through by building models of given subset sizes, that is, subgroups of most important predictors determined from step 1. Ranking of the predictors is recalculated in each iteration.

Step 3: The model performances are compared across different subset sizes to arrive at the optimal number and list of final predictors.

It can be implemented using the rfe() function and we have the flexibility to control what algorithm rfe uses and how it cross validates by defining the rfeControl()

Code for feature Selection

set.seed(100)
#options(warn=-1)

subsets <- c(1:5, 10, 15, 18)

ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)

#names(trainData)
#dim(trainData$Purchase)


#lmProfile <- rfe(x= trainData[,1:18], y=trainData$Purchase,
#                 sizes = subsets,
#                 rfeControl = ctrl)

#lmProfile

In the above code, we call the rfe() which implements the recursive feature elimination. Apart from the x and y datasets, RFE also takes two important parameters.

sizes
rfeControl

The sizes determines what all model sizes (the number of most important features) the rfe should consider.

In above case, it iterates models of size 1 to 5, 10, 15 and 18. The rfeControl parameter on the other hand receives the output of the rfeControl() as values. If we look at the call to rfeControl() we set what type of algorithm and what cross validation method should be

In above case, the cross validation method is repeatedcv which implements k-Fold cross validation repeated 5 times, which is rigorous enough for our case.

Once rfe() is run, the output shows the accuracy and kappa (and their standard deviation) for the different model sizes we provided.

The final selected model subset size is marked with a * in the rightmost Selected column.

From the above output, a model size of 3 with LoyalCH, PriceDiff and StoreID seems to achieve the optimal accuracy. That means, out of 18 other features, a model with just 3 features outperformed many other larger model.

However, it is not a mandate that only including these 3 variables will always give high accuracy over larger sized models.

That’s because, the rfe() we just implemented is particular to random forest based rfFuncs.

Since ML algorithms have their own way of learning the relationship between the x and y, it is not wise to neglect the other predictors, especially when there is evidence that there is information contained in rest of the variables to explain the relationship between x and y.

Plus also, since the training dataset isn’t large enough, the other predictors may not have had the chance to show its worth. In the next step, we will build the actual randomForest model on trainData.

6. Training and Tuning the model

6.1. How to train() the model and interpret the results?

Now comes the important stage where we actually build the machine learning model. To know what models caret supports, run the following:

The Code

# See available algorithms in caret
modelnames <- paste(names(getModelInfo()), collapse=',  ')
modelnames
## [1] "ada,  AdaBag,  AdaBoost.M1,  adaboost,  amdai,  ANFIS,  avNNet,  awnb,  awtan,  bag,  bagEarth,  bagEarthGCV,  bagFDA,  bagFDAGCV,  bam,  bartMachine,  bayesglm,  binda,  blackboost,  blasso,  blassoAveraged,  bridge,  brnn,  BstLm,  bstSm,  bstTree,  C5.0,  C5.0Cost,  C5.0Rules,  C5.0Tree,  cforest,  chaid,  CSimca,  ctree,  ctree2,  cubist,  dda,  deepboost,  DENFIS,  dnn,  dwdLinear,  dwdPoly,  dwdRadial,  earth,  elm,  enet,  evtree,  extraTrees,  fda,  FH.GBML,  FIR.DM,  foba,  FRBCS.CHI,  FRBCS.W,  FS.HGD,  gam,  gamboost,  gamLoess,  gamSpline,  gaussprLinear,  gaussprPoly,  gaussprRadial,  gbm_h2o,  gbm,  gcvEarth,  GFS.FR.MOGUL,  GFS.LT.RS,  GFS.THRIFT,  glm.nb,  glm,  glmboost,  glmnet_h2o,  glmnet,  glmStepAIC,  gpls,  hda,  hdda,  hdrda,  HYFIS,  icr,  J48,  JRip,  kernelpls,  kknn,  knn,  krlsPoly,  krlsRadial,  lars,  lars2,  lasso,  lda,  lda2,  leapBackward,  leapForward,  leapSeq,  Linda,  lm,  lmStepAIC,  LMT,  loclda,  logicBag,  LogitBoost,  logreg,  lssvmLinear,  lssvmPoly,  lssvmRadial,  lvq,  M5,  M5Rules,  manb,  mda,  Mlda,  mlp,  mlpKerasDecay,  mlpKerasDecayCost,  mlpKerasDropout,  mlpKerasDropoutCost,  mlpML,  mlpSGD,  mlpWeightDecay,  mlpWeightDecayML,  monmlp,  msaenet,  multinom,  mxnet,  mxnetAdam,  naive_bayes,  nb,  nbDiscrete,  nbSearch,  neuralnet,  nnet,  nnls,  nodeHarvest,  null,  OneR,  ordinalNet,  ordinalRF,  ORFlog,  ORFpls,  ORFridge,  ORFsvm,  ownn,  pam,  parRF,  PART,  partDSA,  pcaNNet,  pcr,  pda,  pda2,  penalized,  PenalizedLDA,  plr,  pls,  plsRglm,  polr,  ppr,  pre,  PRIM,  protoclass,  qda,  QdaCov,  qrf,  qrnn,  randomGLM,  ranger,  rbf,  rbfDDA,  Rborist,  rda,  regLogistic,  relaxo,  rf,  rFerns,  RFlda,  rfRules,  ridge,  rlda,  rlm,  rmda,  rocc,  rotationForest,  rotationForestCp,  rpart,  rpart1SE,  rpart2,  rpartCost,  rpartScore,  rqlasso,  rqnc,  RRF,  RRFglobal,  rrlda,  RSimca,  rvmLinear,  rvmPoly,  rvmRadial,  SBC,  sda,  sdwd,  simpls,  SLAVE,  slda,  smda,  snn,  sparseLDA,  spikeslab,  spls,  stepLDA,  stepQDA,  superpc,  svmBoundrangeString,  svmExpoString,  svmLinear,  svmLinear2,  svmLinear3,  svmLinearWeights,  svmLinearWeights2,  svmPoly,  svmRadial,  svmRadialCost,  svmRadialSigma,  svmRadialWeights,  svmSpectrumString,  tan,  tanSearch,  treebag,  vbmpRadial,  vglmAdjCat,  vglmContRatio,  vglmCumulative,  widekernelpls,  WM,  wsrf,  xgbDART,  xgbLinear,  xgbTree,  xyf"

Each of those is a machine learning algorithm caret supports. Yes, it’s a huge list! And if we want to know more details like the hyperparameters and if it can be used of regression or classification problem, then do a modelLookup(algo). Once we have chosen an algorithm, building the model is fairly easy using the train() function. Let’s train a Multivariate Adaptive Regression Splines (MARS) model by setting the method=‘earth’. The MARS algorithm was named as ‘earth’ in R because of a possible trademark conflict with Salford Systems. May be a rumor. Or not.

Code to get details of an algorithm:

modelLookup('earth')
##   model parameter          label forReg forClass probModel
## 1 earth    nprune         #Terms   TRUE     TRUE      TRUE
## 2 earth    degree Product Degree   TRUE     TRUE      TRUE

Application of model on train data.

# Set the seed for reproducibility
set.seed(100)

# Train the model using randomForest and predict on the training data itself.
model_mars = train(Purchase ~ ., data=trainData, method='earth')
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.2.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.2.3
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 4.2.3
fitted <- predict(model_mars)

But we may ask how is using train() different from using the algorithm’s function directly? The difference is, besides building the model train() does multiple other things like:

Cross validating the model
Tune the hyper parameters for optimal model performance
Choose the optimal model based on a given evaluation metric
Preprocess the predictors (what we did so far using preProcess())

The train function also accepts the arguments used by the algorithm specified in the method argument. Now let’s see what the train() has generated.

COde to see the results

model_mars
## Multivariate Adaptive Regression Spline 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 857, 857, 857, 857, 857, 857, ... 
## Resampling results across tuning parameters:
## 
##   nprune  Accuracy   Kappa    
##    2      0.8116999  0.5969106
##    9      0.8224037  0.6225706
##   17      0.8132064  0.6030569
## 
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 9 and degree = 1.

we can see what is the Accuracy and Kappa for various combinations of the hyper parameters – interaction.depth and n.trees.

And it says ‘Resampling: Bootstrapped (25 reps)’ with a summary of sample sizes. Looks like train() has already done a basic cross validation and hyper parameter tuning.

And that is the default behaviour.

The chosen model and its parameters is reported in the last 2 lines of the output.

When we used model_mars to predict the Y, this final model was automatically used by predict() to compute the predictions. Plotting the model shows how the various iterations of hyperparameter search performed.

Plot to check the accuracy:

plot(model_mars, main="Model Accuracies with MARS")

6.2 How to compute variable importance?

Excellent, since MARS supports computing variable importances, let’s extract the variable importances using varImp() to understand which variables came out to be useful.

Code to get the important variable in the model

varimp_mars <- varImp(model_mars)
plot(varimp_mars, main="Variable Importance with MARS")

As suspected, LoyalCH was the most used variable, followed by PriceDiff and StoreID.

6.3. Prepare the test dataset and predict

A default MARS model has been selected. Now in order to use the model to predict on new data, the data has to be preprocessed and transformed just the way we did on the training data.

Thanks to caret, all the information required for pre-processing is stored in the respective preProcess model and dummyVar model.

If we recall, we did the pre-processing in the following sequence: Missing Value imputation –> One-Hot Encoding –> Range Normalization we need to pass the testData through these models in the same sequence:

Code to run trianed model on test data:

# Step 1: fill missing values 
testData2 <- predict(preProcess_missingdata_model, testData)  

# Step 2: Create one-hot encodings (dummy variables)
testData3 <- predict(dummies_model, testData2)

# Step 3: Transform the features to range between 0 and 1
testData4 <- predict(preProcess_range_model, testData3)

# View
head(testData4[, 1:10])
##    WeekofPurchase   StoreID PriceCH   PriceMM DiscCH DiscMM SpecialCH SpecialMM
## 7      0.09803922 1.0000000   0.000 0.5000000      0    0.5         1         1
## 11     0.25490196 1.0000000   0.425 0.6666667      0    0.0         0         0
## 18     0.80392157 0.1666667   0.425 0.8166667      0    0.0         0         1
## 21     0.58823529 1.0000000   0.425 0.8166667      0    0.0         0         0
## 33     0.94117647 0.1666667   0.675 0.8166667      0    1.0         0         1
## 35     0.47058824 0.3333333   0.750 0.9000000      0    0.0         0         0
##      LoyalCH SalePriceMM
## 7  0.9722332   0.3636364
## 11 0.9886583   0.8181818
## 18 0.4000146   0.9000000
## 21 0.6000274   0.9000000
## 33 0.6800325   0.1727273
## 35 0.5440238   0.9454545
#View(testData4)

6.4. Predict on testData

The test dataset is prepared. Let’s predict the Y.

# Predict on testData
names(testData4)
## NULL
predicted <- predict(model_mars, testData2)
#head(predicted)

6.5. Confusion Matrix

The confusion matrix is a tabular representation to compare the predictions (data) vs the actuals (reference).

By setting mode=‘everything’ pretty much most classification evaluation metrics are computed.

Code to get the confusion matrix:

# Compute the confusion matrix
#length(predicted)
confusionMatrix(reference = as.factor(testData$Purchase), data =as.factor(predicted), mode='everything', positive='MM')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction CH MM
##         CH 47  4
##         MM 83 79
##                                           
##                Accuracy : 0.5915          
##                  95% CI : (0.5223, 0.6582)
##     No Information Rate : 0.6103          
##     P-Value [Acc > NIR] : 0.7374          
##                                           
##                   Kappa : 0.2673          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9518          
##             Specificity : 0.3615          
##          Pos Pred Value : 0.4877          
##          Neg Pred Value : 0.9216          
##               Precision : 0.4877          
##                  Recall : 0.9518          
##                      F1 : 0.6449          
##              Prevalence : 0.3897          
##          Detection Rate : 0.3709          
##    Detection Prevalence : 0.7606          
##       Balanced Accuracy : 0.6567          
##                                           
##        'Positive' Class : MM              
## 

We get an overall accuracy of 80.28%.

7. How to do hyperparameter tuning to optimize the model for better performance?

There are two main ways to do hyper parameter tuning using the train() function can be used with: Set the tuneLength Define and set the tuneGrid

tuneLength corresponds to the number of unique values for the tuning parameters caret will consider while forming the hyper parameter combinations.

Caret will automatically determine the values each parameter should take. Alternately, if we want to explicitly control what values should be considered for each parameter, then, we can define the tuneGrid and pass it to train().

Let’s see an example of both these approaches but first let’s setup the trainControl().

7.1. Setting up the trainControl()

The train() function takes a trControl argument that accepts the output of trainControl().

Inside trainControl() we can control how the train() will:

Cross validation method to use.
How the results should be summarised using a summary function

Cross validation method can be one amongst:

‘boot’: Bootstrap sampling
‘boot632’: Bootstrap sampling with 63.2% bias correction applied
‘optimism_boot’: The optimism bootstrap estimator
‘boot_all’: All boot methods.
‘cv’: k-Fold cross validation
‘repeatedcv’: Repeated k-Fold cross validation
‘oob’: Out of Bag cross validation
‘LOOCV’: Leave one out cross validation
‘LGOCV’: Leave group out cross validation

The summaryFunction can be twoClassSummary if Y is binary class or multiClassSummary if the Y has more than 2 categories. By settiung the classProbs=T the probability scores are generated instead of directly predicting the class based on a predetermined cutoff of 0.5.

Code to set the code control:

# Define the training control
fitControl <- trainControl(method = 'cv',# k-fold cross validation
    number = 5,                      # number of folds
    savePredictions = 'final',       # saves predictions for optimal tuning parameter
    classProbs = T,                  # should class probabilities be returned
    summaryFunction=twoClassSummary  # results summary function
)

7.2 Hyper Parameter Tuning using tuneLength Let’s take the train() function we used before, plus, additionally set the tuneLength, trControl and metric.

set.seed(100)
model_mars2 = train(Purchase ~ ., data=na.omit(trainData), method='earth', tuneLength = 5, metric='ROC', trControl = fitControl)
model_mars2
## Multivariate Adaptive Regression Spline 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 685, 686, 685, 686, 686 
## Resampling results across tuning parameters:
## 
##   nprune  ROC        Sens       Spec     
##    2      0.8837092  0.8757143  0.7094075
##    5      0.9045666  0.8756960  0.7483944
##    9      0.8958465  0.8776190  0.7483039
##   13      0.8942303  0.8719048  0.7513342
##   17      0.8942303  0.8719048  0.7513342
## 
## Tuning parameter 'degree' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 5 and degree = 1.
# Step 2: Predict on testData and Compute the confusion matrix
predicted2 <- predict(model_mars2, testData2)
confusionMatrix(reference = as.factor(testData$Purchase), data = as.factor(predicted2), mode='everything', positive='MM')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction CH MM
##         CH 53  5
##         MM 77 78
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.5461, 0.6807)
##     No Information Rate : 0.6103          
##     P-Value [Acc > NIR] : 0.4741          
##                                           
##                   Kappa : 0.3004          
##                                           
##  Mcnemar's Test P-Value : 4.483e-15       
##                                           
##             Sensitivity : 0.9398          
##             Specificity : 0.4077          
##          Pos Pred Value : 0.5032          
##          Neg Pred Value : 0.9138          
##               Precision : 0.5032          
##                  Recall : 0.9398          
##                      F1 : 0.6555          
##              Prevalence : 0.3897          
##          Detection Rate : 0.3662          
##    Detection Prevalence : 0.7277          
##       Balanced Accuracy : 0.6737          
##                                           
##        'Positive' Class : MM              
## 

7.3. Hyper Parameter Tuning using tuneGrid

Alternately, we can set the tuneGrid instead of tuneLength.

Code

# Step 1: Define the tuneGrid
marsGrid <-  expand.grid(nprune = c(2, 4, 6, 8, 10), 
                        degree = c(1, 2, 3))

# Step 2: Tune hyper parameters by setting tuneGrid
set.seed(100)
model_mars3 = train(Purchase ~ ., data= na.omit(trainData), method='earth', metric='ROC', tuneGrid = marsGrid, trControl = fitControl)
model_mars3
## Multivariate Adaptive Regression Spline 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 685, 686, 685, 686, 686 
## Resampling results across tuning parameters:
## 
##   degree  nprune  ROC        Sens       Spec     
##   1        2      0.8837092  0.8757143  0.7094075
##   1        4      0.9046530  0.8795238  0.7692899
##   1        6      0.9007984  0.8795055  0.7454545
##   1        8      0.8980735  0.8776190  0.7512890
##   1       10      0.8938840  0.8700000  0.7422433
##   2        2      0.8594193  0.8701648  0.6499774
##   2        4      0.9020261  0.8776374  0.7693351
##   2        6      0.9012110  0.8603297  0.7752601
##   2        8      0.8956165  0.8565385  0.7693351
##   2       10      0.8949678  0.8603846  0.7723202
##   3        2      0.8773872  0.8297802  0.7604251
##   3        4      0.9034469  0.8776007  0.7753053
##   3        6      0.9026720  0.8660806  0.7631389
##   3        8      0.8997580  0.8603846  0.7511533
##   3       10      0.8974952  0.8565751  0.7571687
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 4 and degree = 1.
# Step 3: Predict on testData and Compute the confusion matrix
predicted3 <- predict(model_mars3, testData2)
confusionMatrix(reference = as.factor(testData$Purchase), data = as.factor(predicted3), mode='everything', positive='MM')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction CH MM
##         CH 52  5
##         MM 78 78
##                                           
##                Accuracy : 0.6103          
##                  95% CI : (0.5413, 0.6762)
##     No Information Rate : 0.6103          
##     P-Value [Acc > NIR] : 0.53            
##                                           
##                   Kappa : 0.2932          
##                                           
##  Mcnemar's Test P-Value : 2.722e-15       
##                                           
##             Sensitivity : 0.9398          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.9123          
##               Precision : 0.5000          
##                  Recall : 0.9398          
##                      F1 : 0.6527          
##              Prevalence : 0.3897          
##          Detection Rate : 0.3662          
##    Detection Prevalence : 0.7324          
##       Balanced Accuracy : 0.6699          
##                                           
##        'Positive' Class : MM              
## 

8. How to evaluate performance of multiple machine learning algorithms?

Caret provides the resamples() function where we can provide multiple machine learning models and collectively evaluate them.

Let’s first train some more algorithms.

8.1. Training Adaboost

set.seed(100)

# Train the model using adaboost
model_adaboost = train(Purchase ~ ., data=trainData, method='adaboost', tuneLength=2, trControl = fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
model_adaboost
## AdaBoost Classification Trees 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 685, 686, 685, 686, 686 
## Resampling results across tuning parameters:
## 
##   nIter  method         ROC        Sens       Spec     
##    50    Adaboost.M1    0.8783495  0.8298535  0.7543193
##    50    Real adaboost  0.6860001  0.8393407  0.7303935
##   100    Adaboost.M1    0.8768766  0.8317399  0.7602895
##   100    Real adaboost  0.6629364  0.8450733  0.7363636
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nIter = 50 and method = Adaboost.M1.

8.2. Training Random Forest

set.seed(100)

# Train the model using rf
model_rf = train(Purchase ~ ., data=trainData, method='rf', tuneLength=5, trControl = fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
model_rf
## Random Forest 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 685, 686, 685, 686, 686 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.8720916  0.8622711  0.6824514
##    5    0.8855879  0.8527656  0.7303030
##    9    0.8864110  0.8508608  0.7482587
##   13    0.8835262  0.8508425  0.7602442
##   17    0.8863651  0.8508425  0.7692447
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.

8.4. Training SVM

set.seed(100)

# Train the model using MARS
model_svmRadial = train(Purchase ~ ., data=trainData, method='svmRadial', tuneLength=15, trControl = fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
model_svmRadial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 685, 686, 685, 686, 686 
## Resampling results across tuning parameters:
## 
##   C        ROC        Sens       Spec     
##      0.25  0.8979965  0.8833516  0.7244233
##      0.50  0.8982256  0.8776007  0.7274536
##      1.00  0.8983574  0.8757143  0.7334238
##      2.00  0.8934986  0.8737729  0.7303483
##      4.00  0.8920373  0.8794872  0.7064677
##      8.00  0.8871427  0.8909524  0.6855269
##     16.00  0.8829390  0.8870696  0.6914971
##     32.00  0.8769735  0.8889744  0.6554048
##     64.00  0.8592418  0.8870696  0.6643600
##    128.00  0.8480155  0.8794139  0.6494346
##    256.00  0.8402301  0.8813736  0.6254636
##    512.00  0.8320629  0.8871245  0.6196744
##   1024.00  0.8208694  0.8890476  0.6106287
##   2048.00  0.8145820  0.8947802  0.5628223
##   4096.00  0.8121453  0.8967582  0.5358661
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06561907
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.06561907 and C = 1.

8.5. Run resamples() to compare the models

# Compare model performances using resample()
models_compare <- resamples(list(ADABOOST=model_adaboost, RF=model_rf, MARS=model_mars3, SVM=model_svmRadial))

# Summary of the models performances
summary(models_compare)
## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: ADABOOST, RF, MARS, SVM 
## Number of resamples: 5 
## 
## ROC 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ADABOOST 0.8525253 0.8772245 0.8828932 0.8783495 0.8852878 0.8938166    0
## RF       0.8718615 0.8757893 0.8922933 0.8864110 0.8953092 0.8968017    0
## MARS     0.8855700 0.8943743 0.9068941 0.9046530 0.9146411 0.9217853    0
## SVM      0.8679654 0.8844719 0.9053305 0.8983574 0.9153272 0.9186923    0
## 
## Sens 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ADABOOST 0.7714286 0.8076923 0.8380952 0.8298535 0.8653846 0.8666667    0
## RF       0.8076923 0.8285714 0.8476190 0.8508608 0.8761905 0.8942308    0
## MARS     0.8365385 0.8380952 0.8952381 0.8795238 0.9134615 0.9142857    0
## SVM      0.8269231 0.8761905 0.8761905 0.8757143 0.8761905 0.9230769    0
## 
## Spec 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ADABOOST 0.6969697 0.7014925 0.7462687 0.7543193 0.8059701 0.8208955    0
## RF       0.6666667 0.6716418 0.7761194 0.7482587 0.7910448 0.8358209    0
## MARS     0.7121212 0.7761194 0.7761194 0.7692899 0.7761194 0.8059701    0
## SVM      0.6969697 0.7014925 0.7164179 0.7334238 0.7761194 0.7761194    0

Let’s plot the resamples summary output.

# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

In the above output we can see clearly how the algorithms performed in terms of ROC, Specificity and Sensitivity and how consistent has it been.

The xgbDART model appears to be the be best performing model overall because of the high ROC. But if we need a model that predicts the positives better, we might want to consider MARS, given its high sensitivity. Either way, we can now make an informed decision on which model to pick.

9. Ensembling the predictions

9.1. How to ensemble predictions from multiple models using caretEnsemble?

So we have predictions from multiple individual models. To do this we had to run the train() function once for each model, store the models and pass it to the res The caretEnsemble package lets do just that.

All we have to do is put the names of all the algorithms we want to run in a vector and pass it to caretEnsemble::caretList() instead of caret::train()

library(caretEnsemble)
## Warning: package 'caretEnsemble' was built under R version 4.2.3
## 
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
## 
##     autoplot
# Stacking Algorithms - Run multiple algos in one call.
trainControl <- trainControl(method="repeatedcv", 
                             number=3, 
                             repeats=2,
                             savePredictions=TRUE, 
                             classProbs=TRUE)

algorithmList <- c('rf', 'adaboost', 'earth', 'svmRadial')

set.seed(100)
models <- caretList(Purchase ~ ., data=trainData, trControl=trainControl, methodList=algorithmList) 
## Warning in trControlCheck(x = trControl, y = target): x$savePredictions == TRUE
## is depreciated. Setting to 'final' instead.
## Warning in trControlCheck(x = trControl, y = target): indexes not defined in
## trControl.  Attempting to set them ourselves, so each model in the ensemble
## will have the same resampling indexes.
results <- resamples(models)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: rf, adaboost, earth, svmRadial 
## Number of resamples: 6 
## 
## Accuracy 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf        0.7692308 0.8027997 0.8122807 0.8063174 0.8175439 0.8251748    0
## adaboost  0.7482517 0.7924537 0.8024513 0.7952247 0.8114035 0.8146853    0
## earth     0.7727273 0.8001564 0.8272155 0.8179767 0.8345039 0.8526316    0
## svmRadial 0.7832168 0.8134696 0.8269393 0.8267404 0.8425101 0.8666667    0
## 
## Kappa 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf        0.5202806 0.5852219 0.6024232 0.5929820 0.6184562 0.6294761    0
## adaboost  0.4716749 0.5613085 0.5844175 0.5665583 0.5976776 0.6031371    0
## earth     0.5207528 0.5732292 0.6347936 0.6136130 0.6481526 0.6860246    0
## svmRadial 0.5406694 0.5994878 0.6317400 0.6296834 0.6627546 0.7121212    0

Plot the resamples output to compare the models.

# Box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)

It is possible for further tune the model within caretList in a customised way.

9.2. How to combine the predictions of multiple models to form a final prediction

That one function simplified a whole lot of work in one line of code. Here is another thought: Is it possible to combine these predicted values from multiple models somehow and make a new ensemble that predicts better?

Turns out this can be done too, using the caretStack(). we just need to make sure we don’t use the same trainControl we used to build the models.

# Create the trainControl
set.seed(101)
stackControl <- trainControl(method="repeatedcv", 
                             number=10, 
                             repeats=3,
                             savePredictions=TRUE, 
                             classProbs=TRUE)

# Ensemble the predictions of `models` to form a new combined prediction based on glm
stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
print(stack.glm)
## A glm ensemble of 4 base models: rf, adaboost, earth, svmRadial
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 1714 samples
##    4 predictor
##    2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1543, 1543, 1542, 1544, 1543, 1542, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8307866  0.6393798

A point to consider: The ensembles tend to perform better if the predictions are less correlated with each other. So we may want to try passing different types of models, both high and low performing rather than just stick to passing high accuracy models to the caretStack.

print(stack.glm)
## A glm ensemble of 4 base models: rf, adaboost, earth, svmRadial
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 1714 samples
##    4 predictor
##    2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1543, 1543, 1542, 1544, 1543, 1542, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8307866  0.6393798
# Predict on testData
stack_predicteds <- predict(stack.glm, newdata=testData2)
head(stack_predicteds)
## [1] MM CH MM MM MM MM
## Levels: CH MM

10. Conclusion

The purpose of this report is to cover the core pieces of the caret package and how we can effectively use it to build machine learning models. This information should serve as a reference and also as a template we can use to build a standardised machine learning workflow, so we can develop it further from there.