The data is obtained from the C50 package and contains the customer churn data at a telecom company. We will fit different models.

In order to do a proper ‘apples to apples’ comparison, we must make sure that each model uses the exact split of the training and testing model for each fold. We will do this by creating a shared trainControl object.

Lets load the data -

We have divided the data into a set of predictor variables “churn_x” and a vector of response variables “churn_y”.

The churn dataset contains data on a variety of telecom customers and the modeling challenge is to predict which customers will cancel their service (or churn).

We will be exploring two different types of predictive models: glmnet and rf, so the first order of business is to create a reusable trainControl object you can use to reliably compare them.

#Use createFolds() to create 5 CV folds on churn_y, our target variable for this exercise.
myFolds <- createFolds(churn_y, k = 5)

#Pass them to trainControl() to create a reusable trainControl for comparing models.

myControl <- trainControl(summaryFunction = twoClassSummary, #uses 'AUC'
     classProbs = T,
     verboseIter = T,
     savePredictions = T,
     index = myFolds)

By saving the indexes in the trainControl(), we can fit many models useing the same CV fold.

Lets look at what exactly is a glmnet model. 1)It is a linear model with built in variable selection.

2)It’s a great baseline model for any predictive modeling problem. -Its fast -Uses variable selection to ignore noisy variables - Provides linear regression coefficients that we can use to understand patterns in the undelying data. -Easily interpretable models(like lm or glm from base R).

As a Business analysis, we could use these coefficients to understand key drivers of churn, but even if we care omly about the predictions, glmnet fits quickly and often provides accurate results.

FIT THE BASELINE MODEL

Now that we have a reusable trainControl object called myControl, we can start fitting different predictive models to our churn dataset and evaluate their predictive accuracy.

we will start with one of my favorite models, glmnet, which penalizes linear and logistic regression models on the size and number of coefficients to help prevent over-fitting.

model_glmnet <- train(
  x = churn_x,
  y = churn_y,
  metric = "ROC",
  method = "glmnet",
  trControl = myControl
)
## + Fold1: alpha=0.10, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold1: alpha=0.10, lambda=0.01821 
## + Fold1: alpha=0.55, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold1: alpha=0.55, lambda=0.01821 
## + Fold1: alpha=1.00, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold1: alpha=1.00, lambda=0.01821 
## + Fold2: alpha=0.10, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold2: alpha=0.10, lambda=0.01821 
## + Fold2: alpha=0.55, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold2: alpha=0.55, lambda=0.01821 
## + Fold2: alpha=1.00, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold2: alpha=1.00, lambda=0.01821 
## + Fold3: alpha=0.10, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold3: alpha=0.10, lambda=0.01821 
## + Fold3: alpha=0.55, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold3: alpha=0.55, lambda=0.01821 
## + Fold3: alpha=1.00, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold3: alpha=1.00, lambda=0.01821 
## + Fold4: alpha=0.10, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold4: alpha=0.10, lambda=0.01821 
## + Fold4: alpha=0.55, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold4: alpha=0.55, lambda=0.01821 
## + Fold4: alpha=1.00, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold4: alpha=1.00, lambda=0.01821 
## + Fold5: alpha=0.10, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold5: alpha=0.10, lambda=0.01821 
## + Fold5: alpha=0.55, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold5: alpha=0.55, lambda=0.01821 
## + Fold5: alpha=1.00, lambda=0.01821
## Warning in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## - Fold5: alpha=1.00, lambda=0.01821 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.55, lambda = 0.0182 on full training set

Plotting the model

plot(model_glmnet)

plot(model_glmnet$finalModel)

Trying a Random Forest model-

Afer glmnet, Random Forests are always the second model I try on any new predictive modeling problem.

1)Random Forests are slightly slower that glmnet models

  1. Bit of a ‘blackbox’ when it comes to interpretability, but can yield accurate models in lot of situations with little parameter tuning.

  2. Require less preprocessing (No need to log transform or otherwise normalize our predictors as well as handles the MNAR data quite beautifully, even with median imputation)

  3. They also often capture ‘threshold effects’ and ‘variable interactions’ which quite often happen in the real world as well.

5)Easy to tune, but slower running.

The default parameters of caret are great for fitting an RF model, hence we wont need a custom tuning grid(alpha and lambda parameters).

Lets fit a Random Forest model to the churn data using the ‘ranger’ package.(Rather than using the classic randomForest package, I’ll be using the ranger package, which is a re-implementation of randomForest that produces almost the exact same results, but is faster, more stable, and uses less memory.)

model_rf <- train(
  x = churn_x, 
  y = churn_y,
  metric = "ROC",
  method = "ranger",
  trControl = myControl
)
## + Fold1: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold1: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold1: mtry=36, min.node.size=1, splitrule=gini 
## - Fold1: mtry=36, min.node.size=1, splitrule=gini 
## + Fold1: mtry=70, min.node.size=1, splitrule=gini 
## - Fold1: mtry=70, min.node.size=1, splitrule=gini 
## + Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=36, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=36, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=70, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=70, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold2: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold2: mtry=36, min.node.size=1, splitrule=gini 
## - Fold2: mtry=36, min.node.size=1, splitrule=gini 
## + Fold2: mtry=70, min.node.size=1, splitrule=gini 
## - Fold2: mtry=70, min.node.size=1, splitrule=gini 
## + Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=36, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=36, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=70, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=70, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold3: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold3: mtry=36, min.node.size=1, splitrule=gini 
## - Fold3: mtry=36, min.node.size=1, splitrule=gini 
## + Fold3: mtry=70, min.node.size=1, splitrule=gini 
## - Fold3: mtry=70, min.node.size=1, splitrule=gini 
## + Fold3: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold3: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry=36, min.node.size=1, splitrule=extratrees 
## - Fold3: mtry=36, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry=70, min.node.size=1, splitrule=extratrees 
## - Fold3: mtry=70, min.node.size=1, splitrule=extratrees 
## + Fold4: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold4: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold4: mtry=36, min.node.size=1, splitrule=gini 
## - Fold4: mtry=36, min.node.size=1, splitrule=gini 
## + Fold4: mtry=70, min.node.size=1, splitrule=gini 
## - Fold4: mtry=70, min.node.size=1, splitrule=gini 
## + Fold4: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold4: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold4: mtry=36, min.node.size=1, splitrule=extratrees 
## - Fold4: mtry=36, min.node.size=1, splitrule=extratrees 
## + Fold4: mtry=70, min.node.size=1, splitrule=extratrees 
## - Fold4: mtry=70, min.node.size=1, splitrule=extratrees 
## + Fold5: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold5: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold5: mtry=36, min.node.size=1, splitrule=gini 
## - Fold5: mtry=36, min.node.size=1, splitrule=gini 
## + Fold5: mtry=70, min.node.size=1, splitrule=gini 
## - Fold5: mtry=70, min.node.size=1, splitrule=gini 
## + Fold5: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold5: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold5: mtry=36, min.node.size=1, splitrule=extratrees 
## - Fold5: mtry=36, min.node.size=1, splitrule=extratrees 
## + Fold5: mtry=70, min.node.size=1, splitrule=extratrees 
## - Fold5: mtry=70, min.node.size=1, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 36, splitrule = extratrees, min.node.size = 1 on full training set
plot(model_rf)

Caret has automatically chose the best results for mtry = 70.

COMPARING MODELS

Selection criteria-

We want to pick the model with the highest average AUC across all the folds, but also a lower sd in AUC(model with the higher median AUC, as well as a smaller range between min and max AUC).

caret::resamples() function can help us(provided they have the same training data and use the same trainControl object with preset cross-validation folds). It provides a variety of methods for us to assess which of 2 models is the best for a given dataset.

model_list <- list(glmnet_model = model_glmnet,
                   rf_model = model_rf)

resamps <- resamples(model_list)

summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: glmnet_model, rf_model 
## Number of resamples: 5 
## 
## ROC 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glmnet_model 0.3956322 0.5300619 0.6171429 0.5948031 0.6918681 0.7393103    0
## rf_model     0.5556322 0.6234066 0.6543988 0.6587370 0.6946154 0.7656322    0
## 
## Sens 
##                   Min.   1st Qu.    Median      Mean 3rd Qu.      Max. NA's
## glmnet_model 0.9310345 0.9371429 0.9597701 0.9518424    0.96 0.9712644    0
## rf_model     0.9885057 0.9942529 0.9942529 0.9954023    1.00 1.0000000    0
## 
## Spec 
##                    Min.    1st Qu.    Median       Mean 3rd Qu. Max. NA's
## glmnet_model 0.03846154 0.03846154 0.1153846 0.09446154    0.12 0.16    0
## rf_model     0.00000000 0.00000000 0.0000000 0.02400000    0.04 0.08    0
bwplot(resamps, metric = "ROC")

Box and whisker plots show the median of each distribution as a line and the interquartile range of each distribution as a box around the median line.

We can easily use the above plot to select the model with the highest AUC - the Random Forest model.

We can use a density plot that shows the full distribution of the AUC scores using kernel density plot and can be a useful way to look for outlier folds with unusually high or low AUC.

densityplot(resamps, metric = "ROC")

Let’s use a scatterplot, also known as the xy-plot. This plot shows us how similar the two models’ performances are on different folds.It directly compares the AUC on all 10 cross validation folds.

It’s particularly useful for identifying if one model is consistently better than the other across all folds, or if there are situations when the inferior model produces better predictions on a particular subset of the data.

xyplot(resamps, metric = "ROC")

This plot shows us that on every fold, the random forest model provided a higher AUC than the glmnet model.

It would make us very confident in choosing the random forest model over the glmnet model for this customer churn modeling. problem.