In this tutorial, we’ll predict Telco Customer Churn via building the following classification models, using the tidymodels framework:

  • Logistic Regression,

  • Random Forest,

  • XGBoost (extreme gradient boosted trees),

  • K-nearest neighbor,

  • Neural network

Explore Data

The Telco Customer Churn data is retrieved from IBM sample:

customer <- read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")

customer <- customer %>%
  mutate_if(is.character,factor)

Plot summary of all variables with DataExplorer

library(DataExplorer)
plot_intro(customer)

Summary customer data: (exclude CustomerID)

Data summary
Name customer[, -1]
Number of rows 7043
Number of columns 20
_______________________
Column type frequency:
factor 16
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts pct
gender 0 1 FALSE 2 Mal: 3555, Fem: 3488 Mal: 0.50, Fem: 0.50
Partner 0 1 FALSE 2 No: 3641, Yes: 3402 No: 0.52, Yes: 0.48
Dependents 0 1 FALSE 2 No: 4933, Yes: 2110 No: 0.70, Yes: 0.30
PhoneService 0 1 FALSE 2 Yes: 6361, No: 682 Yes: 0.90, No: 0.10
MultipleLines 0 1 FALSE 3 No: 3390, Yes: 2971, No : 682 No: 0.48, Yes: 0.42, No : 0.10
InternetService 0 1 FALSE 3 Fib: 3096, DSL: 2421, No: 1526 Fib: 0.44, DSL: 0.34, No: 0.22
OnlineSecurity 0 1 FALSE 3 No: 3498, Yes: 2019, No : 1526 No: 0.50, Yes: 0.29, No : 0.22
OnlineBackup 0 1 FALSE 3 No: 3088, Yes: 2429, No : 1526 No: 0.44, Yes: 0.34, No : 0.22
DeviceProtection 0 1 FALSE 3 No: 3095, Yes: 2422, No : 1526 No: 0.44, Yes: 0.34, No : 0.22
TechSupport 0 1 FALSE 3 No: 3473, Yes: 2044, No : 1526 No: 0.49, Yes: 0.29, No : 0.22
StreamingTV 0 1 FALSE 3 No: 2810, Yes: 2707, No : 1526 No: 0.40, Yes: 0.38, No : 0.22
StreamingMovies 0 1 FALSE 3 No: 2785, Yes: 2732, No : 1526 No: 0.40, Yes: 0.39, No : 0.22
Contract 0 1 FALSE 3 Mon: 3875, Two: 1695, One: 1473 Mon: 0.55, Two: 0.24, One: 0.21
PaperlessBilling 0 1 FALSE 2 Yes: 4171, No: 2872 Yes: 0.59, No: 0.41
PaymentMethod 0 1 FALSE 4 Ele: 2365, Mai: 1612, Ban: 1544, Cre: 1522 Ele: 0.34, Mai: 0.23, Ban: 0.22, Cre: 0.22
Churn 0 1 FALSE 2 No: 5174, Yes: 1869 No: 0.73, Yes: 0.27

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SeniorCitizen 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tenure 0 1 32.37 24.56 0.00 9.00 29.00 55.00 72.00 ▇▃▃▃▆
MonthlyCharges 0 1 64.76 30.09 18.25 35.50 70.35 89.85 118.75 ▇▅▆▇▅
TotalCharges 11 1 2283.30 2266.77 18.80 401.45 1397.47 3794.74 8684.80 ▇▂▂▂▁


  • It’s about 80% of variables are category data . It’s good to use Tree-based algorithms method.

  • No missing columns and missing observations is small: there are only 11 NAs at TotalCharges ~ 0.0074%. We can ignore NAs in prediction.

  • About the output Churn, we can see our data is not balance with Churn/No-Churn ratio is about 27% over 73%.

Quick EDA

Explore Categorical Data

plot_bar(customer, by = "Churn")

  • Customer using Fiber Optic, No-Online Security, No - Online Backup or No of Technical support are more likely to Churn.

  • Customer with Month-to-month Contract type or using Electronic check for Payment are easier to leave.

  • We can also see: Gender, PhoneService, MultipleLines, SteamingTV, StreamingMovies have very similar pattern. They might not contribute much to predictive model.

  • With ExpCatStat() - a function to provide summary statistic for character or categorical columns, we can see those variables mentioning above are shown as Weak or Very Weak in Degree of Association.

library(SmartEDA)
library(flextable)
ExpCatStat(customer,
           Target = "Churn") %>%
  flextable()

Explore Numeric Variables

Correlation check

library(GGally)
customer %>%
  select(is.numeric,Churn) %>%
  ggpairs(aes(color=fct_rev(Churn)),
          diag = list(continuous = wrap("densityDiag", alpha = 0.6), discrete = wrap("barDiag", alpha = 0.7, color="grey30")))

Looks like TotalCharges is highly correlated with tenure, especially for customer has been churned. It’s also highly correlated with MonthlyCharges. So, we can omit TotalCharge to reduce multicollinearity.

We can also see:

  • tenure is relative uniform, with customer who is not churn, and left skewed with customer who is churned. We have a higher churn at lower tenure value and increase loyalty at tenure >60.

  • Distribution of MonthlyCharges is variant. When the MonthlyCharges is high, the Churn is also high.

  • For TotalCharges, with both yes / no churn, distribution is left skewed. We can see more Churn with Lower Charge

  • SeniorCitizen only have 2 values - and consider it will not be a numeric variable.

Box-plot

customer %>% 
  select(Churn,PhoneService,tenure,MonthlyCharges,TotalCharges) %>%
  pivot_longer(tenure:TotalCharges, names_to = "stat", values_to = "value") %>%
  ggplot(aes(PhoneService,value, fill = Churn, color = Churn)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~stat, scales = "free_y", nrow = 2) +
  labs(y=NULL, color=NULL, fill=NULL)

Feature engineering

Wrangling data

  • There are 11 NAs in TotalChanges, which we can omit them.
  • Looks like SeniorCitizen is not a numeric variable, but a factor, we should convert them.
  • We can eliminate TotalCharges due to multicollinearity.
  • We also can remove customerID, as it is not useful for prediction.
customer <- customer %>%
  na.omit() %>%
  mutate_at(vars(SeniorCitizen),factor) %>%
    select(-customerID, - TotalCharges)

Data splitting and resamples

Splitting and assigning data set to training and testing. Then creating resamples data from training data set - using cross validation vfold-cv() :

library(tidymodels)
set.seed(123)
cust_split <- customer %>%
  initial_split(prop = 0.8, strata = Churn)

train <- training(cust_split)
test <- testing(cust_split)

# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Churn)

Pre-processing

The type of data preprocessing is dependent on the data and the type of model being fit. The excellent book “Tidy Modeling with R” provides an appendix with recommendations for baseline levels of pre-processing that are needed for various model functions.

Based on the models we are going to use, the following recommended pre-processing may be considered.

Pre-processing step
Model dummy nz impute decorrelate normalize
Logistic Regression Yes Yes Yes Yes No
Random Forest No Opt Opt Opt No
XGBoost Yes Yes Opt
Knn Y Yes Yes No Yes
Neural network

Let’s create a base recipe()

cust_rec <- recipe(Churn ~., data = train) %>%
#  update_role(customerID, new_role = "ID") %>%
#  step_corr(all_numeric()) %>%
  step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
  step_zv(all_numeric()) %>% # filter zero variance
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors())

summary(cust_rec)
## # A tibble: 19 × 4
##    variable         type    role      source  
##    <chr>            <chr>   <chr>     <chr>   
##  1 gender           nominal predictor original
##  2 SeniorCitizen    nominal predictor original
##  3 Partner          nominal predictor original
##  4 Dependents       nominal predictor original
##  5 tenure           numeric predictor original
##  6 PhoneService     nominal predictor original
##  7 MultipleLines    nominal predictor original
##  8 InternetService  nominal predictor original
##  9 OnlineSecurity   nominal predictor original
## 10 OnlineBackup     nominal predictor original
## 11 DeviceProtection nominal predictor original
## 12 TechSupport      nominal predictor original
## 13 StreamingTV      nominal predictor original
## 14 StreamingMovies  nominal predictor original
## 15 Contract         nominal predictor original
## 16 PaperlessBilling nominal predictor original
## 17 PaymentMethod    nominal predictor original
## 18 MonthlyCharges   numeric predictor original
## 19 Churn            nominal outcome   original

If we would like to check whether all of our pre-processing steps from above actually worked, we can process as follows:

cust_prep <- 
  cust_rec %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the pre-processed dataframe 
# cust_prep

Build models

Specify model

# Logistic Regression
log_spec <- # your model specification
  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") # model mode

# Random Forest
rf_spec <- 
  rand_forest() %>% 
  # importance = "impurity" will provide variable importance score for Random Forest model
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# XGBoost
#library(xgboost)
xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") 

# K-nearest neighbor
knn_spec <- 
  nearest_neighbor(neighbors = 4) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification") 

#Neural network
library(keras)
nnet_spec <-
  mlp() %>%
  set_mode("classification") %>% 
  set_engine("keras", verbose = FALSE) # verbose to prevent logging results

Create Workflow

# Logistic Regression
log_wf <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(cust_rec) %>%   # use the new recipe
 add_model(log_spec)   # add your model spec

# Random Forest
rf_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(rf_spec) 

# XGBoost
xgb_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(xgb_spec)

# K-nearest neighbor
knn_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(knn_spec)

# Neutral Network
nnet_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(nnet_spec)

Fit models

We use reamples evaluation set (cv_folds) to estimate the performance of our models using fit_resamples()

doParallel::registerDoParallel()
# Logistic regression
log_fit <- fit_resamples(log_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# Random Forest
rf_fit <- fit_resamples(rf_wf,
                        resamples = folds,
                        metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                        control = control_resamples(save_pred = TRUE))

# XGBoost
xgb_fit <- fit_resamples(xgb_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# K-nearest neighbors
knn_fit <- fit_resamples(knn_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# Neutral Network
nnet_fit <- fit_resamples(knn_wf,
                          resamples = folds,
                          metrics = metric_set(recall, precision, f_meas,
                                               accuracy, kap, roc_auc, sens,
                                               spec),
                          control = control_resamples(save_pred = TRUE))

Evaluate models

Compare models

After running these four methods, now it’s time to evaluate their performance with collect_metric() function.

log_metrics <- 
  log_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "LG") # add the name of the model to every row

rf_metrics <- 
  rf_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "RF")

xgb_metrics <- 
  xgb_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGB")

knn_metrics <- 
  knn_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Knn")

nnet_metrics <- 
   nnet_fit %>% 
   collect_metrics(summarise = TRUE) %>%
   mutate(model = "Nnet")

# create dataframe with all models
model_compare <- bind_rows(log_metrics,
                           rf_metrics,
                           xgb_metrics,
                           knn_metrics,
                           nnet_metrics
                           ) 

# change data structure
model_comp <- 
  model_compare %>% 
  select(model, .metric, mean) %>% 
  pivot_wider(names_from = .metric, values_from = mean) 

model_comp %>% 
  knitr::kable(caption = "Metric evaluation",digits = 3) %>%
  kableExtra::kable_styling()
Metric evaluation
model accuracy f_meas kap precision recall roc_auc sens spec
LG 0.806 0.872 0.473 0.846 0.899 0.843 0.899 0.549
RF 0.801 0.870 0.450 0.836 0.907 0.841 0.907 0.509
XGB 0.797 0.866 0.448 0.839 0.895 0.842 0.895 0.526
Knn 0.716 0.804 0.287 0.815 0.795 0.745 0.795 0.499
Nnet 0.716 0.804 0.287 0.815 0.795 0.745 0.795 0.499

ROC curve

We note that the metric results from models are all quite similar. In this example, we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:

model_comp %>% slice_max(f_meas)
## # A tibble: 1 × 9
##   model accuracy f_meas   kap precision recall roc_auc  sens  spec
##   <chr>    <dbl>  <dbl> <dbl>     <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 LG       0.806  0.872 0.473     0.846  0.899   0.843 0.899 0.549

Then the Logistic Regression model is suggested as the selected model !

Finalize model

The final step to fit the selected model - in this case with Logistic Regression - to the whole training data and evaluate it on the test set.

# Logistic Regression
log_final <- last_fit(
  log_wf,
  split = cust_split,
  metrics = metric_set(recall, precision, f_meas,
                         accuracy, kap, roc_auc, sens,spec)
)

We then can check the Out-of-sample performance of selected model:

Performance metrics

# On testing
log_final %>%
  collect_metrics() %>% arrange(.metric)
## # A tibble: 8 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 accuracy  binary         0.799 Preprocessor1_Model1
## 2 f_meas    binary         0.867 Preprocessor1_Model1
## 3 kap       binary         0.458 Preprocessor1_Model1
## 4 precision binary         0.844 Preprocessor1_Model1
## 5 recall    binary         0.891 Preprocessor1_Model1
## 6 roc_auc   binary         0.842 Preprocessor1_Model1
## 7 sens      binary         0.891 Preprocessor1_Model1
## 8 spec      binary         0.545 Preprocessor1_Model1

Performance metric on unseen testing data set are bit lower than on training

# On training data set
log_fit %>% 
  collect_metrics() %>% arrange(.metric)
## # A tibble: 8 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  binary     0.806    10 0.00574 Preprocessor1_Model1
## 2 f_meas    binary     0.872    10 0.00401 Preprocessor1_Model1
## 3 kap       binary     0.473    10 0.0149  Preprocessor1_Model1
## 4 precision binary     0.846    10 0.00395 Preprocessor1_Model1
## 5 recall    binary     0.899    10 0.00643 Preprocessor1_Model1
## 6 roc_auc   binary     0.843    10 0.00527 Preprocessor1_Model1
## 7 sens      binary     0.899    10 0.00643 Preprocessor1_Model1
## 8 spec      binary     0.549    10 0.0136  Preprocessor1_Model1

Confusion Matrix

log_final %>%
  collect_predictions() %>%
  conf_mat(Churn,.pred_class)
##           Truth
## Prediction  No Yes
##        No  920 170
##        Yes 113 204

Variable importance

Plot variable importance scores for the predictors in the model. The plot is telling us a measure of importance.

library(vip)
log_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip(geom = "point") + 
  labs(title = "Variable importance scores",
               subtitle = "Logistic Regression")