Telco Customer Churn Prediction

In this tutorial, we’ll predict Telco Customer Churn via building the following classification models, using the tidymodels framework:

Logistic Regression,
Random Forest,
XGBoost (extreme gradient boosted trees),
K-nearest neighbor,
Neural network

Explore Data

The Telco Customer Churn data is retrieved from IBM sample:

customer <- read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")

customer <- customer %>%
  mutate_if(is.character,factor)

Plot summary of all variables with DataExplorer

library(DataExplorer)
plot_intro(customer)

Summary customer data: (exclude CustomerID)

Data summary
Name	customer[, -1]
Number of rows	7043
Number of columns	20
_______________________
Column type frequency:
factor	16
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts	pct
gender	1	FALSE	2	Mal: 3555, Fem: 3488	Mal: 0.50, Fem: 0.50
Partner	1	FALSE	2	No: 3641, Yes: 3402	No: 0.52, Yes: 0.48
Dependents	1	FALSE	2	No: 4933, Yes: 2110	No: 0.70, Yes: 0.30
PhoneService	1	FALSE	2	Yes: 6361, No: 682	Yes: 0.90, No: 0.10
MultipleLines	1	FALSE	3	No: 3390, Yes: 2971, No : 682	No: 0.48, Yes: 0.42, No : 0.10
InternetService	1	FALSE	3	Fib: 3096, DSL: 2421, No: 1526	Fib: 0.44, DSL: 0.34, No: 0.22
OnlineSecurity	1	FALSE	3	No: 3498, Yes: 2019, No : 1526	No: 0.50, Yes: 0.29, No : 0.22
OnlineBackup	1	FALSE	3	No: 3088, Yes: 2429, No : 1526	No: 0.44, Yes: 0.34, No : 0.22
DeviceProtection	1	FALSE	3	No: 3095, Yes: 2422, No : 1526	No: 0.44, Yes: 0.34, No : 0.22
TechSupport	1	FALSE	3	No: 3473, Yes: 2044, No : 1526	No: 0.49, Yes: 0.29, No : 0.22
StreamingTV	1	FALSE	3	No: 2810, Yes: 2707, No : 1526	No: 0.40, Yes: 0.38, No : 0.22
StreamingMovies	1	FALSE	3	No: 2785, Yes: 2732, No : 1526	No: 0.40, Yes: 0.39, No : 0.22
Contract	1	FALSE	3	Mon: 3875, Two: 1695, One: 1473	Mon: 0.55, Two: 0.24, One: 0.21
PaperlessBilling	1	FALSE	2	Yes: 4171, No: 2872	Yes: 0.59, No: 0.41
PaymentMethod	1	FALSE	4	Ele: 2365, Mai: 1612, Ban: 1544, Cre: 1522	Ele: 0.34, Mai: 0.23, Ban: 0.22, Cre: 0.22
Churn	1	FALSE	2	No: 5174, Yes: 1869	No: 0.73, Yes: 0.27

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SeniorCitizen	0	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
tenure	0	1	32.37	24.56	0.00	9.00	29.00	55.00	72.00	▇▃▃▃▆
MonthlyCharges	0	1	64.76	30.09	18.25	35.50	70.35	89.85	118.75	▇▅▆▇▅
TotalCharges	11	1	2283.30	2266.77	18.80	401.45	1397.47	3794.74	8684.80	▇▂▂▂▁

It’s about 80% of variables are category data . It’s good to use Tree-based algorithms method.
No missing columns and missing observations is small: there are only 11 NAs at TotalCharges ~ 0.0074%. We can ignore NAs in prediction.
About the output Churn, we can see our data is not balance with Churn/No-Churn ratio is about 27% over 73%.

Quick EDA

Explore Categorical Data

plot_bar(customer, by = "Churn")

Customer using Fiber Optic, No-Online Security, No - Online Backup or No of Technical support are more likely to Churn.
Customer with Month-to-month Contract type or using Electronic check for Payment are easier to leave.
We can also see: Gender, PhoneService, MultipleLines, SteamingTV, StreamingMovies have very similar pattern. They might not contribute much to predictive model.
With ExpCatStat() - a function to provide summary statistic for character or categorical columns, we can see those variables mentioning above are shown as Weak or Very Weak in Degree of Association.

library(SmartEDA)
library(flextable)
ExpCatStat(customer,
           Target = "Churn") %>%
  flextable()

Variable	Target	Unique	Chi-squared	p-value	Cramers V	Degree of Association	Predictive Power
gender	Churn	2	0.522	0.484	0.01	Very Weak	Not Predictive
Partner	Churn	2	159.415	0.000	0.15	Weak	Not Predictive
Dependents	Churn	2	189.940	0.000	0.16	Weak	Not Predictive
PhoneService	Churn	2	1.004	0.335	0.01	Very Weak	Not Predictive
MultipleLines	Churn	3	11.330	0.002	0.04	Very Weak	Not Predictive
InternetService	Churn	3	732.310	0.000	0.32	Strong	Not Predictive
OnlineSecurity	Churn	3	849.999	0.000	0.35	Strong	Not Predictive
OnlineBackup	Churn	3	601.813	0.000	0.29	Moderate	Not Predictive
DeviceProtection	Churn	3	558.419	0.000	0.28	Moderate	Not Predictive
TechSupport	Churn	3	828.197	0.000	0.34	Strong	Not Predictive
StreamingTV	Churn	3	374.204	0.000	0.23	Moderate	Not Predictive
StreamingMovies	Churn	3	375.661	0.000	0.23	Moderate	Not Predictive
Contract	Churn	3	1,184.597	0.000	0.41	Strong	Not Predictive
PaperlessBilling	Churn	2	259.161	0.000	0.19	Moderate	Not Predictive
PaymentMethod	Churn	4	648.142	0.000	0.30	Strong	Not Predictive
SeniorCitizen	Churn	2	160.352	0.000	0.15	Weak	Not Predictive
tenure	Churn	10	990.439	0.000	0.38	Strong	Not Predictive
MonthlyCharges	Churn	10	423.962	0.000	0.25	Moderate	Not Predictive
TotalCharges	Churn	9	164.522	0.000	0.17	Weak	Not Predictive

Explore Numeric Variables

Correlation check

library(GGally)
customer %>%
  select(is.numeric,Churn) %>%
  ggpairs(aes(color=fct_rev(Churn)),
          diag = list(continuous = wrap("densityDiag", alpha = 0.6), discrete = wrap("barDiag", alpha = 0.7, color="grey30")))

Looks like TotalCharges is highly correlated with tenure, especially for customer has been churned. It’s also highly correlated with MonthlyCharges. So, we can omit TotalCharge to reduce multicollinearity.

We can also see:

tenure is relative uniform, with customer who is not churn, and left skewed with customer who is churned. We have a higher churn at lower tenure value and increase loyalty at tenure >60.
Distribution of MonthlyCharges is variant. When the MonthlyCharges is high, the Churn is also high.
For TotalCharges, with both yes / no churn, distribution is left skewed. We can see more Churn with Lower Charge
SeniorCitizen only have 2 values - and consider it will not be a numeric variable.

Box-plot

customer %>% 
  select(Churn,PhoneService,tenure,MonthlyCharges,TotalCharges) %>%
  pivot_longer(tenure:TotalCharges, names_to = "stat", values_to = "value") %>%
  ggplot(aes(PhoneService,value, fill = Churn, color = Churn)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~stat, scales = "free_y", nrow = 2) +
  labs(y=NULL, color=NULL, fill=NULL)

Feature engineering

Wrangling data

There are 11 NAs in TotalChanges, which we can omit them.
Looks like SeniorCitizen is not a numeric variable, but a factor, we should convert them.
We can eliminate TotalCharges due to multicollinearity.
We also can remove customerID, as it is not useful for prediction.

customer <- customer %>%
  na.omit() %>%
  mutate_at(vars(SeniorCitizen),factor) %>%
    select(-customerID, - TotalCharges)

Data splitting and resamples

Splitting and assigning data set to training and testing. Then creating resamples data from training data set - using cross validation vfold-cv() :

library(tidymodels)
set.seed(123)
cust_split <- customer %>%
  initial_split(prop = 0.8, strata = Churn)

train <- training(cust_split)
test <- testing(cust_split)

# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Churn)

Pre-processing

The type of data preprocessing is dependent on the data and the type of model being fit. The excellent book “Tidy Modeling with R” provides an appendix with recommendations for baseline levels of pre-processing that are needed for various model functions.

Based on the models we are going to use, the following recommended pre-processing may be considered.

Pre-processing step
Model	dummy	nz	impute	decorrelate	normalize
Logistic Regression	Yes	Yes	Yes	Yes	No
Random Forest	No	Opt	Opt	Opt	No
XGBoost	Yes		Yes	Opt
Knn	Y	Yes	Yes	No	Yes
Neural network

Let’s create a base recipe()

cust_rec <- recipe(Churn ~., data = train) %>%
#  update_role(customerID, new_role = "ID") %>%
#  step_corr(all_numeric()) %>%
  step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
  step_zv(all_numeric()) %>% # filter zero variance
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors())

summary(cust_rec)

## # A tibble: 19 × 4
##    variable         type    role      source  
##    <chr>            <chr>   <chr>     <chr>   
##  1 gender           nominal predictor original
##  2 SeniorCitizen    nominal predictor original
##  3 Partner          nominal predictor original
##  4 Dependents       nominal predictor original
##  5 tenure           numeric predictor original
##  6 PhoneService     nominal predictor original
##  7 MultipleLines    nominal predictor original
##  8 InternetService  nominal predictor original
##  9 OnlineSecurity   nominal predictor original
## 10 OnlineBackup     nominal predictor original
## 11 DeviceProtection nominal predictor original
## 12 TechSupport      nominal predictor original
## 13 StreamingTV      nominal predictor original
## 14 StreamingMovies  nominal predictor original
## 15 Contract         nominal predictor original
## 16 PaperlessBilling nominal predictor original
## 17 PaymentMethod    nominal predictor original
## 18 MonthlyCharges   numeric predictor original
## 19 Churn            nominal outcome   original

If we would like to check whether all of our pre-processing steps from above actually worked, we can process as follows:

cust_prep <- 
  cust_rec %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the pre-processed dataframe 
# cust_prep

Build models

Specify model

# Logistic Regression
log_spec <- # your model specification
  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") # model mode

# Random Forest
rf_spec <- 
  rand_forest() %>% 
  # importance = "impurity" will provide variable importance score for Random Forest model
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# XGBoost
#library(xgboost)
xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") 

# K-nearest neighbor
knn_spec <- 
  nearest_neighbor(neighbors = 4) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification") 

#Neural network
library(keras)
nnet_spec <-
  mlp() %>%
  set_mode("classification") %>% 
  set_engine("keras", verbose = FALSE) # verbose to prevent logging results

Create Workflow

# Logistic Regression
log_wf <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(cust_rec) %>%   # use the new recipe
 add_model(log_spec)   # add your model spec

# Random Forest
rf_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(rf_spec) 

# XGBoost
xgb_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(xgb_spec)

# K-nearest neighbor
knn_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(knn_spec)

# Neutral Network
nnet_wf <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(nnet_spec)

Fit models

We use reamples evaluation set (cv_folds) to estimate the performance of our models using fit_resamples()

doParallel::registerDoParallel()
# Logistic regression
log_fit <- fit_resamples(log_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# Random Forest
rf_fit <- fit_resamples(rf_wf,
                        resamples = folds,
                        metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                        control = control_resamples(save_pred = TRUE))

# XGBoost
xgb_fit <- fit_resamples(xgb_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# K-nearest neighbors
knn_fit <- fit_resamples(knn_wf,
                         resamples = folds,
                         metrics = metric_set(recall, precision, f_meas,
                                              accuracy, kap, roc_auc, sens,
                                              spec),
                         control = control_resamples(save_pred = TRUE))

# Neutral Network
nnet_fit <- fit_resamples(knn_wf,
                          resamples = folds,
                          metrics = metric_set(recall, precision, f_meas,
                                               accuracy, kap, roc_auc, sens,
                                               spec),
                          control = control_resamples(save_pred = TRUE))

Evaluate models

Compare models

After running these four methods, now it’s time to evaluate their performance with collect_metric() function.

log_metrics <- 
  log_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "LG") # add the name of the model to every row

rf_metrics <- 
  rf_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "RF")

xgb_metrics <- 
  xgb_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGB")

knn_metrics <- 
  knn_fit %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Knn")

nnet_metrics <- 
   nnet_fit %>% 
   collect_metrics(summarise = TRUE) %>%
   mutate(model = "Nnet")

# create dataframe with all models
model_compare <- bind_rows(log_metrics,
                           rf_metrics,
                           xgb_metrics,
                           knn_metrics,
                           nnet_metrics
                           ) 

# change data structure
model_comp <- 
  model_compare %>% 
  select(model, .metric, mean) %>% 
  pivot_wider(names_from = .metric, values_from = mean) 

model_comp %>% 
  knitr::kable(caption = "Metric evaluation",digits = 3) %>%
  kableExtra::kable_styling()

Metric evaluation
model	accuracy	f_meas	kap	precision	recall	roc_auc	sens	spec
LG	0.806	0.872	0.473	0.846	0.899	0.843	0.899	0.549
RF	0.801	0.870	0.450	0.836	0.907	0.841	0.907	0.509
XGB	0.797	0.866	0.448	0.839	0.895	0.842	0.895	0.526
Knn	0.716	0.804	0.287	0.815	0.795	0.745	0.795	0.499
Nnet	0.716	0.804	0.287	0.815	0.795	0.745	0.795	0.499

ROC curve

We note that the metric results from models are all quite similar. In this example, we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:

model_comp %>% slice_max(f_meas)

## # A tibble: 1 × 9
##   model accuracy f_meas   kap precision recall roc_auc  sens  spec
##   <chr>    <dbl>  <dbl> <dbl>     <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 LG       0.806  0.872 0.473     0.846  0.899   0.843 0.899 0.549

Then the Logistic Regression model is suggested as the selected model !

Finalize model

The final step to fit the selected model - in this case with Logistic Regression - to the whole training data and evaluate it on the test set.

# Logistic Regression
log_final <- last_fit(
  log_wf,
  split = cust_split,
  metrics = metric_set(recall, precision, f_meas,
                         accuracy, kap, roc_auc, sens,spec)
)

We then can check the Out-of-sample performance of selected model:

Performance metrics

# On testing
log_final %>%
  collect_metrics() %>% arrange(.metric)

## # A tibble: 8 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 accuracy  binary         0.799 Preprocessor1_Model1
## 2 f_meas    binary         0.867 Preprocessor1_Model1
## 3 kap       binary         0.458 Preprocessor1_Model1
## 4 precision binary         0.844 Preprocessor1_Model1
## 5 recall    binary         0.891 Preprocessor1_Model1
## 6 roc_auc   binary         0.842 Preprocessor1_Model1
## 7 sens      binary         0.891 Preprocessor1_Model1
## 8 spec      binary         0.545 Preprocessor1_Model1

Performance metric on unseen testing data set are bit lower than on training

# On training data set
log_fit %>% 
  collect_metrics() %>% arrange(.metric)

## # A tibble: 8 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  binary     0.806    10 0.00574 Preprocessor1_Model1
## 2 f_meas    binary     0.872    10 0.00401 Preprocessor1_Model1
## 3 kap       binary     0.473    10 0.0149  Preprocessor1_Model1
## 4 precision binary     0.846    10 0.00395 Preprocessor1_Model1
## 5 recall    binary     0.899    10 0.00643 Preprocessor1_Model1
## 6 roc_auc   binary     0.843    10 0.00527 Preprocessor1_Model1
## 7 sens      binary     0.899    10 0.00643 Preprocessor1_Model1
## 8 spec      binary     0.549    10 0.0136  Preprocessor1_Model1

Confusion Matrix

log_final %>%
  collect_predictions() %>%
  conf_mat(Churn,.pred_class)

##           Truth
## Prediction  No Yes
##        No  920 170
##        Yes 113 204

Variable importance

Plot variable importance scores for the predictors in the model. The plot is telling us a measure of importance.

library(vip)
log_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip(geom = "point") + 
  labs(title = "Variable importance scores",
               subtitle = "Logistic Regression")