In this tutorial, we’ll predict Telco Customer Churn via building the following classification models, using the tidymodels framework:
Logistic Regression,
Random Forest,
XGBoost (extreme gradient boosted trees),
K-nearest neighbor,
Neural network
The Telco Customer Churn data is retrieved from IBM sample:
customer <- read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")
customer <- customer %>%
mutate_if(is.character,factor)
Plot summary of all variables with DataExplorer
library(DataExplorer)
plot_intro(customer)
Summary customer data: (exclude CustomerID)
| Name | customer[, -1] |
| Number of rows | 7043 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| factor | 16 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts | pct |
|---|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | Mal: 3555, Fem: 3488 | Mal: 0.50, Fem: 0.50 |
| Partner | 0 | 1 | FALSE | 2 | No: 3641, Yes: 3402 | No: 0.52, Yes: 0.48 |
| Dependents | 0 | 1 | FALSE | 2 | No: 4933, Yes: 2110 | No: 0.70, Yes: 0.30 |
| PhoneService | 0 | 1 | FALSE | 2 | Yes: 6361, No: 682 | Yes: 0.90, No: 0.10 |
| MultipleLines | 0 | 1 | FALSE | 3 | No: 3390, Yes: 2971, No : 682 | No: 0.48, Yes: 0.42, No : 0.10 |
| InternetService | 0 | 1 | FALSE | 3 | Fib: 3096, DSL: 2421, No: 1526 | Fib: 0.44, DSL: 0.34, No: 0.22 |
| OnlineSecurity | 0 | 1 | FALSE | 3 | No: 3498, Yes: 2019, No : 1526 | No: 0.50, Yes: 0.29, No : 0.22 |
| OnlineBackup | 0 | 1 | FALSE | 3 | No: 3088, Yes: 2429, No : 1526 | No: 0.44, Yes: 0.34, No : 0.22 |
| DeviceProtection | 0 | 1 | FALSE | 3 | No: 3095, Yes: 2422, No : 1526 | No: 0.44, Yes: 0.34, No : 0.22 |
| TechSupport | 0 | 1 | FALSE | 3 | No: 3473, Yes: 2044, No : 1526 | No: 0.49, Yes: 0.29, No : 0.22 |
| StreamingTV | 0 | 1 | FALSE | 3 | No: 2810, Yes: 2707, No : 1526 | No: 0.40, Yes: 0.38, No : 0.22 |
| StreamingMovies | 0 | 1 | FALSE | 3 | No: 2785, Yes: 2732, No : 1526 | No: 0.40, Yes: 0.39, No : 0.22 |
| Contract | 0 | 1 | FALSE | 3 | Mon: 3875, Two: 1695, One: 1473 | Mon: 0.55, Two: 0.24, One: 0.21 |
| PaperlessBilling | 0 | 1 | FALSE | 2 | Yes: 4171, No: 2872 | Yes: 0.59, No: 0.41 |
| PaymentMethod | 0 | 1 | FALSE | 4 | Ele: 2365, Mai: 1612, Ban: 1544, Cre: 1522 | Ele: 0.34, Mai: 0.23, Ban: 0.22, Cre: 0.22 |
| Churn | 0 | 1 | FALSE | 2 | No: 5174, Yes: 1869 | No: 0.73, Yes: 0.27 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SeniorCitizen | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| tenure | 0 | 1 | 32.37 | 24.56 | 0.00 | 9.00 | 29.00 | 55.00 | 72.00 | ▇▃▃▃▆ |
| MonthlyCharges | 0 | 1 | 64.76 | 30.09 | 18.25 | 35.50 | 70.35 | 89.85 | 118.75 | ▇▅▆▇▅ |
| TotalCharges | 11 | 1 | 2283.30 | 2266.77 | 18.80 | 401.45 | 1397.47 | 3794.74 | 8684.80 | ▇▂▂▂▁ |
It’s about 80% of variables are category data . It’s good to use Tree-based algorithms method.
No missing columns and missing observations is small: there are only 11 NAs at TotalCharges ~ 0.0074%. We can ignore NAs in prediction.
About the output Churn, we can see our data is not balance with Churn/No-Churn ratio is about 27% over 73%.
plot_bar(customer, by = "Churn")
Customer using Fiber Optic, No-Online Security, No - Online Backup or No of Technical support are more likely to Churn.
Customer with Month-to-month Contract type or using Electronic check for Payment are easier to leave.
We can also see: Gender, PhoneService, MultipleLines, SteamingTV, StreamingMovies have very similar pattern. They might not contribute much to predictive model.
With ExpCatStat() - a function to provide summary
statistic for character or categorical columns, we can see those
variables mentioning above are shown as Weak or Very Weak in Degree of
Association.
library(SmartEDA)
library(flextable)
ExpCatStat(customer,
Target = "Churn") %>%
flextable()
Variable | Target | Unique | Chi-squared | p-value | df | IV Value | Cramers V | Degree of Association | Predictive Power |
gender | Churn | 2 | 0.522 | 0.484 | 0 | 0.01 | Very Weak | Not Predictive | |
Partner | Churn | 2 | 159.415 | 0.000 | 0 | 0.15 | Weak | Not Predictive | |
Dependents | Churn | 2 | 189.940 | 0.000 | 0 | 0.16 | Weak | Not Predictive | |
PhoneService | Churn | 2 | 1.004 | 0.335 | 0 | 0.01 | Very Weak | Not Predictive | |
MultipleLines | Churn | 3 | 11.330 | 0.002 | 0 | 0.04 | Very Weak | Not Predictive | |
InternetService | Churn | 3 | 732.310 | 0.000 | 0 | 0.32 | Strong | Not Predictive | |
OnlineSecurity | Churn | 3 | 849.999 | 0.000 | 0 | 0.35 | Strong | Not Predictive | |
OnlineBackup | Churn | 3 | 601.813 | 0.000 | 0 | 0.29 | Moderate | Not Predictive | |
DeviceProtection | Churn | 3 | 558.419 | 0.000 | 0 | 0.28 | Moderate | Not Predictive | |
TechSupport | Churn | 3 | 828.197 | 0.000 | 0 | 0.34 | Strong | Not Predictive | |
StreamingTV | Churn | 3 | 374.204 | 0.000 | 0 | 0.23 | Moderate | Not Predictive | |
StreamingMovies | Churn | 3 | 375.661 | 0.000 | 0 | 0.23 | Moderate | Not Predictive | |
Contract | Churn | 3 | 1,184.597 | 0.000 | 0 | 0.41 | Strong | Not Predictive | |
PaperlessBilling | Churn | 2 | 259.161 | 0.000 | 0 | 0.19 | Moderate | Not Predictive | |
PaymentMethod | Churn | 4 | 648.142 | 0.000 | 0 | 0.30 | Strong | Not Predictive | |
SeniorCitizen | Churn | 2 | 160.352 | 0.000 | 0 | 0.15 | Weak | Not Predictive | |
tenure | Churn | 10 | 990.439 | 0.000 | 0 | 0.38 | Strong | Not Predictive | |
MonthlyCharges | Churn | 10 | 423.962 | 0.000 | 0 | 0.25 | Moderate | Not Predictive | |
TotalCharges | Churn | 9 | 164.522 | 0.000 | 0 | 0.17 | Weak | Not Predictive |
library(GGally)
customer %>%
select(is.numeric,Churn) %>%
ggpairs(aes(color=fct_rev(Churn)),
diag = list(continuous = wrap("densityDiag", alpha = 0.6), discrete = wrap("barDiag", alpha = 0.7, color="grey30")))
Looks like TotalCharges is highly correlated with tenure, especially for customer has been churned. It’s also highly correlated with MonthlyCharges. So, we can omit TotalCharge to reduce multicollinearity.
We can also see:
tenure is relative uniform, with customer who is not churn, and left skewed with customer who is churned. We have a higher churn at lower tenure value and increase loyalty at tenure >60.
Distribution of MonthlyCharges is variant. When the MonthlyCharges is high, the Churn is also high.
For TotalCharges, with both yes / no churn, distribution is left skewed. We can see more Churn with Lower Charge
SeniorCitizen only have 2 values - and consider it will not be a numeric variable.
customer %>%
select(Churn,PhoneService,tenure,MonthlyCharges,TotalCharges) %>%
pivot_longer(tenure:TotalCharges, names_to = "stat", values_to = "value") %>%
ggplot(aes(PhoneService,value, fill = Churn, color = Churn)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~stat, scales = "free_y", nrow = 2) +
labs(y=NULL, color=NULL, fill=NULL)
customer <- customer %>%
na.omit() %>%
mutate_at(vars(SeniorCitizen),factor) %>%
select(-customerID, - TotalCharges)
Splitting and assigning data set to training and testing. Then
creating resamples data from training data set - using cross
validation vfold-cv() :
library(tidymodels)
set.seed(123)
cust_split <- customer %>%
initial_split(prop = 0.8, strata = Churn)
train <- training(cust_split)
test <- testing(cust_split)
# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Churn)
The type of data preprocessing is dependent on the data and the type of model being fit. The excellent book “Tidy Modeling with R” provides an appendix with recommendations for baseline levels of pre-processing that are needed for various model functions.
Based on the models we are going to use, the following recommended pre-processing may be considered.
| Model | dummy | nz | impute | decorrelate | normalize |
|---|---|---|---|---|---|
| Logistic Regression | Yes | Yes | Yes | Yes | No |
| Random Forest | No | Opt | Opt | Opt | No |
| XGBoost | Yes | Yes | Opt | ||
| Knn | Y | Yes | Yes | No | Yes |
| Neural network |
Let’s create a base recipe()
cust_rec <- recipe(Churn ~., data = train) %>%
# update_role(customerID, new_role = "ID") %>%
# step_corr(all_numeric()) %>%
step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
step_zv(all_numeric()) %>% # filter zero variance
step_normalize(all_numeric()) %>%
step_dummy(all_nominal_predictors())
summary(cust_rec)
## # A tibble: 19 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 gender nominal predictor original
## 2 SeniorCitizen nominal predictor original
## 3 Partner nominal predictor original
## 4 Dependents nominal predictor original
## 5 tenure numeric predictor original
## 6 PhoneService nominal predictor original
## 7 MultipleLines nominal predictor original
## 8 InternetService nominal predictor original
## 9 OnlineSecurity nominal predictor original
## 10 OnlineBackup nominal predictor original
## 11 DeviceProtection nominal predictor original
## 12 TechSupport nominal predictor original
## 13 StreamingTV nominal predictor original
## 14 StreamingMovies nominal predictor original
## 15 Contract nominal predictor original
## 16 PaperlessBilling nominal predictor original
## 17 PaymentMethod nominal predictor original
## 18 MonthlyCharges numeric predictor original
## 19 Churn nominal outcome original
If we would like to check whether all of our pre-processing steps from above actually worked, we can process as follows:
cust_prep <-
cust_rec %>% # use the recipe object
prep() %>% # perform the recipe on training data
juice() # extract only the pre-processed dataframe
# cust_prep
# Logistic Regression
log_spec <- # your model specification
logistic_reg() %>% # model type
set_engine(engine = "glm") %>% # model engine
set_mode("classification") # model mode
# Random Forest
rf_spec <-
rand_forest() %>%
# importance = "impurity" will provide variable importance score for Random Forest model
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# XGBoost
#library(xgboost)
xgb_spec <-
boost_tree() %>%
set_engine("xgboost") %>%
set_mode("classification")
# K-nearest neighbor
knn_spec <-
nearest_neighbor(neighbors = 4) %>% # we can adjust the number of neighbors
set_engine("kknn") %>%
set_mode("classification")
#Neural network
library(keras)
nnet_spec <-
mlp() %>%
set_mode("classification") %>%
set_engine("keras", verbose = FALSE) # verbose to prevent logging results
# Logistic Regression
log_wf <- # new workflow object
workflow() %>% # use workflow function
add_recipe(cust_rec) %>% # use the new recipe
add_model(log_spec) # add your model spec
# Random Forest
rf_wf <-
workflow() %>%
add_recipe(cust_rec) %>%
add_model(rf_spec)
# XGBoost
xgb_wf <-
workflow() %>%
add_recipe(cust_rec) %>%
add_model(xgb_spec)
# K-nearest neighbor
knn_wf <-
workflow() %>%
add_recipe(cust_rec) %>%
add_model(knn_spec)
# Neutral Network
nnet_wf <-
workflow() %>%
add_recipe(cust_rec) %>%
add_model(nnet_spec)
We use reamples evaluation set (cv_folds) to estimate the performance
of our models using fit_resamples()
doParallel::registerDoParallel()
# Logistic regression
log_fit <- fit_resamples(log_wf,
resamples = folds,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,
spec),
control = control_resamples(save_pred = TRUE))
# Random Forest
rf_fit <- fit_resamples(rf_wf,
resamples = folds,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,
spec),
control = control_resamples(save_pred = TRUE))
# XGBoost
xgb_fit <- fit_resamples(xgb_wf,
resamples = folds,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,
spec),
control = control_resamples(save_pred = TRUE))
# K-nearest neighbors
knn_fit <- fit_resamples(knn_wf,
resamples = folds,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,
spec),
control = control_resamples(save_pred = TRUE))
# Neutral Network
nnet_fit <- fit_resamples(knn_wf,
resamples = folds,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,
spec),
control = control_resamples(save_pred = TRUE))
After running these four methods, now it’s time to evaluate their
performance with collect_metric() function.
log_metrics <-
log_fit %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "LG") # add the name of the model to every row
rf_metrics <-
rf_fit %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "RF")
xgb_metrics <-
xgb_fit %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "XGB")
knn_metrics <-
knn_fit %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Knn")
nnet_metrics <-
nnet_fit %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Nnet")
# create dataframe with all models
model_compare <- bind_rows(log_metrics,
rf_metrics,
xgb_metrics,
knn_metrics,
nnet_metrics
)
# change data structure
model_comp <-
model_compare %>%
select(model, .metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean)
model_comp %>%
knitr::kable(caption = "Metric evaluation",digits = 3) %>%
kableExtra::kable_styling()
| model | accuracy | f_meas | kap | precision | recall | roc_auc | sens | spec |
|---|---|---|---|---|---|---|---|---|
| LG | 0.806 | 0.872 | 0.473 | 0.846 | 0.899 | 0.843 | 0.899 | 0.549 |
| RF | 0.801 | 0.870 | 0.450 | 0.836 | 0.907 | 0.841 | 0.907 | 0.509 |
| XGB | 0.797 | 0.866 | 0.448 | 0.839 | 0.895 | 0.842 | 0.895 | 0.526 |
| Knn | 0.716 | 0.804 | 0.287 | 0.815 | 0.795 | 0.745 | 0.795 | 0.499 |
| Nnet | 0.716 | 0.804 | 0.287 | 0.815 | 0.795 | 0.745 | 0.795 | 0.499 |
We note that the metric results from models are all quite similar. In this example, we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:
model_comp %>% slice_max(f_meas)
## # A tibble: 1 × 9
## model accuracy f_meas kap precision recall roc_auc sens spec
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LG 0.806 0.872 0.473 0.846 0.899 0.843 0.899 0.549
Then the Logistic Regression model is suggested as the selected model !
The final step to fit the selected model - in this case with Logistic Regression - to the whole training data and evaluate it on the test set.
# Logistic Regression
log_final <- last_fit(
log_wf,
split = cust_split,
metrics = metric_set(recall, precision, f_meas,
accuracy, kap, roc_auc, sens,spec)
)
We then can check the Out-of-sample performance of selected model:
# On testing
log_final %>%
collect_metrics() %>% arrange(.metric)
## # A tibble: 8 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.799 Preprocessor1_Model1
## 2 f_meas binary 0.867 Preprocessor1_Model1
## 3 kap binary 0.458 Preprocessor1_Model1
## 4 precision binary 0.844 Preprocessor1_Model1
## 5 recall binary 0.891 Preprocessor1_Model1
## 6 roc_auc binary 0.842 Preprocessor1_Model1
## 7 sens binary 0.891 Preprocessor1_Model1
## 8 spec binary 0.545 Preprocessor1_Model1
Performance metric on unseen testing data set are bit lower than on training
# On training data set
log_fit %>%
collect_metrics() %>% arrange(.metric)
## # A tibble: 8 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.806 10 0.00574 Preprocessor1_Model1
## 2 f_meas binary 0.872 10 0.00401 Preprocessor1_Model1
## 3 kap binary 0.473 10 0.0149 Preprocessor1_Model1
## 4 precision binary 0.846 10 0.00395 Preprocessor1_Model1
## 5 recall binary 0.899 10 0.00643 Preprocessor1_Model1
## 6 roc_auc binary 0.843 10 0.00527 Preprocessor1_Model1
## 7 sens binary 0.899 10 0.00643 Preprocessor1_Model1
## 8 spec binary 0.549 10 0.0136 Preprocessor1_Model1
log_final %>%
collect_predictions() %>%
conf_mat(Churn,.pred_class)
## Truth
## Prediction No Yes
## No 920 170
## Yes 113 204
Plot variable importance scores for the predictors in the model. The plot is telling us a measure of importance.
library(vip)
log_final %>%
extract_fit_engine() %>% # extract engine from the final model
vip(geom = "point") +
labs(title = "Variable importance scores",
subtitle = "Logistic Regression")