This machine learning project revolves around the essential task of predicting customer churn. The project’s primary objective is to build a strong predictive model capable of identifying customers who are likely to churn. By effectively recognizing potential churners, the model aims to optimize decision-making in customer retention strategies, reduce business risks, and contribute to more informed actions in the field of customer relationship management.

These are some valuable business advantages that are offered by the result of this project:

  • Enhanced Risk Management: Accurate identification of potential churners minimizes the risk of revenue loss and enables proactive intervention strategies.
  • Personalized Customer Engagement: Predicting churn allows for targeted retention efforts, fostering stronger customer relationships and loyalty.
  • Operational Efficiency: Automated churn prediction streamlines resource allocation, reducing manual efforts and operational costs.
  • Maximized Customer Lifetime Value: Retaining high-value customers through predictive modeling increases their long-term contribution to the business.
  • Financial Stability: Minimizing customer attrition ensures a steady revenue stream and enhances overall financial stability.
  • Data-Driven Decision-Making: Churn prediction provides valuable insights into customer behavior, enabling informed strategic decisions and improvements.
  • Competitive Advantage: Effective churn management enhances the company’s reputation for customer-centric practices, differentiating it from competitors.
  • Optimized Marketing Strategies: Accurate churn predictions lead to targeted marketing campaigns, maximizing their impact and minimizing waste.
  • Enhanced Profitability: Reduced churn rates translate to increased customer retention, contributing to improved financial performance and profitability.
  • Continuous Improvement: Ongoing analysis of churn patterns enables the refinement of retention strategies, adapting to evolving customer preferences.

In short, save the business more time, save the business more money, provide the business more insight, and make the business more money.

Now, let’s jump into the coding process!

Data Pre-processing

Import used libraries

library(dplyr)
library(inspectdf)
library(caret)
library(Ardian) # My presonal package
library(ggplot2)
library(GGally)
library(xgboost)
library(randomForest)

Read the data

churn <- read.csv("Telco Customer Churn.csv")

Inspect the data

Top 6 rows

churn %>% head()

Bottom 6 rows

churn %>% tail()

Check duplicated rows

churn %>% duplicated() %>% any()
## [1] FALSE

There’s no duplicated row

Check missing values

churn %>% anyNA()
## [1] TRUE

There are missing values. Let’s inspect the missing values!

Inspect missing values

churn %>% is.na() %>% colSums()
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

There are only 11 rows that has a missing value. Let’s remove those rows!

Handle missing values

churn <- churn %>% filter(complete.cases(.))

churn %>% is.na() %>% colSums()
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0                0 
##            Churn 
##                0

Cool! We are now free of missing values!

Inspect data structure

churn %>% glimpse()
## Rows: 7,032
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

We need to do 2 things:

  1. Remove customerID column
  2. Parse categorical columns into factor

Remove unneeded columns

churn <- churn %>% select(-customerID)

Parse categorical columns

churn <- churn %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(SeniorCitizen = as.factor(SeniorCitizen))

churn %>% glimpse()
## Rows: 7,032
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

Cool. All categorical columns are now in the correct data type

Exploratory Data Analysis & Feature Selection

Check target variable proportion

churn$Churn %>% table() %>% barplot()

The proportion of the target variable is imbalanced. Let’s upsample the data!

Upsample data

up_churn <- upSample(x = churn %>% select(-Churn),
                     y = churn$Churn,
                     yname = "Churn")

up_churn$Churn %>% table() %>% barplot()

Cool. The target proportion is now balanced!

Inspect categorical columns distributions

up_churn %>% inspect_cat() %>% show_plot()

Cool. All categorical columns are distributed normally

Inspect numerical columns distributions

plotNumericalDistribution(up_churn) # This function is from my personal package
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Cool. All numerical columns are distributed normally

Numerical Feature Correlations

up_churn %>% ggcorr(label = T)

TotalCharges is highly correlated with other numerical columns. We’ll remove it

Remove highly correlated columns

up_churn <- up_churn %>% select(-TotalCharges)

up_churn %>% ggcorr(label = T)

Cool! Now there is no correlated column

Cross Validation

Set training indices

Because we don’t have a lot of records, we will set the training proportion to 85%

set.seed(1)

indices <- createDataPartition(y = up_churn$Churn,
                               p = 0.85,
                               list = F)

Split train & test

train_data <- up_churn[indices, ]
test_data <- up_churn[-indices, ]

X_train <- train_data %>% select(-Churn)
X_test <- test_data %>% select(-Churn)

y_train <- train_data$Churn
y_test <- test_data$Churn

Model Fitting & Evaluation

Random Forest Algorithm

I chose the Random Forest algorithm. Because, in cases like predicting human actions, using a simple one rule algorithm may not suffice. The Random Forest algorithm, being an ensemble method which combine multiple rules (trees), is well-suited for such scenarios as it doesn’t solely rely on a single rule.

model_rf <- randomForest(formula =Churn ~ .,
                         data = train_data,
                         ntree = 505,
                         importance = T)

Random Forest Model Evaluation

pred_rf <- predict(model_rf, X_test)

confusionMatrix(pred_rf, y_test, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  637  27
##        Yes 137 747
##                                          
##                Accuracy : 0.8941         
##                  95% CI : (0.8776, 0.909)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7881         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9651         
##             Specificity : 0.8230         
##          Pos Pred Value : 0.8450         
##          Neg Pred Value : 0.9593         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4826         
##    Detection Prevalence : 0.5711         
##       Balanced Accuracy : 0.8941         
##                                          
##        'Positive' Class : Yes            
## 

The metric that we’re gonna be focusing on is Sensitivity. Why? because we want to minimize the case where customers who are likely to churn are predicted as not likely to churn.

The random forest model achieved an overall Accuracy of 89%, with a Sensitivity of 95%. This high sensitivity indicates that the model is effective at correctly identifying customers who are likely to churn, which is crucial for proactively addressing potential churn and retaining valuable customers.

We’ll tune the model and see if we could increase it’s performance!

Model Tuning & Selection

We’re gonna tune the model by implementing K-Fold Cross Validation with 3 repetitions of 5 folds

ctrl <- trainControl(
  method = "repeatedcv",      
  number = 5,    
  repeats = 3
)

model_rf_tunned <- train(
  Churn ~ .,
  data = train_data,
  method = "rf",      
  trControl = ctrl
)

Random Forest Tuned Model Evaluation

pred_rf_tunned <- predict(model_rf_tunned, X_test)

confusionMatrix(pred_rf_tunned, y_test, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  653  23
##        Yes 121 751
##                                          
##                Accuracy : 0.907          
##                  95% CI : (0.8914, 0.921)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.814          
##                                          
##  Mcnemar's Test P-Value : 6.302e-16      
##                                          
##             Sensitivity : 0.9703         
##             Specificity : 0.8437         
##          Pos Pred Value : 0.8612         
##          Neg Pred Value : 0.9660         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4851         
##    Detection Prevalence : 0.5633         
##       Balanced Accuracy : 0.9070         
##                                          
##        'Positive' Class : Yes            
## 

The performance increased!

The tuned random forest model achieved a higher Accuracy of 90% and Sensitivity of 96% indicating stronger ability to accurately identify customers who are likely to churn, making it the winner model of this project!

AUC of ROC

My way of explaining the AUC of ROC score is that it reflects the level of certainty our model has in its predictions. A high AUC of ROC score indicates that the model is very confident and sure of its predictions. As people who use the model, we want the model to be highly confident in its predictions since we rely on it for making decisions. We don’t want a model that isn’t sure or confident about its predictions. This is why the AUC or ROC score is a crucial measure to determine if the model is prepared for practical use or not.

pred_rf_tunned_raw <- predict(model_rf_tunned, X_test, type = "prob")

plotROC(pred_rf_tunned_raw[, 2], ifelse(y_test == "Yes", 1, 0)) # This function is from my personal package

Magnificent. Why? Because the closer the AUC to 1 the more confident the model is at detecting which customers are churning and which are not

Conclusion

In conclusion, the generated model is truly outstanding, boasting an Accuracy of 90% and Sensitivity of 96%. The model exhibits exceptional accuracy, high sensitivity, and strong specificity, signifying its prowess in effectively identifying customers who are likely to churn, while also performs well in identifying customers who are not likely to churn.

Furthermore, the impressive AUC of ROC score of 0.96 serves as an additional testament to the model’s readiness for practical deployment. The near-perfect AUC of ROC score indicates the model’s high confidence in distinguishing between customers who are likely to churn and who are not. With its remarkable performance across various metrics, the model has proven itself to be well-prepared and capable of reliable utilization and real-world applications.