CUSTOMER RETENTION MACHINE LEARNING ANALYSIS

1. Introduction

In today’s digital era where the telecommunications industry is highly competitive, dealing with customer churn and leaving the service provider has become a huge challenge for businesses. Regork Telecom is also facing this problem when they are collecting data about their customers. And they have found that retaining existing customers is extremely necessary and urgent. Indeed, Regork believes that retaining existing customers is less expensive than investing in attracting new potential customers.

Therefore, our analysis will partly help Regork discover which factors will influence customer churn by identifying the factors from the data file “customer_retention.csv”. In addition, we developed a number of predictive models using Machine Learning pattern analysis to provide additional insights to reduce customer churn, which is a serious revenue loss that the company is facing.

Our findings will help predict and also help the company’s development departments focus on the right products, factors, or operations.

We propose various classification methods such as logistic regression, random forest, bagging, and decision tree. You can find our conclusion and reasoning highlighted throughout the summary tab.

2. Packages Required

# Load necessary libraries
library(tidyverse)
library(tidymodels)
library(vip)
library(kernlab)
library(ggplot2)
library(corrplot)
library(ggcorrplot)
library(baguette)
library(pdp)
library(parsnip)
library(ranger)
library(workflows)
library(DALEXtra)
library(widyr)
library(kableExtra)

3. Data Preparation & Explorary Data Analysis

Data Loading

# Load the dataset
customer_data <- read.csv("customer_retention.csv")

# Initial inspection
str(customer_data)
## 'data.frame':    6999 obs. of  20 variables:
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : chr  "Current" "Current" "Left" "Current" ...
summary(customer_data)
##     Gender          SeniorCitizen      Partner           Dependents       
##  Length:6999        Min.   :0.0000   Length:6999        Length:6999       
##  Class :character   1st Qu.:0.0000   Class :character   Class :character  
##  Mode  :character   Median :0.0000   Mode  :character   Mode  :character  
##                     Mean   :0.1619                                        
##                     3rd Qu.:0.0000                                        
##                     Max.   :1.0000                                        
##                                                                           
##      Tenure      PhoneService       MultipleLines      InternetService   
##  Min.   : 0.00   Length:6999        Length:6999        Length:6999       
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :29.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :32.38                                                           
##  3rd Qu.:55.00                                                           
##  Max.   :72.00                                                           
##                                                                          
##  OnlineSecurity     OnlineBackup       DeviceProtection   TechSupport       
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  StreamingTV        StreamingMovies      Contract         PaperlessBilling  
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaymentMethod      MonthlyCharges    TotalCharges       Status         
##  Length:6999        Min.   : 18.25   Min.   :  18.8   Length:6999       
##  Class :character   1st Qu.: 35.48   1st Qu.: 401.9   Class :character  
##  Mode  :character   Median : 70.35   Median :1397.5   Mode  :character  
##                     Mean   : 64.75   Mean   :2283.1                     
##                     3rd Qu.: 89.85   3rd Qu.:3796.9                     
##                     Max.   :118.75   Max.   :8684.8                     
##                                      NA's   :11

Data Cleaning

# Check for missing values
sum(is.na(customer_data)) # Only 2 NA values in TotalCharges for new customers
## [1] 11
# Handle missing values - replace NA in TotalCharges with 0 for new customers
customer_data$TotalCharges[is.na(customer_data$TotalCharges)] <- 0

# Convert TotalCharges to numeric (handle non-numeric issues)
customer_data$TotalCharges <- as.numeric(customer_data$TotalCharges)

# Convert SeniorCitizen to factor
customer_data$SeniorCitizen <- as.factor(customer_data$SeniorCitizen)
# Convert Status to factor
customer_data$Status <- as.factor(customer_data$Status)

# Convert tenure to years for better interpretation
customer_data$TenureYears <- customer_data$Tenure / 12

# Convert categorical variables to factors
factor_cols <- c("Gender", "Partner", "Dependents", "PhoneService", "MultipleLines",
                 "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection",
                 "TechSupport", "StreamingTV", "StreamingMovies", "Contract",
                 "PaperlessBilling", "PaymentMethod", "Status")

customer_data[factor_cols] <- lapply(customer_data[factor_cols], as.factor)

# Check the cleaned data
str(customer_data)
## 'data.frame':    6999 obs. of  21 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...
##  $ TenureYears     : num  0.0833 2.8333 0.1667 3.75 0.1667 ...
summary(customer_data)
##     Gender     SeniorCitizen Partner    Dependents     Tenure      PhoneService
##  Female:3467   0:5866        No :3613   No :4894   Min.   : 0.00   No : 676    
##  Male  :3532   1:1133        Yes:3386   Yes:2105   1st Qu.: 9.00   Yes:6323    
##                                                    Median :29.00               
##                                                    Mean   :32.38               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##           MultipleLines     InternetService             OnlineSecurity
##  No              :3371   DSL        :2405   No                 :3471  
##  No phone service: 676   Fiber optic:3075   No internet service:1519  
##  Yes             :2952   No         :1519   Yes                :2009  
##                                                                       
##                                                                       
##                                                                       
##               OnlineBackup             DeviceProtection
##  No                 :3070   No                 :3074   
##  No internet service:1519   No internet service:1519   
##  Yes                :2410   Yes                :2406   
##                                                        
##                                                        
##                                                        
##               TechSupport                StreamingTV  
##  No                 :3448   No                 :2792  
##  No internet service:1519   No internet service:1519  
##  Yes                :2032   Yes                :2688  
##                                                       
##                                                       
##                                                       
##             StreamingMovies           Contract    PaperlessBilling
##  No                 :2762   Month-to-month:3847   No :2862        
##  No internet service:1519   One year      :1465   Yes:4137        
##  Yes                :2718   Two year      :1687                   
##                                                                   
##                                                                   
##                                                                   
##                    PaymentMethod  MonthlyCharges    TotalCharges   
##  Bank transfer (automatic):1534   Min.   : 18.25   Min.   :   0.0  
##  Credit card (automatic)  :1512   1st Qu.: 35.48   1st Qu.: 399.4  
##  Electronic check         :2350   Median : 70.35   Median :1394.5  
##  Mailed check             :1603   Mean   : 64.75   Mean   :2279.5  
##                                   3rd Qu.: 89.85   3rd Qu.:3790.4  
##                                   Max.   :118.75   Max.   :8684.8  
##      Status      TenureYears   
##  Current:5143   Min.   :0.000  
##  Left   :1856   1st Qu.:0.750  
##                 Median :2.417  
##                 Mean   :2.698  
##                 3rd Qu.:4.583  
##                 Max.   :6.000

Exploratory Data Analysis

# Churn rate
churn_rate <- prop.table(table(customer_data$Status))
ggplot(data.frame(churn_rate), aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  labs(title = "Customer Status Distribution", 
       x = "Status", y = "Proportion") +
  scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
  theme_minimal()

# Tenure distribution by status
ggplot(customer_data, aes(x = TenureYears, fill = Status)) +
  geom_histogram(binwidth = 1, position = "dodge") +
  labs(title = "Tenure Distribution by Customer Status",
       x = "Tenure (Years)", y = "Count") +
  scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
  theme_minimal()

# Monthly charges by status
ggplot(customer_data, aes(x = MonthlyCharges, fill = Status)) +
  geom_density(alpha = 0.5) +
  labs(title = "Monthly Charges Distribution by Customer Status",
       x = "Monthly Charges", y = "Density") +
  scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
  theme_minimal()

# Contract type by status
ggplot(customer_data, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Customer Status by Contract Type",
       x = "Contract Type", y = "Proportion") +
  scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Internet service by status
ggplot(customer_data, aes(x = InternetService, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Customer Status by Internet Service",
       x = "Internet Service", y = "Proportion") +
  scale_fill_manual(values = c("Current" = "darkgreen", "darkgreen", "Left" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Correlation matrix for numeric variables
numeric_vars <- customer_data %>% 
  select(Tenure, MonthlyCharges, TotalCharges) %>% 
  cor()

ggcorrplot(numeric_vars, 
           hc.order = TRUE, 
           type = "lower",
           lab = TRUE,
           title = "Correlation Between Numeric Variables")

Some characteristics that we have drawn from the analysis and comparison of demographics and phone service usage at Regork:

  1. Churn Rate Chart: The chart clearly shows that approximately 26.5% of customers have left Regork’s total customers. Meanwhile, 73.4% are still current.

  2. Tenure Chart: Customers are now classified according to the number of years they have been with the company. It is easy to see that customers who are at risk of leaving will not stay with the company for long, this number is usually highest in the first year and decreases very clearly in the 6th year. In addition, customers who tend to stay with the company for a long time will tend to leave less, even reaching the maximum in the 6th year.

  3. Monthly Charges Chart: We can see that customers who tend to leave will often have high monthly charges, usually in the range of 70-110. On the contrary, customers with low monthly charges will stay longer.

  4. Contract Type chart: It is easy to see that signing a Month-to-month contract makes customers more likely to leave (around 43%), but the number drops significantly when customers start signing One-year contracts or the lowest is Two-year contracts (at 11% - lowest)

  5. Internet Service chart: Fiber Optic service holds the highest number in terms of churn rate at nearly 42%, while DSL is only at 19%, or the lowest is in the no internet service category at 7%.

  6. The chart shows the correlations between numeric variables: Tenure and Total Charges are closely correlated with each other at 0.83. This shows us that the longer the customer stays, the more likely they are to pay more over the years.

4. Machine Learning Analysis

Data Splitting

set.seed(123)
split <- initial_split(customer_data, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)

Feature Engineering

recipe <- recipe(Status ~ ., data = train_data) %>%
  step_YeoJohnson(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal(), -all_outcomes())

Model Training and Evaluation

Logistic Regression

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(train_data, v = 5)
# titanic_train model via cross validation
results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)
# collect the average accuracy rate
collect_metrics(results) 
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.0104  Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00421 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00780 Preprocessor1_Model1

From the Model Performance Summary table, we can see that the accuracy is nearly 80.4%, a good number when classifying variables (here, churned and non-churned customers). In addition, the ROC_AUC at 0.846 shows that the Model shows incredible classification power, contributing to identifying high-risk customers and implementing policies to intervene early for Regork.

# retrain our model across the entire training data
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = train_data)
tidy(final_fit)
## # A tibble: 32 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.04     0.975       1.07   2.86e- 1
##  2 GenderMale                     -0.0170   0.0778     -0.219  8.27e- 1
##  3 SeniorCitizen1                  0.185    0.101       1.83   6.73e- 2
##  4 PartnerYes                     -0.0655   0.0934     -0.701  4.83e- 1
##  5 DependentsYes                  -0.0438   0.106      -0.411  6.81e- 1
##  6 Tenure                         -0.0557   0.00752    -7.41   1.29e-13
##  7 PhoneServiceYes                 0.0969   0.773       0.125  9.00e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.463    0.211       2.20   2.81e- 2
## 10 InternetServiceFiber optic      1.77     0.949       1.86   6.26e- 2
## # ℹ 22 more rows
# confusion matric with the test data
final_fit %>%
  predict(test_data) %>%
  bind_cols(test_data %>% select(Status)) %>%
  conf_mat(Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1389  248
##    Left        154  309

It can be seen that this model is better at predicting current customers than churning customers. Looking at the value of confusion matrix gives us a similar view of the actual customer churn versus the predicted churn. We have the data of false positives (166) and false negatives (240) which proves the high accuracy and our model is more inclined towards predicting current customers (1377) than churning customers.

# Plot the feature importance
vip(final_fit)

The Tenure Factor is the most important factor we found. This also shows that the longer a customer is likely to stick around, the lower the likelihood that they will leave in the future.

Decision trees

# Step 1: create decision tree model object
dt_mod <- decision_tree(mode = 'classification') %>%
set_engine("rpart")
# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = train_data)
# Step 3: fit model workflow
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = train_data)
# create resampling procedure
set.seed(123)
kfold <- vfold_cv(train_data, v = 5)
# train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
# model results
collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.789     5 0.00537 Preprocessor1_Model1
## 2 brier_class binary     0.147     5 0.00318 Preprocessor1_Model1
## 3 roc_auc     binary     0.801     5 0.00651 Preprocessor1_Model1

From the Summary table above, we can see that the accuracy level has a small difference compared to the Logistic Regression model. The number 0.799 of ROC_AUC also partly reflects the weakness and ambiguity of this method.

dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.824     5 0.00698
## 2    0.0000000178          8    30 roc_auc binary     0.824     5 0.00698
## 3    0.00000316            8    30 roc_auc binary     0.824     5 0.00698
## 4    0.0000000001          8    21 roc_auc binary     0.823     5 0.00548
## 5    0.0000000178          8    21 roc_auc binary     0.823     5 0.00548
## # ℹ 1 more variable: .config <chr>
# get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
# put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = train_data)
# plot feature importance
dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(20)

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

From the Decision Tree graph, we see that our root node has a first split of Contract type, which shows that if a customer has a one-year or two-year contract, they are more likely to stay. Therefore, if Regork invests in customer retention strategies on customers with long-term contracts, it will help reduce the rate of customers wanting to leave.

Random forests

# create random forest model object and
# use the ranger package for the engine
# use the ranger package for the engine
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
# train model
rf_results <- fit_resamples(rf_mod, model_recipe, kfold)
# model results
collect_metrics(rf_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00512 Preprocessor1_Model1
## 2 brier_class binary     0.138     5 0.00175 Preprocessor1_Model1
## 3 roc_auc     binary     0.838     5 0.00427 Preprocessor1_Model1
# create random forest model object with tuning option
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
trees(range = c(50, 800)),
mtry(range = c(2, 50)),
min_n(range = c(1, 20)),
levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)
# model results
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   612    20 roc_auc binary     0.844     5 0.00499 Preprocessor1_Model1…
## 2     2   800    20 roc_auc binary     0.843     5 0.00457 Preprocessor1_Model1…
## 3     2   425    20 roc_auc binary     0.843     5 0.00496 Preprocessor1_Model1…
## 4     2   237    20 roc_auc binary     0.843     5 0.00496 Preprocessor1_Model1…
## 5     2   237    10 roc_auc binary     0.843     5 0.00444 Preprocessor1_Model0…
# get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
# create final workflow object
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
# fit final workflow object
rf_final_fit <- final_rf_wf %>%
fit(data = train_data)
# plot feature importance
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)

Similar to the two models above, we can see the influence of factors such as Tenure and even Contract on the customer’s decision to leave or stay. This model provides higher accuracy than Decision Trees because it examines multiple trees.

Optimal Model

Among 3 Models we went over, the most optimal Model that we can evaluate is Logistic Regression.

collect_metrics(results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 5 × 5
##   id    .metric .estimator .estimate .config             
##   <chr> <chr>   <chr>          <dbl> <chr>               
## 1 Fold1 roc_auc binary         0.829 Preprocessor1_Model1
## 2 Fold2 roc_auc binary         0.859 Preprocessor1_Model1
## 3 Fold3 roc_auc binary         0.822 Preprocessor1_Model1
## 4 Fold4 roc_auc binary         0.860 Preprocessor1_Model1
## 5 Fold5 roc_auc binary         0.850 Preprocessor1_Model1
vip::vip(final_fit)

set.seed(123)
kfold <- vfold_cv(train_data, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

collect_metrics(results) 
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.0104  Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00421 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00780 Preprocessor1_Model1
  1. Most Influential Predictor Variables in Regork’s custumor behaviors.

From the final_fit chart, we can clearly see the factors Tenure, ContractTwo year, ContractOne year, and TotalCharges.

It can be seen that these factors are connected to each other and directly affect the decision to stay or leave. It is possible that short-tenure customers will tend to leave early because they do not see the value that Regork brings. The same with Contract Type when customers only commit for a short time because they are afraid of the risks of long-term companionship and do not continue to sign, significantly affecting the decision to churn or not. The total charges factor also reflects the spending behavior of customers. If the charges are higher, it will predict the loyalty of customers when choosing to stay with Regork.

  1. How could the Regork Telecom CEO use this information?

We propose creating loyalty reward programs based on (possibly) tenure reward points.

In addition, Regork can also reach contract customers by encouraging customers to switch to longer-term contracts such as applying discounts when signing for 5 or 6 months, or adding different service milestones based on the contract term (for example, with 7 months, add a bonus month of usage)

  1. For the optimal model selected, use the test set to compute the generalization error so that the Regork Telecom leadership understands what to expect on new data?

Looking at the output data of the value table from the Logistic Regression Model, we see that its performance is the best among the 3 models that we have used. Indeed, based on the accuracy of 0.8045, this is a model with high accuracy and extremely small error (0.0098). Furthermore, the ROC_AUC or CV mean of 0.846 shows that the models perform well, contributing significantly to predicting the flexibility of customer behavior at Regork.

  1. As a person responsible for making business decisions, what else are you learning from the observations in this section?

As a decision-maker at Rogerk, we can learn the close relationship between the factors Tenure and Contract Length. Especially with the situation of spending too much to attract new customers, Regork should focus more on building more preferential policies to retain the company’s current short-term customers. The models we use are all strong and highly certain. Combined with segmenting customers by demographics definitely supplement specific tight preferential strategies that the company can exploit. In particular, Logistic Regression Model is an easy model to communicate insights to non-technical business owners, which is necessary because of its easy interpretation ability.

5. Business Analysis

  1. In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

As we have analyzed in the Optimal Model section (which is Logistic Regression Model), we can see that Tenure and Contract Length are the key factors that determine the influence of variables in the data. We have agreed and proposed a solution to apply many discounts in different contract packages to retain many current customers. We also want to apply as many incentive programs as possible based on the company’s long-term loyalty, contributing to maintaining the stability of the Tenure factor. By applying many such incentive policies, Regork in the future will find many answers to why customers leave like that.

  1. Collect all the customers from the test dataset that you predict are going to leave.

When we supplement the machine learning analysis (Logistic Regression Model, Decision Trees, Random Forest) with the demographic analysis ideas, we will understand the customer churn trend that we learned in the midterm. We can clearly see that the single customer segment is more likely to churn because there is no family tie. They are more likely to churn and they have less obstacles in terms of childcare and other financial factors.

  1. What is the predicted loss in revenue per month if no action is taken?

The data we analyzed shows that Regork’s total loss from more customers churning is expected to be $10,000 per month (equivalent to losing 120 customers). If Regork approves the potential new policies that we offer, Rogerk will increase by $23,000 per month instead of struggling with loss in revenue. However, this number is just an estimation because demographic factors are subjective and do not generalize all data (because each person will have a purchasing tendency independent of their demographics, except for very obvious behaviors such as women often like pink products)

  1. Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal.

As mentioned above, we will specifically propose a 10% discount for customers who choose a 6-month contract package to use telecommunications products at Regork. For example, if out of our 120 customers, the discount helps retain 60 customers, how will we analyze the revenue? The answer is that if 60 customers are retained, the retained revenue will be 60x80 = 4800 per month. After 6 months, it will reach 4800x6 = 28800 dollars. Assuming that Regork’s cost for the promotion program in 1 month is 10,000 dollars, then the 10% discount will be 1,000 dollars. Then, after 6 months, Regork discounts 6,000 dollars. Finally, Regork owns a net gain of 28,800-6,000= 22,800 dollars. Therefore, instead of losing 120 customers, Regork can retain 60 customers and still earn 22,800 dollars in net gain from maintaining the promotion program, and even retain customers longer by extending the Contract Length to 6 months.

  1. Conclusion

In summary, we found that the factors of Tenure and Contract Length have a significant impact on the behavior of staying or leaving at Regork. Through analyzing and evaluating the models, the Linear Regression Model left us with the strongest data in terms of accuracy and ROC_AUC. Through that, combined with the Demographics model, we can have more in-depth solutions to retain customers according to the criteria and conditions that Regork offers. In addition, thanks to data from Machine Learning, we encourage Regork to use methods such as contract incentives or loyalty duration incentives. These incentive methods require Regork’s perseverance and careful observation to meet the needs of customers (because each person has a different buying behavior) and in addition, they also help the company develop in the long term and also help bring in more potential customers without having to spend a lot of money to invest in attracting more customers (Regork’s current limitation).