In today’s digital era where the telecommunications industry is highly competitive, dealing with customer churn and leaving the service provider has become a huge challenge for businesses. Regork Telecom is also facing this problem when they are collecting data about their customers. And they have found that retaining existing customers is extremely necessary and urgent. Indeed, Regork believes that retaining existing customers is less expensive than investing in attracting new potential customers.
Therefore, our analysis will partly help Regork discover which factors will influence customer churn by identifying the factors from the data file “customer_retention.csv”. In addition, we developed a number of predictive models using Machine Learning pattern analysis to provide additional insights to reduce customer churn, which is a serious revenue loss that the company is facing.
Our findings will help predict and also help the company’s development departments focus on the right products, factors, or operations.
We propose various classification methods such as logistic regression, random forest, bagging, and decision tree. You can find our conclusion and reasoning highlighted throughout the summary tab.
# Load necessary libraries
library(tidyverse)
library(tidymodels)
library(vip)
library(kernlab)
library(ggplot2)
library(corrplot)
library(ggcorrplot)
library(baguette)
library(pdp)
library(parsnip)
library(ranger)
library(workflows)
library(DALEXtra)
library(widyr)
library(kableExtra)
# Load the dataset
customer_data <- read.csv("customer_retention.csv")
# Initial inspection
str(customer_data)
## 'data.frame': 6999 obs. of 20 variables:
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Status : chr "Current" "Current" "Left" "Current" ...
summary(customer_data)
## Gender SeniorCitizen Partner Dependents
## Length:6999 Min. :0.0000 Length:6999 Length:6999
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :0.0000 Mode :character Mode :character
## Mean :0.1619
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Tenure PhoneService MultipleLines InternetService
## Min. : 0.00 Length:6999 Length:6999 Length:6999
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :29.00 Mode :character Mode :character Mode :character
## Mean :32.38
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## Length:6999 Length:6999 Length:6999 Length:6999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## StreamingTV StreamingMovies Contract PaperlessBilling
## Length:6999 Length:6999 Length:6999 Length:6999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Status
## Length:6999 Min. : 18.25 Min. : 18.8 Length:6999
## Class :character 1st Qu.: 35.48 1st Qu.: 401.9 Class :character
## Mode :character Median : 70.35 Median :1397.5 Mode :character
## Mean : 64.75 Mean :2283.1
## 3rd Qu.: 89.85 3rd Qu.:3796.9
## Max. :118.75 Max. :8684.8
## NA's :11
# Check for missing values
sum(is.na(customer_data)) # Only 2 NA values in TotalCharges for new customers
## [1] 11
# Handle missing values - replace NA in TotalCharges with 0 for new customers
customer_data$TotalCharges[is.na(customer_data$TotalCharges)] <- 0
# Convert TotalCharges to numeric (handle non-numeric issues)
customer_data$TotalCharges <- as.numeric(customer_data$TotalCharges)
# Convert SeniorCitizen to factor
customer_data$SeniorCitizen <- as.factor(customer_data$SeniorCitizen)
# Convert Status to factor
customer_data$Status <- as.factor(customer_data$Status)
# Convert tenure to years for better interpretation
customer_data$TenureYears <- customer_data$Tenure / 12
# Convert categorical variables to factors
factor_cols <- c("Gender", "Partner", "Dependents", "PhoneService", "MultipleLines",
"InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies", "Contract",
"PaperlessBilling", "PaymentMethod", "Status")
customer_data[factor_cols] <- lapply(customer_data[factor_cols], as.factor)
# Check the cleaned data
str(customer_data)
## 'data.frame': 6999 obs. of 21 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Status : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...
## $ TenureYears : num 0.0833 2.8333 0.1667 3.75 0.1667 ...
summary(customer_data)
## Gender SeniorCitizen Partner Dependents Tenure PhoneService
## Female:3467 0:5866 No :3613 No :4894 Min. : 0.00 No : 676
## Male :3532 1:1133 Yes:3386 Yes:2105 1st Qu.: 9.00 Yes:6323
## Median :29.00
## Mean :32.38
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity
## No :3371 DSL :2405 No :3471
## No phone service: 676 Fiber optic:3075 No internet service:1519
## Yes :2952 No :1519 Yes :2009
##
##
##
## OnlineBackup DeviceProtection
## No :3070 No :3074
## No internet service:1519 No internet service:1519
## Yes :2410 Yes :2406
##
##
##
## TechSupport StreamingTV
## No :3448 No :2792
## No internet service:1519 No internet service:1519
## Yes :2032 Yes :2688
##
##
##
## StreamingMovies Contract PaperlessBilling
## No :2762 Month-to-month:3847 No :2862
## No internet service:1519 One year :1465 Yes:4137
## Yes :2718 Two year :1687
##
##
##
## PaymentMethod MonthlyCharges TotalCharges
## Bank transfer (automatic):1534 Min. : 18.25 Min. : 0.0
## Credit card (automatic) :1512 1st Qu.: 35.48 1st Qu.: 399.4
## Electronic check :2350 Median : 70.35 Median :1394.5
## Mailed check :1603 Mean : 64.75 Mean :2279.5
## 3rd Qu.: 89.85 3rd Qu.:3790.4
## Max. :118.75 Max. :8684.8
## Status TenureYears
## Current:5143 Min. :0.000
## Left :1856 1st Qu.:0.750
## Median :2.417
## Mean :2.698
## 3rd Qu.:4.583
## Max. :6.000
# Churn rate
churn_rate <- prop.table(table(customer_data$Status))
ggplot(data.frame(churn_rate), aes(x = Var1, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
labs(title = "Customer Status Distribution",
x = "Status", y = "Proportion") +
scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
theme_minimal()
# Tenure distribution by status
ggplot(customer_data, aes(x = TenureYears, fill = Status)) +
geom_histogram(binwidth = 1, position = "dodge") +
labs(title = "Tenure Distribution by Customer Status",
x = "Tenure (Years)", y = "Count") +
scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
theme_minimal()
# Monthly charges by status
ggplot(customer_data, aes(x = MonthlyCharges, fill = Status)) +
geom_density(alpha = 0.5) +
labs(title = "Monthly Charges Distribution by Customer Status",
x = "Monthly Charges", y = "Density") +
scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
theme_minimal()
# Contract type by status
ggplot(customer_data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Customer Status by Contract Type",
x = "Contract Type", y = "Proportion") +
scale_fill_manual(values = c("Current" = "darkgreen", "Left" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Internet service by status
ggplot(customer_data, aes(x = InternetService, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Customer Status by Internet Service",
x = "Internet Service", y = "Proportion") +
scale_fill_manual(values = c("Current" = "darkgreen", "darkgreen", "Left" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Correlation matrix for numeric variables
numeric_vars <- customer_data %>%
select(Tenure, MonthlyCharges, TotalCharges) %>%
cor()
ggcorrplot(numeric_vars,
hc.order = TRUE,
type = "lower",
lab = TRUE,
title = "Correlation Between Numeric Variables")
Some characteristics that we have drawn from the analysis and comparison
of demographics and phone service usage at Regork:
Churn Rate Chart: The chart clearly shows that approximately 26.5% of customers have left Regork’s total customers. Meanwhile, 73.4% are still current.
Tenure Chart: Customers are now classified according to the number of years they have been with the company. It is easy to see that customers who are at risk of leaving will not stay with the company for long, this number is usually highest in the first year and decreases very clearly in the 6th year. In addition, customers who tend to stay with the company for a long time will tend to leave less, even reaching the maximum in the 6th year.
Monthly Charges Chart: We can see that customers who tend to leave will often have high monthly charges, usually in the range of 70-110. On the contrary, customers with low monthly charges will stay longer.
Contract Type chart: It is easy to see that signing a Month-to-month contract makes customers more likely to leave (around 43%), but the number drops significantly when customers start signing One-year contracts or the lowest is Two-year contracts (at 11% - lowest)
Internet Service chart: Fiber Optic service holds the highest number in terms of churn rate at nearly 42%, while DSL is only at 19%, or the lowest is in the no internet service category at 7%.
The chart shows the correlations between numeric variables: Tenure and Total Charges are closely correlated with each other at 0.83. This shows us that the longer the customer stays, the more likely they are to pay more over the years.
set.seed(123)
split <- initial_split(customer_data, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)
recipe <- recipe(Status ~ ., data = train_data) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes())
# create resampling procedure
set.seed(123)
kfold <- vfold_cv(train_data, v = 5)
# titanic_train model via cross validation
results <- logistic_reg() %>%
fit_resamples(Status ~ ., kfold)
# collect the average accuracy rate
collect_metrics(results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.796 5 0.0104 Preprocessor1_Model1
## 2 brier_class binary 0.136 5 0.00421 Preprocessor1_Model1
## 3 roc_auc binary 0.844 5 0.00780 Preprocessor1_Model1
From the Model Performance Summary table, we can see that the accuracy is nearly 80.4%, a good number when classifying variables (here, churned and non-churned customers). In addition, the ROC_AUC at 0.846 shows that the Model shows incredible classification power, contributing to identifying high-risk customers and implementing policies to intervene early for Regork.
# retrain our model across the entire training data
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = train_data)
tidy(final_fit)
## # A tibble: 32 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.04 0.975 1.07 2.86e- 1
## 2 GenderMale -0.0170 0.0778 -0.219 8.27e- 1
## 3 SeniorCitizen1 0.185 0.101 1.83 6.73e- 2
## 4 PartnerYes -0.0655 0.0934 -0.701 4.83e- 1
## 5 DependentsYes -0.0438 0.106 -0.411 6.81e- 1
## 6 Tenure -0.0557 0.00752 -7.41 1.29e-13
## 7 PhoneServiceYes 0.0969 0.773 0.125 9.00e- 1
## 8 MultipleLinesNo phone service NA NA NA NA
## 9 MultipleLinesYes 0.463 0.211 2.20 2.81e- 2
## 10 InternetServiceFiber optic 1.77 0.949 1.86 6.26e- 2
## # ℹ 22 more rows
# confusion matric with the test data
final_fit %>%
predict(test_data) %>%
bind_cols(test_data %>% select(Status)) %>%
conf_mat(Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1389 248
## Left 154 309
It can be seen that this model is better at predicting current customers than churning customers. Looking at the value of confusion matrix gives us a similar view of the actual customer churn versus the predicted churn. We have the data of false positives (166) and false negatives (240) which proves the high accuracy and our model is more inclined towards predicting current customers (1377) than churning customers.
# Plot the feature importance
vip(final_fit)
The Tenure Factor is the most important factor we found. This also shows
that the longer a customer is likely to stick around, the lower the
likelihood that they will leave in the future.
# Step 1: create decision tree model object
dt_mod <- decision_tree(mode = 'classification') %>%
set_engine("rpart")
# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = train_data)
# Step 3: fit model workflow
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = train_data)
# create resampling procedure
set.seed(123)
kfold <- vfold_cv(train_data, v = 5)
# train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
# model results
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.789 5 0.00537 Preprocessor1_Model1
## 2 brier_class binary 0.147 5 0.00318 Preprocessor1_Model1
## 3 roc_auc binary 0.801 5 0.00651 Preprocessor1_Model1
From the Summary table above, we can see that the accuracy level has a small difference compared to the Logistic Regression model. The number 0.799 of ROC_AUC also partly reflects the weakness and ambiguity of this method.
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.824 5 0.00698
## 2 0.0000000178 8 30 roc_auc binary 0.824 5 0.00698
## 3 0.00000316 8 30 roc_auc binary 0.824 5 0.00698
## 4 0.0000000001 8 21 roc_auc binary 0.823 5 0.00548
## 5 0.0000000178 8 21 roc_auc binary 0.823 5 0.00548
## # ℹ 1 more variable: .config <chr>
# get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
# put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = train_data)
# plot feature importance
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)
From the Decision Tree graph, we see that our root node has a first
split of Contract type, which shows that if a customer has a one-year or
two-year contract, they are more likely to stay. Therefore, if Regork
invests in customer retention strategies on customers with long-term
contracts, it will help reduce the rate of customers wanting to
leave.
# create random forest model object and
# use the ranger package for the engine
# use the ranger package for the engine
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
# train model
rf_results <- fit_resamples(rf_mod, model_recipe, kfold)
# model results
collect_metrics(rf_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.797 5 0.00512 Preprocessor1_Model1
## 2 brier_class binary 0.138 5 0.00175 Preprocessor1_Model1
## 3 roc_auc binary 0.838 5 0.00427 Preprocessor1_Model1
# create random forest model object with tuning option
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
trees(range = c(50, 800)),
mtry(range = c(2, 50)),
min_n(range = c(1, 20)),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)
# model results
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 612 20 roc_auc binary 0.844 5 0.00499 Preprocessor1_Model1…
## 2 2 800 20 roc_auc binary 0.843 5 0.00457 Preprocessor1_Model1…
## 3 2 425 20 roc_auc binary 0.843 5 0.00496 Preprocessor1_Model1…
## 4 2 237 20 roc_auc binary 0.843 5 0.00496 Preprocessor1_Model1…
## 5 2 237 10 roc_auc binary 0.843 5 0.00444 Preprocessor1_Model0…
# get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
# create final workflow object
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
# fit final workflow object
rf_final_fit <- final_rf_wf %>%
fit(data = train_data)
# plot feature importance
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
Similar to the two models above, we can see the influence of factors
such as Tenure and even Contract on the customer’s decision to leave or
stay. This model provides higher accuracy than Decision Trees because it
examines multiple trees.
Among 3 Models we went over, the most optimal Model that we can evaluate is Logistic Regression.
collect_metrics(results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 5 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold1 roc_auc binary 0.829 Preprocessor1_Model1
## 2 Fold2 roc_auc binary 0.859 Preprocessor1_Model1
## 3 Fold3 roc_auc binary 0.822 Preprocessor1_Model1
## 4 Fold4 roc_auc binary 0.860 Preprocessor1_Model1
## 5 Fold5 roc_auc binary 0.850 Preprocessor1_Model1
vip::vip(final_fit)
set.seed(123)
kfold <- vfold_cv(train_data, v = 5)
results <- logistic_reg() %>%
fit_resamples(Status ~ ., kfold)
collect_metrics(results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.796 5 0.0104 Preprocessor1_Model1
## 2 brier_class binary 0.136 5 0.00421 Preprocessor1_Model1
## 3 roc_auc binary 0.844 5 0.00780 Preprocessor1_Model1
From the final_fit chart, we can clearly see the factors Tenure, ContractTwo year, ContractOne year, and TotalCharges.
It can be seen that these factors are connected to each other and directly affect the decision to stay or leave. It is possible that short-tenure customers will tend to leave early because they do not see the value that Regork brings. The same with Contract Type when customers only commit for a short time because they are afraid of the risks of long-term companionship and do not continue to sign, significantly affecting the decision to churn or not. The total charges factor also reflects the spending behavior of customers. If the charges are higher, it will predict the loyalty of customers when choosing to stay with Regork.
We propose creating loyalty reward programs based on (possibly) tenure reward points.
In addition, Regork can also reach contract customers by encouraging customers to switch to longer-term contracts such as applying discounts when signing for 5 or 6 months, or adding different service milestones based on the contract term (for example, with 7 months, add a bonus month of usage)
Looking at the output data of the value table from the Logistic Regression Model, we see that its performance is the best among the 3 models that we have used. Indeed, based on the accuracy of 0.8045, this is a model with high accuracy and extremely small error (0.0098). Furthermore, the ROC_AUC or CV mean of 0.846 shows that the models perform well, contributing significantly to predicting the flexibility of customer behavior at Regork.
As a decision-maker at Rogerk, we can learn the close relationship between the factors Tenure and Contract Length. Especially with the situation of spending too much to attract new customers, Regork should focus more on building more preferential policies to retain the company’s current short-term customers. The models we use are all strong and highly certain. Combined with segmenting customers by demographics definitely supplement specific tight preferential strategies that the company can exploit. In particular, Logistic Regression Model is an easy model to communicate insights to non-technical business owners, which is necessary because of its easy interpretation ability.
As we have analyzed in the Optimal Model section (which is Logistic Regression Model), we can see that Tenure and Contract Length are the key factors that determine the influence of variables in the data. We have agreed and proposed a solution to apply many discounts in different contract packages to retain many current customers. We also want to apply as many incentive programs as possible based on the company’s long-term loyalty, contributing to maintaining the stability of the Tenure factor. By applying many such incentive policies, Regork in the future will find many answers to why customers leave like that.
When we supplement the machine learning analysis (Logistic Regression Model, Decision Trees, Random Forest) with the demographic analysis ideas, we will understand the customer churn trend that we learned in the midterm. We can clearly see that the single customer segment is more likely to churn because there is no family tie. They are more likely to churn and they have less obstacles in terms of childcare and other financial factors.
The data we analyzed shows that Regork’s total loss from more customers churning is expected to be $10,000 per month (equivalent to losing 120 customers). If Regork approves the potential new policies that we offer, Rogerk will increase by $23,000 per month instead of struggling with loss in revenue. However, this number is just an estimation because demographic factors are subjective and do not generalize all data (because each person will have a purchasing tendency independent of their demographics, except for very obvious behaviors such as women often like pink products)
As mentioned above, we will specifically propose a 10% discount for customers who choose a 6-month contract package to use telecommunications products at Regork. For example, if out of our 120 customers, the discount helps retain 60 customers, how will we analyze the revenue? The answer is that if 60 customers are retained, the retained revenue will be 60x80 = 4800 per month. After 6 months, it will reach 4800x6 = 28800 dollars. Assuming that Regork’s cost for the promotion program in 1 month is 10,000 dollars, then the 10% discount will be 1,000 dollars. Then, after 6 months, Regork discounts 6,000 dollars. Finally, Regork owns a net gain of 28,800-6,000= 22,800 dollars. Therefore, instead of losing 120 customers, Regork can retain 60 customers and still earn 22,800 dollars in net gain from maintaining the promotion program, and even retain customers longer by extending the Contract Length to 6 months.
In summary, we found that the factors of Tenure and Contract Length have a significant impact on the behavior of staying or leaving at Regork. Through analyzing and evaluating the models, the Linear Regression Model left us with the strongest data in terms of accuracy and ROC_AUC. Through that, combined with the Demographics model, we can have more in-depth solutions to retain customers according to the criteria and conditions that Regork offers. In addition, thanks to data from Machine Learning, we encourage Regork to use methods such as contract incentives or loyalty duration incentives. These incentive methods require Regork’s perseverance and careful observation to meet the needs of customers (because each person has a different buying behavior) and in addition, they also help the company develop in the long term and also help bring in more potential customers without having to spend a lot of money to invest in attracting more customers (Regork’s current limitation).