Customer Retention Analysis

Introduction

Customer retention is a key challenge in the telecom industry. Losing customers, or “churn,” not only impacts revenue but also increases the cost of acquiring new customers. In this project, we analyze customer churn data to uncover patterns and identify factors influencing customer retention. Using data from a telecom company, we apply machine learning models to predict which customers are most likely to leave. These predictions enable businesses to take proactive steps to improve customer satisfaction and loyalty. This report highlights our approach to data preparation, exploratory analysis, and predictive modeling to support data-driven retention strategies.

Required Packages

Here are the packages used in our analysis, including a short description of each:

  • tidymodels: For building and validating machine learning models.
  • tidyverse: For data manipulation and visualization.
  • baguette: For bagged tree models.
  • vip: For assessing variable importance in models.
  • pdp: For creating partial dependence plots.
  • kernlab: For kernel-based methods.
  • ggplot2: For data visualization.
# Loading Required Libraries
  library(tidymodels)
  library(tidyverse)
  library(baguette)
  library(vip)
  library(pdp)
  library(kernlab)
  library(ggplot2)

Data Preparation

We begin by importing the dataset and exploring its structure to identify potential data quality issues or necessary transformations.

# Importing the dataset
customer_retention <- read.csv("Data/customer_retention.csv")

# Viewing the first few rows of the dataset
head(customer_retention)

# Checking the structure of the dataset
str(customer_retention)

# Summary statistics of the dataset
summary(customer_retention)

The column TotalCharges contains 11 missing values. We omit these rows to ensure data quality for our analysis.

# Identifying missing values
colSums(is.na(customer_retention))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0               11                0
# Removing rows with missing values
customer_retention <- na.omit(customer_retention)

# Confirming all missing values have been removed
sum(is.na(customer_retention))
## [1] 0

To prevent issues during analysis, we ensure that categorical variables are consistent and have no unexpected levels.

unique(customer_retention$Status)
unique(customer_retention$Gender)
unique(customer_retention$SeniorCitizen)
unique(customer_retention$Partner)
unique(customer_retention$Dependents)
unique(customer_retention$InternetService)
unique(customer_retention$OnlineSecurity)
unique(customer_retention$OnlineBackup)
unique(customer_retention$DeviceProtection)
unique(customer_retention$TechSupport)
unique(customer_retention$StreamingTV)
unique(customer_retention$StreamingMovies)
unique(customer_retention$Contract)
unique(customer_retention$PaperlessBilling)

We split the dataset into training (70%) and test (30%) sets, ensuring stratified sampling to maintain the balance of Status in both sets.

# Recode response variable as a factor
customer_retention <- customer_retention %>%
  mutate(Status = factor(Status, levels = c("Current", "Left")))

# Splitting the dataset
set.seed(123)
retention_split <- initial_split(customer_retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)

# Checking dimensions of the splits
dim(retention_train)
## [1] 4891   20
dim(retention_test)
## [1] 2097   20

Exploratory Data Analysis

Overview of Target Variable

The Status variable indicates whether a customer is currently retained (Current) or has left (Left). Below is the distribution of the target variable in the dataset.

table(customer_retention$Status)
## 
## Current    Left 
##    5132    1856
prop.table(table(customer_retention$Status))
## 
##   Current      Left 
## 0.7344018 0.2655982

The majority of customers (73.4%) are current, while just over a quarter (26.6%) have left the service.

Continuous Variables

Tenure Distribution

We first explored the distribution of tenure among our customer base, both current and past.

cust_tenure <- ggplot(customer_retention, aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5) + 
  theme_grey() +
  scale_x_continuous(n.breaks=10)+
  labs(title = "Customer Tenure", x = "Number of Months Kept", y = "Count of Instances",)
cust_tenure

It seems that of the current customer base, the majority have either just joined in the last few months, or have been with the service since the beginning. This could be because the service offered a special deal upon startup that customers had been grandfathered into for those who have been long-standing customers. Those who have left the service departed within the first few months, and many of the current ones are new as well, suggesting there could be a promotion for the first ~6 months of the contract that attracts people, and they may choose to switch services once the promotion is up.

Monthly Charges

In addition to the tenure, we explored monthly charges to understand why customers may leave or stay with the service.

monthly_charges<- ggplot(customer_retention, aes(x = MonthlyCharges, fill = Status)) +
  geom_histogram(binwidth = 3) +
  theme_grey() +
  scale_x_continuous(n.breaks=10)+
  labs(title = "Customer Monthly Charges", x = "Amount Billed ($)", y = "Count of Instances")
monthly_charges

In this graph, we find that most of the current customers are being charged around $20/month. If we analyze this in tandem with the previous tenure graph, we can conclude there is some sort of promotion for new members that offers the lower price. Many of the customers who left the service began doing so around the $70/month mark, which we may need to consider in our pricing strategy.

Categorical Variables

Internet Service

We graphed the internet service variable to help understand why customers may be leaving, and if it had to do with their type of internet service.

internet_service <- ggplot(customer_retention, aes(x = InternetService, fill = Status)) + 
  geom_bar() +
  theme_minimal()+
  labs(x = "Internet Service", y = "Count", title = "Status by Internet Service Type")
internet_service

In doing this, we found that the majority of customers who left had fiber optic cables as their internet service. We may need to look into enhancing or changing our approach for our fiber optic customers to increase retention.

Senior Citizens

We looked at the demographics of our customer base and found that the distribution between male and female identifying customers were fairly even, so we decided to explore the general age of our customer base.

senior_citizen <- ggplot(customer_retention, aes(x = SeniorCitizen, fill = Status)) + 
  geom_bar() +
  theme_minimal()+
  labs(x = "Senior Status", y = "Count", title = "Customers vs Senior Citizen Status")
senior_citizen

Most of the customers, both current and past, have not identified as senior citizens. This could help us in marketing strategies in the future, as we can better cater to our larger base, but know we can work to improve how our service could be advertised to and fit for senior citizens.

Relationships Between Variables

Demographics and Service

Because the non-senior citizen status was so significant in our demographic, we decided to analyze the differnt services between age groups to understand how our service could better fit the senior citizen group.

senior_service <- ggplot(customer_retention, aes(x = SeniorCitizen, fill = InternetService)) +
  geom_bar(position = "dodge") +
  theme_minimal()+
  labs(x = "Senior Status", y = "Count", title = "Relationship between Age and Internet Service Type")
senior_service

Both the senior and non-senior demographics utlizie the fiber optic services for their internet, so Regork may want to focus on potentially enhancing their services for the DSL to gain a larger base from the senior citizens, but make sure they fortify the service they do provide in conjunction with fiber optic, as that seems to be the most popular among the graphs.

Tenure vs Status

This graph helps us to better visualize the range of those staying with the service vs leaving it.

tenure_status <- ggplot(customer_retention, aes(x = Status, y = Tenure)) +
  geom_boxplot(fill = "purple") +
  theme_minimal()
tenure_status

Most of the customers that leave do so in under 30 months of commitment, with the mean being around 10 months. Regork can look at what promotions or incentives they can offer around the 10 month mark to increase the perceived value of their service and hopefully retain more customers.

Monthly Charge vs Status

We watned to get a better understanding of the costs related to customers leaving the service and the effect the price range may have on the retention.

monthly_status <- ggplot(customer_retention, aes(x = Status, y = MonthlyCharges)) +
  geom_boxplot(fill = "red") +
  theme_minimal()
monthly_status

The customers who have left were generally charged more than those who have stayed, paying an average of around $80/month. If Regork can determine why those costs were not associated with perceived value, they may be able to better retain that customer group.

Machine Learning Analysis

Predicting Likelihood of Churn

In this section, we use a logistic regression model to predict customer churn. By analyzing factors such as customer demographics, service usage, and billing methods, the model identifies which variables are most influential in predicting whether a customer will churn. This allows us to prioritize retention efforts effectively.

set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")

To ensure the model performs reliably, we divided the data into smaller subsets and tested the model on different portions of the data. This approach gives a clearer picture of how well the model will work in real-world scenarios.

results <- logistic_reg() %>%
  set_engine("glm") %>%
  fit_resamples(Status ~ ., resamples = kfold)

The model was trained using all available customer data, including demographics, contract details, and service preferences. This allows us to determine which factors most influence customer decisions to leave or stay.

collect_metrics(results) %>%
  filter(.metric == "accuracy")
## # A tibble: 1 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00401 Preprocessor1_Model1

The model correctly predicts whether a customer will churn approximately 80% of the time. This accuracy ensures that the model provides valuable insights into customer behavior without being overly complex.

final_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(Status ~ ., data = retention_train)

We trained the logistic regression model on the entire training dataset to finalize it.

final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

When applied to new data, the model continued to perform well. While it correctly identifies many customers likely to leave, it does occasionally miss some (false negatives) and predicts churn for a few customers who stay (false positives). These predictions can guide retention efforts, but care must be taken to avoid unnecessary interventions.

vip::vip(final_fit)

After finalizing the model, we identified the factors most associated with churn:

  • Contract Length: Customers on shorter contracts, such as month-to-month, are more likely to leave.
  • Internet Service Type: Fiber optic users are at higher risk, potentially due to service quality concerns or competition.
  • Monthly Charges: Higher bills slightly increase the risk of churn. These findings align with our earlier data analysis and provide a foundation for targeted strategies.

The logistic regression model provides valuable insights into the linear relationships between predictors and churn. Its simplicity makes it an excellent starting point for understanding churn patterns. However, its linear nature might miss complex interactions between variables, which more advanced models like random forests can capture.

Predicting High-Risk Customer Churn

In this section, we use a random forest model to predict which customers are at the highest risk of churn (probability > 70%). By identifying these customers, Regork Telecom can implement targeted retention strategies, focusing efforts where they are most needed.

model_recipe <- recipe(Status ~ ., data = retention_train)

rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")

To build the random forest model, we prepared a recipe using all available predictors, including customer demographics, contract types, and service usage. Cross-validation ensures the model performs reliably across various data subsets, providing consistent results.

rf_hyper_grid <- grid_regular(
  trees(range = c(50, 800)),
  mtry(range = c(2, 19)),
  min_n(range = c(1, 20)),
  levels = 5
)

We created a grid of potential hyperparameter values, allowing the model to test various settings for the number of trees, predictors per split, and minimum node size. Optimizing these values ensures the model is both accurate and efficient.

set.seed(123)
rf_results <- tune_grid(
  rf_mod,
  model_recipe,
  resamples = kfold,
  grid = rf_hyper_grid,
  metrics = metric_set(accuracy, roc_auc, mn_log_loss)
)

The model was trained and evaluated across multiple folds using the hyperparameter grid. Metrics such as ROC AUC help us determine how well the model distinguishes between churners and non-churners, ensuring reliable predictions.

show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   800    20 roc_auc binary     0.845     5 0.00534 Preprocessor1_Model1…
## 2     2   800    15 roc_auc binary     0.845     5 0.00535 Preprocessor1_Model0…
## 3     2   425    20 roc_auc binary     0.845     5 0.00536 Preprocessor1_Model1…
## 4     2   612    20 roc_auc binary     0.845     5 0.00555 Preprocessor1_Model1…
## 5     2   612    15 roc_auc binary     0.844     5 0.00523 Preprocessor1_Model0…
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")

The best-performing model was selected based on the ROC AUC metric. This ensures the final model focuses on accurately predicting high-risk churners, which is critical for targeted interventions.

final_rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

rf_final_fit <- final_rf_wf %>%
  fit(data = retention_train)

Using the best hyperparameter settings, we trained the final random forest model on the entire training dataset. This model captures complex relationships and non-linear interactions between predictors, which were less visible in the linear regression model.

rf_predictions <- predict(rf_final_fit, new_data = retention_test, type = "prob") %>%
  bind_cols(predict(rf_final_fit, new_data = retention_test, type = "class")) %>%
  bind_cols(retention_test)

high_risk_customers <- rf_predictions %>%
  filter(.pred_Left > 0.7)

high_risk_metrics <- high_risk_customers %>%
  metrics(truth = Status, estimate = .pred_class)

print(high_risk_metrics)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary          0.86
## 2 kap      binary          0

The model was then used to predict churn probabilities for the test dataset. We focused on customers with a churn probability greater than 70% (high risk). The predictions were 86% accurate for this group, demonstrating strong reliability in identifying high-risk churners.

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(
    title = "Top 10 Features Influencing High-Risk Customer Churn",
    x = "Feature",
    y = "Importance"
  ) +
  theme_minimal()

Finally, we examined the key factors influencing churn risk using variable importance. The top drivers include:

  • Tenure: Shorter tenure is associated with higher churn risk.
  • Total Charges: Customers with higher total charges are more likely to leave.
  • Contract Type: Month-to-month contracts increase churn likelihood. These insights align with earlier graphs, such as tenure distribution and monthly charges, and provide Regork with clear priorities for improving customer retention.

The feature importance graph underscores the need for targeted retention efforts. Customers in the first year of service, particularly on month-to-month plans, should receive personalized offers to prevent churn. Addressing these pain points with promotions or improved service quality can yield significant retention benefits.

Which Demographic Groups are Most Likely to Churn?

This section focuses on predicting churn likelihood based on demographic variables such as Gender, Senior Citizen, Partner, and Dependents. By understanding the demographic groups most at risk of leaving, Regork Telecom can tailor retention strategies to these customers effectively.

dt_mod <- decision_tree(
  mode = "classification"
) %>%
  set_engine("rpart")

model_recipe <- recipe(Status ~ Gender + SeniorCitizen + Partner + Dependents, data = retention_train)

We begin by building a simple decision tree model to explore the relationships between demographic variables and churn. These features were selected based on their potential impact on customer behavior and insights derived from the exploratory data analysis. For example, earlier findings showed senior citizens and customers without dependents may be more likely to churn.

dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = retention_train)

After specifying the model and recipe, we train the decision tree using the training dataset. This initial model helps us understand the relationships between demographic variables and churn. It sets the stage for further refinement through hyperparameter tuning to improve performance.

set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")

dt_results <- fit_resamples(
  dt_mod,
  model_recipe,
  resamples = kfold,
  metrics = metric_set(accuracy, roc_auc)
)

collect_metrics(dt_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n  std_err .config             
##   <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
## 1 accuracy binary     0.734     5 0.000137 Preprocessor1_Model1
## 2 roc_auc  binary     0.5       5 0        Preprocessor1_Model1

To assess the model’s reliability, we perform 5-fold cross-validation and evaluate metrics such as accuracy and ROC AUC. The initial decision tree achieved an accuracy of 73.4% during cross-validation, consistent with earlier models. However, the ROC AUC of 0.5 suggests it is not yet effective at distinguishing between churners and non-churners.

dt_mod_tuned <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_tuned_results <- tune_grid(
  dt_mod_tuned,
  model_recipe,
  resamples = kfold,
  grid = dt_hyper_grid,
  metrics = metric_set(accuracy, roc_auc)
)

dt_best_params <- select_best(dt_tuned_results, metric = "roc_auc")
show_best(dt_tuned_results, metric = "roc_auc")
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          4     2 roc_auc binary     0.614     5 0.00727
## 2    0.0000000178          4     2 roc_auc binary     0.614     5 0.00727
## 3    0.00000316            4     2 roc_auc binary     0.614     5 0.00727
## 4    0.0000000001          8     2 roc_auc binary     0.614     5 0.00727
## 5    0.0000000178          8     2 roc_auc binary     0.614     5 0.00727
## # ℹ 1 more variable: .config <chr>

To enhance performance, we tune hyperparameters like tree depth, minimum node size, and cost complexity. These parameters control the tree’s size and shape, balancing simplicity and accuracy.

dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod_tuned) %>%
  finalize_workflow(dt_best_params)

dt_final_fit <- dt_final_wf %>%
  fit(data = retention_train)

This final model captures optimized relationships between demographics and churn. By focusing on tuned parameters, we achieve a balance between interpretability and predictive accuracy.

dt_predictions <- predict(dt_final_fit, new_data = retention_test, type = "prob") %>%
  bind_cols(predict(dt_final_fit, new_data = retention_test, type = "class")) %>%
  bind_cols(retention_test)

confusion <- dt_predictions %>%
  conf_mat(truth = Status, estimate = .pred_class)
confusion
##           Truth
## Prediction Current Left
##    Current    1444  488
##    Left         96   69

We evaluate the tuned model on the test dataset, focusing on accuracy and the confusion matrix to assess its ability to predict churn. The model correctly predicted churn for 69 customers who left but misclassified 488 churners as current customers (false negatives). These misclassifications emphasize the need for additional strategies to minimize churn predictions’ business impact.

dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(10) +
  labs(
    title = "Top Demographic Features Influencing Churn",
    x = "Feature",
    y = "Importance"
  )

After finalizing the decision tree model, we identified the most influential demographic factors associated with churn:

  • Dependents: Customers without dependents are significantly more likely to churn, suggesting a need for personalized retention strategies for individuals.
  • Senior Citizen: Senior citizens show a higher likelihood of churn, possibly reflecting dissatisfaction with service features or pricing.
  • Partner: Customers without partners are at greater risk of leaving, indicating opportunities for campaigns tailored to single customers.
  • Gender: Gender had minimal impact on churn, consistent with earlier demographic analysis. These insights allow Regork Telecom to develop group-specific retention strategies. By proactively addressing the needs of senior citizens or customers without dependents, the company can prevent churn while maintaining a focus on high-risk groups identified earlier.

Potential Revenue Loss from Predicted Churn by Demographic

Lastly, this section identifies customers who are predicted to churn and calculates their potential revenue loss. Using monthly charges, we estimate the annual revenue loss per customer

predicted_churners <- rf_predictions %>%
  filter(.pred_class == "Left") %>%
  mutate(RevenueLoss = MonthlyCharges * 12)

We aggregate revenue loss by demographic groups to better understand the financial impact of churn across different segments. This step groups the predicted churners by attributes like dependents, senior citizenship, partnership status, and gender, summarizing total and average revenue loss, as well as the count of customers in each group

demographic_loss <- predicted_churners %>%
  group_by(Dependents, SeniorCitizen, Partner, Gender) %>%
  summarize(
    TotalRevenueLoss = sum(RevenueLoss, na.rm = TRUE),
    AvgRevenueLoss = mean(RevenueLoss, na.rm = TRUE),
    Count = n()
  ) %>%
  arrange(desc(TotalRevenueLoss))
## `summarise()` has grouped output by 'Dependents', 'SeniorCitizen', 'Partner'.
## You can override using the `.groups` argument.

The resulting table, displayed below, reveals which demographic groups contribute most to revenue loss. For example, non-senior females without dependents or partners seem to account for the largest share of potential revenue loss, highlighting a priority segment for retention strategies.

print(demographic_loss)
## # A tibble: 15 × 7
## # Groups:   Dependents, SeniorCitizen, Partner [8]
##    Dependents SeniorCitizen Partner Gender TotalRevenueLoss AvgRevenueLoss Count
##    <chr>              <int> <chr>   <chr>             <dbl>          <dbl> <int>
##  1 No                     0 No      Female          102455.           907.   113
##  2 No                     0 No      Male             95996.           906.   106
##  3 No                     1 No      Female           45361.           926.    49
##  4 No                     1 No      Male             34546.           909.    38
##  5 No                     1 Yes     Male             28844.           961.    30
##  6 No                     0 Yes     Male             19508.          1027.    19
##  7 Yes                    0 Yes     Female           16655.           980.    17
##  8 No                     0 Yes     Female           16088.           946.    17
##  9 Yes                    0 No      Male             10523.           809.    13
## 10 No                     1 Yes     Female            8337            926.     9
## 11 Yes                    0 Yes     Male              7728            966      8
## 12 Yes                    0 No      Female            5587.           931.     6
## 13 Yes                    1 No      Female            1216.          1216.     1
## 14 Yes                    1 Yes     Female            1150.          1150.     1
## 15 Yes                    1 Yes     Male               965.           965.     1

To better show this, we use a bar chart that breaks down total revenue loss by demographic combinations. The x-axis represents whether customers have dependents, while the bars are grouped and colored by combinations of senior citizenship, partnership status, and gender.

ggplot(demographic_loss, aes(x = Dependents, y = TotalRevenueLoss, fill = interaction(SeniorCitizen, Partner, Gender))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Potential Revenue Loss by Demographics",
    x = "Dependents",
    y = "Total Revenue Loss ($)",
    fill = "SeniorCitizen-Partner-Gender"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the chart, we observe that non-dependent customers, particularly those who are not senior citizens, contribute the majority of potential revenue loss. This insight underscores the importance of tailored retention efforts targeting these high-impact groups.

Business Analysis and Conclusion

Most influential Predictor Variables and Demographics

Our analysis reveals that the most influential predictor variables driving customer churn are Tenure, Contract Type, Monthly Charges, Total Charges, and Online Security. Among these, Tenure is the strongest indicator. Customers with longer tenures are significantly less likely to leave, emphasizing the importance of fostering loyalty through retention programs and incentives aimed at long-standing customers.

Contract Type emerges as another critical factor, with customers on month-to-month contracts being more likely to churn than those on longer-term contracts. Encouraging customers to switch to annual or bi-annual plans, potentially through discounted rates or bundled service offerings, could help mitigate churn. Similarly, Monthly Charges and Total Charges have substantial influence, indicating that cost-sensitive customers are more likely to leave. Competitive pricing, targeted discounts, or personalized offers for high-risk customers could be effective in addressing this issue. Additionally, Online Security plays a notable role, as customers who subscribe to these services tend to have lower churn rates. Strengthening and promoting online security services could further enhance customer retention.

Demographically, the customers most likely to churn include younger individuals, non-senior citizens, and single customers without dependents. These groups may have more limited disposable income or greater sensitivity to price changes. By contrast, senior citizens and those with dependents tend to display greater stability, likely due to fixed incomes and family obligations that anchor their choices. Targeting high-risk demographics with customized promotions, such as discounts for younger customers or tailored offers for single individuals, can help address their specific concerns and improve retention.

Potential Revenue Loss and Incentives

Using our model, we predict that younger single females and males contribute $102,454.80 and $95,995.80 in potential revenue loss respectively. To retain these customers, we propose a multi-pronged incentive plan. First, offer a 10% discount on monthly charges for three months to cost-sensitive demographics. This would address concerns related to Monthly Charges and Total Charges while providing immediate financial relief. Second, implement a rewards program to incentivize long-term contracts and recognize customer loyalty, especially for younger and single individuals, who are more likely to churn. For instance, rewards could include discounts, exclusive services, or recognition on significant milestones like anniversaries with the company.

Conclusion

Our analysis highlights the key drivers of churn and demographic trends that Regork can target to improve customer retention. By addressing critical predictors such as Tenure, Contract Type, Monthly Charges, and Total Charges, alongside demographic groups at higher risk, the company can proactively reduce churn. Implementing incentive programs, promoting long-term contracts, and enhancing services like Online Security can ensure customer satisfaction and loyalty. These measures, grounded in our predictive model with an average AUC of 80%, provide a strong foundation for reducing revenue loss and building stronger customer relationships. By taking these steps, Regork can sustain long-term growth and solidify its position in the telecommunications market.