Executive Summary

Regork recently expanded into telecommunications. Since it’s far more cost-effective to keep existing customers than to acquire new ones, our team was asked to figure out why customers leave and identify who is most likely to cancel their service (this is called “churn”).

To do this, we looked at customer behavior patterns using visualizations, then used different predictive models to estimate which customers were at risk of leaving. Our strongest model, a Random Forest, identified contract type, length of time with the company (tenure), monthly charges, and access to tech support as key reasons people might leave.

With this knowledge, we proposed a plan to keep the most at-risk customers by offering small incentives—helping Regork save money and keep loyal users.

1. Data Preparation and Cleaning

We started by loading the data and making sure it was clean and usable. This means: - Changing text into categories that models can understand - Removing any missing values (empty cells) - Making sure values like total charges are numbers (not text)

Packages Used

library(tidyverse)

library(caret)

library(rpart)

library(randomForest)

library(pROC)

library(knitr)

library(cowplot)

library(ggthemes)

library(vip)

library(rsample)

library(tune)

library(recipes)

library(dials)

library(workflows)

library(parsnip)

library(yardstick)

df <- read_csv("customer_retention.csv")
df$TotalCharges <- as.numeric(df$TotalCharges)
df <- df %>% drop_na() %>% mutate_if(is.character, as.factor)

standardize_cols <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection", 
                      "TechSupport", "StreamingTV", "StreamingMovies", "MultipleLines")

for (col in standardize_cols) {
  df[[col]] <- fct_collapse(df[[col]], No = c("No", "No internet service", "No phone service"), Yes = "Yes")
}

2. Exploratory Data Analysis

This section helps us understand trends in the data. We made simple graphs to explore how certain customer types were more likely to leave. These early insights helped us decide which features were important.

ggplot(df, aes(x = Status, fill = Status)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", size = 3) +
  labs(title = "Customer Churn Breakdown") +
  theme_minimal()

This shows the total number of customers who stayed versus those who left.

ggplot(df, aes(x = Contract, fill = Status)) +
  geom_bar(position = "stack") +
  geom_text(stat = "count", aes(label = after_stat(count)), 
            position = position_stack(vjust = 0.5), color = "black", size = 3) +
  labs(title = "Churn by Contract Type") +
  theme_minimal()

Here we see that customers with month-to-month contracts are much more likely to leave.

# 3. Tenure vs Monthly Charges by Status (scatterplot — no counts needed)
ggplot(df, aes(x = Tenure, y = MonthlyCharges, color = Status)) +
  geom_point(alpha = 0.6) +
  labs(title = "Tenure vs Monthly Charges by Status") +
  theme_minimal()

This shows that newer customers paying more each month tend to leave more often.

ggplot(df, aes(x = InternetService, fill = Status)) +
  geom_bar(position = "stack") +
  geom_text(stat = "count", aes(label = after_stat(count)), 
            position = position_stack(vjust = 0.5), color = "black", size = 3) +
  labs(title = "Churn by Internet Service") +
  theme_minimal()

ggplot(df, aes(x = TechSupport, fill = Status)) +
  geom_bar(position = "stack") +
  geom_text(stat = "count", aes(label = after_stat(count)), 
            position = position_stack(vjust = 0.5), color = "black", size = 3) +
  labs(title = "Churn by Tech Support Access") +
  theme_minimal()

Customers without tech support are more likely to churn—possibly because issues aren’t being resolved.

3. Machine Learning: Model Building and Evaluation

We tested three types of models to predict which customers will churn: - Logistic Regression: A simple, traditional model - MARS: A flexible model that handles different types of data well - Random Forest: A powerful model made from combining many decision trees

Logistic Regression

This gives us a good starting point. It’s simple and helps us see if the data is predictive.

set.seed(123)
split <- initial_split(df, prop = 0.7, strata = Status)
train <- training(split)
test <- testing(split)

log_model <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")
log_res <- fit_resamples(log_model, Status ~ ., vfold_cv(train, v = 5))
collect_metrics(log_res)

MARS

This model helps detect more complicated patterns in customer behavior.

mars_model <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification") %>% set_engine("earth")

mars_wf <- workflow() %>% add_recipe(recipe(Status ~ ., data = train)) %>% add_model(mars_model)
mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 5)
mars_res <- tune_grid(mars_wf, resamples = vfold_cv(train, v = 5), grid = mars_grid)
show_best(mars_res, metric = "roc_auc")

Random Forest

This model turned out to be the most accurate. It looks at patterns by building many decision trees and averaging their results.

rf_model <- rand_forest(trees = tune(), mtry = tune(), min_n = tune()) %>%
  set_mode("classification") %>% set_engine("ranger", importance = "impurity")

rf_wf <- workflow() %>% add_recipe(recipe(Status ~ ., data = train)) %>% add_model(rf_model)

# Finalize mtry
final_mtry <- finalize(mtry(), train)
rf_grid <- grid_regular(
  trees(range = c(100, 500)),
  final_mtry,
  min_n(range = c(1, 20)),
  levels = 3
)

rf_res <- tune_grid(rf_wf, resamples = vfold_cv(train, v = 5), grid = rf_grid)
show_best(rf_res, metric = "roc_auc")

Model Performance and Variable Importance

We picked the best version of the Random Forest and tested it on new data. We also looked at which variables were most important.

rf_best <- select_best(rf_res, metric = "roc_auc")
final_rf <- finalize_workflow(rf_wf, rf_best)
rf_fit <- fit(final_rf, data = train)
preds <- predict(rf_fit, test) %>% bind_cols(test %>% select(Status))
conf_mat(preds, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1507  430
##    Left         33  127
rf_fit %>% extract_fit_parsnip() %>% vip(num_features = 15) + ggtitle("Top 15 Important Features")

4. Retention Strategy and Business Case

Now that we know who is most likely to leave, we can estimate how much money Regork could lose—and how much they could save by offering incentives.

at_risk <- predict(rf_fit, new_data = test) %>% bind_cols(test) %>% filter(.pred_class == "Left")
monthly_loss <- sum(at_risk$MonthlyCharges)
monthly_loss
## [1] 13190.5

This is how much money Regork could lose in one month if those customers leave.

top_500 <- at_risk %>% top_n(500, wt = MonthlyCharges)
incentive_cost <- 20 * nrow(top_500)
savings <- sum(top_500$MonthlyCharges)
tibble(Incentive_Cost = incentive_cost, Potential_Savings = savings)

By offering a $20/month incentive to the top 500 highest-paying at-risk customers, we could prevent big losses and actually save more money overall.

Final Thoughts and Analysis

Through detailed analysis and predictive modeling, we identified key patterns driving customer churn at Regork Telecom. Our findings show that month-to-month contract holders, new customers (low tenure), fiber optic users, and customers lacking support services are the most at-risk groups for leaving.

To proactively retain these valuable customers and maximize profits, we recommend the following targeted incentive strategies:

  • Offer 1-month free promotions to encourage customers to shift from month-to-month contracts to longer 1- or 2-year agreements.

  • Provide loyalty rewards (such as free upgrades or small bill credits) to customers early in their tenure, especially within the first six months.

  • Bundle free Tech Support and Online Security services to create additional value with minimal cost, boosting satisfaction and loyalty.

  • Incentivize automatic payments by offering small bill credits to customers who switch from electronic checks, reducing both churn and operational costs.

  • Offer personalized service packages to senior citizens and fiber optic users, increasing perceived value and lowering churn in key segments.

Implementing these strategies is expected to yield significant business benefits:

  • Lower monthly revenue loss by retaining high-risk customers.

  • Increase customer lifetime value through contract extensions.

  • Reduce operational costs associated with payment collection and churn.

  • Improve customer satisfaction by offering services aligned with their needs.

In summary, by focusing on targeted, cost-effective incentives, Regork Telecom can proactively retain more customers, stabilize revenue, and strengthen its market position as a trusted service provider.

Limitations:

Historical Data Only:

Our model is trained on past customer behavior. Future market conditions, new competitors, or changes in customer preferences may impact how well the model performs.

Incomplete Customer View:

Important factors like customer satisfaction ratings, prior service issues, and competitor promotions were not available. Including these could make predictions even more accurate.

Incentive Effectiveness May Vary:

While we estimate positive impacts from incentive programs, actual customer response rates and the true cost of offering incentives could differ once implemented at scale.