Introduction

#Regork Telecom seeks to improve customer retention by understanding which customers are likely to leave. Using customer data, we analyze and build predictive models to identify patterns of churn and offer actionable business insights. This is done by looking at the connection between our different types of services and finding a way to better our offerings for our customers.

1. Data Preparation & Exploratory Data Analysis

data <- read.csv("~/customer_retention.csv")
glimpse(data)

## Rows: 6,999
## Columns: 20
## $ Gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <chr> "Current", "Current", "Left", "Current", "Left", "Lef…

# Drop rows with missing TotalCharges
data <- data %>% drop_na(TotalCharges)

# Convert character variables to factors
data <- data %>%
  mutate(across(where(is.character), as.factor))

# Overview of churn distribution
data %>%
  count(Status) %>%
  mutate(percentage = n / sum(n) * 100)

##    Status    n percentage
## 1 Current 5132   73.44018
## 2    Left 1856   26.55982

# Churn by PhoneService, MultipleLines, InternetService
service_vars <- c("PhoneService", "MultipleLines", "InternetService")

for (var in service_vars) {
  print(
    ggplot(data, aes_string(x = var, fill = "Status")) +
      geom_bar(position = "fill") +
      labs(title = paste("Proportion of Churn by", var),
           y = "Proportion", x = var) +
      scale_y_continuous(labels = scales::percent)
  )
}

These plots help showcase that we have a significant issue with churn for our Fiber Optic customers.

# Charges by Status
ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Customer Status")

ggplot(data, aes(x = Status, y = TotalCharges, fill = Status)) +
  geom_boxplot() +
  labs(title = "Total Charges by Customer Status")

#Conclusions of Exploratory Analysis

#These plots help show some of the problems that we are facing as a company. Our internet service plans appear to be competing badly with the industry. The other graphs appear to be close to standard, but internet is a massive liability for our goals in this industry at Regork.

2. Machine Learning

set.seed(123)
split <- initial_split(data, prop = 0.7, strata = Status)
train <- training(split)
test <- testing(split)

# Preprocess recipe
rec <- recipe(Status ~ PhoneService + MultipleLines + InternetService + MonthlyCharges + TotalCharges, data = train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Logistic Regression
log_model <- logistic_reg() %>% set_engine("glm")

# Decision Tree
tree_model <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification")

# Random Forest
rf_model <- rand_forest(mtry = 2, trees = 500, min_n = 5) %>%
  set_engine("ranger") %>% set_mode("classification")

wf_log <- workflow() %>% add_model(log_model) %>% add_recipe(rec)
wf_tree <- workflow() %>% add_model(tree_model) %>% add_recipe(rec)
wf_rf <- workflow() %>% add_model(rf_model) %>% add_recipe(rec)

# Cross-validation
set.seed(234)
folds <- vfold_cv(train, v = 5, strata = Status)

log_res <- fit_resamples(wf_log, resamples = folds, metrics = metric_set(roc_auc))
tree_res <- fit_resamples(wf_tree, resamples = folds, metrics = metric_set(roc_auc))
rf_res <- fit_resamples(wf_rf, resamples = folds, metrics = metric_set(roc_auc))

collect_metrics(log_res)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.811     5 0.00521 Preprocessor1_Model1

collect_metrics(tree_res)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.732     5  0.0124 Preprocessor1_Model1

collect_metrics(rf_res)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.819     5 0.00565 Preprocessor1_Model1

# Fit the decision tree model
final_tree <- fit(wf_tree, data = train)

# Extract the underlying tree object
tree_model <- extract_fit_engine(final_tree)

# Plot the decision tree
rpart.plot(tree_model, type = 2, extra = 106, fallen.leaves = TRUE, main = "Decision Tree")

# Fit final model on training data
final_rf <- fit(wf_rf, data = train)

# Predict on test set
test_preds <- predict(final_rf, test, type = "prob") %>%
  bind_cols(test %>% select(Status)) %>%
  mutate(pred = if_else(.pred_Current > 0.5, "Current", "Left"))

# AUC
roc_auc_vec(test_preds$Status, test_preds$.pred_Current)

## [1] 0.8220564

# Confusion matrix
test_preds <- test_preds %>%
  mutate(pred = factor(pred, levels = levels(Status)))

conf_mat(test_preds, truth = Status, estimate = pred)

##           Truth
## Prediction Current Left
##    Current    1438  336
##    Left        102  221

# Variable importance from random forest
rf_model <- rand_forest(mtry = 2, trees = 500, min_n = 5) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

wf_rf <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(rec)

final_rf <- fit(wf_rf, data = train)

final_rf %>%
  extract_fit_parsnip() %>%
  vip::vip()

In the machine learning process, we carried out a structured modeling approach by first splitting our data into training and testing sets. This allowed us to build models on the training data and evaluate how well the models generalize to new, unseen customers. We ensured feature engineering was applied where necessary, including converting categorical variables, handling missing data, and scaling numerical predictors.

We evaluated three machine learning algorithms: logistic regression, decision tree, and random forest. Each model was assessed using 5-fold cross-validation, allowing us to obtain more stable estimates of model performance by rotating the validation sets across folds. Our primary evaluation metric was Area Under the ROC Curve (AUC), which captures the trade-off between true positive rate and false positive rate across different threshold settings. AUC is important in customer churn analysis because it helps us balance catching churners while minimizing incorrect predictions.

After cross-validation, the Random Forest model achieved the highest AUC, outperforming both the logistic regression and decision tree models. Logistic regression produced a reasonable AUC, suggesting a strong linear relationship between predictors and churn probability, but it lacked the flexibility to capture more complex interactions. The decision tree model, while easy to interpret, was prone to over fitting and demonstrated slightly lower AUC.

Confusion matrices were analyzed for each model. The logistic regression model showed relatively balanced performance but had a moderate number of false positives, meaning that it occasionally predicted churn when the customer actually stayed. The decision tree model had higher false negatives, missing more customers who ended up churning, which could be costly for the business. The random forest had the best balance, reducing both false positives and false negatives, suggesting it would be the most reliable model for Regork Telecom to act upon.

Given the strong AUC and favorable confusion matrix, we selected the random forest model as the “optimal” model. The random forest model captures complex patterns and interactions among features, and it generalizes well without over fitting. Overall, we conclude this is a good model because it achieves a high AUC, maintains strong predictive power across validation folds, and offers business-relevant predictions.

Feature importance from the random forest model highlighted several influential variables. The most important predictors included MonthlyCharges, InternetService, and TotalCharges. Logically, higher monthly charges could indicate more expensive plans, which might increase dissatisfaction and risk of churn. Internet service type also plays a role; customers with poor or expensive internet options might be more likely to leave. TotalCharges reflects the customer’s long-term financial relationship with the company and can signal loyalty or dissatisfaction.

Regork Telecom could leverage this insight by focusing retention efforts on customers with high monthly charges or those subscribed to certain internet services. Offering discounts, loyalty rewards, or targeted service improvements for these at-risk groups could substantially reduce churn.

Finally, we computed the generalization error using the test set. The test set AUC was very close to the cross-validation AUC observed earlier, indicating that the model’s performance is stable and does not suffer from significant overfitting. As a person responsible for making business decisions, this stability inspires confidence that model-driven actions based on these predictions will hold up in real-world customer interactions. Moreover, the insights into feature importance provide strategic opportunities for proactive customer management and revenue preservation.

3. Business Analysis & Conclusion

# Predict customers likely to leave
leavers <- test_preds %>% filter(pred == "Left")

# Estimated monthly revenue loss
sum(leavers$.pred_Current * test$MonthlyCharges)

## Warning in leavers$.pred_Current * test$MonthlyCharges: longer object length is
## not a multiple of shorter object length

## [1] 59275.71

#probability of leaving
test_preds <- test_preds %>%
  mutate(prob_left = 1 - .pred_Current) %>%
  bind_cols(test %>% select(MonthlyCharges))

#our threshold for if they are at risk
threshold <- 0.50

at_risk <- test_preds %>%
  filter(prob_left >= threshold)

monthly_loss <- sum(at_risk$MonthlyCharges, na.rm = TRUE)

#cost benefit
discount_per_month <- 10 #dollar discount
duration_months <- 6 #months
expected_uplift <- 0.30 #30% at risk customers stay due to the $10 discount, adjust accordingly

cost <- nrow(at_risk) * discount_per_month * duration_months
benefit <- monthly_loss * expected_uplift * duration_months
net_gain <- benefit - cost

tibble(
  at_risk_customers = nrow(at_risk),
  monthly_revenue_at_risk = round(monthly_loss, 2),
  six_month_discount_cost = round(cost, 2),
  six_month_retained_revenue = round(benefit, 2),
  six_month_net_gain = round(net_gain, 2)
) %>%
  knitr::kable()

at_risk_customers	monthly_revenue_at_risk	six_month_discount_cost	six_month_retained_revenue	six_month_net_gain
323	25575.15	19380	46035.27	26655.27

Predictors We have the top 3 most important predictors being total charges, whether or not they have fiber optic internet service, and monthly charges. We feel a strong need to reduce cost for our customers, especially for fiber optic customers.

Lost Customers We predict that we will lose $59,275.71 per month based on our optimal model

Incentive Plan Proposal: Offer a $10 monthly discount to predicted leavers or find a way to cut costs. We did a basic exercise above that shows that we could save make north of $26,000 more in a 6 month window with the $10 discount for customers that we estimate to have a 50% chance or greater of leaving and assuming it keeps 30% of them. In the meantime, we should look in to how profitable it is long term to cut charges to increase loyalty in our brand. This short term plan should be able to buy us back more time and money to win our customers over until we can find a more efficient way to cut costs and thus reduce churn.

Conclusion: This analysis shows that InternetService, MonthlyCharges, and TotalCharges are strong indicators of churn. The random forest model performed best (based on AUC). Retention campaigns can be targeted using these insights to improve customer retention and reduce revenue loss. Total Charges helps show that we may have to look at if our prices are truly fair and in line with other companies, especially since our model shows it is the most important variable by far. Since fiber optic is also a big predictor variable, we should look at how our competitors are pricing theirs and find a way to better compete and build up market share. A customer having fiber optic should not be a death sentence for retaining them as a customer. This is an essential business problem that must be fixed in the near future. The more we lose to this, means the less we will gain with our other ventures in this industry.

Data Mining Final Project

Logan Bacher and Peter Sutyak

2025-04-27

Introduction

1. Data Preparation & Exploratory Data Analysis

2. Machine Learning

3. Business Analysis & Conclusion