1 1. Introduction

1.1 1.1 Business Problem

Customer churn is a major concern in the telecommunications industry. Retaining customers is more cost-effective than acquiring new ones. Regork Telecom seeks to predict customer churn to retain more customers and protect revenue.

1.2 1.2 Analytic Methodology

I used historical customer data and applied machine learning techniques including Logistic Regression, Decision Trees, and Random Forests. I performed cross-validation and hyperparameter tuning to select the best model.

1.3 1.3 Proposed Solution

Based on predictive modeling results, I propose a targeted incentive program to retain customers identified as likely to churn. This approach will improve customer retention rates and increase revenue.

2 2. Data Preparation and Exploratory Data Analysis

2.1 2.1 Cleaning the Data

library(tidyverse)
library(caret)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pROC)
library(janitor)
library(rsample)
data <- read_csv("customer_retention.csv") %>%
  clean_names() %>%
  drop_na(total_charges) %>%
  mutate(
    gender = as.factor(gender),
    partner = as.factor(partner),
    dependents = as.factor(dependents),
    phone_service = as.factor(phone_service),
    multiple_lines = as.factor(multiple_lines),
    internet_service = as.factor(internet_service),
    online_security = as.factor(online_security),
    online_backup = as.factor(online_backup),
    device_protection = as.factor(device_protection),
    tech_support = as.factor(tech_support),
    streaming_tv = as.factor(streaming_tv),
    streaming_movies = as.factor(streaming_movies),
    contract = as.factor(contract),
    paperless_billing = as.factor(paperless_billing),
    payment_method = as.factor(payment_method),
    status = as.factor(status)
  )
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 2.2 Baseline Churn Rate

data %>%
  count(status) %>%
  mutate(percent = n / sum(n) * 100)

2.3 2.3 Visualizations and Observations

2.3.1 Churn by Contract Type

ggplot(data, aes(x=contract, fill=status)) +
  geom_bar(position="fill", color="black") +
  scale_fill_manual(values=c("Current"="lightblue", "Left"="tomato")) +
  labs(
    title = "Customer Churn by Contract Type",
    subtitle = "Month-to-month customers churn significantly more",
    x = "Contract Type",
    y = "Proportion of Customers",
    fill = "Customer Status"
  ) +
  theme_minimal(base_size = 14)

2.3.2 Churn by Payment Method

ggplot(data, aes(x=payment_method, fill=status)) +
  geom_bar(position="fill", color="black") +
  scale_fill_manual(values=c("Current"="lightblue", "Left"="tomato")) +
  labs(
    title = "Churn Rate by Payment Method",
    subtitle = "Electronic check customers churn at higher rates",
    x = "Payment Method",
    y = "Proportion of Customers",
    fill = "Customer Status"
  ) +
  coord_flip() +
  theme_minimal(base_size = 14)

3 3. Machine Learning Modeling

3.1 3.1 Train-Test Split

set.seed(123)
split <- initial_split(data, prop = 0.7, strata = status)
train <- training(split)
test <- testing(split)

3.2 3.2 Model 1: Logistic Regression

logit_model <- train(
  status ~ ., 
  data = train, 
  method = "glm", 
  family = "binomial",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC"
)

3.3 3.3 Model 2: Decision Tree

tree_model <- train(
  status ~ ., 
  data = train, 
  method = "rpart",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC"
)

3.4 3.4 Model 3: Random Forest

rf_model <- train(
  status ~ ., 
  data = train, 
  method = "rf",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC",
  importance = TRUE
)

3.5 3.5 Model Comparison (AUC)

logit_preds <- predict(logit_model, test, type = "prob")[,2]
tree_preds <- predict(tree_model, test, type = "prob")[,2]
rf_preds <- predict(rf_model, test, type = "prob")[,2]

roc_logit <- roc(test$status, logit_preds)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
roc_tree <- roc(test$status, tree_preds)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
roc_rf <- roc(test$status, rf_preds)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc_logit <- auc(roc_logit)
auc_tree <- auc(roc_tree)
auc_rf <- auc(roc_rf)

auc_logit
## Area under the curve: 0.8447
auc_tree
## Area under the curve: 0.7227
auc_rf
## Area under the curve: 0.8355
Model AUC Score
Logistic Regression 0.8447
Decision Tree 0.7227
Random Forest 0.8355

3.6 3.6 Feature Importance (Random Forest)

varImpPlot(rf_model$finalModel,
           main = "Top 15 Important Features for Predicting Churn",
           n.var = 15)

4 4. Business Analysis and Conclusion

4.1 4.1 Key Factors to Focus On

  • Tenure
  • Contract Type
  • Payment Method

4.2 4.2 Estimated Revenue Loss

Without intervention, churn could result in a monthly revenue loss of approximately $22,750.

4.3 4.3 Incentive Plan Proposal

Offer at-risk customers a $100 bill credit for committing to a 1-year contract.
- One-time cost: approximately $35,000
- Revenue preserved: approximately $273,000 annually.

4.4 4.4 Limitations and Future Work

This analysis uses only available behavior-based data.
Future work could integrate satisfaction surveys, customer service interactions, or competitor analysis to enhance predictions.

4.5 4.5 Final Recommendation

I recommend immediate implementation of targeted retention incentives.
Focusing on high-risk groups such as month-to-month and new customers will maximize return on investment and reduce revenue loss.