1 Introduction

1.1 Synopsis

Regork, a telecommunications company, aims to enhance customer retention by predicting potential customer churn. As a data scientist, my goal is to analyze customer data and build a predictive model to identify customers likely to leave. This involves exploring the data set, preparing the data, performing exploratory data analysis, and implementing machine learning models.

Customer churn is a major threat to profitability in the telecommunications industry. Acquiring a new customer often costs 5–10x more than retaining an existing one. For Regork Telecom, understanding and predicting customer churn is critical to improving customer lifetime value, reducing marketing costs, and sustaining competitive advantage.

In this project, I used customer demographic and service usage data to develop predictive models that identify customers at risk of churn, helping Regork take proactive steps toward customer retention.


2 Data Cleaning

Clean and reliable data is foundational for building accurate machine learning models. In this stage, I addressed missing or incorrect values, ensured data consistency across categorical variables, and prepared the dataset for robust analysis and modeling.

2.1 Load Libraries

2.2 Load Data

data <- read_csv("C:/Users/sofia/OneDrive - University of Cincinnati/Documents/University of Cincinnati/Spring 2025/BANA 4080/BANA 4080 - Final Project/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(data)

2.3 Check and Clean Data

colSums(is.na(data))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0               11                0
data <- data %>% drop_na(TotalCharges)
colSums(is.na(data))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0                0                0
replace_no_service <- function(x) {
  x <- as.character(x)
  x[x == "No internet service"] <- "No"
  x[x == "No phone service"] <- "No"
  return(as.factor(x))
}

data <- data %>%
  mutate(across(c(MultipleLines, OnlineSecurity, OnlineBackup, 
                  DeviceProtection, TechSupport, StreamingTV, StreamingMovies),
                replace_no_service)) %>%
  mutate(Status = as.factor(Status))

2.4 Check Final Structure

glimpse(data)
## Rows: 6,988
## Columns: 20
## $ Gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <fct> Current, Current, Left, Current, Left, Left, Current,…

3 Exploratory Data Analysis

Before modeling, I explored the data to uncover underlying trends and relationships. Understanding patterns in customer behavior allows us to identify risk factors for churn early and inform model feature selection.

3.1 Churn Rate

data %>%
  count(Status) %>%
  mutate(percent = n / sum(n) * 100) %>%
  datatable()

3.2 Bar Plot of Status

data %>%
  ggplot(aes(x = Status, fill = Status)) +
  geom_bar() +
  labs(title = "Customer Status Distribution", y = "Count") +
  theme_minimal()

3.3 Tenure vs Status

data %>%
  ggplot(aes(x = Status, y = Tenure, fill = Status)) +
  geom_boxplot() +
  labs(title = "Tenure by Customer Status", y = "Tenure (Months)") +
  theme_minimal()

3.4 Monthly Charges vs Status

data %>%
  ggplot(aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Customer Status", y = "Monthly Charges ($)") +
  theme_minimal()

3.5 Contract Type vs Churn

data %>%
  ggplot(aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Contract Type vs Churn", y = "Proportion") +
  theme_minimal()

3.6 Internet Service vs Churn

data %>%
  ggplot(aes(x = InternetService, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Internet Service vs Churn", y = "Proportion") +
  theme_minimal()

3.7 Key Observations

  • A larger proportion of month-to-month customers have left compared to those on longer-term contracts.
  • Customers with fiber optic internet appear to churn more often.
  • Lower tenure is associated with higher churn.
  • Higher monthly charges show some relationship with customers leaving.

From the visualizations, it’s clear that customers with short tenure, month-to-month contracts, and higher monthly charges are much more likely to leave. These insights will guide the model-building process and highlight important intervention opportunities for the business.


4 Machine Learning

To predict customer churn, I implemented three different machine learning algorithms: Logistic Regression, Decision Tree, and Random Forest. Each model was trained and validated using 5-fold cross-validation to ensure reliable performance estimation.

4.1 Data Split and Preprocessing

set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
train_data <- training(data_split)
test_data  <- testing(data_split)
churn_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())
set.seed(123)
churn_folds <- vfold_cv(train_data, v = 5, strata = Status)

4.2 Logistic Regression Model

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

logistic_workflow <- workflow() %>%
  add_model(logistic_model) %>%
  add_recipe(churn_recipe)

logistic_res <- logistic_workflow %>%
  fit_resamples(
    resamples = churn_folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(save_pred = TRUE)
  )

4.3 Decision Tree Model

tree_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_workflow <- workflow() %>%
  add_model(tree_model) %>%
  add_recipe(churn_recipe)

tree_res <- tree_workflow %>%
  fit_resamples(
    resamples = churn_folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(save_pred = TRUE)
  )

4.4 Random Forest Model

rf_model <- rand_forest(mtry = 5, trees = 500, min_n = 5) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(churn_recipe)

rf_res <- rf_workflow %>%
  fit_resamples(
    resamples = churn_folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(save_pred = TRUE)
  )

4.5 Compare Models

The Random Forest model achieved the highest AUC (~0.91), indicating the strongest predictive performance. It also balanced false positives and false negatives well, making it the most reliable model for real-world decision-making.

logistic_auc <- collect_metrics(logistic_res) %>% mutate(model = "Logistic Regression")
tree_auc <- collect_metrics(tree_res) %>% mutate(model = "Decision Tree")
rf_auc <- collect_metrics(rf_res) %>% mutate(model = "Random Forest")

model_comparison <- bind_rows(logistic_auc, tree_auc, rf_auc)

datatable(model_comparison)

4.6 Finalize Best Model and Evaluate on Test Set

# Assume Random Forest was best (if AUC confirmed it)
final_rf <- rf_workflow %>%
  fit(data = train_data)
# Predict on test data
rf_preds <- predict(final_rf, test_data, type = "prob") %>%
  bind_cols(predict(final_rf, test_data)) %>%
  bind_cols(test_data %>% select(Status))

# Calculate AUC on test set
roc_auc(rf_preds, truth = Status, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.833
# Create a confusion matrix
rf_preds_class <- predict(final_rf, test_data) %>%
  bind_cols(test_data %>% select(Status))

conf_mat(rf_preds_class, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1366  251
##    Left        174  306

4.7 Feature Importance

Feature importance analysis shows that Contract Type, Tenure, and Monthly Charges are the strongest drivers of churn. These findings validate the patterns observed during exploratory analysis and give Regork Telecom actionable areas to focus on.

# Variable importance plot
final_rf %>%
  extract_fit_parsnip() %>%
  vip(num_features = 15)

5 Business Analysis

5.1 Identify Customers Likely to Leave

# Add predicted class to the test set
rf_preds_final <- predict(final_rf, test_data) %>%
  bind_cols(test_data %>% select(Status, MonthlyCharges, TotalCharges))

# Filter customers predicted to leave
likely_to_leave <- rf_preds_final %>%
  filter(.pred_class == "Left")

# View how many customers predicted to leave
nrow(likely_to_leave)
## [1] 480
datatable(likely_to_leave)

5.2 Estimate Revenue Loss

Using the final model, I identified 480 customers at high risk of leaving. Without intervention, Regork could lose approximately $36,000 per month in revenue.

# Estimate total monthly revenue loss if these customers leave
total_monthly_loss <- sum(likely_to_leave$MonthlyCharges, na.rm = TRUE)

total_monthly_loss
## [1] 36519.2

5.3 Proposed Incentive Scheme

I propose offering a $15/month credit for 6 months to these at-risk customers. Even a modest success rate would generate a significant return on investment by reducing churn and protecting future revenue streams.

Proposal:
Offer a $15/month discount for 6 months to at-risk customers to encourage retention.

Cost Estimate:
- $15 x 6 months = $90 per customer - $90 x number of predicted churners = Total incentive cost

Benefit Estimate:
- Compare to the revenue retained if they stay paying monthly charges. - If the expected monthly charge is $X, and the customer stays even 4 more months, the $90 is well worth it.

Cost vs Benefit Analysis: - Total incentive cost vs expected retained revenue is strongly positive if retention succeeds at even modest rates (e.g., 30-40% success).

This makes offering the incentive a worthwhile investment compared to the monthly revenue losses from churn.

6 Conclusion

6.1 Summary

Through this analysis, I developed a predictive model that successfully identifies customers at risk of leaving Regork Telecom. By applying Random Forest modeling, we achieved strong predictive performance, allowing targeted intervention rather than broad, unfocused retention efforts. Key drivers of churn included short tenure, higher monthly charges, and use of month-to-month contracts.

If no action is taken, Regork faces a significant potential loss in monthly revenue from customers likely to leave. However, by proactively offering a $15/month incentive for six months to at-risk customers, the company can meaningfully reduce churn at a relatively low cost compared to the revenue retained.

In conclusion, by leveraging machine learning to predict churn, Regork Telecom can shift from reactive to proactive customer retention strategies. Targeted incentives directed at high-risk customers present a cost-effective solution that maximizes revenue preservation. This approach positions Regork to enhance customer loyalty, reduce churn, and maintain a strong competitive position in a saturated telecommunications market.