Synopsis

In this analysis, we explore factors influencing customer churn in Regork’s telecommunications services. By understanding relationships between contract types, payment methods, and tenure across internet service types, we aim to develop actionable insights for customer retention. Finally, we build a predictive model to identify at-risk customers and provide recommendations for reducing churn.

Packages Required

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
library(scales)
library(tidymodels)
library(vip)
library(rpart.plot)

Data Preparation

In order to prepare the data for analysis and modeling, we began by importing, cleaning, and formatting the dataset. We also loaded all necessary libraries to support data processing and machine learning tasks.

# Read data
df <- read.csv("customer_retention.csv")

# Convert 'TotalCharges' to numeric, ensuring proper format
df$TotalCharges <- as.numeric(df$TotalCharges)

# Convert 'Status' to a factor (Current or Left)
df$Status <- as.factor(df$Status)

# Remove rows with NA values to ensure data integrity
df <- drop_na(df)

# Split the data into training and testing sets (70% training, 30% testing)
set.seed(123)
split <- initial_split(df, prop = 0.7, strata = "Status")
train_data <- training(split)
test_data <- testing(split)

# Check the structure of train_data
str(train_data)
## 'data.frame':    4891 obs. of  20 variables:
##  $ Gender          : chr  "Female" "Male" "Male" "Female" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ Tenure          : int  1 34 45 10 13 25 69 52 71 21 ...
##  $ PhoneService    : chr  "No" "Yes" "No" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No phone service" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "No" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "Yes" "No" ...
##  $ TechSupport     : chr  "No" "No" "Yes" "No" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "One year" "Month-to-month" ...
##  $ PaperlessBilling: chr  "Yes" "No" "No" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Bank transfer (automatic)" "Mailed check" ...
##  $ MonthlyCharges  : num  29.9 57 42.3 29.8 50 ...
##  $ TotalCharges    : num  29.9 1889.5 1840.8 301.9 587.5 ...
##  $ Status          : Factor w/ 2 levels "Current","Left": 1 1 1 1 1 1 1 1 1 1 ...

Exploratory Analysis

To understand the patterns behind customer churn, we visualized different variables such as billing method, support services, and contract type. These charts help form hypotheses about which factors might contribute to churn.

df %>%
  count(PaperlessBilling, Status) %>%
  ggplot(aes(x = PaperlessBilling, y = n, fill = Status)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = percent) +
  labs(
    title = "Churn by Paperless Billing",
    subtitle = "Proportion of customers who stayed or left by billing preference",
    x = "Paperless Billing",
    y = "Customer Proportion",
    fill = "Status",
    caption = "Data: Regork Telecom"
  ) +
  theme_minimal()

Customers who opted for paperless billing showed a noticeably higher churn rate. This may indicate that digitally-engaged users are more price-sensitive or expect better service experiences.

df %>%
  group_by(PaymentMethod, Status) %>%
  summarise(TotalMonthlyCharges = sum(MonthlyCharges, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = PaymentMethod, y = TotalMonthlyCharges, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(
    title = "Total Monthly Charges by Payment Method",
    subtitle = "Comparing churn vs. current across different payment methods",
    x = "Payment Method",
    y = "Sum of Monthly Charges",
    fill = "Status",
    caption = "Data: Regork Telecom"
  ) +
  theme_minimal()

Electronic check users contribute the highest monthly charges and also show a high churn rate, suggesting this group may be more transactional or less loyal. Automatic payment users appear more stable.

df %>%
  filter(TechSupport != "No internet service") %>%
  count(TechSupport, Status, InternetService) %>%
  ggplot(aes(x = TechSupport, y = n, fill = Status)) +
  geom_bar(stat = "identity") +
  facet_wrap(~InternetService) +
  labs(
    title = "Tech Support vs. Internet Service and Churn",
    subtitle = "Are customers with tech support less likely to leave?",
    x = "Tech Support",
    y = "Customer Count",
    fill = "Status",
    caption = "Data: Regork Telecom"
  ) +
  theme_minimal()

Customers with both streaming and backup services tend to stay longer. This implies that bundled or value-added services may increase customer retention.

# Streaming TV and Online Backup vs. Churn
df %>%
  filter(StreamingTV != "No internet service") %>%
  count(StreamingTV, Status, OnlineBackup) %>%
  ggplot(aes(x = StreamingTV, y = n, fill = Status)) +
  geom_bar(stat = "identity") +
  facet_wrap(~OnlineBackup) +
  labs(
    title = "Streaming TV and Online Backup vs. Churn",
    subtitle = "How do entertainment and support services affect churn?",
    x = "Streaming TV",
    y = "Customer Count",
    fill = "Status",
    caption = "Data: Regork Telecom"
  ) +
  theme_minimal()

Customers with both streaming and backup services tend to stay longer. This implies that bundled or value-added services may increase customer retention.

Machine Learning

  1. Data Preparation for Modeling In this section, we’ll create the recipe for pre-processing the data, including normalization of numeric predictors and dummy encoding for categorical variables.
# Create recipe for data preprocessing
retention_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())
  1. Logistic Regression We will start by using Logistic Regression as our baseline model. Logistic regression is useful for binary classification tasks like predicting customer churn.
# Logistic Regression Model
logistic_regression <- logistic_reg() %>%
  set_engine("glm")

# Cross-validation setup (5-fold)
logistic_kfolds <- vfold_cv(train_data, v = 5, strata = "Status")

# Train logistic regression model with cross-validation
log_results <- logistic_regression %>% 
  fit_resamples(Status ~ ., logistic_kfolds)

# Collect and evaluate the metrics
collect_metrics(log_results) %>%
  filter(.metric == "roc_auc")
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00582 Preprocessor1_Model1

We will evaluate the model using the AUC (Area Under the Curve) and accuracy metrics. This will allow us to assess how well the model distinguishes between customers who leave and those who stay.

  1. Model Evaluation: Once the model has been trained, we will evaluate its performance using the test set, calculating the confusion matrix and plotting the ROC curve to visualize model performance.
# Fit the logistic regression model on the full training data
final_log_fit <- logistic_regression %>%
  fit(Status ~ ., data = train_data)

# Predict on the test data
log_preds <- predict(final_log_fit, test_data) %>%
  bind_cols(test_data %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

# ROC Curve
final_log_fit %>%
  predict(test_data, type = "prob") %>%
  mutate(truth = test_data$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

We can assess the bias of the model by analyzing false positives and false negatives. If the model predicts many customers as Current when they have Left, we can adjust the threshold or consider a different model.

  1. Decision Tree Next, we will apply a Decision Tree model, which can capture more complex relationships in the data.
# Decision Tree Model Setup
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Fit Decision Tree model
dt_fit <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = train_data)

# Cross-validation for Decision Tree
dt_results <- fit_resamples(dt_mod, retention_recipe, logistic_kfolds)

# Evaluate metrics
collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.790     5 0.00542 Preprocessor1_Model1
## 2 brier_class binary     0.159     5 0.00204 Preprocessor1_Model1
## 3 roc_auc     binary     0.715     5 0.00544 Preprocessor1_Model1

Hyperparameter Tuning We will also tune the hyperparameters of the decision tree, such as cost complexity, tree depth, and min_n to improve its performance.

# Hyperparameter tuning for Decision Tree
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# Create hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# Train the decision tree with hyperparameter tuning
set.seed(123)
dt_results <- tune_grid(dt_mod, retention_recipe, resamples = logistic_kfolds, grid = dt_hyper_grid)

# Show best results
show_best(dt_results, metric = "roc_auc")
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    40 roc_auc binary     0.814     5 0.00288
## 2    0.0000000178          8    40 roc_auc binary     0.814     5 0.00288
## 3    0.00000316            8    40 roc_auc binary     0.814     5 0.00288
## 4    0.000562              8    40 roc_auc binary     0.813     5 0.00237
## 5    0.0000000001          8    30 roc_auc binary     0.813     5 0.00335
## # ℹ 1 more variable: .config <chr>
  1. Random Forest A Random Forest model will help us assess whether an ensemble method can improve performance. We will tune the hyperparameters like number of trees, mtry, and min_n to find the best configuration.
# Random Forest Model Setup
rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

# Train Random Forest Model
rf_results <- fit_resamples(rf_mod, retention_recipe, logistic_kfolds)
## Warning: package 'ranger' was built under R version 4.4.3
# Evaluate metrics
collect_metrics(rf_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.801     5 0.00509 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00243 Preprocessor1_Model1
## 3 roc_auc     binary     0.842     5 0.00593 Preprocessor1_Model1
  1. Final Model Selection and Evaluation After evaluating the models, we will choose the Logistic Regression model as the optimal model based on its AUC score and strong performance on both training and test data. We will evaluate its feature importance to identify which predictors influence churn the most.
# Final Logistic Regression Model
final_log_fit <- logistic_regression %>%
  fit(Status ~ ., data = train_data)

# Test the final model on unseen data
log_preds <- predict(final_log_fit, test_data) %>%
  bind_cols(test_data %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

# Feature importance
vip::vip(final_log_fit)

From the analysis, the most influential predictors of customer churn are Tenure, Contract type, and Payment Method. Specifically:

Tenure: The longer a customer has been with Regork, the less likely they are to churn.

Contract type: Customers with long-term contracts (e.g., 1-year or 2-year) tend to stay longer compared to those on month-to-month contracts.

Payment Method: Customers using automatic payments are more likely to stay compared to those using manual payment methods like electronic checks.

Decision Tree Model: Customer Churn Prediction Data Preprocessing and Recipe In this section, we will create a recipe for preprocessing the data, which includes normalization of numeric variables and dummy encoding for categorical ones.

# Create recipe for data preprocessing
decision_tree_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())
  1. Train Decision Tree Model We’ll use the rpart package to create a decision tree for predicting customer churn. This tree will help us understand the key features affecting customer retention.
# Decision tree model setup
dt_model <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Fit the decision tree model using the training data
dt_model_fit <- workflow() %>%
  add_recipe(decision_tree_recipe) %>%
  add_model(dt_model) %>%
  fit(data = train_data)

Model Evaluation: Cross-Validation We’ll evaluate the decision tree model using 5-fold cross-validation to ensure that the model performs well across different subsets of the data.

# Set up 5-fold cross-validation
set.seed(123)
dt_kfolds <- vfold_cv(train_data, v = 5, strata = "Status")

# Perform cross-validation
dt_cv_results <- fit_resamples(
  dt_model,
  decision_tree_recipe,
  resamples = dt_kfolds,
  metrics = metric_set(roc_auc, accuracy)
)

# Collect and display the metrics
collect_metrics(dt_cv_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.788     5 0.00399 Preprocessor1_Model1
## 2 roc_auc  binary     0.710     5 0.00529 Preprocessor1_Model1
  1. Model Performance on Test Data After training, we can evaluate the model on the test data to see how well it performs on unseen data, specifically looking at the AUC and the confusion matrix.
# Predict on the test data
dt_test_preds <- predict(dt_model_fit, test_data) %>%
  bind_cols(test_data %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

# Show confusion matrix
dt_test_preds
##           Truth
## Prediction Current Left
##    Current    1422  321
##    Left        118  236

Visualizing the Decision Tree We’ll visualize the decision tree to understand how decisions are made based on the predictors. The tree will show how the features split at each node.

# Visualize the decision tree
dt_tree <- extract_fit_engine(dt_model_fit)
rpart.plot(dt_tree, type = 2, extra = 106, under = TRUE, tweak = 1.2, fallen.leaves = TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Feature Importance After training the decision tree, we will extract the feature importance to see which predictors are most influential in determining whether a customer churns or stays.

# Plot feature importance
vip(dt_model_fit, num_features = 10)

Conclusion

In conclusion, our churn analysis project for Regork provides a strong foundation for data-driven decision-making as the company seeks to expand its presence in the telecommunications industry. By identifying the key factors influencing customer churn—particularly tenure, contract type, total and monthly charges, and online security preferences—we have demonstrated that targeted strategies can meaningfully reduce customer attrition.

The predictive models developed throughout this project performed with an average AUC of approximately 0.84, indicating strong model performance and reliability in identifying potential churners. These models are not only statistically sound but also practically useful in shaping marketing, pricing, and customer retention strategies.

Looking ahead, we recommend that Regork implement longer contract offerings, monitor pricing competitiveness, and enhance data privacy measures to address the top concerns contributing to churn. These insights can help the company not only retain existing customers but also grow its customer base through more personalized and strategic service offerings.

Ultimately, this analysis equips Regork with valuable tools to proactively manage churn, optimize customer engagement, and support sustainable growth in a highly competitive market.