In this analysis, we explore factors influencing customer churn in Regork’s telecommunications services. By understanding relationships between contract types, payment methods, and tenure across internet service types, we aim to develop actionable insights for customer retention. Finally, we build a predictive model to identify at-risk customers and provide recommendations for reducing churn.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
library(scales)
library(tidymodels)
library(vip)
library(rpart.plot)
In order to prepare the data for analysis and modeling, we began by importing, cleaning, and formatting the dataset. We also loaded all necessary libraries to support data processing and machine learning tasks.
# Read data
df <- read.csv("customer_retention.csv")
# Convert 'TotalCharges' to numeric, ensuring proper format
df$TotalCharges <- as.numeric(df$TotalCharges)
# Convert 'Status' to a factor (Current or Left)
df$Status <- as.factor(df$Status)
# Remove rows with NA values to ensure data integrity
df <- drop_na(df)
# Split the data into training and testing sets (70% training, 30% testing)
set.seed(123)
split <- initial_split(df, prop = 0.7, strata = "Status")
train_data <- training(split)
test_data <- testing(split)
# Check the structure of train_data
str(train_data)
## 'data.frame': 4891 obs. of 20 variables:
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ Tenure : int 1 34 45 10 13 25 69 52 71 21 ...
## $ PhoneService : chr "No" "Yes" "No" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No phone service" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "No" "No" ...
## $ DeviceProtection: chr "No" "Yes" "Yes" "No" ...
## $ TechSupport : chr "No" "No" "Yes" "No" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "One year" "Month-to-month" ...
## $ PaperlessBilling: chr "Yes" "No" "No" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Bank transfer (automatic)" "Mailed check" ...
## $ MonthlyCharges : num 29.9 57 42.3 29.8 50 ...
## $ TotalCharges : num 29.9 1889.5 1840.8 301.9 587.5 ...
## $ Status : Factor w/ 2 levels "Current","Left": 1 1 1 1 1 1 1 1 1 1 ...
To understand the patterns behind customer churn, we visualized different variables such as billing method, support services, and contract type. These charts help form hypotheses about which factors might contribute to churn.
df %>%
count(PaperlessBilling, Status) %>%
ggplot(aes(x = PaperlessBilling, y = n, fill = Status)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = percent) +
labs(
title = "Churn by Paperless Billing",
subtitle = "Proportion of customers who stayed or left by billing preference",
x = "Paperless Billing",
y = "Customer Proportion",
fill = "Status",
caption = "Data: Regork Telecom"
) +
theme_minimal()
Customers who opted for paperless billing showed a noticeably higher
churn rate. This may indicate that digitally-engaged users are more
price-sensitive or expect better service experiences.
df %>%
group_by(PaymentMethod, Status) %>%
summarise(TotalMonthlyCharges = sum(MonthlyCharges, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = PaymentMethod, y = TotalMonthlyCharges, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(
title = "Total Monthly Charges by Payment Method",
subtitle = "Comparing churn vs. current across different payment methods",
x = "Payment Method",
y = "Sum of Monthly Charges",
fill = "Status",
caption = "Data: Regork Telecom"
) +
theme_minimal()
Electronic check users contribute the highest monthly charges and also
show a high churn rate, suggesting this group may be more transactional
or less loyal. Automatic payment users appear more stable.
df %>%
filter(TechSupport != "No internet service") %>%
count(TechSupport, Status, InternetService) %>%
ggplot(aes(x = TechSupport, y = n, fill = Status)) +
geom_bar(stat = "identity") +
facet_wrap(~InternetService) +
labs(
title = "Tech Support vs. Internet Service and Churn",
subtitle = "Are customers with tech support less likely to leave?",
x = "Tech Support",
y = "Customer Count",
fill = "Status",
caption = "Data: Regork Telecom"
) +
theme_minimal()
Customers with both streaming and backup services tend to stay longer.
This implies that bundled or value-added services may increase customer
retention.
# Streaming TV and Online Backup vs. Churn
df %>%
filter(StreamingTV != "No internet service") %>%
count(StreamingTV, Status, OnlineBackup) %>%
ggplot(aes(x = StreamingTV, y = n, fill = Status)) +
geom_bar(stat = "identity") +
facet_wrap(~OnlineBackup) +
labs(
title = "Streaming TV and Online Backup vs. Churn",
subtitle = "How do entertainment and support services affect churn?",
x = "Streaming TV",
y = "Customer Count",
fill = "Status",
caption = "Data: Regork Telecom"
) +
theme_minimal()
Customers with both streaming and backup services tend to stay longer. This implies that bundled or value-added services may increase customer retention.
# Create recipe for data preprocessing
retention_recipe <- recipe(Status ~ ., data = train_data) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Logistic Regression Model
logistic_regression <- logistic_reg() %>%
set_engine("glm")
# Cross-validation setup (5-fold)
logistic_kfolds <- vfold_cv(train_data, v = 5, strata = "Status")
# Train logistic regression model with cross-validation
log_results <- logistic_regression %>%
fit_resamples(Status ~ ., logistic_kfolds)
# Collect and evaluate the metrics
collect_metrics(log_results) %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00582 Preprocessor1_Model1
We will evaluate the model using the AUC (Area Under the Curve) and accuracy metrics. This will allow us to assess how well the model distinguishes between customers who leave and those who stay.
# Fit the logistic regression model on the full training data
final_log_fit <- logistic_regression %>%
fit(Status ~ ., data = train_data)
# Predict on the test data
log_preds <- predict(final_log_fit, test_data) %>%
bind_cols(test_data %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
# ROC Curve
final_log_fit %>%
predict(test_data, type = "prob") %>%
mutate(truth = test_data$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
We can assess the bias of the model by analyzing false positives and
false negatives. If the model predicts many customers as Current when
they have Left, we can adjust the threshold or consider a different
model.
# Decision Tree Model Setup
dt_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Fit Decision Tree model
dt_fit <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(dt_mod) %>%
fit(data = train_data)
# Cross-validation for Decision Tree
dt_results <- fit_resamples(dt_mod, retention_recipe, logistic_kfolds)
# Evaluate metrics
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.790 5 0.00542 Preprocessor1_Model1
## 2 brier_class binary 0.159 5 0.00204 Preprocessor1_Model1
## 3 roc_auc binary 0.715 5 0.00544 Preprocessor1_Model1
Hyperparameter Tuning We will also tune the hyperparameters of the decision tree, such as cost complexity, tree depth, and min_n to improve its performance.
# Hyperparameter tuning for Decision Tree
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Create hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# Train the decision tree with hyperparameter tuning
set.seed(123)
dt_results <- tune_grid(dt_mod, retention_recipe, resamples = logistic_kfolds, grid = dt_hyper_grid)
# Show best results
show_best(dt_results, metric = "roc_auc")
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 40 roc_auc binary 0.814 5 0.00288
## 2 0.0000000178 8 40 roc_auc binary 0.814 5 0.00288
## 3 0.00000316 8 40 roc_auc binary 0.814 5 0.00288
## 4 0.000562 8 40 roc_auc binary 0.813 5 0.00237
## 5 0.0000000001 8 30 roc_auc binary 0.813 5 0.00335
## # ℹ 1 more variable: .config <chr>
# Random Forest Model Setup
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
# Train Random Forest Model
rf_results <- fit_resamples(rf_mod, retention_recipe, logistic_kfolds)
## Warning: package 'ranger' was built under R version 4.4.3
# Evaluate metrics
collect_metrics(rf_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.801 5 0.00509 Preprocessor1_Model1
## 2 brier_class binary 0.137 5 0.00243 Preprocessor1_Model1
## 3 roc_auc binary 0.842 5 0.00593 Preprocessor1_Model1
# Final Logistic Regression Model
final_log_fit <- logistic_regression %>%
fit(Status ~ ., data = train_data)
# Test the final model on unseen data
log_preds <- predict(final_log_fit, test_data) %>%
bind_cols(test_data %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
# Feature importance
vip::vip(final_log_fit)
From the analysis, the most influential predictors of customer churn are Tenure, Contract type, and Payment Method. Specifically:
Tenure: The longer a customer has been with Regork, the less likely they are to churn.
Contract type: Customers with long-term contracts (e.g., 1-year or 2-year) tend to stay longer compared to those on month-to-month contracts.
Payment Method: Customers using automatic payments are more likely to stay compared to those using manual payment methods like electronic checks.
Decision Tree Model: Customer Churn Prediction Data Preprocessing and Recipe In this section, we will create a recipe for preprocessing the data, which includes normalization of numeric variables and dummy encoding for categorical ones.
# Create recipe for data preprocessing
decision_tree_recipe <- recipe(Status ~ ., data = train_data) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Decision tree model setup
dt_model <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Fit the decision tree model using the training data
dt_model_fit <- workflow() %>%
add_recipe(decision_tree_recipe) %>%
add_model(dt_model) %>%
fit(data = train_data)
Model Evaluation: Cross-Validation We’ll evaluate the decision tree model using 5-fold cross-validation to ensure that the model performs well across different subsets of the data.
# Set up 5-fold cross-validation
set.seed(123)
dt_kfolds <- vfold_cv(train_data, v = 5, strata = "Status")
# Perform cross-validation
dt_cv_results <- fit_resamples(
dt_model,
decision_tree_recipe,
resamples = dt_kfolds,
metrics = metric_set(roc_auc, accuracy)
)
# Collect and display the metrics
collect_metrics(dt_cv_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00399 Preprocessor1_Model1
## 2 roc_auc binary 0.710 5 0.00529 Preprocessor1_Model1
# Predict on the test data
dt_test_preds <- predict(dt_model_fit, test_data) %>%
bind_cols(test_data %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
# Show confusion matrix
dt_test_preds
## Truth
## Prediction Current Left
## Current 1422 321
## Left 118 236
Visualizing the Decision Tree We’ll visualize the decision tree to understand how decisions are made based on the predictors. The tree will show how the features split at each node.
# Visualize the decision tree
dt_tree <- extract_fit_engine(dt_model_fit)
rpart.plot(dt_tree, type = 2, extra = 106, under = TRUE, tweak = 1.2, fallen.leaves = TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Feature Importance After training the decision tree, we will extract the feature importance to see which predictors are most influential in determining whether a customer churns or stays.
# Plot feature importance
vip(dt_model_fit, num_features = 10)
In conclusion, our churn analysis project for Regork provides a strong foundation for data-driven decision-making as the company seeks to expand its presence in the telecommunications industry. By identifying the key factors influencing customer churn—particularly tenure, contract type, total and monthly charges, and online security preferences—we have demonstrated that targeted strategies can meaningfully reduce customer attrition.
The predictive models developed throughout this project performed with an average AUC of approximately 0.84, indicating strong model performance and reliability in identifying potential churners. These models are not only statistically sound but also practically useful in shaping marketing, pricing, and customer retention strategies.
Looking ahead, we recommend that Regork implement longer contract offerings, monitor pricing competitiveness, and enhance data privacy measures to address the top concerns contributing to churn. These insights can help the company not only retain existing customers but also grow its customer base through more personalized and strategic service offerings.
Ultimately, this analysis equips Regork with valuable tools to proactively manage churn, optimize customer engagement, and support sustainable growth in a highly competitive market.