Customer retention is a critical priority for Regork Telecom, especially given the high costs associated with acquiring new customers compared to retaining existing ones.
The goal of this project was to analyze customer data and build a predictive model that accurately identifies customers at risk of churning. By doing so, Regork Telecom can proactively target high-risk customers with strategic incentives to improve retention and protect future revenue.
This report outlines the full data analysis process, including exploratory data analysis (EDA), model development and selection, business impact analysis, and a final strategic recommendation based on the predictive model’s insights.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'broom' was built under R version 4.4.3
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(ranger)
## Warning: package 'ranger' was built under R version 4.4.3
Loading essential packages for this projetc. Tidyverse handles data wrangling, ggplot2 handles helps with visualization and tidymodels, rpart, and ranger are all used for machine learning models.
df <- read.csv('customer_retention.csv')
head(df)
## Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines
## 1 Female 0 Yes No 1 No No phone service
## 2 Male 0 No No 34 Yes No
## 3 Male 0 No No 2 Yes No
## 4 Male 0 No No 45 No No phone service
## 5 Female 0 No No 2 Yes No
## 6 Female 0 No No 8 Yes Yes
## InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1 DSL No Yes No No
## 2 DSL Yes No Yes No
## 3 DSL Yes Yes No No
## 4 DSL Yes No Yes Yes
## 5 Fiber optic No No No No
## 6 Fiber optic No No Yes No
## StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No Month-to-month Yes
## 2 No No One year No
## 3 No No Month-to-month Yes
## 4 No No One year No
## 5 No No Month-to-month Yes
## 6 Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Status
## 1 Electronic check 29.85 29.85 Current
## 2 Mailed check 56.95 1889.50 Current
## 3 Mailed check 53.85 108.15 Left
## 4 Bank transfer (automatic) 42.30 1840.75 Current
## 5 Electronic check 70.70 151.65 Left
## 6 Electronic check 99.65 820.50 Left
Here I am importing the customer retention data set into R. Then I am displaying the first few rows to understand the data set’s structure and confirm a successful import.
# Converting total charges to numerics and removing missing values
df <- df %>%
mutate(TotalCharges = as.numeric(TotalCharges)) %>%
filter(!is.na(TotalCharges))
# Convert status to factor
df <- df %>%
mutate(Status = factor(Status, levels = c("Current", "Left")))
df$Status <- as.factor(df$Status)
# Convert Senior Citizen to Factor
df <- df %>%
mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0,1),
labels = c("Not Senior", "Senior")))
# Preview
glimpse(df)
## Rows: 6,988
## Columns: 20
## $ Gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <fct> Not Senior, Not Senior, Not Senior, Not Senior, Not S…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <fct> Current, Current, Left, Current, Left, Left, Current,…
# Check missing values
colSums(is.na(df))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 0 0
Preparing the data set for modeling by converting TotalCharges to numeric and removing missing values. Then I convert Status and SeniorCitizen columns into factors for appropriate modeling. Then I confirm that all variables have the correct data type and that there are no missing values.
Now that the data is all imported and cleaned, we can now start to visualize to see what factors may play an important role in the model that we will create.
ggplot(data = df, aes(x = Status, fill = Status)) +
geom_bar() +
labs(title = "Customer Churn Distribution",
x = "Customer Status", y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Here I am visualizing the distribution of customer status (Current
vs. Left). This provides a baseline understanding of the churn rate and
class imbalance.
ggplot(df, aes(x = Contract, fill = Status))+
geom_bar(position = "dodge")+
labs(title = "Churn by Contract Type",
x = "Contract Type", y = "Count")+
theme_minimal()+
scale_fill_brewer(palette = "Set1")
This explores how churn rate relates to different contract
type(Month-to-Month, One year, Two year). Contract type is expected to
be a strong indicator of churn behavior.
ggplot(df, aes(x = Status, y = Tenure, fill = Status))+
geom_boxplot()+
labs(title = "Tenure by Churn Status",
x = "Customer Status", y = "Tenure (Months)")+
theme_minimal()+
scale_fill_brewer(palette = "Set3")
The Tenure by Current Status graph compares the length of customer
tenure for those who stayed vs. those who left. Shorter tenures
typically suggest a higher risk for churn.
ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status))+
geom_boxplot()+
labs(title = "Monthly Charges by Churn Status",
x = "Customer Status", y = "Monthly Charges ($)")+
theme_minimal()+
scale_fill_brewer(palette = "Set2")
This graph analyzes how monthly billing amounts differ between customers
who churned and those who stayed. High monthly charges might correlate
with and increased likelihood of churn.
ggplot(df, aes(x= Gender, fill = Status))+
geom_bar(position = "fill")+
labs(title = "Churn Rate by Gender",
x = "Gender", y = "Proportion")+
theme_minimal()+
scale_fill_brewer(palette = "Set1")+
scale_y_continuous(labels = scales::percent)
This stacked bar char shows the churn rate for male and female
customers. The churn rates are nearly identical indicating that gender
does not appear to be a strong predictor of churn in this data set.
ggplot(df, aes(x = SeniorCitizen, fill = Status))+
geom_bar(position = "fill")+
labs(title = "Churn Rate by Senior Citizen Status",
x = "Senior Citizen", y = "Proportion")+
theme_minimal()+
scale_fill_brewer(palette = "Set2")+
scale_y_continuous(labels = scales::percent)
From this graph of churn rate by senior citizen status we can see that
senior citizens have a noticeably higher churn rate compared to non
senior customers. This suggests that being a senior is associated with
an increased likelihood of churn.
After exploring key patterns in the data, we observed many factors that can be associated with customer churn. These findings suggest that predictive modeling could effectively identify customers at high risk of leaving. ## Splitting the Data
set.seed(123)
split <- initial_split(df, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)
I split the data set into a 70% training set and a 30% testing set, stratified by churn status to maintain class proportions. This ensures that model evaluation is fair and unbiased.
churn_recipe <- recipe(Status ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>% # one hot encode categorical factors
step_normalize(all_numeric_predictors()) # normalize numeric factors
This recipe defines a data preprocessing workflow to prepare data for modeling. Categorical predictors are one-hot encoded and numeric predictors are normalized to ensure consistent model input.
# Define logistic regression model
log_mod <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Create workflow
log_wf <- workflow() %>%
add_model(log_mod) %>%
add_recipe(churn_recipe)
# Cross validation setup
set.seed(123)
kfold <- vfold_cv(train_data, v = 5, strata = Status)
# Fit model
log_fit <- fit_resamples(
log_wf,
resamples = kfold,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
# Get results
collect_metrics(log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00401 Preprocessor1_Model1
## 2 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
This logistic regression model was evaluated using a 5 fold cross validation. It achieved an average accuracy of roughly 80% and an AUC of 0.845. This suggests that the logistic regression model is effective at distinguishing between customers who will stay and those who will churn. While this model performs well, we will proceed to evaluate more complex models to determine if further predictive imporvements can be made.
# Define decision tree model
tree_mod <- decision_tree(
cost_complexity = tune(),
tree_depth = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
# Create workflow
tree_wf <- workflow() %>%
add_model(tree_mod) %>%
add_recipe(churn_recipe)
# Deinfe tuning grid
tree_grid <- grid_regular(
cost_complexity(range = c(-3,-1)),
tree_depth(range = c(3,10)),
levels = 5
)
# Tune modelusing cross validation
set.seed(123)
tree_res <- tune_grid(
tree_wf,
resamples = kfold,
grid = tree_grid,
metrics = metric_set(roc_auc, accuracy)
)
# Get best results
show_best(tree_res, metric = "roc_auc")
## # A tibble: 5 × 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 6 roc_auc binary 0.808 5 0.00623 Preprocesso…
## 2 0.001 8 roc_auc binary 0.803 5 0.00399 Preprocesso…
## 3 0.001 10 roc_auc binary 0.798 5 0.00455 Preprocesso…
## 4 0.00316 6 roc_auc binary 0.792 5 0.0152 Preprocesso…
## 5 0.00316 8 roc_auc binary 0.787 5 0.0152 Preprocesso…
The decision tree model was tuned across different values of cost complexity and tree depth. The best performing decision tree achieved an AUC of 0.808 with a cost complexity of 0.001 and a maximum tree depth of 6. Although this model performed well, its AUC was slightly lower than the logistic regression model. Lastly we will evaluate a random forest model to see if it can further improve predictive power.
# Define random forest model
rf_mod <- rand_forest(
mtry = tune(),
trees = 500,
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# Create workflow
rf_wf <- workflow() %>%
add_model(rf_mod) %>%
add_recipe(churn_recipe)
# Tune hyperparameters
rf_grid <- grid_regular(
mtry(range = c(3,15)),
min_n(range = c(5,30)),
levels = 5
)
# Run tuning
set.seed(123)
rf_res <- tune_grid(
rf_wf,
resamples = kfold,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy)
)
# Show top performers
show_best(rf_res, metric = "roc_auc")
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 30 roc_auc binary 0.845 5 0.00460 Preprocessor1_Model22
## 2 3 17 roc_auc binary 0.845 5 0.00447 Preprocessor1_Model11
## 3 3 23 roc_auc binary 0.845 5 0.00471 Preprocessor1_Model16
## 4 3 30 roc_auc binary 0.845 5 0.00487 Preprocessor1_Model21
## 5 6 23 roc_auc binary 0.844 5 0.00398 Preprocessor1_Model17
The random forest model was tuned by varying the number of predictors considered at each split (mtry) and the minimum number of observations required in a node (min_n). The best performing forest achieved an AUC of 0.845 with mtry of 6 and min_n of 30. The random forest model slightly outperformed the logistic regression and decision tree models in terms of AUC. Its strong predictive performance makes it the most suitable model for predicting customer churn in this project.
# Select best model based on AUC
best_rf <- select_best(rf_res, metric = "roc_auc")
# Finalize workflow
final_rf_wf <- finalize_workflow(rf_wf, best_rf)
# Fit final model on training data
final_rf_fit <- fit(final_rf_wf, data = train_data)
Finalizing the best random forest model based on cross validated AUC results. Then we are fitting the finalized model to complete the training data set to prepare for test set evaluation.
# Predict on test set
rf_predictions <- predict(final_rf_fit, new_data = test_data, type = "prob") %>%
bind_cols(predict(final_rf_fit, new_data = test_data)) %>%
bind_cols(test_data %>% select(Status))
# Evaluate AUC
roc_auc(rf_predictions, truth = Status, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.843
# Confusion Matrix
rf_class_preds <- rf_predictions %>%
mutate(pred_class = .pred_class)
conf_mat(rf_class_preds, truth = Status, estimate = pred_class)
## Truth
## Prediction Current Left
## Current 1389 271
## Left 151 286
The final random forest model achieved an AUC of 0.843 on the test set, closely matching its cross validation performance during training. This strong AUC indicates that the model perfroms well at distinguishing customers who are likely to churn form those who will stay. The confusion matrix confirms high classification accuracy with a balanced distribution of false positives and false negatives.
# Calculate the ROC curve manually
roc_data <- yardstick::roc_curve(rf_predictions, truth = Status, .pred_Current)
# Plot the ROC curve
ggplot(roc_data, aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 2, color = "gray") +
labs(title = "ROC Curve",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)") +
theme_minimal()
The ROC curve plots the true positive rate against the false positive rate across different classification thresholds. The high area under the curve (AUC) indicates the model is very effective.
# Extract fitted model
rf_model <- extract_fit_parsnip(final_rf_fit)$fit
# PLot feature importance
vip::vip(rf_model, num_features = 10)
This bar chart displays the relative importance of the top predictors used by the random forest model to classify customers as likely to stay or churn. Feature importance is calculated based on how much each variable improves the model’s ability to make accurate splits across all trees.
After selecting the final random forest model based on its strong AUC and predictive performance, we now apply the model to conduct a business-focused analysis. Specifically, we use the model to identify customers at high risk of churning, estimate the potential monthly revenue loss if no action is taken, and propose an incentive strategy to retain valuable customers.
# Predict probabilities and attach the original test data
rf_predictions <- predict(final_rf_fit, new_data = test_data, type = "prob") %>%
bind_cols(test_data) %>%
mutate(pred_class = if_else(.pred_Left > 0.5, "Left", "Current"))
# Now filter predicted churners
predicted_leavers <- rf_predictions %>%
filter(pred_class == "Left")
n_predicted_leavers <- nrow(predicted_leavers)
n_predicted_leavers
## [1] 437
The finalized random forest model was used to predict the probability that each customer in the test set would churn.After generating the predictions, we filtered the results to identify customers predicted to leave. A total of 437 customers were predicted to churn if no intervention is made.
# Sum monthly charges of predicted leavers
predicted_loss <- sum(predicted_leavers$MonthlyCharges)
predicted_loss
## [1] 33517.4
After identifying the 437 customers predicted to churn, we calculated the total monthly revenue associated with these customers.By summing the MonthlyCharges for all predicted leavers, we estimate that Regork Telecom is at risk of losing approximately $33,517.40 per month if no action is taken to retain them.
# Assume offering $15 discount per month for 6 months
cost_per_customer <- 15 * 6
# Total Cost offered to all predicted leavers
total_incentive_cost <- cost_per_customer * nrow(predicted_leavers)
# Revenue preserved if they stay (6 months)
revenue_preserved <- sum(predicted_leavers$MonthlyCharges) * 6
# Net Benefit
net_benefit <- revenue_preserved - total_incentive_cost
total_incentive_cost
## [1] 39330
revenue_preserved
## [1] 201104.4
net_benefit
## [1] 161774.4
To mitigate this loss, we propose offering a $15 per month discount for six months to each predicted leaver. The total cost of implementing this incentive program would be approximately $39,330. If successful in retaining these customers, the company would preserve approximately $201,104.40 in revenue over the six-month period.
After accounting for the incentive costs, the net financial benefit of retaining these customers would be approximately $161,774.40.
Based on this analysis, implementing a targeted retention campaign offering a modest discount to high-risk customers is financially justified and strongly recommended.
In this project, we successfully developed a predictive model to identify customers at risk of churning for Regork Telecom.
After extensive exploratory data analysis and model comparison, a random forest model was selected based on its strong cross-validation and test set performance, achieving an AUC of 0.843.
Business analysis using the model’s predictions revealed that Regork Telecom is at risk of losing approximately $33,517.40 per month if no action is taken. We proposed a cost-effective incentive plan offering a $15 per month discount to predicted leavers, resulting in an estimated net financial benefit of $161,774.40 over six months.
Overall, this project demonstrates how predictive modeling combined with targeted business strategies can drive data-informed decisions to increase customer retention and protect company revenue.