Customer retention is a key challenge in the telecom industry. Losing customers, or “churn,” not only impacts revenue but also increases the cost of acquiring new customers. In this project, we analyze customer churn data to uncover patterns and identify factors influencing customer retention. Using data from a telecom company, we apply machine learning models to predict which customers are most likely to leave. These predictions enable businesses to take proactive steps to improve customer satisfaction and loyalty. This report highlights our approach to data preparation, exploratory analysis, and predictive modeling to support data-driven retention strategies.
Here are the packages used in our analysis, including a short description of each:
# Loading Required Libraries
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(kernlab)
library(ggplot2)
We begin by importing the dataset and exploring its structure to identify potential data quality issues or necessary transformations.
# Importing the dataset
customer_retention <- read.csv("Data/customer_retention.csv")
# Viewing the first few rows of the dataset
head(customer_retention)
# Checking the structure of the dataset
str(customer_retention)
# Summary statistics of the dataset
summary(customer_retention)
The column TotalCharges contains 11 missing values. We omit these rows to ensure data quality for our analysis.
# Identifying missing values
colSums(is.na(customer_retention))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 11 0
# Removing rows with missing values
customer_retention <- na.omit(customer_retention)
# Confirming all missing values have been removed
sum(is.na(customer_retention))
## [1] 0
To prevent issues during analysis, we ensure that categorical variables are consistent and have no unexpected levels.
unique(customer_retention$Status)
unique(customer_retention$Gender)
unique(customer_retention$SeniorCitizen)
unique(customer_retention$Partner)
unique(customer_retention$Dependents)
unique(customer_retention$InternetService)
unique(customer_retention$OnlineSecurity)
unique(customer_retention$OnlineBackup)
unique(customer_retention$DeviceProtection)
unique(customer_retention$TechSupport)
unique(customer_retention$StreamingTV)
unique(customer_retention$StreamingMovies)
unique(customer_retention$Contract)
unique(customer_retention$PaperlessBilling)
We split the dataset into training (70%) and test (30%) sets, ensuring stratified sampling to maintain the balance of Status in both sets.
# Recode response variable as a factor
customer_retention <- customer_retention %>%
mutate(Status = factor(Status, levels = c("Current", "Left")))
# Splitting the dataset
set.seed(123)
retention_split <- initial_split(customer_retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
# Checking dimensions of the splits
dim(retention_train)
## [1] 4891 20
dim(retention_test)
## [1] 2097 20
The Status variable indicates whether a customer is currently retained (Current) or has left (Left). Below is the distribution of the target variable in the dataset.
table(customer_retention$Status)
##
## Current Left
## 5132 1856
prop.table(table(customer_retention$Status))
##
## Current Left
## 0.7344018 0.2655982
The majority of customers (73.4%) are current, while just over a quarter (26.6%) have left the service.
We first explored the distribution of tenure among our customer base, both current and past.
cust_tenure <- ggplot(customer_retention, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5) +
theme_grey() +
scale_x_continuous(n.breaks=10)+
labs(title = "Customer Tenure", x = "Number of Months Kept", y = "Count of Instances",)
cust_tenure
It seems that of the current customer base, the majority have either just joined in the last few months, or have been with the service since the beginning. This could be because the service offered a special deal upon startup that customers had been grandfathered into for those who have been long-standing customers. Those who have left the service departed within the first few months, and many of the current ones are new as well, suggesting there could be a promotion for the first ~6 months of the contract that attracts people, and they may choose to switch services once the promotion is up.
In addition to the tenure, we explored monthly charges to understand why customers may leave or stay with the service.
monthly_charges<- ggplot(customer_retention, aes(x = MonthlyCharges, fill = Status)) +
geom_histogram(binwidth = 3) +
theme_grey() +
scale_x_continuous(n.breaks=10)+
labs(title = "Customer Monthly Charges", x = "Amount Billed ($)", y = "Count of Instances")
monthly_charges
In this graph, we find that most of the current customers are being charged around $20/month. If we analyze this in tandem with the previous tenure graph, we can conclude there is some sort of promotion for new members that offers the lower price. Many of the customers who left the service began doing so around the $70/month mark, which we may need to consider in our pricing strategy.
We graphed the internet service variable to help understand why customers may be leaving, and if it had to do with their type of internet service.
internet_service <- ggplot(customer_retention, aes(x = InternetService, fill = Status)) +
geom_bar() +
theme_minimal()+
labs(x = "Internet Service", y = "Count", title = "Status by Internet Service Type")
internet_service
In doing this, we found that the majority of customers who left had
fiber optic cables as their internet service. We may need to look into
enhancing or changing our approach for our fiber optic customers to
increase retention.
We looked at the demographics of our customer base and found that the distribution between male and female identifying customers were fairly even, so we decided to explore the general age of our customer base.
senior_citizen <- ggplot(customer_retention, aes(x = SeniorCitizen, fill = Status)) +
geom_bar() +
theme_minimal()+
labs(x = "Senior Status", y = "Count", title = "Customers vs Senior Citizen Status")
senior_citizen
Most of the customers, both current and past, have not identified as
senior citizens. This could help us in marketing strategies in the
future, as we can better cater to our larger base, but know we can work
to improve how our service could be advertised to and fit for senior
citizens.
Because the non-senior citizen status was so significant in our demographic, we decided to analyze the differnt services between age groups to understand how our service could better fit the senior citizen group.
senior_service <- ggplot(customer_retention, aes(x = SeniorCitizen, fill = InternetService)) +
geom_bar(position = "dodge") +
theme_minimal()+
labs(x = "Senior Status", y = "Count", title = "Relationship between Age and Internet Service Type")
senior_service
Both the senior and non-senior demographics utlizie the fiber optic services for their internet, so Regork may want to focus on potentially enhancing their services for the DSL to gain a larger base from the senior citizens, but make sure they fortify the service they do provide in conjunction with fiber optic, as that seems to be the most popular among the graphs.
This graph helps us to better visualize the range of those staying with the service vs leaving it.
tenure_status <- ggplot(customer_retention, aes(x = Status, y = Tenure)) +
geom_boxplot(fill = "purple") +
theme_minimal()
tenure_status
Most of the customers that leave do so in under 30 months of commitment, with the mean being around 10 months. Regork can look at what promotions or incentives they can offer around the 10 month mark to increase the perceived value of their service and hopefully retain more customers.
We watned to get a better understanding of the costs related to customers leaving the service and the effect the price range may have on the retention.
monthly_status <- ggplot(customer_retention, aes(x = Status, y = MonthlyCharges)) +
geom_boxplot(fill = "red") +
theme_minimal()
monthly_status
The customers who have left were generally charged more than those who have stayed, paying an average of around $80/month. If Regork can determine why those costs were not associated with perceived value, they may be able to better retain that customer group.
In this section, we use a logistic regression model to predict customer churn. By analyzing factors such as customer demographics, service usage, and billing methods, the model identifies which variables are most influential in predicting whether a customer will churn. This allows us to prioritize retention efforts effectively.
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")
To ensure the model performs reliably, we divided the data into smaller subsets and tested the model on different portions of the data. This approach gives a clearer picture of how well the model will work in real-world scenarios.
results <- logistic_reg() %>%
set_engine("glm") %>%
fit_resamples(Status ~ ., resamples = kfold)
The model was trained using all available customer data, including demographics, contract details, and service preferences. This allows us to determine which factors most influence customer decisions to leave or stay.
collect_metrics(results) %>%
filter(.metric == "accuracy")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00401 Preprocessor1_Model1
The model correctly predicts whether a customer will churn approximately 80% of the time. This accuracy ensures that the model provides valuable insights into customer behavior without being overly complex.
final_fit <- logistic_reg() %>%
set_engine("glm") %>%
fit(Status ~ ., data = retention_train)
We trained the logistic regression model on the entire training dataset to finalize it.
final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
When applied to new data, the model continued to perform well. While it correctly identifies many customers likely to leave, it does occasionally miss some (false negatives) and predicts churn for a few customers who stay (false positives). These predictions can guide retention efforts, but care must be taken to avoid unnecessary interventions.
vip::vip(final_fit)
After finalizing the model, we identified the factors most associated with churn:
The logistic regression model provides valuable insights into the linear relationships between predictors and churn. Its simplicity makes it an excellent starting point for understanding churn patterns. However, its linear nature might miss complex interactions between variables, which more advanced models like random forests can capture.
In this section, we use a random forest model to predict which customers are at the highest risk of churn (probability > 70%). By identifying these customers, Regork Telecom can implement targeted retention strategies, focusing efforts where they are most needed.
model_recipe <- recipe(Status ~ ., data = retention_train)
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")
To build the random forest model, we prepared a recipe using all available predictors, including customer demographics, contract types, and service usage. Cross-validation ensures the model performs reliably across various data subsets, providing consistent results.
rf_hyper_grid <- grid_regular(
trees(range = c(50, 800)),
mtry(range = c(2, 19)),
min_n(range = c(1, 20)),
levels = 5
)
We created a grid of potential hyperparameter values, allowing the model to test various settings for the number of trees, predictors per split, and minimum node size. Optimizing these values ensures the model is both accurate and efficient.
set.seed(123)
rf_results <- tune_grid(
rf_mod,
model_recipe,
resamples = kfold,
grid = rf_hyper_grid,
metrics = metric_set(accuracy, roc_auc, mn_log_loss)
)
The model was trained and evaluated across multiple folds using the hyperparameter grid. Metrics such as ROC AUC help us determine how well the model distinguishes between churners and non-churners, ensuring reliable predictions.
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 800 20 roc_auc binary 0.845 5 0.00534 Preprocessor1_Model1…
## 2 2 800 15 roc_auc binary 0.845 5 0.00535 Preprocessor1_Model0…
## 3 2 425 20 roc_auc binary 0.845 5 0.00536 Preprocessor1_Model1…
## 4 2 612 20 roc_auc binary 0.845 5 0.00555 Preprocessor1_Model1…
## 5 2 612 15 roc_auc binary 0.844 5 0.00523 Preprocessor1_Model0…
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
The best-performing model was selected based on the ROC AUC metric. This ensures the final model focuses on accurately predicting high-risk churners, which is critical for targeted interventions.
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = retention_train)
Using the best hyperparameter settings, we trained the final random forest model on the entire training dataset. This model captures complex relationships and non-linear interactions between predictors, which were less visible in the linear regression model.
rf_predictions <- predict(rf_final_fit, new_data = retention_test, type = "prob") %>%
bind_cols(predict(rf_final_fit, new_data = retention_test, type = "class")) %>%
bind_cols(retention_test)
high_risk_customers <- rf_predictions %>%
filter(.pred_Left > 0.7)
high_risk_metrics <- high_risk_customers %>%
metrics(truth = Status, estimate = .pred_class)
print(high_risk_metrics)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.86
## 2 kap binary 0
The model was then used to predict churn probabilities for the test dataset. We focused on customers with a churn probability greater than 70% (high risk). The predictions were 86% accurate for this group, demonstrating strong reliability in identifying high-risk churners.
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10) +
labs(
title = "Top 10 Features Influencing High-Risk Customer Churn",
x = "Feature",
y = "Importance"
) +
theme_minimal()
Finally, we examined the key factors influencing churn risk using variable importance. The top drivers include:
The feature importance graph underscores the need for targeted retention efforts. Customers in the first year of service, particularly on month-to-month plans, should receive personalized offers to prevent churn. Addressing these pain points with promotions or improved service quality can yield significant retention benefits.
This section focuses on predicting churn likelihood based on demographic variables such as Gender, Senior Citizen, Partner, and Dependents. By understanding the demographic groups most at risk of leaving, Regork Telecom can tailor retention strategies to these customers effectively.
dt_mod <- decision_tree(
mode = "classification"
) %>%
set_engine("rpart")
model_recipe <- recipe(Status ~ Gender + SeniorCitizen + Partner + Dependents, data = retention_train)
We begin by building a simple decision tree model to explore the relationships between demographic variables and churn. These features were selected based on their potential impact on customer behavior and insights derived from the exploratory data analysis. For example, earlier findings showed senior citizens and customers without dependents may be more likely to churn.
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = retention_train)
After specifying the model and recipe, we train the decision tree using the training dataset. This initial model helps us understand the relationships between demographic variables and churn. It sets the stage for further refinement through hyperparameter tuning to improve performance.
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")
dt_results <- fit_resamples(
dt_mod,
model_recipe,
resamples = kfold,
metrics = metric_set(accuracy, roc_auc)
)
collect_metrics(dt_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.734 5 0.000137 Preprocessor1_Model1
## 2 roc_auc binary 0.5 5 0 Preprocessor1_Model1
To assess the model’s reliability, we perform 5-fold cross-validation and evaluate metrics such as accuracy and ROC AUC. The initial decision tree achieved an accuracy of 73.4% during cross-validation, consistent with earlier models. However, the ROC AUC of 0.5 suggests it is not yet effective at distinguishing between churners and non-churners.
dt_mod_tuned <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
dt_tuned_results <- tune_grid(
dt_mod_tuned,
model_recipe,
resamples = kfold,
grid = dt_hyper_grid,
metrics = metric_set(accuracy, roc_auc)
)
dt_best_params <- select_best(dt_tuned_results, metric = "roc_auc")
show_best(dt_tuned_results, metric = "roc_auc")
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 4 2 roc_auc binary 0.614 5 0.00727
## 2 0.0000000178 4 2 roc_auc binary 0.614 5 0.00727
## 3 0.00000316 4 2 roc_auc binary 0.614 5 0.00727
## 4 0.0000000001 8 2 roc_auc binary 0.614 5 0.00727
## 5 0.0000000178 8 2 roc_auc binary 0.614 5 0.00727
## # ℹ 1 more variable: .config <chr>
To enhance performance, we tune hyperparameters like tree depth, minimum node size, and cost complexity. These parameters control the tree’s size and shape, balancing simplicity and accuracy.
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod_tuned) %>%
finalize_workflow(dt_best_params)
dt_final_fit <- dt_final_wf %>%
fit(data = retention_train)
This final model captures optimized relationships between demographics and churn. By focusing on tuned parameters, we achieve a balance between interpretability and predictive accuracy.
dt_predictions <- predict(dt_final_fit, new_data = retention_test, type = "prob") %>%
bind_cols(predict(dt_final_fit, new_data = retention_test, type = "class")) %>%
bind_cols(retention_test)
confusion <- dt_predictions %>%
conf_mat(truth = Status, estimate = .pred_class)
confusion
## Truth
## Prediction Current Left
## Current 1444 488
## Left 96 69
We evaluate the tuned model on the test dataset, focusing on accuracy and the confusion matrix to assess its ability to predict churn. The model correctly predicted churn for 69 customers who left but misclassified 488 churners as current customers (false negatives). These misclassifications emphasize the need for additional strategies to minimize churn predictions’ business impact.
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(10) +
labs(
title = "Top Demographic Features Influencing Churn",
x = "Feature",
y = "Importance"
)
After finalizing the decision tree model, we identified the most influential demographic factors associated with churn:
Lastly, this section identifies customers who are predicted to churn and calculates their potential revenue loss. Using monthly charges, we estimate the annual revenue loss per customer
predicted_churners <- rf_predictions %>%
filter(.pred_class == "Left") %>%
mutate(RevenueLoss = MonthlyCharges * 12)
We aggregate revenue loss by demographic groups to better understand the financial impact of churn across different segments. This step groups the predicted churners by attributes like dependents, senior citizenship, partnership status, and gender, summarizing total and average revenue loss, as well as the count of customers in each group
demographic_loss <- predicted_churners %>%
group_by(Dependents, SeniorCitizen, Partner, Gender) %>%
summarize(
TotalRevenueLoss = sum(RevenueLoss, na.rm = TRUE),
AvgRevenueLoss = mean(RevenueLoss, na.rm = TRUE),
Count = n()
) %>%
arrange(desc(TotalRevenueLoss))
## `summarise()` has grouped output by 'Dependents', 'SeniorCitizen', 'Partner'.
## You can override using the `.groups` argument.
The resulting table, displayed below, reveals which demographic groups contribute most to revenue loss. For example, non-senior females without dependents or partners seem to account for the largest share of potential revenue loss, highlighting a priority segment for retention strategies.
print(demographic_loss)
## # A tibble: 15 × 7
## # Groups: Dependents, SeniorCitizen, Partner [8]
## Dependents SeniorCitizen Partner Gender TotalRevenueLoss AvgRevenueLoss Count
## <chr> <int> <chr> <chr> <dbl> <dbl> <int>
## 1 No 0 No Female 102455. 907. 113
## 2 No 0 No Male 95996. 906. 106
## 3 No 1 No Female 45361. 926. 49
## 4 No 1 No Male 34546. 909. 38
## 5 No 1 Yes Male 28844. 961. 30
## 6 No 0 Yes Male 19508. 1027. 19
## 7 Yes 0 Yes Female 16655. 980. 17
## 8 No 0 Yes Female 16088. 946. 17
## 9 Yes 0 No Male 10523. 809. 13
## 10 No 1 Yes Female 8337 926. 9
## 11 Yes 0 Yes Male 7728 966 8
## 12 Yes 0 No Female 5587. 931. 6
## 13 Yes 1 No Female 1216. 1216. 1
## 14 Yes 1 Yes Female 1150. 1150. 1
## 15 Yes 1 Yes Male 965. 965. 1
To better show this, we use a bar chart that breaks down total revenue loss by demographic combinations. The x-axis represents whether customers have dependents, while the bars are grouped and colored by combinations of senior citizenship, partnership status, and gender.
ggplot(demographic_loss, aes(x = Dependents, y = TotalRevenueLoss, fill = interaction(SeniorCitizen, Partner, Gender))) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Potential Revenue Loss by Demographics",
x = "Dependents",
y = "Total Revenue Loss ($)",
fill = "SeniorCitizen-Partner-Gender"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the chart, we observe that non-dependent customers, particularly those who are not senior citizens, contribute the majority of potential revenue loss. This insight underscores the importance of tailored retention efforts targeting these high-impact groups.
Our analysis reveals that the most influential predictor variables driving customer churn are Tenure, Contract Type, Monthly Charges, Total Charges, and Online Security. Among these, Tenure is the strongest indicator. Customers with longer tenures are significantly less likely to leave, emphasizing the importance of fostering loyalty through retention programs and incentives aimed at long-standing customers.
Contract Type emerges as another critical factor, with customers on month-to-month contracts being more likely to churn than those on longer-term contracts. Encouraging customers to switch to annual or bi-annual plans, potentially through discounted rates or bundled service offerings, could help mitigate churn. Similarly, Monthly Charges and Total Charges have substantial influence, indicating that cost-sensitive customers are more likely to leave. Competitive pricing, targeted discounts, or personalized offers for high-risk customers could be effective in addressing this issue. Additionally, Online Security plays a notable role, as customers who subscribe to these services tend to have lower churn rates. Strengthening and promoting online security services could further enhance customer retention.
Demographically, the customers most likely to churn include younger individuals, non-senior citizens, and single customers without dependents. These groups may have more limited disposable income or greater sensitivity to price changes. By contrast, senior citizens and those with dependents tend to display greater stability, likely due to fixed incomes and family obligations that anchor their choices. Targeting high-risk demographics with customized promotions, such as discounts for younger customers or tailored offers for single individuals, can help address their specific concerns and improve retention.
Using our model, we predict that younger single females and males contribute $102,454.80 and $95,995.80 in potential revenue loss respectively. To retain these customers, we propose a multi-pronged incentive plan. First, offer a 10% discount on monthly charges for three months to cost-sensitive demographics. This would address concerns related to Monthly Charges and Total Charges while providing immediate financial relief. Second, implement a rewards program to incentivize long-term contracts and recognize customer loyalty, especially for younger and single individuals, who are more likely to churn. For instance, rewards could include discounts, exclusive services, or recognition on significant milestones like anniversaries with the company.
Our analysis highlights the key drivers of churn and demographic trends that Regork can target to improve customer retention. By addressing critical predictors such as Tenure, Contract Type, Monthly Charges, and Total Charges, alongside demographic groups at higher risk, the company can proactively reduce churn. Implementing incentive programs, promoting long-term contracts, and enhancing services like Online Security can ensure customer satisfaction and loyalty. These measures, grounded in our predictive model with an average AUC of 80%, provide a strong foundation for reducing revenue loss and building stronger customer relationships. By taking these steps, Regork can sustain long-term growth and solidify its position in the telecommunications market.