The business problem presented is predicting whether customers will leave the company in the future, and providing actionable decisions to retain more customers. I addressed the problem by creating graphs to better understand the customer demographics, and creating models to find factors in why customers are more likely to leave in the future. This report will help the CEO by better understanding their customers and how to keep them through data driven insights.
# Load required libraries
suppressWarnings(library(tidyverse))
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
suppressWarnings(library(ggplot2))
suppressWarnings(library(corrplot))
## corrplot 0.95 loaded
suppressWarnings(library(caret))
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
suppressWarnings(library(tidymodels))
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
suppressWarnings(library(vip))
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
# Read the data
df <- read.csv("customer_retention.csv", stringsAsFactors = FALSE)
# Convert TotalCharges to numeric (handle blank entries)
df$TotalCharges <- as.numeric(gsub(" ", NA, df$TotalCharges))
# Remove rows with NA in TotalCharges
df <- df[!is.na(df$TotalCharges), ]
# Clean categorical levels
df <- df %>%
mutate(across(c(OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport,
StreamingTV, StreamingMovies, MultipleLines),
~ ifelse(. == "No internet service" | . == "No phone service", "No", .)))
# Churn distribution
churn_rate <- prop.table(table(df$Status)) * 100
print(churn_rate)
##
## Current Left
## 73.44018 26.55982
# Plot churn distribution
ggplot(df, aes(x = Status, fill = Status)) +
geom_bar() +
ggtitle("Churn Rate (Customer Status)") +
ylab("Count") +
theme_minimal()
ggplot(df, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5, position = "fill") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Churn Rate by Tenure") +
xlab("Tenure (Months)") +
ylab("Churn Percentage") +
theme_minimal()
ggplot(df, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Churn by Contract Type") +
ylab("Proportion") +
theme_minimal()
ggplot(df, aes(x = as.factor(SeniorCitizen), fill = Status)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Churn by Senior Citizen") +
xlab("Senior Citizen (1 = Yes, 0 = No)") +
theme_minimal()
ggplot(df, aes(x = InternetService, fill = Status)) +
geom_bar(position = "fill") +
ggtitle("Churn by Internet Service Type") +
ylab("Proportion") +
theme_minimal()
ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
ggtitle("Monthly Charges vs Churn") +
theme_minimal()
# Encode categorical Yes/No to binary
df_binary <- df %>%
mutate(across(c(Partner, Dependents, PhoneService, PaperlessBilling,
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport,
StreamingTV, StreamingMovies), ~ as.numeric(. == "Yes"))) %>%
mutate(SeniorCitizen = as.numeric(SeniorCitizen),
Churn = as.numeric(Status == "Left")) %>%
select(Tenure, MonthlyCharges, TotalCharges, SeniorCitizen, Partner, Dependents,
PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, StreamingMovies, PaperlessBilling, Churn)
# Correlation matrix
corr_matrix <- cor(df_binary, use = "complete.obs")
corrplot(corr_matrix, method = "circle", type = "lower", tl.col = "black")
The correlation matrix shows the correlation between any two variables
in the data set. Larger circles indicate a stronger relationship, with
blue circles being positive and red circles being a negative
correlation.
# Create CustomerID index
df$CustomerID <- paste0("", seq_len(nrow(df)))
df <- df %>% select(CustomerID, everything())
# Load necessary libraries
suppressWarnings(library(caret))
suppressWarnings(library(pROC))
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
suppressWarnings(library(randomForest))
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
suppressWarnings(library(e1071))
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
suppressWarnings(library(rpart))
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
suppressWarnings(library(rpart.plot))
# Set seed for reproducibility
set.seed(123)
# Convert the target variable to a binary factor
df$Churn <- factor(ifelse(df$Status == "Left", "Yes", "No"), levels = c("No", "Yes"))
# Split the dataset into training (70%) and testing (30%)
trainIndex <- createDataPartition(df$Churn, p = 0.7, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
# Set up training control for cross-validation
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary,
classProbs = TRUE, savePredictions = TRUE)
df <- df %>%
mutate(CustomerID = paste0("", row_number()))
set.seed(123)
# Create binary target variable
df <- df %>%
mutate(Churn = factor(ifelse(Status == "Left", "Yes", "No"), levels = c("No", "Yes"))) %>%
select(-Status, -CustomerID) # Drop non-predictive columns
data_split <- initial_split(df, prop = 0.7, strata = Churn)
train_data <- training(data_split)
test_data <- testing(data_split)
churn_recipe <- recipe(Churn ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>% # Remove zero variance
step_normalize(all_numeric_predictors())
log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
rf_spec <- rand_forest(mtry = 5, trees = 100, min_n = 10) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
log_wf <- workflow() %>%
add_model(log_spec) %>%
add_recipe(churn_recipe)
tree_wf <- workflow() %>%
add_model(tree_spec) %>%
add_recipe(churn_recipe)
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(churn_recipe)
log_fit <- fit(log_wf, data = train_data)
tree_fit <- fit(tree_wf, data = train_data)
rf_fit <- fit(rf_wf, data = train_data)
# Logistic
log_preds <- predict(log_fit, test_data, type = "prob") %>%
bind_cols(predict(log_fit, test_data)) %>%
bind_cols(test_data %>% select(Churn))
# Tree
tree_preds <- predict(tree_fit, test_data, type = "prob") %>%
bind_cols(predict(tree_fit, test_data)) %>%
bind_cols(test_data %>% select(Churn))
# RF
rf_preds <- predict(rf_fit, test_data, type = "prob") %>%
bind_cols(predict(rf_fit, test_data)) %>%
bind_cols(test_data %>% select(Churn))
roc_auc(log_preds, truth = Churn, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.155
roc_auc(tree_preds, truth = Churn, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.284
roc_auc(rf_preds, truth = Churn, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.166
rf_preds %>%
conf_mat(truth = Churn, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 1374 268
## Yes 166 289
rf_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
Tenure is the most important predictor since customers who have stayed for a long time are more likely to keep using their services. Total charge is important since it goes with tenure, the longer a customer has used the services, the more total charges they will accumulate. Thirdly, monthly charges could be key in understanding which customers stay, since customers willing to pay a high bill will not be as urgent to switch companies.
Looking at the variables, I am learning that some of the variables have little to no correlation in predicting whether the customer will stay with the company. for example, the 10th most important factor of customers using paperless billing was statistically insignificant in trying to predict observations of the customer.
test_data <- test_data %>%
mutate(CustomerID = paste0("", row_number()))
# Predict churn probabilities and classes
rf_preds <- predict(rf_fit, test_data, type = "prob") %>%
bind_cols(predict(rf_fit, test_data)) %>%
bind_cols(test_data %>% select(CustomerID, MonthlyCharges, Churn))
# Filter customers likely to leave
churn_risk <- rf_preds %>%
filter(.pred_class == "Yes")
# Total predicted revenue loss if no action is taken
predicted_loss <- sum(churn_risk$MonthlyCharges)
cat("Predicted monthly revenue loss: $", round(predicted_loss, 2))
## Predicted monthly revenue loss: $ 34801.25
n_customers <- nrow(churn_risk)
avg_charge <- mean(churn_risk$MonthlyCharges)
retention_rate <- 0.3 # assume 30% stay with incentive
# Total cost of incentive
total_cost <- n_customers * 10 * 6
# Retained customers' value (assuming 6 more months)
retained_value <- n_customers * retention_rate * avg_charge * 6
# Net benefit
net_benefit <- retained_value - total_cost
cat("Total incentive cost: $", total_cost, "\n")
## Total incentive cost: $ 27300
cat("Expected retained revenue: $", round(retained_value, 2), "\n")
## Expected retained revenue: $ 62642.25
cat("Net benefit: $", round(net_benefit, 2))
## Net benefit: $ 35342.25
The most important churn predictors include short tenure, month-to-month contracts, lack of tech support, and high monthly charges. I recommend that the company offers a $10 per month discount to all high risk churn customers. If a modest 30% of the customers that were going to leave end up staying, it would result in over 40,000 in profit, which exceeds the costs of the initiative.