Final Project

Introduction

The business problem presented is predicting whether customers will leave the company in the future, and providing actionable decisions to retain more customers. I addressed the problem by creating graphs to better understand the customer demographics, and creating models to find factors in why customers are more likely to leave in the future. This report will help the CEO by better understanding their customers and how to keep them through data driven insights.

# Load required libraries
suppressWarnings(library(tidyverse))

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

suppressWarnings(library(ggplot2))
suppressWarnings(library(corrplot))

## corrplot 0.95 loaded

suppressWarnings(library(caret))

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

suppressWarnings(library(tidymodels))

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()

suppressWarnings(library(vip))

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

# Read the data
df <- read.csv("customer_retention.csv", stringsAsFactors = FALSE)

# Convert TotalCharges to numeric (handle blank entries)
df$TotalCharges <- as.numeric(gsub(" ", NA, df$TotalCharges))

# Remove rows with NA in TotalCharges
df <- df[!is.na(df$TotalCharges), ]

# Clean categorical levels
df <- df %>%
  mutate(across(c(OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport,
                  StreamingTV, StreamingMovies, MultipleLines),
                ~ ifelse(. == "No internet service" | . == "No phone service", "No", .)))

# Churn distribution
churn_rate <- prop.table(table(df$Status)) * 100
print(churn_rate)

## 
##  Current     Left 
## 73.44018 26.55982

# Plot churn distribution
ggplot(df, aes(x = Status, fill = Status)) +
  geom_bar() +
  ggtitle("Churn Rate (Customer Status)") +
  ylab("Count") +
  theme_minimal()

ggplot(df, aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5, position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Churn Rate by Tenure") +
  xlab("Tenure (Months)") +
  ylab("Churn Percentage") +
  theme_minimal()

ggplot(df, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Churn by Contract Type") +
  ylab("Proportion") +
  theme_minimal()

ggplot(df, aes(x = as.factor(SeniorCitizen), fill = Status)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Churn by Senior Citizen") +
  xlab("Senior Citizen (1 = Yes, 0 = No)") +
  theme_minimal()

ggplot(df, aes(x = InternetService, fill = Status)) +
  geom_bar(position = "fill") +
  ggtitle("Churn by Internet Service Type") +
  ylab("Proportion") +
  theme_minimal()

ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() +
  ggtitle("Monthly Charges vs Churn") +
  theme_minimal()

# Encode categorical Yes/No to binary
df_binary <- df %>%
  mutate(across(c(Partner, Dependents, PhoneService, PaperlessBilling, 
                  OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, 
                  StreamingTV, StreamingMovies), ~ as.numeric(. == "Yes"))) %>%
  mutate(SeniorCitizen = as.numeric(SeniorCitizen),
         Churn = as.numeric(Status == "Left")) %>%
  select(Tenure, MonthlyCharges, TotalCharges, SeniorCitizen, Partner, Dependents, 
         PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, 
         TechSupport, StreamingTV, StreamingMovies, PaperlessBilling, Churn)

# Correlation matrix
corr_matrix <- cor(df_binary, use = "complete.obs")
corrplot(corr_matrix, method = "circle", type = "lower", tl.col = "black")

The correlation matrix shows the correlation between any two variables in the data set. Larger circles indicate a stronger relationship, with blue circles being positive and red circles being a negative correlation.

# Create CustomerID index
df$CustomerID <- paste0("", seq_len(nrow(df)))

df <- df %>% select(CustomerID, everything())

# Load necessary libraries
suppressWarnings(library(caret))
suppressWarnings(library(pROC))

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

suppressWarnings(library(randomForest))

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

suppressWarnings(library(e1071))

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

suppressWarnings(library(rpart))

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

suppressWarnings(library(rpart.plot))

# Set seed for reproducibility
set.seed(123)

# Convert the target variable to a binary factor
df$Churn <- factor(ifelse(df$Status == "Left", "Yes", "No"), levels = c("No", "Yes"))

# Split the dataset into training (70%) and testing (30%)
trainIndex <- createDataPartition(df$Churn, p = 0.7, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]

# Set up training control for cross-validation
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary,
                     classProbs = TRUE, savePredictions = TRUE)

df <- df %>%
  mutate(CustomerID = paste0("", row_number()))

set.seed(123)
# Create binary target variable
df <- df %>%
  mutate(Churn = factor(ifelse(Status == "Left", "Yes", "No"), levels = c("No", "Yes"))) %>%
  select(-Status, -CustomerID)  # Drop non-predictive columns

data_split <- initial_split(df, prop = 0.7, strata = Churn)
train_data <- training(data_split)
test_data  <- testing(data_split)

churn_recipe <- recipe(Churn ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%        # Remove zero variance
  step_normalize(all_numeric_predictors())

log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

tree_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

rf_spec <- rand_forest(mtry = 5, trees = 100, min_n = 10) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

log_wf <- workflow() %>%
  add_model(log_spec) %>%
  add_recipe(churn_recipe)

tree_wf <- workflow() %>%
  add_model(tree_spec) %>%
  add_recipe(churn_recipe)

rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(churn_recipe)

log_fit <- fit(log_wf, data = train_data)
tree_fit <- fit(tree_wf, data = train_data)
rf_fit <- fit(rf_wf, data = train_data)

# Logistic
log_preds <- predict(log_fit, test_data, type = "prob") %>%
  bind_cols(predict(log_fit, test_data)) %>%
  bind_cols(test_data %>% select(Churn))

# Tree
tree_preds <- predict(tree_fit, test_data, type = "prob") %>%
  bind_cols(predict(tree_fit, test_data)) %>%
  bind_cols(test_data %>% select(Churn))

# RF
rf_preds <- predict(rf_fit, test_data, type = "prob") %>%
  bind_cols(predict(rf_fit, test_data)) %>%
  bind_cols(test_data %>% select(Churn))

roc_auc(log_preds, truth = Churn, .pred_Yes)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.155

roc_auc(tree_preds, truth = Churn, .pred_Yes)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.284

roc_auc(rf_preds, truth = Churn, .pred_Yes)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.166

rf_preds %>%
  conf_mat(truth = Churn, estimate = .pred_class)

##           Truth
## Prediction   No  Yes
##        No  1374  268
##        Yes  166  289

rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

Model Analysis

Tenure is the most important predictor since customers who have stayed for a long time are more likely to keep using their services. Total charge is important since it goes with tenure, the longer a customer has used the services, the more total charges they will accumulate. Thirdly, monthly charges could be key in understanding which customers stay, since customers willing to pay a high bill will not be as urgent to switch companies.

Looking at the variables, I am learning that some of the variables have little to no correlation in predicting whether the customer will stay with the company. for example, the 10th most important factor of customers using paperless billing was statistically insignificant in trying to predict observations of the customer.

test_data <- test_data %>%
  mutate(CustomerID = paste0("", row_number()))

# Predict churn probabilities and classes
rf_preds <- predict(rf_fit, test_data, type = "prob") %>%
  bind_cols(predict(rf_fit, test_data)) %>%
  bind_cols(test_data %>% select(CustomerID, MonthlyCharges, Churn))

# Filter customers likely to leave
churn_risk <- rf_preds %>%
  filter(.pred_class == "Yes")

# Total predicted revenue loss if no action is taken
predicted_loss <- sum(churn_risk$MonthlyCharges)
cat("Predicted monthly revenue loss: $", round(predicted_loss, 2))

## Predicted monthly revenue loss: $ 34801.25

n_customers <- nrow(churn_risk)
avg_charge <- mean(churn_risk$MonthlyCharges)
retention_rate <- 0.3  # assume 30% stay with incentive

# Total cost of incentive
total_cost <- n_customers * 10 * 6

# Retained customers' value (assuming 6 more months)
retained_value <- n_customers * retention_rate * avg_charge * 6

# Net benefit
net_benefit <- retained_value - total_cost

cat("Total incentive cost: $", total_cost, "\n")

## Total incentive cost: $ 27300

cat("Expected retained revenue: $", round(retained_value, 2), "\n")

## Expected retained revenue: $ 62642.25

cat("Net benefit: $", round(net_benefit, 2))

## Net benefit: $ 35342.25

Recommendations

The most important churn predictors include short tenure, month-to-month contracts, lack of tech support, and high monthly charges. I recommend that the company offers a $10 per month discount to all high risk churn customers. If a modest 30% of the customers that were going to leave end up staying, it would result in over 40,000 in profit, which exceeds the costs of the initiative.

Final Project

Cal McKittrick

2025-04-27

Introduction

Model Analysis

Recommendations