knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages({
library(tidymodels)
library(tidyverse)
library(vip)
library(ggplot2)
library(randomForest)
library(pROC)
library(rpart)
})

Introduction

Regork Telecom wants to reduce customer churn, which is more cost-effective than acquiring new customers. Our goal is to use historical customer data to build models that predict whether a customer is likely to leave.

Approach

We use a data set of customer account and service details. We split the data into training and testing sets, explore key trends using visualizations, and then build three models: logistic regression, decision tree, and random forest. We evaluate performance based on accuracy and AUC.

The Importance

Predicting churn allows the business to intervene with retention strategies. Understanding key drivers of churn also helps shape future business offerings.

data <- read.csv("~/Lab data/customer_retention.csv",  stringsAsFactors = TRUE)
#convert to numeric and remove NAs
data$TotalCharges <- as.numeric(as.character(data$TotalCharges))
#remove missing values
data <- na.omit(data)
#response variable
table(data$Status)

## 
## Current    Left 
##    5132    1856

prop.table(table(data$Status)) * 100

## 
##  Current     Left 
## 73.44018 26.55982

# tenure vs churn
ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5) +
  labs(title = "Tenure by Customer Status", x = "Tenure (months)", y = "Count")

# Tenure by Customer Status Chart Analysis

This chart reveals that customer churn is mostly seen within the first 5 months. This suggests that Regork’s initial quality may not be meeting the expectations that customers have set. Although customers who remain beyond 60 months show extremely strong loyalty.

# churn by type of contract
ggplot(data, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Contract Type", y = "Proportion")

# Churn Rate by Contract Type Chart Analysis

This chart illustrates that the type of contract a customer holds is correlated with their likelihood to churn. Customers with 2 year contracts are likely to remain with the company while those that have month-to-month contracts are less likely to remain. This should encourage Regork to focus on long-term contracts to reduce the churn rates.

# train and test split
set.seed(123)
logistic_split <- initial_split(data, prop = 0.7, strata = "Status")
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

# logistic regression
set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)
# logistic regression with re sampling
log_metrics <- logistic_reg() %>%
  fit_resamples(Status ~ ., resamples = logistic_kfolds) %>%
  collect_metrics()
log_metrics

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00188 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00521 Preprocessor1_Model1

# final model fit
final_log_model <- logistic_reg() %>%
  fit(Status ~ ., data = logistic_train)

#MARS
set.seed(123)
mars_spec <- mars(mode = "classification")
final_mars_model <- mars_spec %>%
  fit(Status ~ ., data = logistic_train)

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:scales':
## 
##     rescale

mars_preds <- final_mars_model %>%
  predict(logistic_test) %>%
  bind_cols(logistic_test %>% select(Status))
conf_mat(mars_preds, truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

# decision tree
tree_model <- rpart(Status ~ ., data = logistic_train, method = "class")
tree_pred <- predict(tree_model, newdata = logistic_test, type = "class")
#confusion matrix and accuracy
table(Predicted = tree_pred, Actual = logistic_test$Status)

##          Actual
## Predicted Current Left
##   Current    1398  294
##   Left        142  263

# random forest 
rf_model <- randomForest(Status ~ ., data = logistic_train)
rf_pred <- predict(rf_model, newdata = logistic_test)
rf_model_tuned <- randomForest(Status ~ ., data = logistic_train, ntree = 500, mtry = 3)
#confusion matrix
table(Predicted = rf_pred, Actual = logistic_test$Status)

##          Actual
## Predicted Current Left
##   Current    1350  247
##   Left        190  310

mean(rf_pred == logistic_test$Status)

## [1] 0.7916071

#AUC
rf_probs <- predict(rf_model, newdata = logistic_test, type = "prob")[, "Left"]
rf_roc <- roc(logistic_test$Status, rf_probs)

## Setting levels: control = Current, case = Left

## Setting direction: controls < cases

auc(rf_roc)

## Area under the curve: 0.8313

plot(rf_roc, main = "ROC Curve - Random Forest")

The Presence of False Positives

More false positives are present in our data set to be proactive about losing customers. These false positives may be people that have potential to leave so prioritizing this could help retention.

# feature importance
vip(rf_model, num_features = 10, bar = TRUE) +
  labs(title = "Top 10 Influencing Features")

# Monthly Revenue Loss from Churned Customers
monthly_revenue_loss <- data %>%
  filter(Status == "Left") %>%
  summarise(revenue_lost = sum(MonthlyCharges, na.rm = TRUE))
monthly_revenue_loss

##   revenue_lost
## 1     138222.4

Potential Monthly Revenue Loss

Based on our insights, if no action is taken there could be a potential revenue loss of $136,222.40 each month.

Our Implications and Future Incentive Scheme Recommendations

For Regork to continue to retain customers, our overall recommendation is to focus on month-to-month contracts and understand the customers behaviors to offer retention incentives to continue to come to Regork. Through our analysis, we found that the highest churn rates occurred with customers on month-to-month contracts. We found that customers with longer contracts experienced higher loyalty to Regork.

Our Limitations on our Analysis

Limitations on our analysis include not having behavior data on our customers to focus on why retention rates may decline. Having real-time data like this could increase the importance of our insights and allow for more meaningful data to be built. Having customer satisfaction data could also further our insights. With this future analysis, this could improve our predication accuracy overall.

Final Project

Emily Whitehead

2025-04-18