knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages({
library(tidymodels)
library(tidyverse)
library(vip)
library(ggplot2)
library(randomForest)
library(pROC)
library(rpart)
})
Regork Telecom wants to reduce customer churn, which is more cost-effective than acquiring new customers. Our goal is to use historical customer data to build models that predict whether a customer is likely to leave.
We use a data set of customer account and service details. We split the data into training and testing sets, explore key trends using visualizations, and then build three models: logistic regression, decision tree, and random forest. We evaluate performance based on accuracy and AUC.
Predicting churn allows the business to intervene with retention strategies. Understanding key drivers of churn also helps shape future business offerings.
data <- read.csv("~/Lab data/customer_retention.csv", stringsAsFactors = TRUE)
#convert to numeric and remove NAs
data$TotalCharges <- as.numeric(as.character(data$TotalCharges))
#remove missing values
data <- na.omit(data)
#response variable
table(data$Status)
##
## Current Left
## 5132 1856
prop.table(table(data$Status)) * 100
##
## Current Left
## 73.44018 26.55982
# tenure vs churn
ggplot(data, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5) +
labs(title = "Tenure by Customer Status", x = "Tenure (months)", y = "Count")
# Tenure by Customer Status Chart Analysis
This chart reveals that customer churn is mostly seen within the first 5 months. This suggests that Regork’s initial quality may not be meeting the expectations that customers have set. Although customers who remain beyond 60 months show extremely strong loyalty.
# churn by type of contract
ggplot(data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Churn Rate by Contract Type", y = "Proportion")
# Churn Rate by Contract Type Chart Analysis
This chart illustrates that the type of contract a customer holds is correlated with their likelihood to churn. Customers with 2 year contracts are likely to remain with the company while those that have month-to-month contracts are less likely to remain. This should encourage Regork to focus on long-term contracts to reduce the churn rates.
# train and test split
set.seed(123)
logistic_split <- initial_split(data, prop = 0.7, strata = "Status")
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)
# logistic regression
set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)
# logistic regression with re sampling
log_metrics <- logistic_reg() %>%
fit_resamples(Status ~ ., resamples = logistic_kfolds) %>%
collect_metrics()
log_metrics
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00401 Preprocessor1_Model1
## 2 brier_class binary 0.136 5 0.00188 Preprocessor1_Model1
## 3 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
# final model fit
final_log_model <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)
#MARS
set.seed(123)
mars_spec <- mars(mode = "classification")
final_mars_model <- mars_spec %>%
fit(Status ~ ., data = logistic_train)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
mars_preds <- final_mars_model %>%
predict(logistic_test) %>%
bind_cols(logistic_test %>% select(Status))
conf_mat(mars_preds, truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
# decision tree
tree_model <- rpart(Status ~ ., data = logistic_train, method = "class")
tree_pred <- predict(tree_model, newdata = logistic_test, type = "class")
#confusion matrix and accuracy
table(Predicted = tree_pred, Actual = logistic_test$Status)
## Actual
## Predicted Current Left
## Current 1398 294
## Left 142 263
# random forest
rf_model <- randomForest(Status ~ ., data = logistic_train)
rf_pred <- predict(rf_model, newdata = logistic_test)
rf_model_tuned <- randomForest(Status ~ ., data = logistic_train, ntree = 500, mtry = 3)
#confusion matrix
table(Predicted = rf_pred, Actual = logistic_test$Status)
## Actual
## Predicted Current Left
## Current 1350 247
## Left 190 310
mean(rf_pred == logistic_test$Status)
## [1] 0.7916071
#AUC
rf_probs <- predict(rf_model, newdata = logistic_test, type = "prob")[, "Left"]
rf_roc <- roc(logistic_test$Status, rf_probs)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(rf_roc)
## Area under the curve: 0.8313
plot(rf_roc, main = "ROC Curve - Random Forest")
More false positives are present in our data set to be proactive about losing customers. These false positives may be people that have potential to leave so prioritizing this could help retention.
# feature importance
vip(rf_model, num_features = 10, bar = TRUE) +
labs(title = "Top 10 Influencing Features")
# Monthly Revenue Loss from Churned Customers
monthly_revenue_loss <- data %>%
filter(Status == "Left") %>%
summarise(revenue_lost = sum(MonthlyCharges, na.rm = TRUE))
monthly_revenue_loss
## revenue_lost
## 1 138222.4
Based on our insights, if no action is taken there could be a potential revenue loss of $136,222.40 each month.
For Regork to continue to retain customers, our overall recommendation is to focus on month-to-month contracts and understand the customers behaviors to offer retention incentives to continue to come to Regork. Through our analysis, we found that the highest churn rates occurred with customers on month-to-month contracts. We found that customers with longer contracts experienced higher loyalty to Regork.
Limitations on our analysis include not having behavior data on our customers to focus on why retention rates may decline. Having real-time data like this could increase the importance of our insights and allow for more meaningful data to be built. Having customer satisfaction data could also further our insights. With this future analysis, this could improve our predication accuracy overall.