Now that Regork has entered the telecommunications industry, it has become even more important to retain existing customers due to the high costs of attracting new customers. As a result, it is important to examine which customers Regork is at risk of losing, and offer them incentives to retain them. By taking data from current and former customers, we can use different analytical techniques and various machine learning methods to come up with accurate models that will allow us to effectively target customers who are at a higher risk of leaving.
library(tidyverse) # contains necessary packages for data wrangling and visualization
library(tidymodels) # contains necessary packages for machine learning models
library(vip) # variable importance plots
library(tidyr)
library(dplyr)
library(ggplot2)
First, we read in the “Customer Status” dataset
customer <- read_csv("customer_retention.csv")
Next, we will tidy the data by changing all entries such as “No Phone Service” or “No Internet Service” to “No”. This will allow us to better visualize relationships between different predictor variables and customer status by keeping things binary (yes or no). For example, if someone has no phone service, they won’t have multiple lines, so we can simply change that entry from “no phone service” to “no” in order to better clarify whether or not a customer has a certain attribute or not.
customer[customer=="No phone service"] <- "No"
customer[customer=="No internet service"] <- "No"
Next, we will perform some exploratory analysis using tables and plots to examine some of the predictor variables and find some underlying trends in the data to see why some customers leave.
table(customer$Contract, customer$Status)
##
## Current Left
## Month-to-month 2204 1643
## One year 1300 165
## Two year 1639 48
In the above table, we see that those with a month-to-month contract tend to leave at a much higher rate than those who have/had one or two year contracts. As a result, we may want to incentivize one and two year contracts to reduce customer churn, which is a bigger problem among customers with the monthly plan.
ggplot(customer, aes(x=PaymentMethod, fill=Status)) + geom_bar(position="dodge")+
scale_x_discrete(guide = guide_axis(n.dodge=2),name = "Customer Payment Method") +
labs(
title = "Payment Method vs Customer Status",
)
In this plot, we see that Those who pay with electronic check tend to leave at a much higher rate than those who pay with other methods. Therefore, we may need to incentivize those who pay with this method.
ggplot(customer, aes(x=InternetService, fill=Status)) + geom_bar(position="dodge")+
scale_x_discrete(name = "Customer's Internet Service ") +
labs(
title = "Internet Service vs Customer Status",
)
Those with fiber optic internet service tend to leave at a much higher rate than those with DSL or no internet. This is very surprising, considering fiber optic tends to be the faster and more reliable form of internet service. Nonetheless, we may need to incentivize customers with fiber optic service to reduce churn.
customer %>%
group_by(Status) %>%
summarize(avg_tenure = mean(Tenure))
## # A tibble: 2 × 2
## Status avg_tenure
## <chr> <dbl>
## 1 Current 37.6
## 2 Left 18.0
This chart shows that our current customers have an average tenure of 37.5 months, and that our former customers left on average at 18 months. With this knowledge we may want to offer some incentives for customers to stay with us for 1 and/or 2 years in order to reduce customer churn when it is most common.
Next we will develop some models using different classification methods to better predict which customers we are at most risk of losing. We will assess each of these methods and look at the feature importance of the best model to determine what features seem to be most influential in determining whether or not customers will leave.
Our first model will use the logistic regression method
customer<-mutate(customer, Status=factor(Status))
set.seed(123)
split <- initial_split(customer, prop = 0.7, strata = Status)
customer_train <- training(split)
customer_test <- testing(split)
set.seed(123)
kfolds<-vfold_cv(customer_train,v=5)
logistic_reg()%>%
fit_resamples(Status~.,kfolds)%>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.796 5 0.00994 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00769 Preprocessor1_Model1
This model has an AUC of 0.844 meaning that the model correctly predicts a customer’s status across different thresholds about 84.4% of the time.
# retrain our model across the entire training data
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = customer_train)
tidy(final_fit)
## # A tibble: 24 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.03 0.975 1.06 2.89e- 1
## 2 GenderMale -0.0174 0.0778 -0.224 8.23e- 1
## 3 SeniorCitizen 0.187 0.101 1.85 6.47e- 2
## 4 PartnerYes -0.0630 0.0934 -0.674 5.00e- 1
## 5 DependentsYes -0.0397 0.106 -0.372 7.10e- 1
## 6 Tenure -0.0561 0.00752 -7.45 9.16e-14
## 7 PhoneServiceYes 0.0864 0.774 0.112 9.11e- 1
## 8 MultipleLinesYes 0.464 0.211 2.20 2.80e- 2
## 9 InternetServiceFiber optic 1.76 0.949 1.85 6.42e- 2
## 10 InternetServiceNo -1.72 0.965 -1.78 7.51e- 2
## # … with 14 more rows
final_fit %>%
predict(customer_test) %>%
bind_cols(customer_test %>% select(Status)) %>%
conf_mat(truth=Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1389 247
## Left 153 310
Above is the confusion matrix from the logistic regression model. We can tell the model performs pretty well because it has more true positives (1389) and true negatives (310) than false positives (247) and false negatives (153).
vip(final_fit, num_features = 20)
Here we see that a customer’s tenure and contract seem to be the biggest
factors as to whether or not a customer stays with Regork or leaves.
This confirms what we found in our exploratory analysis, and we should
incentivize our customers to get into longer contracts.
Our second model will use the decision tree method
dt_mod <- decision_tree(
mode = "classification",
tree_depth = 10, #<<
min_n = 5, #<<
cost_complexity = 0.01 #<<
) %>%
set_engine("rpart") %>%
fit(Status ~ ., data = customer_train)
rpart.plot::rpart.plot(dt_mod$fit)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
# model recipe
full_model_recipe <- recipe(
Status ~ .,
data = customer_train
)
# 5-fold cross validation object
set.seed(123)
kfold2 <- vfold_cv(customer_train, v = 5)
dt_mod <- decision_tree(
mode = "classification",
tree_depth = 10, #<<
min_n = 5, #<<
cost_complexity = 0.01 #<<
) %>%
set_engine("rpart")
# train model
results <- fit_resamples(dt_mod, full_model_recipe, kfold2)
# model results
collect_metrics(results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.789 5 0.00529 Preprocessor1_Model1
## 2 roc_auc binary 0.801 5 0.00664 Preprocessor1_Model1
This AUC of 0.80 is worse than our first model, so we will perform some tuning
# create model object with tuning options
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(), #<<
tree_depth = tune(), #<<
min_n = tune() #<<
) %>%
set_engine("rpart")
# create the hyperparameter grid
hyper_grid2 <- grid_regular(
cost_complexity(), #<<
tree_depth(), #<<
min_n() #<<
)
# hyperparameter value combinations to be assessed
hyper_grid2
## # A tibble: 27 × 3
## cost_complexity tree_depth min_n
## <dbl> <int> <int>
## 1 0.0000000001 1 2
## 2 0.00000316 1 2
## 3 0.1 1 2
## 4 0.0000000001 8 2
## 5 0.00000316 8 2
## 6 0.1 8 2
## 7 0.0000000001 15 2
## 8 0.00000316 15 2
## 9 0.1 15 2
## 10 0.0000000001 1 21
## # … with 17 more rows
# train our model across the hyper parameter grid
results2 <- tune_grid(dt_mod, full_model_recipe, resamples = kfold2, grid = hyper_grid2)
# get best results
show_best(results2, metric = "roc_auc", n = 10)
## # A tibble: 10 × 9
## cost_complexity tree_depth min_n .metric .estim…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 8 21 roc_auc binary 0.821 5 0.00477 Prepro…
## 2 0.00000316 8 21 roc_auc binary 0.821 5 0.00477 Prepro…
## 3 0.0000000001 8 40 roc_auc binary 0.818 5 0.00913 Prepro…
## 4 0.00000316 8 40 roc_auc binary 0.818 5 0.00913 Prepro…
## 5 0.0000000001 15 40 roc_auc binary 0.815 5 0.00796 Prepro…
## 6 0.00000316 15 40 roc_auc binary 0.815 5 0.00796 Prepro…
## 7 0.0000000001 8 2 roc_auc binary 0.802 5 0.00415 Prepro…
## 8 0.00000316 8 2 roc_auc binary 0.802 5 0.00415 Prepro…
## 9 0.0000000001 15 21 roc_auc binary 0.798 5 0.00651 Prepro…
## 10 0.00000316 15 21 roc_auc binary 0.798 5 0.00651 Prepro…
## # … with abbreviated variable name ¹.estimator
After tuning the AUC improves a bit, but the logistic regression model still performs better
Our third model will use the random forest method
customer_train2<-na.omit(customer_train)
# create model recipe with all features
model_recipe <- recipe(
Status ~ .,
data = customer_train2
)
# create random forest model object
rf_mod <- rand_forest(mode = "classification") %>% #<<
set_engine("ranger") #<<
# create resampling procedure
set.seed(13)
kfold3 <- vfold_cv(customer_train2, v = 5)
# train model
results3 <- fit_resamples(rf_mod, model_recipe, kfold3)
## Warning: package 'ranger' was built under R version 4.2.2
# model results
collect_metrics(results3)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.797 5 0.00548 Preprocessor1_Model1
## 2 roc_auc binary 0.836 5 0.00608 Preprocessor1_Model1
This is the 2nd best model already, we’ll use tuning to see if we can improve it even more.
rf_mod_oob <- rand_forest(mode = "classification") %>%
set_engine("ranger") %>%
fit(Status ~ ., data = customer_train2)
sqrt(rf_mod_oob$fit$prediction.error)
## [1] 0.3721683
# create random forest model object with tuning option
rf_mod <- rand_forest(
mode = "classification",
trees = 1000,
mtry = tune(), #<<
min_n = tune() #<<
) %>%
set_engine("ranger", importance = "permutation") #<<
# create the hyperparameter grid
hyper_grid <- grid_regular(
mtry(range = c(2, 80)), #<<
min_n(range = c(1, 20)), #<<
levels = 5 #<<
)
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(rf_mod, model_recipe, resamples = kfold3, grid = hyper_grid)
# model results
show_best(results, metric = "roc_auc")
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 15 roc_auc binary 0.842 5 0.00609 Preprocessor1_Model16
## 2 2 20 roc_auc binary 0.842 5 0.00602 Preprocessor1_Model21
## 3 2 10 roc_auc binary 0.842 5 0.00627 Preprocessor1_Model11
## 4 2 5 roc_auc binary 0.841 5 0.00578 Preprocessor1_Model06
## 5 2 1 roc_auc binary 0.839 5 0.00609 Preprocessor1_Model01
After tuning we see some slight improvement, but the logistic regression model still has the highest AUC and is the best model.
Based on our analysis, the three biggest predictors of whether or not a customer will stay with Regork is their tenure and if they are on 1 or 2-year contracts. Those who have stayed with us for a while are far less likely to leave, as are customers who have 1 or 2-year contracts instead of monthly contracts. Because of this, it is important that Regork focus on incentivizing these longer contracts, because being on longer contracts will also directly lead to a longer tenure. Those on monthly contracts are far more likely to leave, so by offering discounts and promotions on longer contracts we can try to get a lot of these monthly contract customers to move to longer contracts, greatly reducing customer churn. If we fail to do this, we will continue to lose thousands each month on customers who only stick with us for a short period of time, on monthly contracts.
The biggest limitation of our analysis is that we might not have enough data. While 19 predictor variables may seem like a lot, there are definitely other factors to consider. Additional demographic information would certainly be useful for targeting specific groups of customers, so having more of that would help make our analysis better.