Introduction

Regork wants to keep its customers around because it’s a lot cheaper to keep current customers than to find new ones. Our project focused on figuring out which customers are most likely to leave so the company can step in and keep them. We used customer data like contract type, billing info, and how long people have been with Regork. We ran some data analysis and built a few graphs to find patterns tied to people leaving. The goal is to help Regork find the at-risk customers early and offer deals or incentives to keep them. This will save money, reduce churn, and help the company grow faster.

Data Preparation

Packages required for the results in this report.

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(randomForest)
library(pROC)

The data that will be used in this report can be seen below.

df <- read.csv("data/customer_retention.csv")
df <- mutate(df, Status = factor(Status))
df <- na.omit(df)

Exploratory Analysis

We first started by creating graphs to help identify some trends in our data.

churn_by_contract <- df %>%
  group_by(Contract) %>%
  summarize(ChurnRate = mean(Status == "Left")) %>%
  mutate(ChurnRate = round(ChurnRate * 100, 1))  # Convert to percentage
ggplot(churn_by_contract, aes(x = Contract, y = ChurnRate, fill = Contract)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = paste0(ChurnRate, "%")), vjust = -0.5) +
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type",
       y = "Churn Rate (%)") +
  theme_minimal() +
  theme(legend.position = "none")

The graph above shows us the customer churn-rate based on contract type. As we can see there is a large churn when it comes to customers under the month-to-month contract.

ggplot(df, aes(x = Status, y = Tenure, fill = Status)) +
  geom_boxplot() +
  labs(title = "Customer Tenure by Churn Status",
       x = "Customer Status",
       y = "Tenure (Months)") +
  theme_minimal() +
  theme(legend.position = "none")

This box and whisker plot displays the data we have gathered on the customers who have left vs. the customers that have stayed based on their tenure.

churn_by_payment <- df %>%
  group_by(PaymentMethod) %>%
  summarize(ChurnRate = mean(Status == "Left") * 100)
ggplot(churn_by_payment, aes(x = reorder(PaymentMethod, -ChurnRate), y = ChurnRate, fill = PaymentMethod)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = paste0(round(ChurnRate, 1), "%")), vjust = -0.5) +
  labs(title = "Churn Rate by Payment Method",
       x = "Payment Method",
       y = "Churn Rate (%)") +
  theme_minimal() +
  theme(legend.position = "none")

We then decided to plot the churn rate found in our customers based on their payment method. As you can see there is a large difference in churn rate when it comes to customers who payed with an electronic check rather then those who paid through mailed checks, bank transfers, and credit cards.

churn_by_senior <- df %>%
  group_by(SeniorCitizen) %>%
  summarize(ChurnRate = mean(Status == "Left") * 100) %>%
  mutate(SeniorCitizen = ifelse(SeniorCitizen == 1, "Senior", "Non-Senior"))
ggplot(churn_by_senior, aes(x = SeniorCitizen, y = ChurnRate, fill = SeniorCitizen)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = paste0(round(ChurnRate, 1), "%")), vjust = -0.5) +
  labs(title = "Churn Rate by Senior Citizen Status",
       x = "Customer Group",
       y = "Churn Rate (%)") +
  theme_minimal() +
  theme(legend.position = "none")

Lastly we decided to plot the data on churn status based on senior status. We once again see a large difference between our variables. This time with seniors having a much larger churn rate then younger customers.

Machine Learning

Logistic Regression

The first machine learning model we decided to use for our analysis is a logistic regression model.

set.seed(123)
logistic_split <- initial_split(df, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(
    Status ~ ., 
    logistic_kfolds
    ) %>%
  
  collect_metrics()
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00188 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00521 Preprocessor1_Model1

After running our logistic regression model we have gotten an acceptable AUC rating. Next we will continue our machine learning by running a decision tree model.

Decision Tree Model

set.seed(123)
dt_split <- initial_split(df, prop = 0.7, strata = "Status")
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

dt_mod_recipe <- recipe(
  Status ~ .,
  data = logistic_train
)

dt_fit <- workflow() %>%
  add_recipe(dt_mod_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = logistic_train)

dt_results <- fit_resamples(dt_mod, dt_mod_recipe, logistic_kfolds)

collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.790     5 0.00338 Preprocessor1_Model1
## 2 brier_class binary     0.145     5 0.00266 Preprocessor1_Model1
## 3 roc_auc     binary     0.808     5 0.00813 Preprocessor1_Model1
rpart.plot::rpart.plot(dt_fit$fit$fit$fit, roundint = FALSE)

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
)

dt_hypergrid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5 
)

set.seed(123)
dt_results <- tune_grid(dt_mod, dt_mod_recipe, resamples = logistic_kfolds, grid = dt_hypergrid)

show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          4     2 roc_auc binary     0.820     5 0.00667
## 2    0.0000000178          4     2 roc_auc binary     0.820     5 0.00667
## 3    0.00000316            4     2 roc_auc binary     0.820     5 0.00667
## 4    0.0000000001          4    11 roc_auc binary     0.815     5 0.00808
## 5    0.0000000178          4    11 roc_auc binary     0.815     5 0.00808
## # ℹ 1 more variable: .config <chr>
dt_best_model <- select_best(dt_results, metric = "roc_auc")

dt_final_wf <- workflow() %>%
  add_recipe(dt_mod_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

dt_final_fit <- dt_final_wf %>%
  fit(data = logistic_train)

dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(10)

After running our decision tree model we noticed that the model is very good when it comes to predicting current customers rather then the ones that left. The most important features being Contract and Tenure.

Random Forest Model

set.seed(123)
rf_split <- initial_split(df, prop = 0.7, strata = "Status")
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(
  Status ~ .,
  data = logistic_train
)

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

set.seed(123)
kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, kfold)

collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00643 Preprocessor1_Model1
## 2 brier_class binary     0.139     5 0.00356 Preprocessor1_Model1
## 3 roc_auc     binary     0.837     5 0.0104  Preprocessor1_Model1
rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
  ) %>% 
  set_engine("ranger", importance = "impurity") 
  

hyper_grid <- grid_regular(
  trees(range = c(100,1000)),
  mtry(range = c(2, 19)),
  min_n(range = c(1, 20)),
  levels = 5
)

set.seed(123)
results <- tune_grid(rf_mod, rf_recipe, resamples = kfold, grid = hyper_grid)

show_best(results, metric = "roc_auc")
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   775    20 roc_auc binary     0.844     5  0.0101 Preprocessor1_Model1…
## 2     2  1000    20 roc_auc binary     0.844     5  0.0103 Preprocessor1_Model1…
## 3     2   550    20 roc_auc binary     0.843     5  0.0103 Preprocessor1_Model1…
## 4     2  1000    15 roc_auc binary     0.843     5  0.0104 Preprocessor1_Model0…
## 5     2   550    15 roc_auc binary     0.843     5  0.0103 Preprocessor1_Model0…
autoplot(results)

rf_best <- select_best(results, metric = "roc_auc")

rf_wf <- workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(rf_recipe) %>%
  finalize_workflow(rf_best)

rf_final_fit <- rf_wf %>%
  fit(data = rf_train)

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

After running our Random Forest Model, we can see that we have gotten a very successful AUC rating. Our model provides great accuracy and AUC values. In comparison to our Decision Tree Model we also see that we have the same top 3 predictor variables being Contract, Tenure, and Total Charges. We will discuss more in our Business Analysis how Tenure is the most effective.

Confusion Matrix

Lastly we cre2ated a confusion matrix to identify false positives or negatives in the dataset for our logistic regression model.

confus_matrix <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)

confus_matrix %>% predict(logistic_test) %>% 
  bind_cols(logistic_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

Business Analysis

Important Predictor Variables

After running all of our machine learning models, our 5 most important predictor variables are, Tenure, Total Charges, Contract, Monthly Charges, and Online Security.

Our most important predictor variable is Tenure. Through our exploratory analysis and machine learning models, there is strong evidence that it is the most important. We reccomend that Regork strongly considers targeting their most tenured customers.

Another strong predictor variable in our analysis we think Regork should focus on is Contract. There seems to be a large trend in churn rate when it comes to what contract Regork’s customers choose. We would strongly recommend encouraging current and future customers to apply for longer contracts and leading them away from the month-to-month contract.

Lastly, as previously mentioned, Total Charges, Monthly Charges, and Online Security are also strong features of this data set. It is clear that the cost of Regork’s contracts unsatisfy a considerable portion of their customers. As for Online Security, customers also seem to be more self-aware and concerned about how Regork is using and protecting their data. We would recommend Regork to maintain their prices as low as possible, and to also ensure customers of how protecting their data is of great importance.

Conclusion

Regork should consider our random forest model as it predicts pretty well on customer status. With an AUC of 84%, the model can be used to make successful business decisions. We also mentioned the most important areas to look into so Regork is able to focus on those areas to keep these customers long term. Regork can make a great telecommunication service by using our predictive model.