Introduction

I have been tasked with analyzing customer data from Regork’s telecommunications market. Ultimately, the goal at hand is to find a model that efficiently and accurately predicts whether customers will churn in the future, as to help Regork retain them as customers.

Data Preparation

Required Packages

library(tidymodels)
library(tidyverse)
library(dplyr)
library(vip)
library(pdp)
library(kernlab)
library(baguette)
library(GGally)
library(ggmosaic)
library(rpart.plot)

tidymodels: A set of tools for building and tuning machine learning models in a tidy way.
tidyverse: A collection of R packages for data science, making data manipulation and visualization easier.
dplyr: A package for easily manipulating and transforming data.
vip: Helps visualize which features are most important in machine learning models.
pdp: Creates plots to show how different features affect your model’s predictions.
kernlab: Provides tools for advanced machine learning methods like support vector machines.
baguette: Makes it easy to use bagging techniques to improve model performance.
GGally: Adds extra plotting functions to ggplot2 for more complex visualizations.
ggmosaic: Allows you to create mosaic plots using ggplot2.
rpart.plot: Simplifies the process of visualizing decision trees.

Data Loading, Cleaning, and Churn Rate

retention <- read_csv("customer_retention.csv")
retention <- retention %>%
  mutate(Status = as.factor(Status))
retention <- retention %>%
  drop_na()

churn_rate <- retention %>%
  count(Status) %>%
  mutate(percentage = n / sum(n) * 100)
churn_rate

## # A tibble: 2 × 3
##   Status      n percentage
##   <fct>   <int>      <dbl>
## 1 Current  5132       73.4
## 2 Left     1856       26.6

After loading and cleaning our data, we can utilize a churn rate calculation to view a baseline of customer retention and potential areas for improvement, before starting our exploratory analysis.

Exploratory Analysis

The Exploratory Analysis section presents the visualizations and data findings that have been utilized throughout the R project, as to address areas for improvement.

Gender vs Status

# Gender vs Status
retention %>% 
  ggplot(aes(PhoneService, fill=Gender))+
  geom_bar() +
  labs(title = "Gender Count and Phone Service Use")

# Gender and Senior Citizen vs. Status
retention %>% 
  mutate(Senior_Citizen = case_when(
    SeniorCitizen == 1 ~ "Senior Citizen",
    SeniorCitizen == 0 ~ "Not a Senior Citizen")) %>% 
  ggplot(aes(PhoneService, fill=Senior_Citizen))+
  geom_bar()+
  labs(title = "Senior Citizen Count and Phone Service Use")

According to the charts above, more female customers use Regork’s phone services than males. Additionally, the second chart shows us that there are more people who aren’t senior citizens using said phone plan than there are actual senior citizens. This shows us that there is both a prevalance of females who are not senior citizens using the phone services.

Partner & Dependents vs. Status

# Stacked Bar Chart of Partner and Dependents vs. Status
retention %>%
  ggplot(aes(x = interaction(Partner, Dependents), fill = Status)) +
  geom_bar(position = "stack") +
  labs(title = "Count of Customers by Partner and Dependents Status", x = "Partner and Dependents", y = "Count") +
  scale_x_discrete(labels = c("No Partner\nNo Dependents", "No Partner\nDependents", "Partner\nNo Dependents", "Partner\nDependents")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As Shown above, most of Regork’s telecommunications customer base consists of those who have no partner and no dependents.

The order of customer prevelance based on partners & dependents is as follows:

No Partner or Dependents
Partner with Dependents
No Partner but has Dependents
Partner but no Dependents

Internet, Security, and Tenure vs. Status

# Internet Service and Online Security vs. Status
retention %>%
  ggplot(aes(InternetService, fill = Status)) +
  geom_bar() +
  facet_wrap(~ OnlineSecurity) +
  labs(title = "Internet Service and Online Security Count vs. Status")

# Tenure and Internet Service vs. Status
retention %>%
  ggplot(aes(Tenure, fill = Status)) +
  geom_histogram(binwidth = 1, position = "dodge") +
  facet_wrap(~ InternetService) +
  labs(title = "Tenure and Internet Service Count vs. Status", x = "Tenure (months)", y = "Count")

The first plot above displays Internet Service and Online Security versus Status. It is clear that there are a higher number of customers with no online security that have fiber optics and DSL. In fact, most of Regork’s current and previous customers did not have online security, yet they had internet services.

The second plot displays Tenure and Internet Service versus Status. In this plot we clearly see that fiber optics retains the most customers. However, it appears as though many previous customers (left) stopped their fiber optics services after only a few months.

Payment Methods & Contract Types vs. Status

# Payment Method vs. Status
retention %>%
  ggplot(aes(PaymentMethod, fill = Status)) +
  geom_bar() +
  labs(title = "Payment Method Count vs. Status")

# Contract vs Status
retention %>%
  ggplot(aes(Contract, fill = Status)) +
  geom_bar() +
  labs(title = "Contract Type and Customer Status")

Correlation Plot

# Correlation plot for numeric variables
numeric_vars <- retention %>%
  select(Tenure, MonthlyCharges, TotalCharges)
ggpairs(numeric_vars, aes(color = retention$Status))

This chart depicts a scatter plot matrix, which allows us to identify correlations between our numeric variables (Tenure, Monthly Charges, and Total Charges) and their difference regarding customers who have or have not churned.

Machine Learning (ML)

Logistic Regression & Global Recipe (Optimal Model)

# Logistic Regression: 
set.seed(123)
retention_split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test  <- testing(retention_split)

retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  step_zv(all_predictors())

lr_mod <- logistic_reg() %>%
  set_engine("glm")

set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

log_results <- lr_mod %>%
  fit_resamples(Status ~ ., kfolds)

collect_metrics(log_results) %>% filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

final_fit <- lr_mod %>%
  fit(Status ~ ., data = retention_train)

# Confusion Matrix
final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

# ROC Curve
final_fit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

# Coefficients for feature importance
coefficients <- tidy(final_fit$fit)


vip(final_fit$fit, num_features = 10) +
  ggtitle("Top 10 Important Features from Logistic Regression Model")

top_features <- coefficients %>%
  mutate(abs_estimate = abs(estimate)) %>%
  arrange(desc(abs_estimate)) %>%
  top_n(10, abs_estimate)

predictions <- predict(final_fit, retention_test, type = "prob")


retention_test <- retention_test %>%
  bind_cols(predictions) %>%
  mutate(expected_loss = .pred_Left * MonthlyCharges)


monthly_predicted_loss <- retention_test %>%
  summarize(monthly_loss = sum(expected_loss))

Our logistic regresion model does a good job at predicting when a customer is still active compared to when a customer has left/churned.

Additionally, it is clear that the inaccuracy of the model stems from FP (false positives) rather than FN (false negatives).

Model Bias:

Indicative of the curve above, there is bias towards predicting customer activity while customers are no longer using Regork’s services

Important Feature(s)

As displayed by the bar graph, our most significant and influential feature is tenure. This means that the length of time Regork’s customers has been with the company is the strongest predictor of churn.

Generalization Error and Its Implications

# Predict on the test set
test_predictions <- final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status))

# Compute confusion matrix
conf_matrix <- test_predictions %>%
  conf_mat(truth = Status, estimate = .pred_class)

# Calculate generalization error
generalization_error <- 1 - sum(diag(conf_matrix$table)) / sum(conf_matrix$table)
generalization_error

## [1] 0.1921793

Because our model is efficient and has a low generalization error, it is likely that our model will have a high success rate when applied in the future.

Decision Tree

#Decision Tree
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

dt_fit <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = retention_train)

dt_results <- fit_resamples(dt_mod, retention_recipe, kfolds)

collect_metrics(dt_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.788     5 0.00399 Preprocessor1_Model1
## 2 brier_class binary     0.161     5 0.00171 Preprocessor1_Model1
## 3 roc_auc     binary     0.710     5 0.00529 Preprocessor1_Model1

# Hypertuning
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_results <- tune_grid(dt_mod, retention_recipe, resamples = kfolds, grid = dt_hyper_grid)

show_best(dt_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.816     5 0.00552
## 2    0.0000000178          8    30 roc_auc binary     0.816     5 0.00552
## 3    0.00000316            8    30 roc_auc binary     0.816     5 0.00552
## 4    0.0000000001          8    40 roc_auc binary     0.815     5 0.00420
## 5    0.0000000178          8    40 roc_auc binary     0.815     5 0.00420
## # ℹ 1 more variable: .config <chr>

dt_best_model <- select_best(dt_results, metric = 'roc_auc')

dt_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

dt_final_fit <- dt_final_wf %>%
  fit(data = retention_train)

dt_final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1317  243
##    Left        223  314

dt_final_fit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

# Visualization with rpart.plot
final_tree <- pull_workflow_fit(dt_final_fit)$fit

printcp(final_tree)

## 
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, cp = ~1e-10, maxdepth = ~8, 
##     minsplit = min_rows(30, data))
## 
## Variables actually used in tree construction:
##  [1] Contract_One.year                     Contract_Two.year                    
##  [3] Gender_Male                           InternetService_Fiber.optic          
##  [5] InternetService_No                    MonthlyCharges                       
##  [7] MultipleLines_Yes                     OnlineBackup_Yes                     
##  [9] OnlineSecurity_Yes                    PaperlessBilling_Yes                 
## [11] PaymentMethod_Credit.card..automatic. PaymentMethod_Electronic.check       
## [13] PaymentMethod_Mailed.check            SeniorCitizen                        
## [15] StreamingMovies_Yes                   Tenure                               
## [17] TotalCharges                         
## 
## Root node error: 1299/4891 = 0.26559
## 
## n= 4891 
## 
##            CP nsplit rel error  xerror     xstd
## 1  1.0431e-01      0   1.00000 1.00000 0.023777
## 2  8.0831e-03      2   0.79138 0.80216 0.022044
## 3  5.0038e-03      4   0.77521 0.80754 0.022098
## 4  3.3359e-03      7   0.75674 0.78984 0.021920
## 5  3.0793e-03     14   0.72440 0.79369 0.021959
## 6  2.3095e-03     21   0.69746 0.78984 0.021920
## 7  1.5396e-03     23   0.69284 0.79677 0.021990
## 8  1.1547e-03     28   0.68514 0.80446 0.022068
## 9  1.0778e-03     31   0.68129 0.80370 0.022060
## 10 7.6982e-04     37   0.67360 0.80370 0.022060
## 11 6.1586e-04     40   0.67129 0.82063 0.022227
## 12 5.1322e-04     45   0.66821 0.82294 0.022250
## 13 1.0000e-10     51   0.66436 0.83141 0.022332

pruned_tree <- prune(final_tree, cp = 0.01)  

rpart.plot(pruned_tree, 
           type = 2,         
           extra = 104,       
           under = TRUE,      
           faclen = 0,        
           fallen.leaves = TRUE, 
           main = "Pruned Decision Tree: Significant Variables")

Bagging

#bagging
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

bag_results <- fit_resamples(bag_mod, retention_recipe, kfolds)

collect_metrics(bag_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.764     5 0.00189 Preprocessor1_Model1
## 2 brier_class binary     0.171     5 0.00193 Preprocessor1_Model1
## 3 roc_auc     binary     0.769     5 0.00433 Preprocessor1_Model1

bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

# hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))


set.seed(123)
bag_results <- tune_grid(bag_mod, retention_recipe, resamples = kfolds, grid = bag_hyper_grid)

show_best(bag_results, metric = "roc_auc")

## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.822     5 0.00347 Preprocessor1_Model5
## 2   300 roc_auc binary     0.821     5 0.00421 Preprocessor1_Model6
## 3   100 roc_auc binary     0.821     5 0.00454 Preprocessor1_Model4
## 4    50 roc_auc binary     0.816     5 0.00374 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00453 Preprocessor1_Model2

bag_best_model <- select_best(bag_results, metric = 'roc_auc')

bag_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(bag_mod) %>%
  finalize_workflow(bag_best_model)

bag_final_fit <- bag_final_wf %>%
  fit(data = retention_train)

# Confusion Matrix
bag_final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1341  263
##    Left        199  294

# ROC Curve
bag_final_fit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Business Analysis & Recommendation

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

In our Logistic Regression model, our most optimal model, it is clear that Tenure and Total Charges are our most influential features. Additionally, our decision tree model shows that Monthly Charges and Internet Service Fiber Optic follow in regards to influence. To ensure that Regork keeps as many customers as they can, it is imperative for them to focus on Tenure and Total Charges. With relation to these features, Regork could roll out incentives and promotions. Possible promotions related to Tenure would be discounts or special package deals for those who have used the company services for a certain amount of time.

Collect all the customers from the test dataset that you predict are going to leave.

Looking at our plots that show demographics versus status, we can see that the majority of customer churn revolves around males, those who are without partners, do not have dependents, and aren’t senior citizens. This trend may be due to the financial stability of those who do not fall under those categories. For instance, those who have dependents (families) and partners face far worse financial strain than those who are single and live alone.

What is the predicted loss in revenue per month if no action is taken?

monthly_predicted_loss

## # A tibble: 1 × 1
##   monthly_loss
##          <dbl>
## 1       42412.

ggplot(retention_test, aes(x = expected_loss)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Predicted Monthly Revenue Loss",
       x = "Predicted Monthly Revenue Loss",
       y = "Count of Customers")

According to our optimal model, Regork’s predicted loss per month based on customer churn is $42411.7.

Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).

After analyzing the trends of customer churn through Regork’s telecommunications model it is clear that there must be incentive plans in place to avoid future churn. One idea fo this incentive system would be lyoalty rewards, wherein customers earn points for tenure (months stayed) which can be applied to future bills or other services they wish to purchase. Additionally, because customer churn is higher for those who have less internet services tenure, it would be beneficial to give customers random service plan upgrades, as well as having their first month be free or discounted.

Conclusion

In conclusion, it is now evident that Regork’s telecommunications service market is highly influenced by both customer tenure and their total charges. Because of this, it is imperative that Regork begins implementing loyalty and reward programs that cater to quickly churning demographics mentioned above. Once Regork rolls out such programs, they will ultimately avoid the monthly loss that was calculated before. Finally, using the optimal logistic regression model will help Regork to continue in this market.

Final Project

Henry Bekins

2024-12-09

Introduction

Introduction

Data Preparation

Required Packages

Data Loading, Cleaning, and Churn Rate

Exploratory Analysis

Gender vs Status

Partner & Dependents vs. Status

Internet, Security, and Tenure vs. Status

Payment Methods & Contract Types vs. Status

Correlation Plot

Machine Learning (ML)

Logistic Regression & Global Recipe (Optimal Model)

Model Bias:

Important Feature(s)

Generalization Error and Its Implications

Decision Tree

Bagging

Business Analysis & Recommendation

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

Collect all the customers from the test dataset that you predict are going to leave.

What is the predicted loss in revenue per month if no action is taken?

Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).

Conclusion