Final Project

1. Introduction

Problem Statement and Summary:

Our team has been tasked with exploring the telecommunication division of Regork and running some data analysis on its current customer retention.

This will provide valuable insight for Regork moving forward with its business strategy to boost their churn and to attract new customers to their services. Since the cost of acquiring new customers is exponentially more than retaining them, we seek to provide the key functions that the firm is facing problems with and try to propose a plan that works.

Utilizing the provided dataset and machine learning models, we were able to find the most influential variables in the customer demographics and create a plan for Regork.

2. Packages Required

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)

Summary The libraries needed for the analysis

3. Data Prep

Calling the data

retention <- read_csv("data/customer_retention.csv")

retention_prepped <- retention %>%
  dplyr::mutate(Status = as.factor(Status)) %>%
  drop_na()

Splitting the data

#Data Split
split <- initial_split(retention_prepped, prop = 0.7, strata = "Status")
retention_train <- training(split)
retention_test <- testing(split)

Splitting 70/30 for training/testing

4. Exploratory Analysis

Status Distribution

mutate_retention <- retention_prepped %>%
  count(Status) %>%
  mutate(percentage = n / sum(n) * 100)

# Visualization of Status distribution
ggplot(mutate_retention, aes(x = Status, fill = Status)) +
  geom_bar() +
  labs(title = "Customer Status Distribution", x = "Status", y = "Count") +
  theme_minimal()

In the first chunk of code, we want to find out the percentage of current customers and customers who have left, as well as finding out the total number of active clients and total clients.

We then take our ‘Current’ vs ‘Left’ findings and put them into a box graph. The pink bar shows the count of current clients, while the blue bar is showing the count of clients that have terminated their contract with the internet providers.

Tenure vs Internet Service

# Tenure vs Internet Service
ggplot(retention_prepped, aes(x = InternetService, y = Tenure, fill = Status)) +
  geom_boxplot() +
  labs(title = "Tenure vs Internet Service", x = "Internet Service", y = "Tenure") +
  theme_minimal()

This next graph we created shows the tenure distribution for each internet service. The graph shows that Fiber Optic has the highest median tenure of active and inactive clients. This will help us in finding certain trends that are causing Fiber Optic to have the highest median tenure.

Monthly Charges by Status

# Tenure vs Internet Service
ggplot(retention_prepped, aes(x = MonthlyCharges, fill = Status)) +
  geom_density(alpha = 0.5) +
  labs(title = "Monthly Charges Distribution by Status", x = "Monthly Charges", y = "Density") +
  theme_minimal()

The next graph we created shows us the monthly charges distribution by status. This graph tells us that a lot of old customers had the highest monthly charges. This could be a big factor in their decision to terminate their contract with the internet providers. There are a lot less clients that have taken their business elsewhere that had monthly charges cheaper than 62 dollars each month.

Online Security Proportion

# Tenure vs Internet Service
retention_prepped %>%
  filter(InternetService != "No") %>%
  ggplot(aes(Status, fill = OnlineSecurity)) +
  geom_bar() +
  labs(title = "Online Security Proportion Internet Service", x = "Customer Status", y = "Count") +
  theme_minimal()

This next visual is showing the status of all of the clients, as well as sharing whether they had online security or not. It looks that approximately 50% of current clients have online security. However, looking at only clients who left, there is a significantly smaller amount with online security. This could have been a big factor in those client’s decision to take their business elsewhere.

Contract vs Status

# Create a stacked bar chart for Contract vs Status
ggplot(retention_prepped, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Proportion of Churn by Contract Type",
    x = "Contract Type",
    y = "Percentage",
    fill = "Customer Status"
  ) +
  theme_minimal()

For our final visual, we created a stacked bar chart comparing the type of contract with the percentage of current customers and customers who have left. There is a clear trend that the majority of customers who left had a month-to-month contract.

5. Data Analysis

Creating a recipe

recipe_retention <- recipe(Status ~ ., data = retention_train) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

This creating a recipe using YeoJohnson Normalization and Standardization. Additionally, using step dummy to create dummy values for the variables.

Logistic Model

logistic_model <- logistic_reg() %>%
  set_engine("glm")

workflow_logreg <- workflow() %>%
  add_recipe(recipe_retention) %>%
  add_model(logistic_model)

logreg_model <- workflow_logreg %>%
  fit(data = retention_prepped)

kfold <- vfold_cv(data = retention_train, v = 5)

results <- workflow_logreg %>%
  fit_resamples(resamples = kfold)

results %>%
  collect_metrics()
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.806     5 0.00751 Preprocessor1_Model1
## 2 brier_class binary     0.134     5 0.00347 Preprocessor1_Model1
## 3 roc_auc     binary     0.848     5 0.00498 Preprocessor1_Model1

Summary/Explanation: Using 5 fold cross validation on the training data of the retention dataset, the logistic model has a roc_auc mean of 0.846 and an accuracy mean of 0.803. This means that this model has an acceptable accuracy and AUC variable. However, there may be a model with a better mean in both accuracy and the AUC variable.

Decision Tree Model

Initial MOdel

decision_tree_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

workflow_tree_model <- workflow() %>%
  add_recipe(recipe_retention) %>%
  add_model(decision_tree_model)

decision_tree_results <- workflow_tree_model %>%
  fit_resamples(resamples = kfold)

print(collect_metrics(decision_tree_results))
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.791     5 0.00550 Preprocessor1_Model1
## 2 brier_class binary     0.146     5 0.00392 Preprocessor1_Model1
## 3 roc_auc     binary     0.802     5 0.00462 Preprocessor1_Model1

Tuning

decision_tree_model_tuned <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")
tune_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5)

workflow_tree_model_tuned <- workflow() %>%
  add_recipe(recipe_retention) %>%
  add_model(decision_tree_model_tuned)

tune_results <- tune_grid(
  workflow_tree_model_tuned,
  resamples = kfold,
  grid = tune_grid,
  metrics = metric_set(roc_auc)
)

Selecting best parameters and finalize

best_params <- select_best(tune_results, metric = "roc_auc")

final_workflow <- finalize_workflow(workflow_tree_model_tuned, best_params)

final_results <- fit_resamples(
  final_workflow,
  resamples = kfold)

final_results %>%
  collect_metrics()
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.784     5 0.00964 Preprocessor1_Model1
## 2 brier_class binary     0.149     5 0.00543 Preprocessor1_Model1
## 3 roc_auc     binary     0.815     5 0.00674 Preprocessor1_Model1

Summary/Explanation: The model does not perform well initially, however after applying tuning to the hyperparameters cost complexity and tree depth at the 5th level we return the result of a 0.810 roc_auc. This is still not a great roc_auc or with great accuracy so this will not be our optimal model.

MARS Model

mars_model <- mars() %>%
  set_engine("earth") %>%
  set_mode("classification")

workflow_mars_model <- workflow() %>%
  add_recipe(recipe_retention) %>%
  add_model(mars_model)

mars_model_results <- workflow_mars_model %>%
  fit_resamples(resamples = kfold)

print(collect_metrics(mars_model_results))
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.807     5 0.00664 Preprocessor1_Model1
## 2 brier_class binary     0.134     5 0.00330 Preprocessor1_Model1
## 3 roc_auc     binary     0.848     5 0.00500 Preprocessor1_Model1

Summary/Explanation: The MARS model we made displayed almost identical accuracy and AUC variable means as the logistic model. The reason we chose the MARS model as the optimal model is because it is the best fit to our data set. The roc_auc is 0.844. Using the MARS model for our prediction model, we have the top five most influential variables which are Contract_Month.to.month, Tenure, InternetService_Fiber.optic, OnlineSecurity_No, and TotalCharges.

Optimal Model

final_fit <- workflow_mars_model %>%
  fit(data = retention_train)

mars_fit <- extract_fit_parsnip(final_fit)$fit

vip(mars_fit)

Summary/Explanation: Using the MARS model which predict the likelihood of a customer leaving the Regork Telecommunication plan, we got the most influential items that the business should focus on to increase the customer retention rate.

CONFUSION MATRIX

test_predictions <- predict(final_fit, retention_test) %>%
  bind_cols(retention_test)

conf_matrix <- conf_mat(test_predictions, truth = Status, estimate = .pred_class)

print(conf_matrix)
##           Truth
## Prediction Current Left
##    Current    1392  270
##    Left        148  287

Summary/Explanation: Using the MARS model which predict the likelihood of a customer is currently using Regork Telecommunication plan, we got the most influential items that the business should focus on to increase the customer retention rate.

6. Summary

Business Plan: Using a month-to-month contract is one of the biggest factors in clients taking their business elsewhere. The company could focus on finding a way to lower the monthly payments to help retain customers. Another idea the company could pursue is finding a way to promote the contract type that appeals to the majority of the customers.

The month-to-month contract seems to predict that the contract isn’t structured as well as it should be. If the customer has a month-to-month contract our model is predicting that the customer will leave. This means the company should redirect its focus on promoting the year and two year contract, as well as restructuring the month-to-month contract.

We can also see customers who signed up for the Fiber Optic Cable are predicted to not be satisfied with their purchase. This means the company should also explore ways they can revamp their fiber optic cable(either performance or cost), as well as look into other available options.

People who did not purchase online security in their contract are also predicted to be dissatisfied with their purchase. The company should push customers to purchase the online security in hopes of increasing the average tenure of their clients. This also means that the current online security program the company has is providing quality service.