Regork Final Presentation By: Paxton Huber

##Introduction

To support Regork in enhancing their telecommunication service, I developed a customer retention model. Analyzing the provided dataset, I visualized trends and subsequently constructed, trained, and tested three distinct machine learning models. Specifically, a logistic regression model, a MARS model, and a bagging model. The performance of each model was evaluated using the standard metric, area under the curve.

##Packacges Required

The following packages allowed us to produce results for Regork:

tidyverse: Designed for data science tidymodels: A collection of packages for modeling and machine learning using tidyverse principles baguette: Tree- and rule-based models can be bagged using this package and their predictions equations are stored in an efficient format to reduce the model objects size and speed. vip: A general framework for constructing variable importance plots from various types of machine learning models pdp: constructing partial dependence plots (PDPs) and individual conditional expectation (ICE) curves ggplot2: To be able to create visualizations (graphical representations) of data kernlab: Package for kernel-based machine learning methods

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ recipes      1.0.8
## ✔ dials        1.2.0     ✔ rsample      1.2.0
## ✔ dplyr        1.1.3     ✔ tibble       3.2.1
## ✔ ggplot2      3.4.3     ✔ tidyr        1.3.0
## ✔ infer        1.0.5     ✔ tune         1.1.2
## ✔ modeldata    1.2.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.2     ✔ yardstick    1.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ stringr   1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(baguette)
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
library(pdp)
## 
## Attaching package: 'pdp'
## 
## The following object is masked from 'package:purrr':
## 
##     partial
library(ggplot2)
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## The following object is masked from 'package:scales':
## 
##     alpha

The data that provided Regork’s findings will need to be stored into the global environment.

df <- read.csv("~/Documents/customer_retention.csv")
df <- mutate(df, Status = factor(Status))
df <- na.omit(df)

##Exploratory Analysis

To start our analysis, I began by comparing some of the trends within the dataset by using graphs to visualize the differences.

ggplot(df, aes(MultipleLines)) + 
  geom_bar(fill = "pink") + 
  facet_wrap(~Contract) +
  ggtitle("Customer With Mutliple Lines vs. Length of Contract") +
  theme(plot.title = element_text(hjust = 0.25)) +
  labs(y = "Count of Customers", x = "Multiple Lines")

This graph shows the legnth of each contract versus the amount of customers with multiple lines. Majority of customers who use a month-to-month contract system do not have multiple lines. Customers who use a one year contract system mostly do not have multiple lines but some do. Customers who use a two year contract system mostly do have multiple lines but still have some users who do not have multiple lines. No phone service is very low for customers.

ggplot(df, aes(Tenure)) +
  geom_bar(fill = "deeppink1") +
  facet_wrap(~PaymentMethod) +
  ggtitle("Tenure vs. Type of Payment Method ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

Most customers who have been with Regork for awhile tend to use automatic payment methods like bank transfers or through credit card. Customers who have not been with Regork for a long time normally use electronic or mailed checks.

ggplot(df, aes(InternetService)) + 
  geom_bar(fill = "magenta1") + 
  facet_wrap(~PhoneService) +
  ggtitle("Customer With Phone Service Accompanied by Internet Service") +
  theme(plot.title = element_text(hjust = 0.25)) +
  labs(y = "Count of Customers", x = "Type of Internet Service")

Customers with no phone service tend to only use DSL for their internet services. Majority of customers who have phone service typically have internet service as well. They mostly use fiber optic and DSL.

##Machine Learning

Logistic Regression I chose to use a logistic Regression as the first machine learning for my analysis

set.seed(123)
logistic_split <- initial_split(df, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(Status ~ ., logistic_kfolds) %>%
  collect_metrics()
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## 
There were issues with some computations   A: x1

There were issues with some computations   A: x3

There were issues with some computations   A: x5

There were issues with some computations   A: x5
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 roc_auc  binary     0.845     5 0.00521 Preprocessor1_Model1

As we can see this model is not the best model even though it has an acceptable AUC.

Mars Model

set.seed(123)
mars_split <- initial_split(df, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train)

set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>% 
  add_recipe(mars_recipe) %>% 
  add_model(mars_mod)

mars_results <- mars_wf %>% 
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        21           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        22           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  5        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  6        24           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  7        25           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  8        26           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  9        27           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## 10        28           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## # ℹ 50 more rows
autoplot(mars_results)

mars_best <- select_best(mars_results, metric = "roc_auc")

mars_final_wf <- workflow() %>% 
  add_model(mars_mod) %>% add_formula(Status ~ .) %>% 
  finalize_workflow(mars_best)

mars_final_wf %>% 
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

As we can see from this graph, Tenure is the most important aspect to customers. Total charges and monthly charges are a close second away.

*Random Forest Model**

set.seed(123)
rf_split <- initial_split(df, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train)

rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")

set.seed(123)
rf_kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, rf_kfold)

collect_metrics(results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.797     5 0.00643 Preprocessor1_Model1
## 2 roc_auc  binary     0.837     5 0.0104  Preprocessor1_Model1
rf_recipe <- recipe(Status ~ ., data = rf_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())
rf_mod <- rand_forest(mode = "classification", trees = tune(), mtry = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity")

rf_hyper_grid <- grid_regular(trees(range = c(10, 100)), mtry(range = c(1, 25)), min_n(range = c(1, 15)), levels = 5)

set.seed(123)
results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfold, grid = rf_hyper_grid)

show_best(results)
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7   100    15 roc_auc binary     0.840     5  0.0106 Preprocessor1_Model1…
## 2     7   100    11 roc_auc binary     0.839     5  0.0105 Preprocessor1_Model0…
## 3     7    77    15 roc_auc binary     0.839     5  0.0105 Preprocessor1_Model1…
## 4     7    55    15 roc_auc binary     0.838     5  0.0108 Preprocessor1_Model1…
## 5     7    77    11 roc_auc binary     0.837     5  0.0104 Preprocessor1_Model0…
rf_best <- select_best(results, metric = "roc_auc")

final_rf_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% extract_fit_parsnip() %>% vip(num_features = 10)

Like the Mars model, the most important aspects are tenure, total charges, and monthly charges.

Confusion Matrix

confus_matrix <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)

confus_matrix %>% predict(logistic_test) %>% 
  bind_cols(logistic_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

##Conclusion

The results of my analytics provided insights into why customers may find it easy to leave Regork. In my observations, each of these scenarios allowed the customer freedom and flexibility to leave without any penalty or pain. This allows competitors to easily convince customers to try their services.

In my first analytic, I noticed most individuals with multiple lines tended to pay month by month. This provides individuals as an easy exit. They do not have any reason to stay with this organization, especially when they may be presented with an incentive or special deals at other organizations.

In my second analytic, I observed that customers, over time, were more likely to set up autopay or some form of electronic payment. If customers are presented with their bill each month, they have more awareness of the cost. If other companies are presenting incentives and/or deals to jump to another organization, they may be more likely to leave.

My third analytic viewed packages/bundles that customers had. Most customers had internet accompanied with their phone service.

In conclusion, I think bundling is a good incentive to offer customers and contracts may also help lock them into long term status.