Final_Project

Synopsis

Regork is well established in the consumers grocery market. After recently entering the telecommunications market to offer internet service, phone service, and online streaming, we believe there will always be areas for continuous development. One of the largest sectors to capitalize on will be consumers who are most likely to leave the company. This loss in revenue could hurt Regork in the long run if swift action is not taken timely and efficiently.

Our team has work with efficacy to ensure Regork has continuable success in the telecommunications market. Firstly, we observed historical data from past Regork customers in telecommunications. With this data, we found several trends worth noting that may have an impact on retaining customer status. Secondly, we trained several machine learning models to view the interactions between several predictors in the data to more accurately predict whether or not a customer will leave. Lastly, we focus in on our best performing model to give specific and actionable suggestions.

Based on our analysis, we believe a new incentives program focused on newer customers (with a tenure lower than 30) will increase Regork revenue in the long-term. The incentives program we will suggest is combining the contract resigning process with a complimentary device or upgrade to current customers that apply for such benefits.

Exploratory Data Analysis

Packages

library(tidymodels)
library(tidyverse) # for data wrangling & plotting
library(baguette)
library(vip) # for variable importance
library(pdp) # for variable relationships
library(rpart.plot)
library(ranger)
library(dplyr)
library(ggplot2)

Data Preparation

This customer retention data is historical data from the telecommunications customers with Regork.

retention <- read_csv(here::here('data', 'customer_retention.csv'))

## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Recode response variable as a factor
retention <- mutate(retention, Status = factor(Status))

retention %>% select(Status, everything())

## # A tibble: 6,999 × 20
##    Status  Gender SeniorCitizen Partner Dependents Tenure PhoneService
##    <fct>   <chr>          <dbl> <chr>   <chr>       <dbl> <chr>       
##  1 Current Female             0 Yes     No              1 No          
##  2 Current Male               0 No      No             34 Yes         
##  3 Left    Male               0 No      No              2 Yes         
##  4 Current Male               0 No      No             45 No          
##  5 Left    Female             0 No      No              2 Yes         
##  6 Left    Female             0 No      No              8 Yes         
##  7 Current Male               0 No      Yes            22 Yes         
##  8 Current Female             0 No      No             10 No          
##  9 Left    Female             0 Yes     No             28 Yes         
## 10 Current Male               0 No      Yes            62 Yes         
## # ℹ 6,989 more rows
## # ℹ 13 more variables: MultipleLines <chr>, InternetService <chr>,
## #   OnlineSecurity <chr>, OnlineBackup <chr>, DeviceProtection <chr>,
## #   TechSupport <chr>, StreamingTV <chr>, StreamingMovies <chr>,
## #   Contract <chr>, PaperlessBilling <chr>, PaymentMethod <chr>,
## #   MonthlyCharges <dbl>, TotalCharges <dbl>

This data includes 19 features that we will use to predict the status of individual customers.

Customer Status Sensitivity Analysis

Customer’s Internet Service Provider and Length of Contract

p1 <- retention %>%
  filter(Status == "Left") %>%
  ggplot(aes(InternetService)) + 
    geom_bar(fill = "yellow", color = "red") +
    facet_wrap(~Contract) +
    coord_flip() +
    ggtitle("Left - Customer's Internet Service and Length of Contract") +
    labs(y = "Count of Contract Type", x = "Customer Internet Service")

p2<- retention %>%
  filter(Status == "Current") %>%
  ggplot(aes(InternetService)) + 
    geom_bar(fill = "yellow", color = "red") +
    facet_wrap(~Contract) +
    coord_flip() +
    ggtitle("Current - Customer's Internet Service and Length of Contract") +
    labs(y = "Count of Contract Type", x = "Customer Internet Service")

gridExtra::grid.arrange(p1, p2)

This graphs illustrate the relationship between customer’s internet services and the length of their contract. Among customer’s who have left, month-to-month contracts are a common factor in this sector. Next, those customer’s with Fiber Optic internet have a higher proportional share for customers who leave.

Paper vs Paperless Billing among Seniors

p1 <- retention %>%
  filter(SeniorCitizen == 1,
         PaperlessBilling == "Yes") %>%
  ggplot(aes(Status, SeniorCitizen)) +
    geom_col(mapping = aes(color = Status)) +
  labs(title = "Paperless Billing")

p2 <- retention %>%
  filter(SeniorCitizen == 1,
         PaperlessBilling == "No") %>%
  ggplot(aes(Status, SeniorCitizen)) +
  geom_col(mapping = aes(color = Status))  +
  labs(title = "Paper Billing")

gridExtra::grid.arrange(p1, p2)

While Paperless Billing is more popular among senior citizens, it is also more popular for those with paperless billing to leave. This may be caused by several factors of not being able to keep up with technology or not knowing how to switch to paper billing.

Payment Method Status

retention %>%
  ggplot(aes(PaymentMethod, fill = Status)) +
  geom_bar() +
  labs(title = "Payment Method and Status")

While the population of customers that left by payment status is relatively low for automatic bank transfers, automatic credit card payments, and mailed checks, the same cannot be said for electronic checks. Electronic checks is the most common form of payment, but closer to 40% of those who pay by electronic check leave. that number is compared with the average of 15% for the other three payment methods.

Retention by Online Security and Internet Service

ggplot(retention, aes(OnlineSecurity)) +
  geom_bar(fill = "red") +
  facet_wrap(~Status)

retention %>%
  filter(InternetService != "No") %>%
  ggplot(aes(OnlineSecurity)) +
  geom_bar(fill = "red") +
  facet_grid(~Status)

retention %>%
  filter(Status == "Left") %>%
  ggplot(aes(InternetService)) +
  geom_bar(fill = "red") +
  facet_grid(~Status)

retention %>%
  filter(InternetService == "Fiber optic") %>%
  filter(Status == "Left") %>%
  ggplot(aes(OnlineSecurity)) +
  geom_bar(fill = "red") +
  facet_grid(~Status)

The first of four graphs shows that customers without Internet Security are most likely to leave Regork. Following the trail of thought and the earlier one that fiber optic is the majority shareholder of customers who left, the combination of the two is remarkable. When compounded, customers without Online Security and with Fiber Optic as their internet provider, customers are more likely to leave.

Machine Learning

Distribution

retention %>%
  count(Status) %>%
  mutate(proportion = (n / sum(n))*100) %>%
  filter(Status == "Current" | Status == "Left")

## # A tibble: 2 × 3
##   Status      n proportion
##   <fct>   <int>      <dbl>
## 1 Current  5143       73.5
## 2 Left     1856       26.5

From historical data, about 73.5% of customers are current with us and close to 26.5% have left.

To train our model, we split the data into two sets to make sure our model will generalize well.

set.seed(123)
split <- initial_split(retention, prop = 0.7, strata = Status)
retention_train <- training(split)
retention_test <- testing(split)

For each model, we will be maximizing our “roc-auc” value.

Logistic Regression

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5, strata = Status)

# titanic_train model via cross validation 
results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## There were issues with some computations   A: x1There were issues with some computations   A: x1
## → B | error:   Cannot find current progress bar for `<environment: 0x000002c681368100>`

# collect the average accuracy rate
collect_metrics(results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.801     1      NA Preprocessor1_Model1
## 2 roc_auc  binary     0.848     1      NA Preprocessor1_Model1

The logistic regression model gives us an auc of 0.843.

Decision Tree

dt_mod <- decision_tree(mode = 'classification') %>%
set_engine("rpart")

model_recipe <- recipe(Status ~ ., data = retention_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = retention_train)

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5)

# train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)

# model results
collect_metrics(dt_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.785     5 0.00841 Preprocessor1_Model1
## 2 roc_auc  binary     0.714     5 0.00880 Preprocessor1_Model1

The logistic regression model gives us an auc of 0.714. This was a decrease from the logistic regression model, so we will move on to our final model.

Random Forest

# split the data while omitting NA values
set.seed(123)
split <- initial_split(retention, prop = 0.7, strata = Status)
rf_train <- na.omit(training(split))
rf_test <- na.omit(testing(split))

# Base Random Forest
model_recipe <- recipe(Status ~ ., data = rf_train)

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

set.seed(123)
kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, model_recipe, kfold)

collect_metrics(results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.795     5 0.00151 Preprocessor1_Model1
## 2 roc_auc  binary     0.835     5 0.00341 Preprocessor1_Model1

The logistic regression model gives us an auc of 0.835. This is not bad when compared to 0.843 of the logistic model, but next we will tune the model to decrease our prediction error.

Tune Random Forest

rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

rf_hyper_grid <- grid_regular(
  trees(range = c(20, 300)), # 19 predictors * 10 = 190 with generous rounding
  mtry(range = c(2, 19)),    # There are 19 predictors
  min_n(range = c(1, 20)),
  levels = 5
)

set.seed(123)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)


show_best(rf_results, metric = "roc_auc")

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   300    20 roc_auc binary     0.842     5 0.00411 Preprocessor1_Model1…
## 2     2   230    20 roc_auc binary     0.842     5 0.00378 Preprocessor1_Model1…
## 3     2   230    10 roc_auc binary     0.841     5 0.00339 Preprocessor1_Model0…
## 4     2   160    15 roc_auc binary     0.841     5 0.00372 Preprocessor1_Model0…
## 5     2   230    15 roc_auc binary     0.841     5 0.00396 Preprocessor1_Model0…

After tuning, our new best auc is 0.841, which means that the logistic regression still performs a better estimation for our data. Let’s quickly view the best features for prediction when using the Random Forest and we will compare that will Logistic regression later.

rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

rf_final_fit <- final_rf_wf %>%
  fit(data = rf_train)

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 20)

Tenure is the most important predictor meaning it will have the greatest effect on whether or not a customer will stay with Regork.

rf_final_fit %>%
  predict(rf_test) %>%
  bind_cols(rf_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1421  279
##    Left        121  278

This model mispredicts a total of 19%. In this instance, it would be more crucial to figure out the reason of mispredicting the 280 customers that in fact left when we thought they would stay. Since we mispredicted, there would not be any indication by our model that they would leave and no incentive offer would be suggested.

Business Analysis & Conclusion

Best Algorithmic Model

Since the Logistic Regression gave us the highest estimation accuracy, we decided to move forward with this model. We would like to know the most important predictors for this model.

# retrain our model across the entire training data
final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = retention_train)

tidy(final_fit)

## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.03     0.975       1.06   2.89e- 1
##  2 GenderMale                     -0.0174   0.0778     -0.224  8.23e- 1
##  3 SeniorCitizen                   0.187    0.101       1.85   6.47e- 2
##  4 PartnerYes                     -0.0630   0.0934     -0.674  5.00e- 1
##  5 DependentsYes                  -0.0397   0.106      -0.372  7.10e- 1
##  6 Tenure                         -0.0561   0.00752    -7.45   9.16e-14
##  7 PhoneServiceYes                 0.0864   0.774       0.112  9.11e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.464    0.211       2.20   2.80e- 2
## 10 InternetServiceFiber optic      1.76     0.949       1.85   6.42e- 2
## # ℹ 21 more rows

final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

This model also mispredicts a total of 19%. Since we discussed earlier about wanting to decrease false predictions for staying, this model does that. We have decreased the number to 247 from 280. This number is still high, but this slight adjustment helps make sure that we are making the best predictions possible.

vip(final_fit$fit, num_features = 20)

Similar to the Random Forest model, the tenure predictor is once again very vital to customer status. Contract One and Two Year are also close in value.

final_fit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current)

## # A tibble: 4,867 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1   -Inf        0               1   
##  2      0.155    0               1   
##  3      0.157    0.000770        1   
##  4      0.158    0.00154         1   
##  5      0.160    0.00231         1   
##  6      0.166    0.00308         1   
##  7      0.167    0.00385         1   
##  8      0.171    0.00385         1.00
##  9      0.173    0.00462         1.00
## 10      0.177    0.00539         1.00
## # ℹ 4,857 more rows

final_fit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Partial Dependency Plot

Since Tenure was the common most important predictor, we will look at what effect changing this value will have on the final prediction of a customer’s status.

# prediction function
pdp_pred_fun <- function(object, newdata) {
  predict(object, newdata, type = "response")
}

# use the pdp package to extract partial dependence predictions
# and then plot
partial(
  object = final_fit,
  pred.var = "Tenure",
  pred.fun = pdp_pred_fun,
  grid.resolution = 10, 
  train = retention_train
) %>%
  ggplot(aes(x = Tenure, y = yhat)) +
    geom_smooth() +
  labs(x = "Tenure", y = "Predicted Outcome") +
  ggtitle("Partial Dependence Plot for Tenure") +
  ylim(0,1)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The longer they stay with Regork, the more likely the customer will keep returning. It is vital for us to focus on these customers with a tenure of less than 30. This will be a big share of your prospective customers.

As shown in the graph above, the probability of of leaving Regork is a steady decline over the years. The most significant time to capitalize on retention is early on. As tenure hits close to 20, the majority of the decline has taken effect and by 40 to 60, there will be minimal gain from these customers.

Existing Regork Potential Customers

Using the historical data, we have shown our prediction on the effect that certain predictors have on customer status. Using our model, we will transfer this prediction to our current customer data.

predictions <- predict(final_fit, retention_test, type = "prob") %>%
  select(.pred_Left)

threshold <- 0.5

retention_test <- retention_test %>%
  mutate(Status_Prediction = ifelse(predictions > threshold, "Leave", "Stay")) %>%
  select(Status_Prediction, everything())

# View the updated retention_test data with predicted classes
retention_test %>%
  filter(Status_Prediction == "Leave")

## # A tibble: 463 × 21
##    Status_Prediction[,".pred_Le…¹ Gender SeniorCitizen Partner Dependents Tenure
##    <chr>                          <chr>          <dbl> <chr>   <chr>       <dbl>
##  1 Leave                          Male               0 No      Yes            22
##  2 Leave                          Female             0 Yes     No             28
##  3 Leave                          Male               0 No      No              5
##  4 Leave                          Female             0 No      No              2
##  5 Leave                          Female             0 Yes     No             47
##  6 Leave                          Female             0 No      No             12
##  7 Leave                          Female             1 Yes     No             25
##  8 Leave                          Male               0 Yes     Yes             3
##  9 Leave                          Female             0 No      No             13
## 10 Leave                          Female             0 No      No              4
## # ℹ 453 more rows
## # ℹ abbreviated name: ¹Status_Prediction[,".pred_Left"]
## # ℹ 15 more variables: PhoneService <chr>, MultipleLines <chr>,
## #   InternetService <chr>, OnlineSecurity <chr>, OnlineBackup <chr>,
## #   DeviceProtection <chr>, TechSupport <chr>, StreamingTV <chr>,
## #   StreamingMovies <chr>, Contract <chr>, PaperlessBilling <chr>,
## #   PaymentMethod <chr>, MonthlyCharges <dbl>, TotalCharges <dbl>, …

Our model predicts that 463 of your current customer base will leave. This list is composed of those customers who should be the focus of your incentive programs.

Offering incentives to these customers to sign contracts for a minimum of one to two years will help in the long run. Incentives such as receiving a complimentary device when resigning their contract could help retain these customers for years past their current contract.