Intro

I will be researching correlation and trends within the Regork data using data visualization. Then, performing three different types of machine learning models to better help determine the most important factors in customer retention.

Packages, data, cleanup

library(tidyverse)
library(vip)
library(tidymodels)
library(kernlab)
library(dplyr)
library(ggplot2)
library(earth)
library(ranger)


maindata <- read.csv("C:/Users/grube/OneDrive/Desktop/data mining/customer_retention.csv")
maindata <- mutate(maindata, Status = as.factor(Status))
maindata <- na.omit(maindata)

Exploratory Data Analysis

avg_charges <- maindata %>%
  group_by(Tenure) %>%
  summarise(avgMonthlyCharges = mean(MonthlyCharges, na.rm = TRUE))

ggplot(data = avg_charges, aes(x = cut(Tenure, breaks = seq(0, max(Tenure, na.rm = TRUE) + 9, by = 9), include.lowest = TRUE), y = avgMonthlyCharges)) + geom_bar(stat = 'identity', fill = 'skyblue', color = 'black') + theme_minimal() + labs(
  title = 'Average Monthly Charges by Tenure (Bucketed)',
  x = 'Tenure (grouped by 9 months)',
  y = 'Average Monthly Charges ($)'
) +  theme(plot.title = element_text(hjust = 0.5))

ggplot(maindata, aes(Tenure)) + geom_bar(fill = 'skyblue') + facet_wrap(~PaymentMethod) + labs(y = 'Customer Count', x = 'Length of Tenure in Months', title = 'Tenure vs Payment Method') + theme(plot.title = element_text(hjust = 0.5))

ggplot(maindata, aes(Status)) + geom_bar(fill = 'skyblue') + facet_wrap(~Contract) + labs(
  title = 'Status vs Contract Type',
  x = 'Status',
  y = 'Contract Count'
) + theme(plot.title = element_text(hjust = 0.5))

ggplot(maindata, aes(Status)) + geom_bar(fill = 'skyblue') + facet_wrap(~PaymentMethod) + labs(
  title = 'Status vs Payment Method',
  x = 'Status',
  y = 'Payment Method'
) + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Machine Learning

table(maindata$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7344018 0.2655982

Logistic Regression

set.seed(123)
logistic_split <- initial_split(maindata, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

set.seed(123)
lr_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

logistic_reg() %>%
  fit_resamples(Status ~ ., lr_kfolds) %>%
  collect_metrics %>%
  filter(.metric == 'roc_auc')

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

The logistic regression model returns an acceptable AUC (see above) but there could be a better model out there to find.

Multivariate Adaptive Regression Splines (MARS)

set.seed(123)
mars_split <- initial_split(maindata, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train)

set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = 'Status')
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode('classification')

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)

mars_wflow <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)

mars_results <- mars_wflow %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>% collect_metrics() %>% filter(.metric == 'roc_auc') %>% arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        18           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        21           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  5        22           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  6        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  7        24           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## # ℹ 50 more rows

autoplot(mars_results)

Based off the graphs above, the MARS model outputs good accuracy and GREAT AUC values. We can further explore the model by finalizing the workflow by plotting the most important features (using vip).

mars_best_results <- select_best(mars_results, metric = 'roc_auc')

mars_wflow_final <- workflow() %>%
  add_model(mars_mod) %>% add_formula(Status ~ .) %>%
  finalize_workflow(mars_best_results)

mars_wflow_final %>%
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip(10, type = 'rss')

The most important factor in the model is Tenure, followed by Total and Monthly Charges. This makes sense because the longer a customer is with us, the less likely they would be to leave. Also, cost is obviously a big factor in deciding whether to stay or go.

Random Forest

set.seed(123)
rf_split <- initial_split(maindata, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train)

rf_mod <- rand_forest(mode = 'classification') %>%
  set_engine('ranger', importance = 'impurity')

set.seed(123)
rf_kfolds <- vfold_cv(rf_train, v = 5)

rf_results <- fit_resamples(rf_mod, rf_recipe, rf_kfolds)

collect_metrics(rf_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.798     5 0.00509 Preprocessor1_Model1
## 2 brier_class binary     0.139     5 0.00355 Preprocessor1_Model1
## 3 roc_auc     binary     0.837     5 0.0103  Preprocessor1_Model1

This random forest model provides great performance at its default work, now lets look at the top factors of important within the model using vip.

rf_best <- select_best(rf_results, metric = "roc_auc")

final_rf_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% extract_fit_parsnip() %>% vip(num_features = 10)

Similar to MARS, Tenure and both charges types make up the top of the list. Contract and Online Security is important as well. As technology advances, people are becoming more aware of what companies may or may not be doing with their data. So, online security plays a big factor.

Confusion Matrix

confus_matrix <- logistic_reg() %>%
  fit(Status ~ ., data = logistic_train)

confus_matrix %>% predict(logistic_test) %>% bind_cols(logistic_test %>% select(Status)) %>% conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

I created a Confusion Matrix to better judge the accuracy of the data. It can be used to identify if there are any false positives or false negatives within the data set. I used the logistic regression model to create the confusion matrix.

Optimal Model?

The MARS model is the most optimal model for this data frame because of how we judge it, based on AUC, compared to the other models.

Business Analysis

What are the most important predictor variables?

After evaluating 3 different models, it was encouraging to find similar results throughout all the models. That shows me that the findings have truth to them. The most important predictor variables are Tenure, TotalCharges, MonthlyCharges, PaymentMethod, and OnlineSecurity. All these make sense when you break them all down. First off, Tenure is important on whether people stay or leave. The longer they stay with the company the less likely they would be to leave. Secondly, the cost is important for customers when determining whether to stay or not. Thirdly, the payment method that the company provides is important because people value flexibility when it comes to paying the bill. Lastly, online security is an important factor especially with the rise of technology. People want to know what the company is doing with their data in order to feel safe these days.

To conclude, Regork’s telecommunication service is on the right path of being a profitable branch of their company. However Regork must be aware and proactive in their efforts to continue research and development for the service. They should remain aware of what factors customers value the most when it comes to staying or leaving the service. I believe the models created here are good foundational models to start off on the right foot and understand what needs to be prioritized. There is always room for improvement whether it is the model or the overall company. This work should be able to help Regork make better decisions in the future because they have tangible research and models to base them off of.

Final Project

Henry Gruber

2024-12-09