Final Project

Results

Introduction

Introduction to our business problem and why Regork Telecom should be interested:

With Regork entering into the telecommunications market offering new services such as internet service, phone service, online streaming, and more. We wanted to increase efforts to retain existing customers. Regork should be interested in this problem because the costs required to attract new customers is much higher than the costs to retain current customers.
How we addressed the problem and the analytic methodologies employed:

To address this problem we wanted to highlight the various factors that lead to customer retention within the data of our “customer_rentention.csv” file. After further investigation and putting together various relationships within the data, we then decided to build, train, and test three types of machine learning models. Including a regularized regression, logistic regression, and random forest models. To help us determine performance based off of AUC (area under the curve).
How our analysis will help and our proposed solution:

Our analysis will find some of the most important relationships that will help lead to the efficiency within the Regork’s new telecommunication market and how they can continue effectively retaining new and existing customers.

Data Preparation

The following packages are required to run the code throughout.

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)

The following data will be stored within the environment for code throughout.

df <- read.csv("customer_retention.csv")
df <- mutate(df, Status = factor(Status))
df <- na.omit(df)

Exploratory Analysis

ggplot(df, aes(InternetService)) +
  geom_bar(fill = "orange", color = "red") +
  facet_wrap(~Contract) +
  coord_flip() +
  ggtitle("Customer Internet Service and Length of Contract") +
  labs(y = "Count of Contract Type", x = "Customer Internet Service")

This analysis shows the relationship between customer’s internet services and the length of their contract. From this analysis we can tell that the Fiber Optic service generates the most month-to-month contracts, DSL generates the most one year contracts, and no internet service generates the most contracts within the two year time frame.

ggplot(df, aes(PaymentMethod)) +
  geom_bar(fill = "orange", color = "red") +
  facet_wrap(~Contract) +
  ggtitle("Payment Method vs. Type of Contract") +
  theme(axis.text.x = element_text(angle = 50, size = 7, vjust = 0.5)) +
  labs(y = "Count of Customers", x = "Payment Method")

This plot highlights the relationship between a customer’s payment method and the type of contract they have. If you look closer you can see that the electronic check is used the most within month-to-month contracts, credit card(automatic) is the highest in the one year contract, and credit card(automatic) is the highest in two year contracts.

ggplot(df, aes(Partner)) +
  geom_bar(fill = "orange", color = "red") +
  facet_wrap(~Contract) +
  ggtitle("Partner vs. Length of Contract") +
  labs(y = "Count of Contract Type", x = "Partner")

From this analysis you can tell that in month-to-month contracts, their are more people without a partner. In one year and two year contracts their are more customers with a partner.

Machine Learning

Logistic Regression Model

set.seed(123)
logistic_split <- initial_split(df, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(Status ~ ., logistic_kfolds) %>%
  collect_metrics()

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 roc_auc  binary     0.845     5 0.00521 Preprocessor1_Model1

From this we can tell that the logistic regression model is pretty good in terms of accuracy and mean AUC, but we will train two more models in comparison.

Multivariate Adaptive Regression Splines (MARS)

set.seed(123)
mars_split <- initial_split(df, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train)

set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)

mars_results <- mars_wf %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>% collect_metrics() %>% filter(.metric == "roc_auc") %>%
arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        18           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        21           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  5        22           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  6        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  7        24           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## # … with 50 more rows

From the MARS model we can tell that based off of the mean AUC that this is the most accurate model.

Decision Tree Model

retention_recipe <- recipe(Status ~ ., data = logistic_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

dt_fit <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = logistic_train)

dt_results <- fit_resamples(dt_mod, retention_recipe, logistic_kfolds)

collect_metrics(dt_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.788     5 0.00399 Preprocessor1_Model1
## 2 roc_auc  binary     0.710     5 0.00529 Preprocessor1_Model1

From the decision tree model we can tell that based off of the accuracy and AUC mean that this model is the least accurate model for making our decisions.

Business Analysis

Our Most Important Predictor Variables From our analysis Partner, Contract, PaymentMethod, and InternetService are our top four features for our models. Through our analysis we found within the InternetService variable that fiber optic is the is the highest amongst customer’s in month-to-month contracts which then drops off quite a bit from one year and two year contracts. So, from this information Regork needs to determine a way to initially attract customers interested in the fiber optic internet service to get on longer contracts, which would lead to less customer retention across at least two years, guaranteeing customer longevity. We also found that within customer’s payment methods that similarly to customer’s internet service it was much higher in a month-to-month contracts and then drops off in one and two year contracts. Regork needs to emphasize the use of electronic checks across all contracts to attract customers and lead to customer longevity within the new telecommunication services. Lastly, we found that customers having a partner are more likely to be on a month-to-month contract as well, so Regork could acquire some sort of long-term partner deal for customer’s with an interest in the new services.

Conclusion From all of our analysis and models we found that with a mean AUC of around 0.85, our models did a pretty solid job in helping Regork make new and efficient business decision while entering the telecommunication market. We believe that with our highlighted analysis, Regork will be able to decrease customer retention and draw an interest for customer’s to sign up for longer contracts for the new services. While our analysis was pretty good, we believe that it could potentially improved upon, but will give Regork a good foundation for more accurate data analysis of their customers.