The purpose of this report is to look deeper into Regorks customer data in order to predict whether customers will leave in the future. Regork wants to appeal to these customers before they leave. In order to do this I selected three models to use on this data.

##1. Data Preparation & Exploratory Data Analysis

#Packages Required

For this project the following packages are required:

library(tidyverse)
library(tidymodels)
library(baguette)    
library(vip)       
library(pdp)
library(readr)
library(ggplot2)

#Importing & Cleaning Data

The customer retention data set shows many variables that relate to Regorks telecommunication data. The response variable in this data set is “Status” which explains whether the customer is current or has left. There are many predictor variables including customer details, contract type, types of services, years with Regork, the charges and many other details.

To prepare the data I mutated it in order to make the “Status” variable into a factor. I also used the na.omit function to remove the rows that contained NA. The reason that I did this was to remove the rows that had NA in the TotalCharges variable. I did this in order to be able to run certain algorithms.

df <- read_csv("data/customer_retention.csv")

glimpse(df)
## Rows: 6,999
## Columns: 20
## $ Gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <chr> "Current", "Current", "Left", "Current", "Left", "Lef…

df <- mutate(df, Status = factor(Status)) %>%
  na.omit()

I created three plots in order to

ggplot(df, aes(Tenure)) + 
geom_histogram(color = "Black", fill = "Light Green") +
facet_wrap(~Status)

This plot explains that customers who have been with Regork for under 5 years tend to be more likely to leave. This is a good variable to look out for to potentially target.

ggplot(df, aes(MonthlyCharges)) + 
geom_histogram(color = "Black", fill = "Pink") +
facet_wrap(~Status)

This plot explains that customers are more likely to stay with Regork when they have monthly charges of less than $25. This makes sense that people with cheap monthly bills will likely stay with Regork. It would take more effort to move services to another company than just staying with Regork.

ggplot(df, aes(Contract)) + 
geom_bar(color = "Black", fill = "Light Blue") +
facet_wrap(~Status)

This plot explains that customers who have a month-to-month contract will be more likely to leave. This makes sense because they will be more likely to examine their monthly bills and aren’t tied down to the company. This could definitely be a good place for Regork to look more deeply into.

set.seed(123)
churn_split <- initial_split(df,prop = 0.7, strata = "Status")
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
set.seed(123)
kfold <- vfold_cv(churn_train, v = 5)

###Logistic Reg
results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

collect_metrics(results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00974 Preprocessor1_Model1

The first model that I tested was a basic logistic regression model. When testing this model on the customer training set it resulted in a roc auc of 0.8442.

model_recipe <- recipe(Status ~ ., data = churn_train)

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 2    0.0000000178         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 3    0.00000316           15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 4    0.0000000001         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## 5    0.0000000178         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## # … with abbreviated variable name ¹​.estimator

###Bagging

bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))

set.seed(123)
bag_results <- tune_grid(bag_mod, model_recipe, resamples = kfold, grid = bag_hyper_grid)

# model results
show_best(bag_results, metric = "roc_auc")
## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.821     5 0.0108  Preprocessor1_Model5
## 2   300 roc_auc binary     0.820     5 0.00981 Preprocessor1_Model6
## 3   100 roc_auc binary     0.819     5 0.00991 Preprocessor1_Model4
## 4    50 roc_auc binary     0.816     5 0.00940 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00971 Preprocessor1_Model2

BAG AUC ROC Times Mean 300 roc_auc binary 0.8243749


set.seed(123)
kfold2 <- vfold_cv(churn_test, v = 5)

###Logistic Reg

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold2)

collect_metrics(results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.803     5 0.00335 Preprocessor1_Model1
## 2 roc_auc  binary     0.841     5 0.00462 Preprocessor1_Model1

final_fit <- logistic_reg() %>%
fit(Status ~ ., data = churn_test)
tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic     p.value
##    <chr>                            <dbl>     <dbl>     <dbl>       <dbl>
##  1 (Intercept)                    2.17       1.55      1.40    0.160     
##  2 GenderMale                    -0.00317    0.120    -0.0264  0.979     
##  3 SeniorCitizen                  0.101      0.157     0.643   0.520     
##  4 PartnerYes                     0.328      0.149     2.20    0.0275    
##  5 DependentsYes                 -0.382      0.166    -2.30    0.0215    
##  6 Tenure                        -0.0511     0.0110   -4.64    0.00000351
##  7 PhoneServiceYes                1.07       1.22      0.877   0.381     
##  8 MultipleLinesNo phone service NA         NA        NA      NA         
##  9 MultipleLinesYes               0.551      0.336     1.64    0.101     
## 10 InternetServiceFiber optic     2.93       1.51      1.94    0.0521    
## # … with 21 more rows

final_fit %>%
predict(churn_test) %>%
bind_cols(churn_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1386  243
##    Left        154  314

The confusion matrix resulted in 154 false positives and 243 false negatives. The false negatives show that there were more customers that we did not predict to leave that did/will.

vip(final_fit$fit, num_features = 20)

This chart shows the most important predictor variables in this model.