INTRODUCTION:

When looking at this set of data we decided to analysis Regorks customer data in order to try and predict whether customers will leave in the future. As a company, Regork is trying to retain as many customers as possible. We have created three models to try and figure out how we can retain these customers and try and prevent them from leaving.

Packages Required

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(vip))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(baguette))
suppressPackageStartupMessages(library(pdp))

CustomerR <- read_csv(file = "Data/customer_retention.csv")

## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data

When looking at the data we noticed a ton of different variables that related to Regorks telecommunications data. We decided we wanted to go more in depth and look at the “Status” variable which is the response variable. This status variable shows if the employee is currently at the company or if he left. Some of the independent variables in this data set are as follows: contract type, customer details, years with Regork, types of services, etc.

Prepping Data

When looking at the data we decided to mutate the “Status” variable to make it a factor. We noticed a lot of Nas in the data so we used na.omit to remove those column NAs. We wanted to make sure we removed all the columns with NAs in the Totalcharge variable. We had to do this to make sure we could run our machine learning program without any problems.

CustomerR <- mutate(CustomerR, Status = factor(Status))%>%
  na.omit()

#Data Visualization

Graph1 <- CustomerR%>%
  select(Tenure, Status)%>% 
  group_by(Status)%>%
  count(Tenure)%>%
  arrange(desc(n))%>%
  ggplot(aes(Tenure, n)) + geom_col(aes(Tenure, n, fill = Status)) + ylab("Number of Customers") + xlab("Tenure") +  facet_wrap(~Status)  + ggtitle("Tenure of Customers Current vs Left")

print(Graph1)

In the first graph you can see the comparison between the employees that are working at the company and the ones who left. When looking at the graph you can see that the current chart is normally distributed, and the left chart is skewed to the right. In both graphs there are a few outliers but nothing major.

Graph2 <- CustomerR%>%
  select(Contract, Status)%>% 
  group_by(Status)%>%
  count(Contract)%>%
  arrange(desc(n))%>%
  ggplot(aes(Contract, n)) + geom_col(aes(Contract, n, fill = Contract)) + facet_wrap(~Status) + ylab("Number of Customers") + ggtitle("Customer contracts Current vs Left")

print(Graph2)

This second graph looks at the bills between the current and the employees that have left. When analyzing this graph, you can see that the Month to Month plan has the highest rate for employees leaving the company. This makes perfect sense because it would be super easy for a customer to cancel a month to month than to cancel a year plan.

Graph3 <- CustomerR%>%
  select(MonthlyCharges, Status)%>% 
  group_by(Status)%>%
  count(MonthlyCharges)%>%
  arrange(desc(n))%>%
  ggplot(aes(MonthlyCharges, n)) + geom_col(aes(MonthlyCharges, n, fill = Status)) + facet_wrap(~Status) +ylab("Number of Customers") + xlab("Monthly Charges") + scale_y_continuous(limits = c(0,25)) + ggtitle("Monthly Charges per Customer Current vs Left")

print(Graph3)

## Warning: Removed 21 rows containing missing values (`position_stack()`).

This third and final graph shows the number of customers compared to their monthly charges. When looking at this graph it seems current customers are the highest when costs are below $25. The left column, without its few outliers, is normally distributed.

Exploratory Data Analysis & Machine Learning

set.seed(123)
spam_split <- initial_split(CustomerR,prop = 0.7, strata = "Status")
spam_train <- training(spam_split)
spam_test <- testing(spam_split)

set.seed(123)
kfold <- vfold_cv(spam_train, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

## → A | warning: prediction from a rank-deficient fit may be misleading

## 
There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5

collect_metrics(results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00974 Preprocessor1_Model1

In the code above we used a linear regression model. When using this linear regression model we end up getting a roc_auc of .844.

model_recipe <- recipe(Status ~ ., data = spam_train)

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)

show_best(dt_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 2    0.0000000178         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 3    0.00000316           15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 4    0.0000000001         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## 5    0.0000000178         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## # … with abbreviated variable name ¹.estimator

bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))

set.seed(123)
bag_results <- tune_grid(bag_mod, model_recipe, resamples = kfold, grid = bag_hyper_grid)

show_best(bag_results, metric = "roc_auc")

## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.821     5 0.0108  Preprocessor1_Model5
## 2   300 roc_auc binary     0.820     5 0.00981 Preprocessor1_Model6
## 3   100 roc_auc binary     0.819     5 0.00991 Preprocessor1_Model4
## 4    50 roc_auc binary     0.816     5 0.00940 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00971 Preprocessor1_Model2

set.seed(123)
kfold2 <- vfold_cv(spam_test, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold2)

## → A | warning: prediction from a rank-deficient fit may be misleading

## 
There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5

collect_metrics(results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.803     5 0.00335 Preprocessor1_Model1
## 2 roc_auc  binary     0.841     5 0.00462 Preprocessor1_Model1

final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = spam_test)
  tidy(final_fit)

## # A tibble: 31 × 5
##    term                          estimate std.error statistic     p.value
##    <chr>                            <dbl>     <dbl>     <dbl>       <dbl>
##  1 (Intercept)                    2.17       1.55      1.40    0.160     
##  2 GenderMale                    -0.00317    0.120    -0.0264  0.979     
##  3 SeniorCitizen                  0.101      0.157     0.643   0.520     
##  4 PartnerYes                     0.328      0.149     2.20    0.0275    
##  5 DependentsYes                 -0.382      0.166    -2.30    0.0215    
##  6 Tenure                        -0.0511     0.0110   -4.64    0.00000351
##  7 PhoneServiceYes                1.07       1.22      0.877   0.381     
##  8 MultipleLinesNo phone service NA         NA        NA      NA         
##  9 MultipleLinesYes               0.551      0.336     1.64    0.101     
## 10 InternetServiceFiber optic     2.93       1.51      1.94    0.0521    
## # … with 21 more rows

final_fit %>%
  predict(spam_test) %>%
  bind_cols(spam_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

##           Truth
## Prediction Current Left
##    Current    1386  243
##    Left        154  314

When looking at the confusion matrix there were 154 false positives and 243 false negatives. Seeing that there are more false negatives than false positives there could be more employees in the company that have already left or even playing on leaving.

final_fit%>%
  vip(num_features = 20)

When looking at the model we noticed the top 3 predictor variables within this model were Tenure, ContractTwo year and PaperlessBillingYes.

Conclusion

After looking at the data, we can conclude that Regork industries need to put more of an emphasis and more time on their Month to Month Contracts. Month to Month is where Regork is losing most of their money. We were thinking of instituting a rule that sort of the first contract you get has to be a year contract after that you can then more back to a month to month. We also noticed when looking at our graphs that when the price hits the $75 mark that is when most people are leaving. Regork should definitely have a promotion of some sort that keeps bills under that price of 75. After going over all the data and looking at all the graphs we found that tenure was the best predictor variable whether a customer should leave, so Regork should put some more time into researching tenure.

Final Project - Group 15

Alex Sieve and Jake Elmore

2023-04-25