The purpose of this report is to look deeper into Regorks customer data in order to predict whether customers will leave in the future. Regork wants to appeal to these customers before they leave. In order to do this I selected three models to use on this data.
##1. Data Preparation & Exploratory Data Analysis
#Packages Required
For this project the following packages are required:
library(tidyverse)
library(tidymodels)
library(baguette)
library(vip)
library(pdp)
library(readr)
library(ggplot2)
#Importing & Cleaning Data
The customer retention data set shows many variables that relate to Regorks telecommunication data. The response variable in this data set is “Status” which explains whether the customer is current or has left. There are many predictor variables including customer details, contract type, types of services, years with Regork, the charges and many other details.
To prepare the data I mutated it in order to make the “Status” variable into a factor. I also used the na.omit function to remove the rows that contained NA. The reason that I did this was to remove the rows that had NA in the TotalCharges variable. I did this in order to be able to run certain algorithms.
df <- read_csv("data/customer_retention.csv")
glimpse(df)
## Rows: 6,999
## Columns: 20
## $ Gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <chr> "Current", "Current", "Left", "Current", "Left", "Lef…
df <- mutate(df, Status = factor(Status)) %>%
na.omit()
I created three plots in order to
ggplot(df, aes(Tenure)) +
geom_histogram(color = "Black", fill = "Light Green") +
facet_wrap(~Status)
This plot explains that customers who have been with Regork for under 5
years tend to be more likely to leave. This is a good variable to look
out for to potentially target.
ggplot(df, aes(MonthlyCharges)) +
geom_histogram(color = "Black", fill = "Pink") +
facet_wrap(~Status)
This plot explains that customers are more likely to stay with Regork
when they have monthly charges of less than $25. This makes sense that
people with cheap monthly bills will likely stay with Regork. It would
take more effort to move services to another company than just staying
with Regork.
ggplot(df, aes(Contract)) +
geom_bar(color = "Black", fill = "Light Blue") +
facet_wrap(~Status)
This plot explains that customers who have a month-to-month contract
will be more likely to leave. This makes sense because they will be more
likely to examine their monthly bills and aren’t tied down to the
company. This could definitely be a good place for Regork to look more
deeply into.
set.seed(123)
churn_split <- initial_split(df,prop = 0.7, strata = "Status")
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
set.seed(123)
kfold <- vfold_cv(churn_train, v = 5)
###Logistic Reg
results <- logistic_reg() %>%
fit_resamples(Status ~ ., kfold)
collect_metrics(results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.797 5 0.00705 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00974 Preprocessor1_Model1
The first model that I tested was a basic logistic regression model. When testing this model on the customer training set it resulted in a roc auc of 0.8442.
model_recipe <- recipe(Status ~ ., data = churn_train)
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estima…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 2 0.0000000178 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 3 0.00000316 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 4 0.0000000001 11 40 roc_auc binary 0.814 5 0.00827 Prepro…
## 5 0.0000000178 11 40 roc_auc binary 0.814 5 0.00827 Prepro…
## # … with abbreviated variable name ¹.estimator
###Bagging
bag_mod <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
set.seed(123)
bag_results <- tune_grid(bag_mod, model_recipe, resamples = kfold, grid = bag_hyper_grid)
# model results
show_best(bag_results, metric = "roc_auc")
## # A tibble: 5 × 7
## times .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 200 roc_auc binary 0.821 5 0.0108 Preprocessor1_Model5
## 2 300 roc_auc binary 0.820 5 0.00981 Preprocessor1_Model6
## 3 100 roc_auc binary 0.819 5 0.00991 Preprocessor1_Model4
## 4 50 roc_auc binary 0.816 5 0.00940 Preprocessor1_Model3
## 5 25 roc_auc binary 0.809 5 0.00971 Preprocessor1_Model2
BAG AUC ROC Times Mean 300 roc_auc binary 0.8243749
set.seed(123)
kfold2 <- vfold_cv(churn_test, v = 5)
###Logistic Reg
results <- logistic_reg() %>%
fit_resamples(Status ~ ., kfold2)
collect_metrics(results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.803 5 0.00335 Preprocessor1_Model1
## 2 roc_auc binary 0.841 5 0.00462 Preprocessor1_Model1
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = churn_test)
tidy(final_fit)
## # A tibble: 31 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.17 1.55 1.40 0.160
## 2 GenderMale -0.00317 0.120 -0.0264 0.979
## 3 SeniorCitizen 0.101 0.157 0.643 0.520
## 4 PartnerYes 0.328 0.149 2.20 0.0275
## 5 DependentsYes -0.382 0.166 -2.30 0.0215
## 6 Tenure -0.0511 0.0110 -4.64 0.00000351
## 7 PhoneServiceYes 1.07 1.22 0.877 0.381
## 8 MultipleLinesNo phone service NA NA NA NA
## 9 MultipleLinesYes 0.551 0.336 1.64 0.101
## 10 InternetServiceFiber optic 2.93 1.51 1.94 0.0521
## # … with 21 more rows
final_fit %>%
predict(churn_test) %>%
bind_cols(churn_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1386 243
## Left 154 314
The confusion matrix resulted in 154 false positives and 243 false negatives. The false negatives show that there were more customers that we did not predict to leave that did/will.
vip(final_fit$fit, num_features = 20)
This chart shows the most important predictor variables in this model.