When assisting Regork with their new telecommunications market, I will be building multiple machine learning models to predict retention of customers. I will first begin with visualizing a few trends that are present in the customer_retention dataset. I will then build a logistic regression model, multivariate adaptive regression spline (MARS), and random forest model. These trends will help cross-functional teams assess how to retain customers, specifically within Marketing department. With this data, Regork can highlight the areas with highest impact related to retention. Finally, with the conclusive models, Regork can assess what trends might need more attention going forward.
The following packages will be used throughout this report
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidymodels)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(baguette)))
suppressWarnings(suppressMessages(library(vip)))
suppressWarnings(suppressMessages(library(pdp)))
suppressWarnings(suppressMessages(library(here)))
suppressWarnings(suppressMessages(library(kernlab)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(ranger)))
suppressWarnings(suppressMessages(library(earth)))
suppressWarnings(suppressMessages(library(knitr)))
suppressMessages(suppressWarnings(setwd("C:/Users/samev/OneDrive/BANA_4080/data")))
suppressWarnings(suppressMessages(retention<- readr::read_csv("customer_retention.csv")))
suppressWarnings(suppressMessages(retention<- mutate(retention, Status = factor(Status))))
suppressWarnings(suppressMessages((retention<- na.omit(retention))))
## # A tibble: 6,988 × 20
## Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines
## <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 Female 0 Yes No 1 No No phone service
## 2 Male 0 No No 34 Yes No
## 3 Male 0 No No 2 Yes No
## 4 Male 0 No No 45 No No phone service
## 5 Female 0 No No 2 Yes No
## 6 Female 0 No No 8 Yes Yes
## 7 Male 0 No Yes 22 Yes Yes
## 8 Female 0 No No 10 No No phone service
## 9 Female 0 Yes No 28 Yes Yes
## 10 Male 0 No Yes 62 Yes No
## # ℹ 6,978 more rows
## # ℹ 13 more variables: InternetService <chr>, OnlineSecurity <chr>,
## # OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## # StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## # PaperlessBilling <chr>, PaymentMethod <chr>, MonthlyCharges <dbl>,
## # TotalCharges <dbl>, Status <fct>
ggplot(retention, aes(Tenure)) +
geom_bar(fill = "lightblue") +
facet_wrap(~InternetService) +
ggtitle("Tenure and Internet Service") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Number of Customers", x = "Length of Tenure (months)")
This plot displays the relationship between Tenure and
Internet Service. There seems to be some relations between each
graph, as it clearly shows a similar distribution of tenure for
DSL, Fiber Optic, and NO internet
service.
ggplot(retention, aes(x= MonthlyCharges)) +
geom_bar(fill = "lightgreen") +
facet_wrap(~Contract)+
ggtitle("Contract Type and Monthly Charges") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Monthly Charges", y = "Number of Customers")
This plot displays the trends in Contract Type and Monthly Charges. This graph shows that more customers have a month-to-month contract and the mean of the distribution also shows that customers are paying around 80 dollars per month. This is a critical factor in the prediction of whether customers will stay with their telecommunications service.
retention%>%
ggplot(aes(x=Tenure, fill=Status))+
geom_bar() +
labs(x = "Tenure", y = "Number of Customers")+
ggtitle("Tenure and Status of Customer")
This plot is crucial for Regork executives to examine. As shown in the plot, we do not see much of a relationship between tenure and whether a customer left or is current. We can see that the majority of the customers that did leave, left early on in their tenure with the internet service.
The underlying outcome variables that we look to evaluate are the roc_auc and accuracy. The lower the roc_auc, the better the model that will be built.
To Split the Data
set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
The first machine learning model used is the logistic regression model
# Use same kfold for whole report
set.seed(123)
kfold<- vfold_cv(retention_train, v = 5)
log_results<- logistic_reg()%>%
fit_resamples(Status ~.,kfold)
collect_metrics(log_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.797 5 0.00705 Preprocessor1_Model1
## 2 brier_class binary 0.137 5 0.00374 Preprocessor1_Model1
## 3 roc_auc binary 0.844 5 0.00974 Preprocessor1_Model1
final_fit_log<- logistic_reg()%>%
fit(Status ~., data = retention_train)
Logistic Regression shows a relatively good roc_auc.
final_fit_log %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
Multivariate Adaptive Regression Sampling, or MARS is the next machine learning model that will be tested. This model provides another non-linearity scope which often proves to be more accurate than simple logistic regression.
retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode('classification')
mars_grid <- grid_regular(
num_terms(range = c(1, 50)),
prod_degree(),
levels = 30)
retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod)
tuning_results <- retention_wf %>%
tune_grid(resamples = kfold, grid = mars_grid)
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 60 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 17 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 3 21 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 4 22 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 5 24 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 6 26 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 7 28 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 8 29 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 9 31 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## 10 33 1 roc_auc binary 0.848 5 0.0107 Preprocessor1_M…
## # ℹ 50 more rows
autoplot(tuning_results)
When we plot the results of MARS, we see the accuracy and roc_auc prove to be significant.
The final machine learning model that I will be testing is the random forest.
suppressMessages(suppressWarnings(retention_recipe<-recipe(Status~.,retention_train)))
suppressMessages(suppressWarnings(retention_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")))
suppressMessages(suppressWarnings(retention_mod <- rand_forest(mode = "classification", trees = tune(), mtry = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity")))
suppressMessages(suppressWarnings(retention_hyper_grid <- grid_regular(trees(range = c(1, 800)), mtry(range = c(1, 50)), min_n(range = c(1, 20)), levels = 5)))
set.seed(123)
suppressMessages(suppressWarnings(results <- tune_grid(retention_mod, retention_recipe, resamples = kfold, grid = retention_hyper_grid)))
suppressMessages(suppressWarnings(show_best(results, metric = "roc_auc")))
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 800 1 roc_auc binary 0.840 5 0.0110 Preprocessor1_Model0…
## 2 1 800 15 roc_auc binary 0.840 5 0.0108 Preprocessor1_Model0…
## 3 1 400 1 roc_auc binary 0.840 5 0.0106 Preprocessor1_Model0…
## 4 1 400 20 roc_auc binary 0.839 5 0.0108 Preprocessor1_Model1…
## 5 1 600 10 roc_auc binary 0.839 5 0.0109 Preprocessor1_Model0…
With the use of tuning, the model’s performance can be maximized. The results show that random forest modeling for this situation provides optimal results, with a low mean roc_auc.
# get optimal hyperparameters
retention_best_hyperparameters <- select_best(results, metric = "roc_auc")
# create final workflow object
final_retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
finalize_workflow(retention_best_hyperparameters)
# fit final workflow object
retention_final_fit <- final_retention_wf %>%
fit(data = retention_train)
# plot feature importance
retention_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
5
## [1] 5
Finally, I have created a chart that shows which variables are more important (VIP) in the random forest model.
retention_final_fit%>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1480 375
## Left 60 182
To show the validity of the model compared to the actual data, a simple confusion matrix was created to list false positive or false negative results from the random forest.
train_predictions <- retention_final_fit %>%
predict(retention_train) %>%
bind_cols(retention_train %>% select(Status))
train_accuracy <- train_predictions %>%
metrics(truth = Status, estimate = .pred_class) %>%
filter(.metric == "accuracy") %>%
pull(.estimate)
test_predictions <- retention_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status))
test_accuracy <- test_predictions %>%
metrics(truth = Status, estimate = .pred_class) %>%
filter(.metric == "accuracy") %>%
pull(.estimate)
generalization_error <- train_accuracy - test_accuracy
generalization_error
## [1] 0.01688512
The next thing I did was create a generalization error equation which shows a low generalization error in this model.The generalization error is similar to the validation error which means this model was able to accurately predict the data.
Business Decision: From the observations in this section, I have learned that each machine learning model is not too far off from one another. Also, the low generalization number means that the random forest model can be accurate at predicting the the actual or future data sets.
TOP 5 FACTORS Contract: Customers that have larger contracts are less likely to leave the service. It would be wise for Regork to advertise larger contracts to customers to decrease the chances of customers leaving.
Tenure: Tenure is the second most important predictor. It is apparent that retaining customers for longer periods of time leads to better retaining of customers.
Total Charges: Coming in at the 3rd most important predictor, customers tend to value the pricing of their plans when deciding who to choose.
Monthly Charges: Related to the 3rd most influential predictor. Monthly charges is similar to total charges as customers tend to put priority on their monthly bill.
Online Security: Coming in at 5th, online security. Customers who have online security also tend to have better retention with these service providers.
retention_monthly_income<- retention%>%
filter(Status == 'Left')%>%
select(MonthlyCharges)%>%
summarize(total_income = sum(MonthlyCharges))
tibble(retention_monthly_income)
## # A tibble: 1 × 1
## total_income
## <dbl>
## 1 138222.
When we look at lost customers from our data set, we see that potential lost monthly income is $138,222. The reason why these customers left would need more potential research but if there is no action taken, Regork would suffer this extreme loss per month.
Incentive Shceme: One incentive scheme that I propose is implementing a loyalty program. Customers can climb tiers with longevity and with each level up, customers can reap the rewards.
For Example:
Bronze: Customers with 3-6 months of tenure receive a discount on their next bill.
Silver: Customers with 6-12 months of tenure are eligible to receive a 10% discount on their monthly bill.
Gold: Customers with 12-18 months of tenure are elegible for free streaming services along with their discount.
Platinum: Customers with 18+ months of tenure receive a free upgrade to higher internet speed and a boost of 15% discount on their monthly bill.
Summary of Report: With this report, Regork officials should consider getting ahead of the game with the incentive program mentioned above. Customer tenure and contract length are two of the highest predictors of retaining customers. The loyalty program will cut the lost profits from customers who were not satisfied with their current provider. The loyalty program is not only an incentive for customers to stay longer, it also decreases the prices for customers who stay longer, which is another major factor of retention. Regork executives can expect higher retention rates if they choose to implement this plan.
Limitations: Throughout this process, there were several limitations in this data set. One of the biggest limitations we might see is the recognition of irrelevant patterns due to an over-fit or under-fit model. It would also be helpful to have more information on our customer base which may explain demographic variable predictors. With this information, we can use factors like age and income range to determine what our ideal customer base is and eliminate any outliers. The last limitation is that the random forest model could be better with more trees.