Introduction
This report analyzes the customer data given to us about the telecommunications market. The goal to ultimately find a model that does well in predicting whether customers will or will not leave in the future as well as their potential reasons for leaving. With a detailed analysis and report, company managers are better informed about the situation and are able to make preventive actions to retain their customers backed by facts.
This report starts with the Data Preparation & Exploratory Data Analysis sections, which highlights the packages used, the data cleaning done, and most importantly, the visualizations used to find any underlying trends and unique patterns within the data. The next section is the Machine Learning section, which is where three different machine learning algorithms were assessed to find the most optimal one. The optimal model will be used to figure out which predictor variables appear to be most influential in customer behavior. The last section will be the Business Analysis and Conclusion section, which predicts some possible future outcomes if no actions are taken along with a summary of the findings in this report and proposed solutions.
All code is hidden for a better experience reading the report, but they can be viewed by clicking the small button that says “Code” located on the right.
Packages Required
The following packages are used in this project:
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ranger)
library(earth)
library(DT)
We start by importing the data set and dropping the rows with NA values in them. A preview of the data table is shown below.
retention <- read.csv("data/customer_retention.csv")
retention <- retention %>%
mutate(Status = as.factor(Status))
retention <- retention %>%
drop_na()
datatable(head(retention, 50), options = list(pageLength = 10))
Figure 1
To start, the first visualization shows the relationship between the total charges customers paid and the number of months the customers have stayed with the company (tenure), as well as whether they are current customers or have left already.
ggplot(retention, aes(x = Tenure, y = TotalCharges, color = Status)) +
geom_point(size = 1) +
geom_smooth() +
scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
labs(x = "Tenure", y = "Total Charges",
title = "Tenure and Total Charges of Customers")
A trend can be seen from the graph. The longer a customer stays with the company, the higher their total charges become. This applies to both current customers and customers who have left. We can also see that customers who leave tend to have a higher total charge compared to current customers. This could potentially imply that customers will leave if costs get too high.
Figure 2
The second visualization shows the relationship between tenure, count of customers, and status of customers.
ggplot(retention, aes(Tenure)) +
geom_bar(fill = "blue") +
facet_wrap(~ Status) +
labs(x = "Tenure", y = "Count of Customers",
title = "Tenure vs Count of Customers")
We can see that the amount of customers who have left is higher when they have a shorter tenure and as tenure increases, the amount of customers who have left also decreases.
Figure 3
The third visualization is similar to the second one, but it compares the contract type instead.
ggplot(retention, aes(Tenure)) +
geom_bar(fill = "red") +
facet_wrap(~Contract) +
labs(x = "Tenure", y = "Count of Customers",
title = "Tenure vs Length of Contract")
From the visualization above, we can see a trend among the contract types. Customers with a lower tenure tend to choose the month-to-month contract over the one-year or two-year contract. But as tenure increases, the amount of customers who choose month-to-month contracts decreases while the amount of customers who choose one-year or two-year contracts increases.
Figure 4 and 5
The fourth and fifth visualizations compare the amount of customers with and without certain services. I had to do a pivot_long to change the data table a little bit for an easier time graphing the data.
long_df <- retention %>%
select(Tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Status) %>%
pivot_longer(cols = c('PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'),
names_to = 'Services',
values_to = 'Service_Status')
long_df %>%
filter(Service_Status == "Yes" | Service_Status == "Fiber optic" | Service_Status == "DSL") %>%
ggplot(aes(Tenure, fill = Status)) +
geom_bar(position = "stack") +
facet_wrap(~ Services, nrow = 3) +
labs(x = "Tenure", y = "Count of Customers",
title = "Customers with Services")
long_df %>%
filter(Service_Status == "No" | Service_Status == "No internet service") %>%
ggplot(aes(Tenure, fill = Status)) +
geom_bar(position = "stack") +
facet_wrap(~ Services, nrow = 3) +
labs(x = "Tenure", y = "Count of Customers",
title = "Customers without Services")
We can see again that there is a negative trend for customers who have left for all services, no matter if they had or didn’t have the service, as the length of tenure increases. This could mean that services might not be an important variable when customers are deciding to leave or not.
On the other hand, there is mostly a somewhat positive trend when it comes to customers who stay with services while there is a somewhat negative trend when it comes to customers without services as the length of tenure increases. This could imply that the longer a customer stays with the company, the more likely they are to add more services to their plans.
Splitting Data and Preparing for the Machine Learning Process
We split the data into training and testing sets as well as creating our kfold and model recipe for the three different machine learning models.
set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
kfold <- vfold_cv(retention_train, v = 5, strata = "Status")
model_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
The first machine learning model will be the logistic regression model.
lr_results <- logistic_reg() %>%
fit_resamples(Status ~., kfold)
collect_metrics(lr_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00449 Preprocessor1_Model1
## 2 roc_auc binary 0.845 5 0.00590 Preprocessor1_Model1
The AUC we get from this one is about 0.845, which is pretty decent and acceptable. We will have to compare it with the rest of the models to find the best one.
Confusion Matrix
lr_final_fit <- logistic_reg() %>%
fit(Status ~ ., data = retention_train)
lr_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
The confusion matrix says the logistic regression model predicted that 1362 customers will stay and 332 customers will leave and predicted right. It predicted 225 customers will stay but they actually left, and it predicted 178 will leave but they actually stayed. This means the logistic regression model is predicting more false positives than false negatives, meaning it will predict that more customers will stay when in reality, they would have left.
The second machine learning model will be the multivariate adaptive regression spline, abbreviated to MARS.
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(range = c(1, 50)),
prod_degree(), levels = 50)
mars_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(mars_mod)
mars_results <- mars_wf %>%
tune_grid(resamples = kfold, grid = mars_grid)
mars_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 100 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 18 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 3 20 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 4 21 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 5 22 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 6 23 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 7 24 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 8 25 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 9 26 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 10 27 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## # ℹ 90 more rows
autoplot(mars_results)
This model has an AUC of 0.851, which is higher than the logistic regression model we just did.
Confusion Matrix
mars_best_hyperparameters <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(mars_best_hyperparameters)
mars_final_fit <- mars_final_wf %>%
fit(data = retention_train)
mars_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
The confusion matrix for this model says it predicted that 1383 customers will stay and 315 customers will leave and predicted right. It predicted 242 customers will stay but they actually left, and it predicted 157 will leave but they actually stayed. This means the MARS model is predicting more false positives than false negatives, meaning it will predict that more customers will stay when in reality, they would have left, similar to the logistic regression model.
The third and final model will be the random forest model.
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
rf_results <- fit_resamples(rf_mod, model_recipe, kfold)
collect_metrics(rf_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00292 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00691 Preprocessor1_Model1
This model has an AUC of 0.844, but with some hyperparameter tuning, it could potentially be higher.
Tuning
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(
trees(range = c(50, 500)),
mtry(range = c(2, 20)),
min_n(range = c(1, 20)),
levels = 5)
set.seed(123)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 500 20 roc_auc binary 0.846 5 0.00640 Preprocessor1_Model1…
## 2 6 387 20 roc_auc binary 0.846 5 0.00645 Preprocessor1_Model1…
## 3 6 162 20 roc_auc binary 0.846 5 0.00660 Preprocessor1_Model1…
## 4 6 275 20 roc_auc binary 0.845 5 0.00643 Preprocessor1_Model1…
## 5 6 500 15 roc_auc binary 0.845 5 0.00627 Preprocessor1_Model0…
After some hyperparameter tuning, the AUC increased to 0.846, an increase of 0.002. It’s higher than the logistic regression model, but lower than the MARS model.
Confusion Matrix
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = retention_train)
rf_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1386 267
## Left 154 290
The confusion matrix for this model is also quite similar to the previous ones, with more false positives than false negatives.
The MARS model ended up having the highest AUC value, so we use it find which predictor variables appear to be the most influential on customer behavior.
mars_final_fit %>%
extract_fit_parsnip() %>%
vip()
It can be seen that the predictor variables Tenure, Total Charges, and Monthly Charges are the most influential when it comes to customer behavior. This means that the number of months that a customer stayed with the company and their costs for the services influence whether they decide to stay or leave the most.
mars_final_fit %>%
predict(retention_test, type = "prob") %>%
mutate(truth = retention_test$Status) %>%
roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.847
The generalization error 0.847, which is slightly less than the cross validation error (0.851), meaning it will likely be a little less accurate when used on new data, but the difference is quite small so I believe it is still safe to assume that the model can still be relied upon.
Conclusion
To conclude the findings of this analysis, the most important predictor variables are the Tenure, Total Charges, and Monthly Charges. All of the visualizations and models seem to support this as well. We can see from our graphs that customers who have stayed longer are less likely to leave as well as more likely to spend more on services.
Total Charges and Monthly Charges are also very influential, as customers care a lot about prices. Ensuring prices remain low and competitive while still generating acceptable amounts of profit isn’t a easy task, but it is one of the most important ones. If prices get too high, customers are more likely to leave and find other service providers.
cust_left <- retention_test %>%
filter(Status == "Left")
monthly_loss <- sum(cust_left$MonthlyCharges)
According to the confusion matrix for the MARS model, it is predicted that 557 customers will leave if nothing is done. That results in about $40,887.80 lost in monthly revenue.
Proposed Incentive Scheme
My suggestions are to first try to retain customers for as long as possible. It can be seen from Figure 3 from the Exploratory Data Analysis section that new customers tend begin with month-to-month contracts. We also saw from Figure 4 that most new customers tend to buy the Phone Service or Internet Service the most. We can start with that by offering new customers small discounts and deals on those services with month-to-month contracts, allowing them to buy those services at slightly lower prices than competitors to attract them to us.
We can further this plan by offering incentives at the end of their contracts to keep them with us, such as offering a discount on Streaming TV or Multiple Lines to continue the contract. After half a year to two years of month-to-month contracts and offering some incentives, we can then offer longer contracts with some small package deals that introduce them to our other services that they never used before. After the one-year contract ends, we can offer another one-year contract that adds more services to the package until eventually, we can offer two-year contracts that include all the services at discounted prices to customers, such as get 10% off all services if they buy all of it. At that point, customers would normally have good impressions of our company and are more willing to try new things even if they don’t necessarily need it, especially if it is part of a deal. This part of the plan can also work for current customers as well.
Limitations
Some limitations to this analysis is the machine learning models used. Only three were explored in this analysis, but there could potentially be a better one out there that would be a better fit for this.