Regork is building a new telecommunication system and is seeking to attract existing customers rather than new customers due to the price difference. We have been tasked to create a predictive model to identify trends and patterns related to customer churn. The main goal is to retain the existing customers and offer incentives in case they leave for another organization. We will be analyzing three different models (Logistic Regression, Decision Tree, and Random Forest) and choose one of these to go deeper in our analysis.
These are the packages used in this report to be able to make this analysis:
library(tidymodels)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(vip)
library(ranger)
library(kernlab)
library(baguette)
This is the data we will be using throughout the report. We will be referring to this data as “customer_data”
customer_data <- read.csv("/Users/chrisdobies/Desktop/BANA 4080/customer_retention.csv")
Looking at the data, our main variable is Status. We have to convert this variable for it to work in our models since it is considered a character.
customer_data <- customer_data %>%
mutate(Status = factor(Status, levels = c("Current", "Left")))
We first want to discover some patterns and trends in customers to get some answers.
We will start by looking into how many customers are current and the customers that have left.
customer_data %>%
count(Status) %>%
mutate(percentage = n / sum(n) * 100)
## Status n percentage
## 1 Current 5143 73.48193
## 2 Left 1856 26.51807
There is a significant amount of customers that stayed, however the customers that left is still a good amount. About 27% of customers left which is something to take note of.
Something good to look at is to see what the tenure of these two sides of customers look like.
ggplot(customer_data, aes(x = Tenure, fill = Status)) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of Tenure by Customer Status", x = "Months", y = "Density") +
theme_minimal()
This graph shows us the patterns these two customer types have. As seen, the “left” customers are present early and have a massive drop off overtime in just a few months. Rather than current customers stay longer overtime. There is a loyalty piece in there which is something to note. Building loyalty may be key to keeping customers long term.
Now let’s take a look to see the monthly charges between the two customers.
ggplot(customer_data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Customer Status", y = "Monthly Charges ($)") +
theme_minimal()
The “Left” customer box has a higher median than the “Current”. Meaning the left customers pay higher prices than the other customers. There could be a few reasons for this. One being since the left customers leave pretty early as seen in our last graph, there is a lot of data from the specific customers. That also could signify that the “Left” customers do not like the prices of Regork products that also causes them to leave quickly.
customer_data %>%
count(Contract, Status) %>%
group_by(Contract) %>%
mutate(percent = n / sum(n) * 100) %>%
ggplot(aes(x = Contract, y = percent, fill = Status)) +
geom_col(position = "fill") +
labs(title = "Customer Status by Contract Type", y = "Proportion of Customers") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
This graph shows that the longer the contract, the more likely they are to stay. The main finding here is that the longer the contract, the better customer retention is.
With all three graphs, we see the same trend. It appears the customers with more loyalty to Regork stay longer and are still existing customers. Thus the reason for them leaning towards having two-year contracts. It seems to be that the “left” customers are less committed, paying higher prices due to a short tenure, and the customers are just new.
We are now going to be looking at three different models to find our perfect optimal model. We will be comparing our models based off of their AUC percentages.
First we will be creating a Logistic Regression model.
set.seed(123)
logistic_split <- initial_split(customer_data, prop = 0.7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)
set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_recipe <- recipe(Status ~ ., data = logistic_train)
logistic_workflow <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(logistic_recipe)
logistic_results <- logistic_workflow %>%
fit_resamples(
resamples = logistic_kfolds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
collect_metrics(logistic_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.798 5 0.00463 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00421 Preprocessor1_Model1
So our AUC percentage is 84% which is pretty good. This suggests that this model is able to distinguish customers who stay or leave 84% of the time.
Our Second model will be a Decision Tree.
tree_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_recipe <- recipe(Status ~ ., data = logistic_train)
tree_workflow <- workflow() %>%
add_model(tree_model) %>%
add_recipe(tree_recipe)
tree_results <- tree_workflow %>%
fit_resamples(
resamples = logistic_kfolds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
collect_metrics(tree_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.784 5 0.00344 Preprocessor1_Model1
## 2 roc_auc binary 0.803 5 0.00330 Preprocessor1_Model1
This AUC percentage comes out to 80%. This is still a good score but not better than the Logistic Regression Model.
Let’s look at our final model which will be a Random Forest Model.
rf_model <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_recipe <- recipe(Status ~ ., data = logistic_train)
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(rf_recipe)
rf_grid <- grid_regular(
mtry(range = c(2, 10)),
min_n(range = c(5, 20)),
levels = 5
)
set.seed(123)
rf_tune_results <- tune_grid(
rf_workflow,
resamples = logistic_kfolds,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy),
control = control_grid(save_pred = TRUE)
)
rf_best <- select_best(rf_tune_results, metric = "roc_auc")
rf_final_workflow <- finalize_workflow(rf_workflow, rf_best)
rf_results <- rf_final_workflow %>%
fit_resamples(
resamples = logistic_kfolds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
collect_metrics(rf_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00331 Preprocessor1_Model1
## 2 roc_auc binary 0.843 5 0.00427 Preprocessor1_Model1
We once again got an AUC of 84% making it the same exact result as the Logistic Regression. So which one will we choose?
Both models have good scores however we will end up using the Random Forest model as this model offers a little more for us to break down and analyze. Let’s create a plot to show the most important features.
rf_final_fit <- rf_final_workflow %>%
last_fit(logistic_split, metrics = metric_set(roc_auc, accuracy))
rf_final_model <- extract_fit_parsnip(rf_final_fit$.workflow[[1]])
vip::vip(rf_final_model, num_features = 10)
Our Random Forest model shows that Tenure, Contract Type, Total Charges, and Monthly Charges as the most influential predictor variables with what customers do. We will also create a confusion matrix to show more of the accuracy of the model.
rf_predictions <- rf_final_fit %>%
collect_predictions()
rf_predictions %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1423 285
## Left 120 272
Looking at what prediction variables came up in our model compared to what we found in our initial graphs is something to point out. Starting with Tenure. This variable is seen as the most important reason for why customers leave. We seen this in our graph made earlier as well. The longer a customer stays, the better chances there are of being a long term customer. Customers that leave have a very short tenure and as seen throughout the report, is the biggest way of predicting customers.
We also found that the contract variable is another important variable. The longer the contract, the more likely customers are to stay. Customers with short contracts are likely to leave. So targeting a good deal for long term contracts or making more perks for long term contracts is the way to go. Maybe even make an exclusive offer for new customers.
Another variable is monthly charges. New customers may be paying more which is causing this data to be sweked this way. But this is an important feature to understand if Regork is trying to keep customers rather than losing them.
Regork should consider our random forest model as it predicts pretty well on customer status. With an AUC of 84%, the model can be used to make successful business decisions. We also mentioned the most important areas to look into so Regork is able to focus on those areas to keep these customers long term. Regork can make a great telecommunication service by using our predictive model.