I have been tasked with analyzing customer data from Regork’s telecommunications market. Ultimately, the goal at hand is to find a model that efficiently and accurately predicts whether customers will churn in the future, as to help Regork retain them as customers.
library(tidymodels)
library(tidyverse)
library(dplyr)
library(vip)
library(pdp)
library(kernlab)
library(baguette)
library(GGally)
library(ggmosaic)
library(rpart.plot)
retention <- read_csv("customer_retention.csv")
retention <- retention %>%
mutate(Status = as.factor(Status))
retention <- retention %>%
drop_na()
churn_rate <- retention %>%
count(Status) %>%
mutate(percentage = n / sum(n) * 100)
churn_rate
## # A tibble: 2 × 3
## Status n percentage
## <fct> <int> <dbl>
## 1 Current 5132 73.4
## 2 Left 1856 26.6
After loading and cleaning our data, we can utilize a churn rate calculation to view a baseline of customer retention and potential areas for improvement, before starting our exploratory analysis.
The Exploratory Analysis section presents the visualizations and data findings that have been utilized throughout the R project, as to address areas for improvement.
# Gender vs Status
retention %>%
ggplot(aes(PhoneService, fill=Gender))+
geom_bar() +
labs(title = "Gender Count and Phone Service Use")
# Gender and Senior Citizen vs. Status
retention %>%
mutate(Senior_Citizen = case_when(
SeniorCitizen == 1 ~ "Senior Citizen",
SeniorCitizen == 0 ~ "Not a Senior Citizen")) %>%
ggplot(aes(PhoneService, fill=Senior_Citizen))+
geom_bar()+
labs(title = "Senior Citizen Count and Phone Service Use")
According to the charts above, more female customers use Regork’s phone services than males. Additionally, the second chart shows us that there are more people who aren’t senior citizens using said phone plan than there are actual senior citizens. This shows us that there is both a prevalance of females who are not senior citizens using the phone services.
# Stacked Bar Chart of Partner and Dependents vs. Status
retention %>%
ggplot(aes(x = interaction(Partner, Dependents), fill = Status)) +
geom_bar(position = "stack") +
labs(title = "Count of Customers by Partner and Dependents Status", x = "Partner and Dependents", y = "Count") +
scale_x_discrete(labels = c("No Partner\nNo Dependents", "No Partner\nDependents", "Partner\nNo Dependents", "Partner\nDependents")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As Shown above, most of Regork’s telecommunications customer base consists of those who have no partner and no dependents.
The order of customer prevelance based on partners & dependents is as follows:
# Internet Service and Online Security vs. Status
retention %>%
ggplot(aes(InternetService, fill = Status)) +
geom_bar() +
facet_wrap(~ OnlineSecurity) +
labs(title = "Internet Service and Online Security Count vs. Status")
# Tenure and Internet Service vs. Status
retention %>%
ggplot(aes(Tenure, fill = Status)) +
geom_histogram(binwidth = 1, position = "dodge") +
facet_wrap(~ InternetService) +
labs(title = "Tenure and Internet Service Count vs. Status", x = "Tenure (months)", y = "Count")
The first plot above displays Internet Service and Online Security versus Status. It is clear that there are a higher number of customers with no online security that have fiber optics and DSL. In fact, most of Regork’s current and previous customers did not have online security, yet they had internet services.
The second plot displays Tenure and Internet Service versus Status. In this plot we clearly see that fiber optics retains the most customers. However, it appears as though many previous customers (left) stopped their fiber optics services after only a few months.
# Payment Method vs. Status
retention %>%
ggplot(aes(PaymentMethod, fill = Status)) +
geom_bar() +
labs(title = "Payment Method Count vs. Status")
# Contract vs Status
retention %>%
ggplot(aes(Contract, fill = Status)) +
geom_bar() +
labs(title = "Contract Type and Customer Status")
# Correlation plot for numeric variables
numeric_vars <- retention %>%
select(Tenure, MonthlyCharges, TotalCharges)
ggpairs(numeric_vars, aes(color = retention$Status))
This chart depicts a scatter plot matrix, which allows us to identify correlations between our numeric variables (Tenure, Monthly Charges, and Total Charges) and their difference regarding customers who have or have not churned.
# Logistic Regression:
set.seed(123)
retention_split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <- logistic_reg() %>%
set_engine("glm")
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
log_results <- lr_mod %>%
fit_resamples(Status ~ ., kfolds)
collect_metrics(log_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
final_fit <- lr_mod %>%
fit(Status ~ ., data = retention_train)
# Confusion Matrix
final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
# ROC Curve
final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
# Coefficients for feature importance
coefficients <- tidy(final_fit$fit)
vip(final_fit$fit, num_features = 10) +
ggtitle("Top 10 Important Features from Logistic Regression Model")
top_features <- coefficients %>%
mutate(abs_estimate = abs(estimate)) %>%
arrange(desc(abs_estimate)) %>%
top_n(10, abs_estimate)
predictions <- predict(final_fit, retention_test, type = "prob")
retention_test <- retention_test %>%
bind_cols(predictions) %>%
mutate(expected_loss = .pred_Left * MonthlyCharges)
monthly_predicted_loss <- retention_test %>%
summarize(monthly_loss = sum(expected_loss))
Our logistic regresion model does a good job at predicting when a customer is still active compared to when a customer has left/churned.
Additionally, it is clear that the inaccuracy of the model stems from FP (false positives) rather than FN (false negatives).
Indicative of the curve above, there is bias towards predicting customer activity while customers are no longer using Regork’s services
As displayed by the bar graph, our most significant and influential feature is tenure. This means that the length of time Regork’s customers has been with the company is the strongest predictor of churn.
# Predict on the test set
test_predictions <- final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status))
# Compute confusion matrix
conf_matrix <- test_predictions %>%
conf_mat(truth = Status, estimate = .pred_class)
# Calculate generalization error
generalization_error <- 1 - sum(diag(conf_matrix$table)) / sum(conf_matrix$table)
generalization_error
## [1] 0.1921793
Because our model is efficient and has a low generalization error, it is likely that our model will have a high success rate when applied in the future.
#Decision Tree
dt_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
dt_fit <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(dt_mod) %>%
fit(data = retention_train)
dt_results <- fit_resamples(dt_mod, retention_recipe, kfolds)
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00399 Preprocessor1_Model1
## 2 brier_class binary 0.161 5 0.00171 Preprocessor1_Model1
## 3 roc_auc binary 0.710 5 0.00529 Preprocessor1_Model1
# Hypertuning
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
dt_results <- tune_grid(dt_mod, retention_recipe, resamples = kfolds, grid = dt_hyper_grid)
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.816 5 0.00552
## 2 0.0000000178 8 30 roc_auc binary 0.816 5 0.00552
## 3 0.00000316 8 30 roc_auc binary 0.816 5 0.00552
## 4 0.0000000001 8 40 roc_auc binary 0.815 5 0.00420
## 5 0.0000000178 8 40 roc_auc binary 0.815 5 0.00420
## # ℹ 1 more variable: .config <chr>
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
dt_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
dt_final_fit <- dt_final_wf %>%
fit(data = retention_train)
dt_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1317 243
## Left 223 314
dt_final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
# Visualization with rpart.plot
final_tree <- pull_workflow_fit(dt_final_fit)$fit
printcp(final_tree)
##
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, cp = ~1e-10, maxdepth = ~8,
## minsplit = min_rows(30, data))
##
## Variables actually used in tree construction:
## [1] Contract_One.year Contract_Two.year
## [3] Gender_Male InternetService_Fiber.optic
## [5] InternetService_No MonthlyCharges
## [7] MultipleLines_Yes OnlineBackup_Yes
## [9] OnlineSecurity_Yes PaperlessBilling_Yes
## [11] PaymentMethod_Credit.card..automatic. PaymentMethod_Electronic.check
## [13] PaymentMethod_Mailed.check SeniorCitizen
## [15] StreamingMovies_Yes Tenure
## [17] TotalCharges
##
## Root node error: 1299/4891 = 0.26559
##
## n= 4891
##
## CP nsplit rel error xerror xstd
## 1 1.0431e-01 0 1.00000 1.00000 0.023777
## 2 8.0831e-03 2 0.79138 0.80216 0.022044
## 3 5.0038e-03 4 0.77521 0.80754 0.022098
## 4 3.3359e-03 7 0.75674 0.78984 0.021920
## 5 3.0793e-03 14 0.72440 0.79369 0.021959
## 6 2.3095e-03 21 0.69746 0.78984 0.021920
## 7 1.5396e-03 23 0.69284 0.79677 0.021990
## 8 1.1547e-03 28 0.68514 0.80446 0.022068
## 9 1.0778e-03 31 0.68129 0.80370 0.022060
## 10 7.6982e-04 37 0.67360 0.80370 0.022060
## 11 6.1586e-04 40 0.67129 0.82063 0.022227
## 12 5.1322e-04 45 0.66821 0.82294 0.022250
## 13 1.0000e-10 51 0.66436 0.83141 0.022332
pruned_tree <- prune(final_tree, cp = 0.01)
rpart.plot(pruned_tree,
type = 2,
extra = 104,
under = TRUE,
faclen = 0,
fallen.leaves = TRUE,
main = "Pruned Decision Tree: Significant Variables")
#bagging
bag_mod <- bag_tree() %>%
set_engine("rpart", times = 5) %>%
set_mode("classification")
bag_results <- fit_resamples(bag_mod, retention_recipe, kfolds)
collect_metrics(bag_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.764 5 0.00189 Preprocessor1_Model1
## 2 brier_class binary 0.171 5 0.00193 Preprocessor1_Model1
## 3 roc_auc binary 0.769 5 0.00433 Preprocessor1_Model1
bag_mod <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
# hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
set.seed(123)
bag_results <- tune_grid(bag_mod, retention_recipe, resamples = kfolds, grid = bag_hyper_grid)
show_best(bag_results, metric = "roc_auc")
## # A tibble: 5 × 7
## times .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 200 roc_auc binary 0.822 5 0.00347 Preprocessor1_Model5
## 2 300 roc_auc binary 0.821 5 0.00421 Preprocessor1_Model6
## 3 100 roc_auc binary 0.821 5 0.00454 Preprocessor1_Model4
## 4 50 roc_auc binary 0.816 5 0.00374 Preprocessor1_Model3
## 5 25 roc_auc binary 0.809 5 0.00453 Preprocessor1_Model2
bag_best_model <- select_best(bag_results, metric = 'roc_auc')
bag_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(bag_mod) %>%
finalize_workflow(bag_best_model)
bag_final_fit <- bag_final_wf %>%
fit(data = retention_train)
# Confusion Matrix
bag_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1341 263
## Left 199 294
# ROC Curve
bag_final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
In our Logistic Regression model, our most optimal model, it is clear that Tenure and Total Charges are our most influential features. Additionally, our decision tree model shows that Monthly Charges and Internet Service Fiber Optic follow in regards to influence. To ensure that Regork keeps as many customers as they can, it is imperative for them to focus on Tenure and Total Charges. With relation to these features, Regork could roll out incentives and promotions. Possible promotions related to Tenure would be discounts or special package deals for those who have used the company services for a certain amount of time.
Looking at our plots that show demographics versus status, we can see that the majority of customer churn revolves around males, those who are without partners, do not have dependents, and aren’t senior citizens. This trend may be due to the financial stability of those who do not fall under those categories. For instance, those who have dependents (families) and partners face far worse financial strain than those who are single and live alone.
monthly_predicted_loss
## # A tibble: 1 × 1
## monthly_loss
## <dbl>
## 1 42412.
ggplot(retention_test, aes(x = expected_loss)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Predicted Monthly Revenue Loss",
x = "Predicted Monthly Revenue Loss",
y = "Count of Customers")
According to our optimal model, Regork’s predicted loss per month based on customer churn is $42411.7.
After analyzing the trends of customer churn through Regork’s telecommunications model it is clear that there must be incentive plans in place to avoid future churn. One idea fo this incentive system would be lyoalty rewards, wherein customers earn points for tenure (months stayed) which can be applied to future bills or other services they wish to purchase. Additionally, because customer churn is higher for those who have less internet services tenure, it would be beneficial to give customers random service plan upgrades, as well as having their first month be free or discounted.
In conclusion, it is now evident that Regork’s telecommunications service market is highly influenced by both customer tenure and their total charges. Because of this, it is imperative that Regork begins implementing loyalty and reward programs that cater to quickly churning demographics mentioned above. Once Regork rolls out such programs, they will ultimately avoid the monthly loss that was calculated before. Finally, using the optimal logistic regression model will help Regork to continue in this market.