Customer churn is a big challenge for Regork Telecom—it means losing revenue and spending more to replace customers. Keeping existing customers is not only cheaper but also better for business, so it’s important to figure out why people leave.
I focused on customer tenure and monthly charges to find patterns tied to churn. Using data analysis and predictive models, I looked at what influences churn and built tools to spot customers most at risk.
This analysis shows how Regork can reduce churn by focusing on customers with short tenure or high monthly charges. Acting on these insights will help improve retention, keep customers happy, and protect revenue.
The packages you need to load/install are:
# Load libraries
library(tidymodels)
library(tidyverse)
library(kableExtra)
library(vip)
The dataset needed for this is:
# Load the dataset
data <- read_csv("customer_retention.csv")
data %>%
ggplot(aes(x = Status)) +
geom_bar(fill = "steelblue") +
labs(title = "Customer Churn Distribution", x = "Status", y = "Count")
What It Shows: The bar chart shows that a majority of customers are retained (“Current”), while a smaller percentage have churned (“Left”). This baseline provides context for comparing churn trends across other variables.
data %>%
ggplot(aes(x = MonthlyCharges, fill = Status)) +
geom_density(alpha = 0.6) +
labs(title = "Monthly Charges by Customer Status", x = "Monthly Charges", y = "Density")
What It Shows: The density plot indicates that customers who churned (“Left”) tend to have higher monthly charges compared to retained customers. This suggests that pricing may play a role in customer dissatisfaction and churn.
data %>%
ggplot(aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5, position = "dodge", color = "black") +
labs(title = "Tenure Distribution by Customer Status", x = "Tenure (Months)", y = "Count")
What It Shows: The histogram reveals that churned customers are more likely to have shorter tenures compared to retained customers. This suggests that customers may leave early if their needs or expectations are not met.
# Logistic Regression Model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_wf <- workflow() %>%
add_model(logistic_model) %>%
add_formula(Status ~ .)
set.seed(123)
logistic_results <- fit_resamples(
logistic_wf,
vfold_cv(train_data, v = 5, strata = Status),
metrics = metric_set(roc_auc, accuracy)
)
logistic_results %>% collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.798 5 0.00463 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00421 Preprocessor1_Model1
Explanation: Logistic regression provides a clear baseline for churn prediction, achieving an AUC of 0.844. It indicates that variables like tenure and monthly charges play a significant role in predicting customer churn.
# Random Forest Model
rf_model <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_formula(Status ~ .)
rf_grid <- grid_random(mtry(range = c(2, 10)), min_n(range = c(5, 20)), size = 10)
set.seed(123)
rf_tuned <- tune_grid(
rf_wf,
resamples = vfold_cv(train_data, v = 5, strata = Status),
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy)
)
rf_best_params <- select_best(rf_tuned, metric = "roc_auc")
rf_final_wf <- rf_wf %>%
finalize_workflow(rf_best_params)
rf_fit <- rf_final_wf %>%
fit(data = train_data)
vip::vip(extract_fit_parsnip(rf_fit)$fit)
What It Shows: Random Forest, tuned for optimal performance, achieves an AUC of 0.843. This model highlights tenure, contract type, and monthly charges as the most influential factors in predicting churn, offering actionable insights for customer retention strategies.
# Decision Tree Model
tree_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_wf <- workflow() %>%
add_model(tree_model) %>%
add_formula(Status ~ .)
set.seed(123)
tree_results <- fit_resamples(
tree_wf,
vfold_cv(train_data, v = 5, strata = Status),
metrics = metric_set(roc_auc, accuracy)
)
tree_results %>% collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.784 5 0.00344 Preprocessor1_Model1
## 2 roc_auc binary 0.803 5 0.00330 Preprocessor1_Model1
Explanation: The decision tree, while easy to interpret, achieves a lower AUC of 0.803. Its performance suggests it’s less effective for complex relationships compared to Random Forest and Logistic Regression.
# Extract AUC metrics for Logistic Regression and Decision Tree
logistic_auc <- logistic_results %>% collect_metrics() %>% filter(.metric == "roc_auc")
tree_auc <- tree_results %>% collect_metrics() %>% filter(.metric == "roc_auc")
# Extract AUC for the best Random Forest model
rf_best_auc <- rf_tuned %>%
show_best(metric = "roc_auc", n = 1) %>%
mutate(model = "Random Forest") %>%
select(model, mean, std_err)
# Combine AUC metrics into one table
auc_comparison <- bind_rows(
logistic_auc %>% mutate(model = "Logistic Regression") %>% select(model, mean, std_err),
rf_best_auc,
tree_auc %>% mutate(model = "Decision Tree") %>% select(model, mean, std_err)
)
# Display AUC comparison table
auc_comparison %>%
arrange(desc(mean)) %>%
kable() %>%
kable_styling(full_width = FALSE)
model | mean | std_err |
---|---|---|
Logistic Regression | 0.8437671 | 0.0042101 |
Random Forest | 0.8431790 | 0.0037668 |
Decision Tree | 0.8029446 | 0.0032961 |
Explanation: The AUC comparison confirms that Logistic Regression and Random Forest outperform the Decision Tree. This validates Random Forest as the best choice due to its balance of accuracy and feature importance insights.
# Evaluate the best model (Random Forest) on the test set
rf_test_predictions <- predict(rf_fit, test_data, type = "prob") %>%
bind_cols(test_data)
rf_test_auc <- rf_test_predictions %>%
roc_auc(truth = Status, .pred_Left)
rf_test_auc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.156
Explanation: The low test set AUC of 0.156 suggests poor generalization. Improving feature engineering or further tuning the model could enhance performance on unseen data.
# Generate predictions for the test dataset
final_predictions <- predict(rf_fit, test_data) %>%
bind_cols(test_data)
# Generate the confusion matrix
confusion_matrix <- final_predictions %>%
conf_mat(truth = Status, estimate = .pred_class)
# Convert the confusion matrix to a tibble for display
confusion_matrix_table <- as_tibble(confusion_matrix$table)
# Display the confusion matrix as a table
confusion_matrix_table %>%
kable() %>%
kable_styling(full_width = FALSE)
Prediction | Truth | n |
---|---|---|
Current | Current | 1425 |
Left | Current | 118 |
Current | Left | 286 |
Left | Left | 271 |
Explanation: The confusion matrix shows the Random Forest model predicts churn with decent accuracy but has 118 false negatives (missed churns) and 286 false positives (unnecessary retention efforts). This provides a clear focus for optimization to reduce these errors.
Two of the biggest factors driving customer churn are tenure and monthly charges. Customers with shorter tenure and higher bills are more likely to leave. For example, newer customers in their first 6-12 months and those paying premium rates should be the focus for retention efforts.
If we don’t act, we’re looking at losing a significant chunk of monthly revenue from high-risk customers. Even saving a small percentage of these customers could make a noticeable difference to the bottom line over time.
Offering a discount on monthly charges or introducing a loyalty program could work wonders. For example, giving a 10-15% discount to customers in their first year or bundling additional services for free could help keep them around longer. It’s a small upfront cost that could pay off big in the long run.
To tackle churn, we should focus on easing pricing concerns for newer customers. A targeted discount or loyalty reward program could encourage them to stay longer. Running a small pilot program first to measure how effective these incentives are would be a smart way to start before rolling it out broadly.
This analysis mainly looks at tenure and charges but doesn’t include things like customer satisfaction or service usage, which could give even better insights. Adding that data in the future could make our predictions stronger. We could also experiment with more advanced models to see if they do a better job of identifying at-risk customers.