Customer retention is a critical aspect of maintaining steady revenue for any subscription-based business. In this analysis, we investigate patterns in customer churn and predict those at risk of leaving. We also propose actionable insights to retain these customers and minimize revenue loss.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## corrplot 0.95 loaded
## Loading required package: lattice
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# Load the dataset
data <- readr::read_csv("C:/Users/Mahika/OneDrive - University of Cincinnati/BANA 4080/data/customer_retention.csv")## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [6,999 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Gender : chr [1:6999] "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : num [1:6999] 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr [1:6999] "Yes" "No" "No" "No" ...
## $ Dependents : chr [1:6999] "No" "No" "No" "No" ...
## $ Tenure : num [1:6999] 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr [1:6999] "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr [1:6999] "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr [1:6999] "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr [1:6999] "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr [1:6999] "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr [1:6999] "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr [1:6999] "No" "No" "No" "Yes" ...
## $ StreamingTV : chr [1:6999] "No" "No" "No" "No" ...
## $ StreamingMovies : chr [1:6999] "No" "No" "No" "No" ...
## $ Contract : chr [1:6999] "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr [1:6999] "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr [1:6999] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num [1:6999] 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num [1:6999] 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Status : chr [1:6999] "Current" "Current" "Left" "Current" ...
## - attr(*, "spec")=
## .. cols(
## .. Gender = col_character(),
## .. SeniorCitizen = col_double(),
## .. Partner = col_character(),
## .. Dependents = col_character(),
## .. Tenure = col_double(),
## .. PhoneService = col_character(),
## .. MultipleLines = col_character(),
## .. InternetService = col_character(),
## .. OnlineSecurity = col_character(),
## .. OnlineBackup = col_character(),
## .. DeviceProtection = col_character(),
## .. TechSupport = col_character(),
## .. StreamingTV = col_character(),
## .. StreamingMovies = col_character(),
## .. Contract = col_character(),
## .. PaperlessBilling = col_character(),
## .. PaymentMethod = col_character(),
## .. MonthlyCharges = col_double(),
## .. TotalCharges = col_double(),
## .. Status = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## Gender SeniorCitizen Partner Dependents
## Length:6999 Min. :0.0000 Length:6999 Length:6999
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :0.0000 Mode :character Mode :character
## Mean :0.1619
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Tenure PhoneService MultipleLines InternetService
## Min. : 0.00 Length:6999 Length:6999 Length:6999
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :29.00 Mode :character Mode :character Mode :character
## Mean :32.38
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## Length:6999 Length:6999 Length:6999 Length:6999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## StreamingTV StreamingMovies Contract PaperlessBilling
## Length:6999 Length:6999 Length:6999 Length:6999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Status
## Length:6999 Min. : 18.25 Min. : 18.8 Length:6999
## Class :character 1st Qu.: 35.48 1st Qu.: 401.9 Class :character
## Mode :character Median : 70.35 Median :1397.5 Mode :character
## Mean : 64.75 Mean :2283.1
## 3rd Qu.: 89.85 3rd Qu.:3796.9
## Max. :118.75 Max. :8684.8
## NA's :11
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 11 0
# Convert categorical variables to factors
categorical_vars <- c("Status", "Gender", "Partner", "Dependents", "PhoneService",
"MultipleLines", "InternetService", "OnlineSecurity",
"OnlineBackup", "DeviceProtection", "TechSupport",
"StreamingTV", "StreamingMovies", "Contract",
"PaperlessBilling", "PaymentMethod")
data[categorical_vars] <- lapply(data[categorical_vars], as.factor)
# Normalize numerical columns
numerical_vars <- c("Tenure", "MonthlyCharges", "TotalCharges")
data[numerical_vars] <- scale(data[numerical_vars])In this first churn distribution we analyzed the percentage of current customers and the customers that have already left, finding that about 75% of customers stayed and about 25% who have already left.
churn_distribution <- data %>%
count(Status) %>%
mutate(Percent = n / sum(n) * 100)
ggplot(churn_distribution, aes(x = Status, y = Percent, fill = Status)) +
geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
scale_fill_manual(values = c("purple", "yellow")) +
labs(
title = "Churn Distribution",
y = "Percentage of Customers",
x = "Customer Status"
) +
theme_minimal()We then decided to analyze the proportions of churn across different categories of predictors, below are the graphs showing the proportion by status, gender, partner, dependents, Phone service(if one has it or not), multiple lines of phone service, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract, paperless billing, and payment method.
plot_list <- list()
for (var in categorical_vars) {
p <- ggplot(data, aes_string(x = var, fill = "Status")) +
geom_bar(position = "fill") +
labs(title = paste("Churn by", var), y = "Proportion", x = var) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_list[[var]] <- p
}
# Display all plots
for (p in plot_list) {
print(p)
}We compared the distribution of numerical values using boxplots for tenure, monthly charges, and total charges between churned and retained customers.
for (var in numerical_vars) {
p <- ggplot(data, aes_string(x = "Status", y = var, fill = "Status")) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "tomato")) + # Custom colors
labs(title = paste("Churn by", var), y = var, x = "Status") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#f7f7f7", color = NA),
panel.background = element_rect(fill = "#f7f7f7", color = NA),
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(color = "#0056b3"),
legend.position = "top"
)
print(p)
}Then we analyzed a dot plot that compares the performance of logistic regression, decision tree and random forest models based on cross validated models.
set.seed(123)
trainIndex <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]
control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)
log_model <- train(Status ~ ., data = train_data, method = "glm", family = "binomial", trControl = control, metric = "ROC")
tree_model <- train(Status ~ ., data = train_data, method = "rpart", trControl = control, metric = "ROC")
rf_model <- train(Status ~ ., data = train_data, method = "rf", trControl = control, metric = "ROC")best_model <- rf_model
predictions <- predict(best_model, newdata = test_data, type = "prob")
conf_matrix <- confusionMatrix(predict(best_model, newdata = test_data), test_data$Status)
# Print confusion matrix
conf_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1471 349
## Left 71 207
##
## Accuracy : 0.7998
## 95% CI : (0.782, 0.8167)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 2.61e-12
##
## Kappa : 0.3883
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9540
## Specificity : 0.3723
## Pos Pred Value : 0.8082
## Neg Pred Value : 0.7446
## Prevalence : 0.7350
## Detection Rate : 0.7011
## Detection Prevalence : 0.8675
## Balanced Accuracy : 0.6631
##
## 'Positive' Class : Current
##
test_data$predicted_prob <- predictions[, 2]
churn_customers <- test_data %>%
filter(predicted_prob > 0.5)
predicted_loss <- sum(churn_customers$MonthlyCharges)
retention_cost <- sum(churn_customers$MonthlyCharges * 0.2)
retained_revenue <- predicted_loss - retention_cost
cat("Predicted monthly revenue loss:", predicted_loss, "\n")## Predicted monthly revenue loss: 122.4018
## Retention incentive cost (20%): 24.48036
## Net retained revenue: 97.92145
Based on the model predictions, it is evident that customer churn can significantly impact the company’s revenue. For customers identified as high-risk of churn:
Revenue Loss: The predicted monthly revenue loss due to these customers leaving is substantial. This indicates a pressing need to address churn to safeguard the company’s financial health.
Retention Incentive Cost: By offering targeted retention incentives, such as discounts or added services, we can retain a significant portion of these at-risk customers. The retention incentive cost, calculated at 20% of their monthly charges, is a fraction of the potential revenue loss.
Net Retained Revenue: Implementing retention strategies could result in a net positive impact on revenue, showcasing the financial benefits of proactive customer retention efforts.
In conclusion, customer retention is pivotal for maintaining a steady revenue stream in subscription-based businesses. Through data preprocessing and exploratory data analysis, we identified patterns and predictors of customer churn. The predictive models developed, particularly the Random Forest model, demonstrate high accuracy in identifying customers at risk of leaving.
By understanding these patterns and implementing targeted retention strategies, businesses can not only mitigate revenue loss but also enhance customer satisfaction and loyalty. This analysis underscores the importance of data-driven decision-making in crafting effective retention strategies and ensuring long-term business sustainability.