As Regork enters the telecommunications market, understanding
customer behavior is essential for long-term success. In this analysis,
we focus on customer retention, using data from the
customer_retention.csv file to build a predictive model.
The goal is to accurately forecast whether a customer will leave in the
future, enabling Regork to take proactive steps to retain valuable
customers.
Customer retention is crucial in the telecommunications industry, as the cost of acquiring new customers often exceeds the cost of keeping existing ones. By developing a model that predicts customer churn, Regork can take targeted actions—such as offering promotions or incentives—based on these predictions. For example, marketing teams can use the model to tailor customer retention campaigns, while the finance team can better forecast company revenue and risk.
This report details the data analysis process, including necessary packages and data preparation, followed by an exploration of trends and relationships between predictors and the response variable. The focus of the analysis is on building and evaluating various machine learning models, including logistic regression, random forest, and decision trees, to identify the best approach for predicting customer churn.
The report provides insights that support model selection and offers recommendations for how Regork can leverage these findings to enhance customer retention strategies.
setwd("/Users/ezishr/Documents/CINCY/Fall 2024/BANA 4080/Final Project") # Local working directory
library(tidyverse) # Data manipulation, visualization, reading data files
library(tidymodels) # Models and recipes use purpose
library(rpart) # Tree-based models use purpose
library(baguette) # Tools for bagging models
library(pdp) # Use for understanding model behavior
library(vip) # Feature importance
library(corrplot) # Correlation matrices visualization
library(patchwork) # Side-by-side or multipanel plots purpose
library(gridExtra) # Arranging multiple plots in a grid
library(kernlab) # Kernel-based machine learning algorithms
1. Read data
First, we load the data from provided csv file and look at the
NA count across every column:
data <- read_csv('customer_retention.csv')
data$SeniorCitizen <- as.character(data$SeniorCitizen)
colSums(is.na(data)) # 11 NAs in TotalCharges
Gender SeniorCitizen Partner Dependents
0 0 0 0
Tenure PhoneService MultipleLines InternetService
0 0 0 0
OnlineSecurity OnlineBackup DeviceProtection TechSupport
0 0 0 0
StreamingTV StreamingMovies Contract PaperlessBilling
0 0 0 0
PaymentMethod MonthlyCharges TotalCharges Status
0 0 11 0
The TotalCharges column currently contains 11 missing
values. When implementing models, we may consider removing this column
due to its potential high multicollinearity with the
MonthlyCharges column.
2. Add Churn Type
We categorized customers with Status =
"Left" based on their contract types into three levels:
Still Active (customers with Status =
"Current"), Contract Fulfilled (customers who left after
their contracts expired), and Early Termination (customers who left
before their contracts ended).
# Convert Contract Into Months --------------------------------------------
sample_df <- data
sample_df <- sample_df %>% mutate(
contract_months = case_when (
Contract == 'Month-to-month' ~ Tenure,
Contract == 'One year' ~ 12,
Contract == 'Two year' ~ 24,
TRUE ~ NA_real_ )
)
sample_df <- sample_df %>% mutate(
contract_period = case_when (
(Tenure <= contract_months) & (Contract != 'Month-to-month') ~ 1,
(Tenure > contract_months) & (Contract != 'Month-to-month') & (Tenure%%contract_months == 0) ~ (Tenure / contract_months),
(Tenure > contract_months) & (Contract != 'Month-to-month') & (Tenure%%contract_months != 0) ~ floor(Tenure / contract_months) + 1,
TRUE ~ Tenure
)
)
sample_df <- sample_df %>% mutate(
total_time = case_when(
(Contract == 'Month-to-month') ~ Tenure,
(Contract != 'Month-to-month') ~ (contract_months * contract_period)
)
)
sample_df <- sample_df %>% mutate(
churn_case = case_when (
(Status == 'Current') ~ 'Still Active',
(Status == 'Left') & (total_time == Tenure) ~ 'Contract fulfilled',
(Status == 'Left') & (total_time > Tenure) & (Contract != 'Month-to-month') ~ 'Early Termination',
TRUE ~ 'Contract fulfilled'
)
)
data <- sample_df %>% select(!total_time)
data$contract_period <- as.character(data$contract_period)
char_cols <- colnames(data)[sapply(data, is.character)]
char_cols <- char_cols[(char_cols!='Status') & (char_cols!='contract_period')]
3. Function: read unique values of each column
The function takes a dataset and a list of character columns as input. It prints the unique values of each specified column, displaying them line by line.
check_unique <- function(dataset, char_cols) {
for (col in char_cols) {
cat('\nUnique values for', col,':\n')
print(unique(dataset[[col]]))
}
}
1. Initial Look of Left Status
In this analysis, we begin by examining the counts of
Status across various services and factors in the
dataset.
plot_list <- list()
for (col in char_cols) {
plot <- ggplot(data = data, aes(x = !!sym(col), fill = Status)) +
geom_bar() +
ylim(0, 7000) +
geom_text(stat = 'count', aes(label = after_stat(count), vjust = -0.5, size = 0.1)) +
labs(title = paste0('Count of Churn by ', col), y='Count')
plot_list[[col]] <- plot
}
grid_plots <- grid.arrange(grobs = plot_list, nrow=6, ncol=3)
The Phone Service column has the highest count of
Status = "Left", totaling 1,687. To understand
this trend, we will analyze customers with Phone Service
and Internet Service separately, along with their add-ons,
to identify potential reasons for the high counts. Additionally, we will
investigate whether customers with both services are more likely to
leave and explore the factors contributing to their departure.
2. Demographic Look
Second, we have a look at demographic:
demographic_cols <- c('Gender', 'SeniorCitizen', 'Partner', 'Dependents')
demo <- data
demo <- demo %>% mutate(
SeniorCitizen = case_when(
SeniorCitizen == 0 ~ 'No',
SeniorCitizen == 1 ~ 'Yes'
)
)
## Charts demographics
for (col in demographic_cols) {
plot <- ggplot(data = demo, aes(x = !!sym(col), fill = Status)) +
geom_bar(position = "dodge") +
labs(title = paste("Count of Status by", col), x = col, y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "bottom") +
geom_text(
stat = 'count',
aes(label = after_stat(count)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3)
assign(paste0("plot_", col), plot)
}
plot_Dependents + plot_Gender + plot_Partner + plot_SeniorCitizen
Overall, there were more Current customers than those
who left. However, customers without partners or dependents tended to
leave at higher rates, which could be attributed to the charges for
individuals living alone. We will conduct further analysis to explore
this aspect. Regarding gender, there was no significant difference in
the departure rates between males and females.
3. Churn Category Look
When preparing the dataset, we have classified the churn category into three: Still Active, Contract fulfilled, and Early Termination.
ggplot(data, aes(x = Contract)) +
geom_bar(fill = "skyblue", color = "black") +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3, color = "black", size = 3.5) +
facet_grid(churn_case ~ .) +
labs(title = "Churn Cases Count by Contract Type", x = "Contract Type", y = "Count")
As shown above, there are relatively few customers leaving due to
Early Termination, suggesting that most customers tend to
use the service until the end of their contract. However, the
Month-to-month contract type has the highest churn rate,
with approximately half of the customers in this category leaving. This
could indicate that these customers may not have received the expected
level of service or their specific needs and concerns were not
addressed. We will conduct further analysis on this in the
Monthly Charges and Churn Type section.
ggplot(data, aes(PaymentMethod, fill = churn_case)) +
geom_bar() +
facet_wrap( ~Contract ) +
coord_flip() +
labs(title = "Count of Payment Methods Across Contract Types and Churn Cases",
x = "Payment Method",
y = "Count",
fill = "Churn Case"
)
The analysis reveals that the payment methods
Credit Bank and Bank Transfer maintain a
consistent count across all contract types. This uniform distribution
suggests that these payment methods are equally preferred by customers
regardless of their chosen contract type. This consistency highlights
the broad appeal and reliability of these options, making them essential
for meeting customer expectations across the board. Additionally, the
balanced usage of these methods provides operational benefits by
simplifying resource allocation and planning.
The Month-to-month contract type displays an
exceptionally high count for Electronic Check payments. This trend could
signify that customers with shorter-term commitments prefer the
flexibility and immediacy provided by this method.
1. General View
First, we get the data of customers having Phone Service, which means
PhoneService != "No" and have a breif look at unique values
across columns:
having_phone_service <- data[data$PhoneService != 'No', ]
having_phone_service <- having_phone_service[, !(colnames(having_phone_service) %in% c('contract_period', 'contract_months', 'contract_period'))]
char_cols <- colnames(having_phone_service)[sapply(having_phone_service, is_character)]
char_cols <- char_cols[char_cols!='PhoneService']
check_unique(having_phone_service, char_cols)
Unique values for Gender :
[1] "Male" "Female"
Unique values for SeniorCitizen :
[1] "0" "1"
Unique values for Partner :
[1] "No" "Yes"
Unique values for Dependents :
[1] "No" "Yes"
Unique values for MultipleLines :
[1] "No" "Yes"
Unique values for InternetService :
[1] "DSL" "Fiber optic" "No"
Unique values for OnlineSecurity :
[1] "Yes" "No" "No internet service"
Unique values for OnlineBackup :
[1] "No" "Yes" "No internet service"
Unique values for DeviceProtection :
[1] "Yes" "No" "No internet service"
Unique values for TechSupport :
[1] "No" "Yes" "No internet service"
Unique values for StreamingTV :
[1] "No" "Yes" "No internet service"
Unique values for StreamingMovies :
[1] "No" "Yes" "No internet service"
Unique values for Contract :
[1] "One year" "Month-to-month" "Two year"
Unique values for PaperlessBilling :
[1] "No" "Yes"
Unique values for PaymentMethod :
[1] "Mailed check" "Electronic check"
[3] "Credit card (automatic)" "Bank transfer (automatic)"
Unique values for Status :
[1] "Current" "Left"
Unique values for churn_case :
[1] "Still Active" "Contract fulfilled" "Early Termination"
plot_list <- list()
for (col in char_cols) {
plot <- ggplot(data = having_phone_service, aes(x = !!sym(col), fill = Status)) +
geom_bar() +
ylim(0, 7000) +
geom_text(stat = 'count', aes(label = after_stat(count), vjust = -0.5, size = 0.1)) +
labs(title = paste0('Count of Churn by ', col), y='Count')
plot_list[[col]] <- plot
}
grid_plots <- grid.arrange(grobs = plot_list, nrow=5, ncol=4)
As seen above, customers with InternetService =
"Fiber optic" and a Month-to-month contract
type had the highest churn rates.
2. Charges across Contract Type and Internet Service
We will look more in the monthly chagres across the Contract Type and Internet Services.
ggplot(data = having_phone_service, aes(x = Contract, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges In General by Contract Type and Churn Status",
x = "Contract Type",
y = "Monthly Charges")
Generally, regardless of the type of Internet Service, the median monthly charges for customers who have churned tend to be higher than those for active customers. This suggests that higher charges may be a contributing factor to customer churn, potentially indicating a perception of being overcharged.
have_internet <- having_phone_service[having_phone_service$InternetService != 'No',]
ggplot(data = have_internet, aes(x = Contract, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
facet_wrap(~ InternetService) +
labs(title = "Monthly Charges by Contract Type, Churn Status, and Internet Service",
x = "Contract Type",
y = "Monthly Charges")
When examining specific Internet Services combined with Contract Type
among customers using Phone Service, customers using
Fiber optic typically face higher charges compared to those
with DSL. However, the median charges for both churned and
active customers across Contract Types are not significantly different.
This suggests that customer churn may be influenced more by
dissatisfaction with the service quality rather than pricing.
3. Charges across Contract Type, Multiple Lines, and Internet Service
# Boxplot of Monthly Charges by Contract and Status with NO InternetService
no_internet <- having_phone_service[having_phone_service$InternetService == 'No',]
ggplot(data = no_internet, aes(x = Contract, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
facet_wrap(~ MultipleLines, labeller = as_labeller(c("Yes" = "Has Multiple Lines", "No" = "No Multiple Lines"))) +
labs(title = "Monthly Charges by Contract Type, Multiple Lines, and Churn Status - No Internet Service",
x = "Contract Type",
y = "Monthly Charges")
Churned customers in month-to-month contracts could be more price-sensitive. They might have initially chosen the month-to-month flexibility, attracted by low starting prices or discounts, but left when they found the pricing unsustainable or felt they were not getting value for their money.
no_internet_no_multipleLines <- no_internet[no_internet$MultipleLines == 'No',]
# Any SeniorCitizen with NO Internet & NO Multiplelines --------
senior_no_multiple <- no_internet_no_multipleLines[no_internet_no_multipleLines$SeniorCitizen == 1,] # Subset for senior citizens with no multiple lines
non_senior_no_multiple <- no_internet_no_multipleLines[no_internet_no_multipleLines$SeniorCitizen == 0,] # Subset for non-senior citizens with no multiple lines
# Boxplot of Monthly Charges by Status for Senior Citizens
plot3 <- ggplot(senior_no_multiple, aes(x = Contract, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Status for Senior Citizens (No Internet & No Multiple Lines)",
x = "Status", y = "Monthly Charges")
# Boxplot of Monthly Charges by Status for Non-Senior Citizens
plot4 <- ggplot(non_senior_no_multiple, aes(x = Contract, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Status for Non-Senior Citizens (No Internet & No Multiple Lines)",
x = "Status", y = "Monthly Charges")
plot3 + plot4
Among senior citizens, there is no churn for customers with one-year and two-year contract types when they do not use internet service or multiple lines for their phone service. This suggests that senior citizens who choose these contracts are likely more stable in their decisions, potentially due to the predictability and stability of long-term commitments. Additionally, the fact that they are not using internet service or multiple lines could indicate that they prefer simpler service plans that meet their basic needs, possibly at lower costs, which may increase their satisfaction and loyalty.
In contrast, non-senior citizens are exhibiting churn in all contract types (month-to-month, one-year, and two-year), indicating that they are less committed or more price-sensitive. They might be more likely to switch providers, possibly due to competitive offers or dissatisfaction with the service. The fact that churn occurs across all contract types for non-senior citizens suggests that factors like pricing, service features, or customer experience are significant drivers of churn, even when the charges across these customers appear similar.
Although the charges for non-senior citizens are similar across contract types, the churn rates differ significantly. This could indicate that the reason for churn may not be directly tied to price but possibly to factors like service satisfaction, perceived value, or flexibility.
4. Tenure Distribution
Now we look at the Tenure distribution across contract
types and churn status:
# Plot tenure vs churn for customers with phone service (initial view) -----------
ggplot(data = having_phone_service, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 1, alpha = 0.6) +
facet_wrap(~Contract) +
labs(title = "Tenure Distribution by Contract Type and Churn Status",
x = "Tenure (Months)",
y = "Count")
The tenure distribution by contract type and churn status shows that month-to-month customers tend to churn at lower tenures, mostly between 0-20 months. In contrast, one-year and two-year contract customers are more likely to stay longer, with most of the churn occurring towards the end of the tenure distribution, around 60+ months. This suggests that month-to-month contracts may attract more short-term customers, while longer contracts tend to retain customers for extended periods.
In this part, we will analyze the Internet Service and its add-ons.
First, we get the data of customers having Internet Service, which means
InternetService != "No":
having_internet_service <- data[data$InternetService != 'No', ]
having_internet_service <- having_internet_service[, !(colnames(having_internet_service) %in% c('contract_period', 'contract_months'))]
1. General View
Similar to the analysis of phone service, the churn patterns for customers with internet service across different demographics show comparable trends. This suggests that the company maintains a balanced approach, providing consistent service quality for both phone and internet services, without bias towards either.
char_cols <- colnames(having_internet_service)[sapply(having_internet_service, is_character)]
char_cols <- char_cols[char_cols != 'InternetService']
check_unique(having_internet_service, char_cols)
Unique values for Gender :
[1] "Female" "Male"
Unique values for SeniorCitizen :
[1] "0" "1"
Unique values for Partner :
[1] "Yes" "No"
Unique values for Dependents :
[1] "No" "Yes"
Unique values for PhoneService :
[1] "No" "Yes"
Unique values for MultipleLines :
[1] "No phone service" "No" "Yes"
Unique values for OnlineSecurity :
[1] "No" "Yes"
Unique values for OnlineBackup :
[1] "Yes" "No"
Unique values for DeviceProtection :
[1] "No" "Yes"
Unique values for TechSupport :
[1] "No" "Yes"
Unique values for StreamingTV :
[1] "No" "Yes"
Unique values for StreamingMovies :
[1] "No" "Yes"
Unique values for Contract :
[1] "Month-to-month" "One year" "Two year"
Unique values for PaperlessBilling :
[1] "Yes" "No"
Unique values for PaymentMethod :
[1] "Electronic check" "Mailed check"
[3] "Bank transfer (automatic)" "Credit card (automatic)"
Unique values for Status :
[1] "Current" "Left"
Unique values for churn_case :
[1] "Still Active" "Contract fulfilled" "Early Termination"
plot_list <- list()
for (col in char_cols) {
plot <- ggplot(data = having_internet_service, aes(x = !!sym(col), fill = churn_case)) +
geom_bar() +
labs(title = paste0("Count of Churn by ", col), y = 'Count', fill = "Churn Case")
plot_list[[col]] <- plot
}
grid_plots <- grid.arrange(grobs = plot_list, nrow = 5, ncol = 4)
2. Charges across Contract Type and Internet Service
Next, we examine how monthly charges vary with different add-ons for
customers with Internet Service. The logic behind this
graph is to focus on customers who have Internet Service along with only
one additional feature. For example, the first graph illustrates
customers with Internet Service who have added only Online Security,
without any other add-ons.
plot_list <- list()
add_ons <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies")
for(add_on in add_ons) {
non_add <- add_ons[add_ons != add_on]
sample1 <- having_internet_service
for (col in non_add) {
sample1 <- sample1[sample1[[col]] == "No",]
}
plot <- ggplot(data = sample1, aes(x = MonthlyCharges, y = Contract, fill = !!sym(add_on))) +
geom_boxplot() +
facet_grid(InternetService ~ Status) +
labs(title = paste0("Monthly Charges by Contract Type and ", add_on),
x = "Monthly Charges",
y = "Contract Type")
plot_list[[add_on]] <- plot
}
grid_plots <- grid.arrange(grobs = plot_list, nrow=3, ncol=2)
plot_list <- list()
for(add_on in add_ons) {
non_add <- add_ons[add_ons != add_on]
sample1 <- having_internet_service
for (col in non_add) {
sample1 <- sample1[sample1[[col]] == "No",]
}
plot <- ggplot(data = sample1, aes(x = Contract, fill = !!sym(add_on))) +
geom_bar(position = "dodge") +
facet_grid(InternetService ~ Status) +
labs(title = paste0("Count of Status by Contract Type and ", add_on),
x = "Contract Type",
y = "Count")
plot_list[[add_on]] <- plot
}
grid_plots <- grid.arrange(grobs = plot_list, nrow=3, ncol=2)
From the grid plots above, it is evident that customers with the
StreamingMovies and StreamingTV add-ons incur
significantly higher charges. This group also exhibits a notably higher
churn rate, suggesting that the increased costs may not align with
customer expectations, prompting them to leave.
3. Tenure Distribution
ggplot(data = having_internet_service, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 1, alpha = 0.6) +
facet_wrap(~Contract) +
labs(title = "Tenure Distribution by Contract Type and Churn Status",
x = "Tenure (Months)",
y = "Count",
fill = "Status")
The similar tenure distribution across both internet and phone services suggests that customer retention patterns are consistent between the two. This implies that the factors influencing churn, such as pricing, service quality, or competition, are likely universal and not specific to either service.
1. Initial Look
We first look at the churn_case count across demographic
features. Generally, customers tent to churn when their contracts
expired. However, customers that are SeniorCitizen are the
least likey to churn, which might imply that they prefer stable
service.
demographic_cols <- c('Gender', 'SeniorCitizen', 'Partner', 'Dependents')
demo <- data
demo <- demo %>% mutate(
SeniorCitizen = case_when(
SeniorCitizen == 0 ~ 'No',
SeniorCitizen == 1 ~ 'Yes'
)
)
## Charts demographics
for (col in demographic_cols) {
plot <- ggplot(data = demo, aes(x = !!sym(col), fill = churn_case)) +
geom_bar(position = "dodge") +
labs(title = paste("Count of Status by", col), x = col, y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "bottom")
assign(paste0("plot_", col), plot)
}
plot_Dependents + plot_Gender + plot_Partner + plot_SeniorCitizen
Now we look further into the monthly charges. We also extract data about customers having no dependents and customers having no partner to see if there is significant difference in charge for them.
no_dependents <- demo[demo$Dependents == 'No',]
has_dependents <- demo[demo$Dependents == 'Yes',]
no_partner <- demo[demo$Partner == 'No',]
has_partner <- demo[demo$Partner == 'Yes',]
2. Dependents Perspective
plot1 <- ggplot(no_dependents, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(
title = "Monthly Charges Distribution by Status (No Dependents)",
x = "Status",
y = "Monthly Charges"
) +
theme_minimal()
plot2 <- ggplot(has_dependents, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(
title = "Monthly Charges Distribution by Status (Have Dependents)",
x = "Status",
y = "Monthly Charges"
) +
theme_minimal()
plot1 + plot2
Insights:
Left:
3. Partner Perspective
plot1 <- ggplot(no_partner, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(
title = "Monthly Charges Distribution by Status (No Partner)",
x = "Status",
y = "Monthly Charges"
) +
theme_minimal()
plot2 <- ggplot(has_partner, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(
title = "Monthly Charges Distribution by Status (No Partner)",
x = "Status",
y = "Monthly Charges"
) +
theme_minimal()
plot1 + plot2
Insights:
Dependents perspective above, the customers
who have left had been monthly charged higher than those are still on
service.Business Implication:
ggplot(data, aes(x = churn_case, y = MonthlyCharges, fill = Contract)) +
geom_boxplot() +
labs(
title = "Monthly Charges by Contract Type and Churn Case",
x = "Churn Case",
y = "Monthly Charges"
) +
theme_minimal() +
theme(legend.position = "right")
The average monthly charge for customers with a
Month-to-month contract is slightly higher compared to
other contract types. This likely reflects the comprehensive range of
services utilized by these customers. Therefore, it is unlikely that
monthly charges are a significant factor contributing to customer churn
in this group.
Since certain predictor variables are dependent on other columns
(e.g., MultipleLines depends on PhoneService),
it is important to cleanse these columns first to avoid
Multicollinearity in logistic regression
models. We will create custom processing functions for the
InternetService and PhoneService columns,
ensuring that each add-on for these services is grouped and combined
into a single column. This will help prevent issues with
multicollinearity and improve the model’s accuracy.
process_internet_features <- function(df) {
new_data <- df %>%
mutate(
internet_status = case_when(
InternetService == "No" ~ "No Internet",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "No")) == 3 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "No")) == 2 &
TechSupport == "No" ~ "DSL Basic",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "Yes" ~ "DSL Full",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport != "Yes" ~ "DSL Security",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport != "Yes" ~ "DSL Entertainment",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport == "Yes" ~ "DSL Support",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport == "Yes" ~ "DSL Security & Support",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "Yes" ~ "DSL Entertainment & Support",
InternetService == "DSL" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "No" ~ "DSL Security & Entertainment",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "No")) == 3 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "No")) == 2 &
TechSupport == "No" ~ "Fiber optic Basic",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "Yes" ~ "Fiber optic Full",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport != "Yes" ~ "Fiber optic Security",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport != "Yes" ~ "Fiber optic Entertainment",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport == "Yes" ~ "Fiber optic Support",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) <= 0 &
TechSupport == "Yes" ~ "Fiber optic Security & Support",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) <= 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "Yes" ~ "Fiber optic Entertainment & Support",
InternetService == "Fiber optic" &
rowSums(across(c(OnlineSecurity, OnlineBackup, DeviceProtection), ~ .x == "Yes")) > 0 &
rowSums(across(c(StreamingTV, StreamingMovies), ~ .x == "Yes")) > 0 &
TechSupport == "No" ~ "Fiber optic Security & Entertainment",
TRUE ~ NA_character_
)
) %>% select(-c(InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, StreamingTV, StreamingMovies, TechSupport))
return(new_data)
}
process_phone_feature <- function(df) {
df %>% mutate (
phone_status = case_when(
PhoneService == "No" ~ "No",
PhoneService == "Yes" & MultipleLines == "No" ~ "Single Line",
PhoneService == "Yes" & MultipleLines == "Yes" ~ "Multiple Lines",
TRUE ~ NA_character_
)
) %>% select (-c(PhoneService, MultipleLines))
}
Now we apply the above processors to the dataset and remove the
column TotalCharges since it has high correlation with
MonthlyCharges, which might create multicollinearity. We
also have a look at the first 5 rows:
df1 <- data %>% select(-c("contract_months", "contract_period", "churn_case", "TotalCharges"))
df2 <- process_internet_features(df1)
df2 <- process_phone_feature(df2)
head(df2, 5)
We create a Logistic Regression model with
the Classification mode since the target
variable Status has two categories: Current and
Left. The model includes 5-fold cross-validation, a recipe
for creating dummy variables for categorical columns, and normalizing
the values for numerical columns. Additionally, a tuning grid is applied
to optimize the model’s parameters.
# Convert Status to a factor and specify levels
df2$Status <- factor(df2$Status)
# Split data into training and testing sets
set.seed(123)
data_split <- initial_split(data = df2, prop = 0.7, strata = Status)
train <- training(data_split)
test <- testing(data_split)
# Set up cross-validation
set.seed(123)
logistic_kfolds <- vfold_cv(train, v = 5, strata = Status)
logistic_recipe <- recipe(Status ~ ., data = train) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Define logistic regression model
logistic_ridge_model <- logistic_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
# Create workflow
logistic_ridge_wf <- workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(logistic_ridge_model)
# Create hyperparameter search grid
logistic_grid <- grid_regular(mixture(), penalty(c(-10,5)), levels = 10)
# Perform hyperparamter search
tuning_results <- logistic_ridge_wf %>%
tune_grid(resamples = logistic_kfolds, grid = logistic_grid)
print(tuning_results %>%
collect_metrics() %>%
dplyr::filter(.metric == "roc_auc") %>%
arrange(desc(mean)))
autoplot(tuning_results)
With such low regularization values (ranging from around 1e-07 to
1e-03), the model can fit the training data well, achieving a high ROC
AUC score (around 0.84) during cross-validation. This suggests that the
features in the dataset are highly informative, enabling the Logistic
Regression model to effectively distinguish between the
Status classes, even with minimal regularization.
Now, we choose the best from tuning search to apply on our test dataset and see top 10 important features:
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(logistic_ridge_model) %>%
finalize_workflow(best_hyperparameters)
final_fit <- final_wf %>%
fit(data = train)
final_fit %>%
extract_fit_parsnip() %>%
vip()
Tenure appears to be the most influential factor in
predicting customer churn. Following that, MonthlyCharges
also plays a significant role in customer departures. This suggests that
we should further investigate the current charges, especially in
conjunction with the previous analysis of different services, to gain
deeper insights into the factors driving churn.
Let have a look at ROC curve, which measures the model’s ability to
distinguish between the two classes of Status:
final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
print(final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_auc(truth, .pred_Current)
)
print(final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
)
Truth
Prediction Current Left
Current 1411 258
Left 132 299
The curve’s proximity to the top-left corner indicates that the Logistic Regression model achieves high sensitivity while maintaining a low false positive rate at optimal thresholds.
The model’s AUC score of 0.85 (as previously calculated) further validates its effectiveness in distinguishing between classes, with values closer to 1 signifying excellent performance.
Now we test the Decision Tree model with same training and testing
dataset. To ensure consistency, the logistic_recipe was
applied as a preprocessing step:
decision_tree_model <- decision_tree(mode = "classification") %>%
set_engine("rpart")
decision_tree_fit <- workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(decision_tree_model) %>%
fit(data = train)
rpart.plot::rpart.plot(decision_tree_fit$fit$fit$fit)
Warning: Cannot retrieve the data used to build the model (model.frame: object '..y' not found).
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
The Decision Tree model identified
Contract_Month.to.month as the key predictor of churn,
prioritizing the flexibility of short-term contracts as a significant
risk factor. This information is also seen when we analyzed the
services. In contrast, the Logistic Regression model highlighted
Tenure, suggesting that customers with longer relationships
are less likely to churn.
This difference reflects the models’ approaches: the Decision Tree focuses on immediate splits in the data, while Logistic Regression emphasizes linear relationships. Together, these insights suggest targeting month-to-month customers and fostering long-term loyalty to reduce churn.
Now we do hyperparameters tuning in other to choose the best model.
We choose cost_complexity, tree_depth, and
min_n to tune:
decision_tree_model <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
decision_tree_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
decision_tree_results <- tune_grid(decision_tree_model, logistic_recipe, resamples = logistic_kfolds, grid = decision_tree_hyper_grid)
show_best(decision_tree_results, metric = "roc_auc", n = 5)
The roc_auc score of this model is slightly lower than
those from Logistic Regression. We continue to choose the best model
from these tuning results and apply it on test dataset:
dt_best_model <- select_best(decision_tree_results, metric = 'roc_auc')
dt_final_wf <- workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(decision_tree_model) %>%
finalize_workflow(dt_best_model)
dt_final_fit <- dt_final_wf %>%
fit(data = train)
dt_final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
print(dt_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class))
Truth
Prediction Current Left
Current 1365 262
Left 178 295
print(dt_final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_auc(truth, .pred_Current)
)
Compared with Logistic Regression model, the Decision Tree model performs slightly worse.
Now we create Random Forest model, which has a number of sub trees, to see if it performs better:
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
rf_results <- fit_resamples(rf_mod, logistic_recipe, logistic_kfolds)
collect_metrics(rf_results)
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(
trees(range = c(50, 800)),
mtry(range = c(2, 39)),
min_n(range = c(1, 20)),
levels = 5
)
set.seed(123)
rf_results <- tune_grid(rf_mod, logistic_recipe, resamples = logistic_kfolds, grid = rf_hyper_grid)
show_best(rf_results, metric = "roc_auc")
Generally, the roc_auc scores across different models
converges to 0.84, which is higher than Decision Tree model but still
lower than the Logistic Regressions model. Now we choose the best model
from these hyperparameters tuning and predict with our test
dataset:
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = train)
rf_final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
print(rf_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class))
Truth
Prediction Current Left
Current 1468 362
Left 75 195
print(rf_final_fit %>%
predict(test, type = "prob") %>%
mutate(truth = test$Status) %>%
roc_auc(truth, .pred_Current)
)
From the confusion matrix, we observe that the model performs
significantly better at predicting the Current class
compared to the Left class, with a higher number of true
positives and fewer false negatives for Current. However,
its performance in predicting the Left class is weaker, as
indicated by the higher number of false negatives.
InternetService and
PhoneService columns, ensuring each add-on was grouped into
a single column.Status variable (Current or
Left). The model included:
Tenure emerged as the most influential predictor of
churn, followed by MonthlyCharges.logistic_recipe
preprocessing step was reused to maintain consistency.Contract_Month.to.month as the
most significant risk factor for churn, emphasizing the flexibility of
short-term contracts.Tenure, the Decision Tree focused on splits like contract
type.Current
class, with a high number of true positives and fewer false
negatives.Left class was
weaker, with more false negatives observed.Contract_Month.to.month but fell short in overall
accuracy.Current class but struggled with Left,
reflecting its focus on overall prediction stability rather than
class-specific performance.This analysis underscores the importance of understanding customer behavior through demographic and service-related lenses. Pricing adjustments, tailored service plans, and a focus on enhancing customer experience can significantly improve retention rates. By addressing the specific needs of diverse customer groups, the company can build stronger relationships and reduce churn effectively.