Customer churn — the phenomenon where customers discontinue a service — is one of the most critical challenges facing telecommunications companies worldwide. Acquiring a new customer costs five to ten times more than retaining an existing one, making churn prediction a high-priority business problem. By identifying customers likely to leave before they actually do, companies can deploy targeted retention campaigns, personalise offers, and improve service quality.
Data-driven machine learning approaches enable companies to analyse behavioural and demographic patterns at scale and build predictive models that generalise across millions of customers.
This project uses the Customer Churn Prediction Dataset (1M) published on Kaggle by isandeep06 (2025). The dataset is a large-scale synthetic telecom churn dataset modelled on real-world patterns, containing 1,000,000 customer records across multiple features capturing demographics, service subscriptions, contract details, and billing information.
Yes) or stay (No) based on
customer tenure, monthly charges, total charges, customer satisfaction,
number of complaints, service calls, late payments, and credit
score.totalcharges) based on customer tenure, monthly charges,
their interaction effect, services, and other customer-related
variables.# ── IMPORTANT: Change this path to where your CSV file is saved ──────────
# Example Windows: "C:/Users/YourName/Downloads/customer_churn_1m.csv"
# Example Mac: "/Users/YourName/Downloads/customer_churn_1m.csv"
# data.table::fread() is much faster than read.csv() for large files
csv_path <- "customer_churn_1m.csv"
if (!file.exists(csv_path)) {
message("CSV not found in the current folder: ", normalizePath(csv_path, mustWork = FALSE))
message("Please choose the customer_churn_1m.csv file manually.")
csv_path <- file.choose()
}
df_raw <- fread(csv_path, stringsAsFactors = FALSE)
cat("CSV path used:", csv_path, "\n")## CSV path used: customer_churn_1m.csv
# Convert to standard data frame for compatibility with all functions
df_raw <- as.data.frame(df_raw)
cat("Dataset loaded successfully!\n")## Dataset loaded successfully!
## Rows: 1000000
## Columns: 32
## Dimension: 32000000
The dataset contains 1,000,000 customer records. In the working CSV file used for this project, there are 32 columns, including customer demographic variables, account information, service usage variables, behavioural risk variables, and the churn target variable.
## 'data.frame': 1000000 obs. of 32 variables:
## $ customer_id : chr "CUST0000000001" "CUST0000000002" "CUST0000000003" "CUST0000000004" ...
## $ signup_date : POSIXct, format: "2022-12-12 12:53:58" "2022-01-13 12:53:58" ...
## $ age : int 43 18 38 44 45 55 47 39 31 38 ...
## $ gender : chr "Female" "Male" "Female" "Male" ...
## $ annual_income : num 55085 60786 73184 40924 36401 ...
## $ education : chr "college" "master" "high_school" "high_school" ...
## $ marital_status : chr "married" "married" "widowed" "married" ...
## $ dependents : int 1 1 0 1 0 1 0 0 0 0 ...
## $ tenure : int 2 22 3 6 9 47 26 18 1 31 ...
## $ contract : chr "two_year" "one_year" "two_year" "two_year" ...
## $ payment_method : chr "electronic_check" "bank_transfer" "credit_card" "bank_transfer" ...
## $ paperless_billing : chr "Yes" "Yes" "No" "Yes" ...
## $ senior_citizen : int 0 0 0 0 0 0 0 0 1 0 ...
## $ monthlycharges : num 67.2 71.5 112.2 107.5 110 ...
## $ totalcharges : num 144 1602 329 643 649 ...
## $ num_services : int 1 2 4 3 3 2 4 3 4 1 ...
## $ has_phone_service : int 1 0 1 1 0 1 1 1 1 1 ...
## $ has_internet_service : int 1 1 1 1 1 1 1 1 1 1 ...
## $ has_online_security : int 0 0 0 0 0 0 1 1 1 0 ...
## $ has_online_backup : int 0 1 0 0 0 0 1 0 1 0 ...
## $ has_device_protection : int 1 0 1 1 0 0 0 0 1 0 ...
## $ has_tech_support : int 1 0 1 1 0 0 0 1 1 1 ...
## $ has_streaming_tv : int 1 0 1 0 0 1 1 1 0 0 ...
## $ has_streaming_movies : int 1 1 0 1 1 1 1 0 1 0 ...
## $ customer_satisfaction : num 9 7 6 5 8 8 4 5 4 5 ...
## $ num_complaints : num 0 0 1 2 1 0 2 2 NA 0 ...
## $ num_service_calls : int 0 3 1 2 1 0 3 4 0 1 ...
## $ late_payments : int 0 1 0 1 0 0 0 0 0 0 ...
## $ avg_monthly_gb : num 109.6 63.2 47.8 50.8 16.7 ...
## $ days_since_last_interaction: int 16 134 11 6 18 22 11 106 28 91 ...
## $ credit_score : num NA 585 632 569 657 702 716 525 637 850 ...
## $ churn : int 0 0 0 0 0 0 0 0 0 0 ...
## customer_id signup_date age gender annual_income education
## 1 CUST0000000001 2022-12-12 12:53:58 43 Female 55085.25 college
## 2 CUST0000000002 2022-01-13 12:53:58 18 Male 60786.11 master
## 3 CUST0000000003 2023-09-04 12:53:58 38 Female 73184.32 high_school
## 4 CUST0000000004 2022-06-27 12:53:58 44 Male 40923.78 high_school
## 5 CUST0000000005 2022-12-08 12:53:58 45 Female 36400.94 bachelor
## 6 CUST0000000006 2023-01-12 12:53:58 55 Female 30771.38 master
## marital_status dependents tenure contract payment_method paperless_billing
## 1 married 1 2 two_year electronic_check Yes
## 2 married 1 22 one_year bank_transfer Yes
## 3 widowed 0 3 two_year credit_card No
## 4 married 1 6 two_year bank_transfer Yes
## 5 single 0 9 two_year credit_card Yes
## 6 married 1 47 two_year credit_card Yes
## senior_citizen monthlycharges totalcharges num_services has_phone_service
## 1 0 67.20 144.39 1 1
## 2 0 71.54 1602.22 2 0
## 3 0 112.20 328.81 4 1
## 4 0 107.49 643.45 3 1
## 5 0 110.05 648.79 3 0
## 6 0 71.80 3041.87 2 1
## has_internet_service has_online_security has_online_backup
## 1 1 0 0
## 2 1 0 1
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## has_device_protection has_tech_support has_streaming_tv has_streaming_movies
## 1 1 1 1 1
## 2 0 0 0 1
## 3 1 1 1 0
## 4 1 1 0 1
## 5 0 0 0 1
## 6 0 0 1 1
## customer_satisfaction num_complaints num_service_calls late_payments
## 1 9 0 0 0
## 2 7 0 3 1
## 3 6 1 1 0
## 4 5 2 2 1
## 5 8 1 1 0
## 6 8 0 0 0
## avg_monthly_gb days_since_last_interaction credit_score churn
## 1 109.63 16 NA 0
## 2 63.25 134 585 0
## 3 47.77 11 632 0
## 4 50.82 6 569 0
## 5 16.74 18 657 0
## 6 24.56 22 702 0
## customer_id signup_date age
## Length :1000000 Min. :2021-01-13 12:53:58 Min. :18.00
## N.unique :1000000 1st Qu.:2022-04-14 12:54:00 1st Qu.:34.00
## N.blank : 0 Median :2023-07-14 12:54:09 Median :44.00
## Min.nchar: 14 Mean :2023-07-14 13:39:07 Mean :44.72
## Max.nchar: 24 3rd Qu.:2024-10-12 12:54:09 3rd Qu.:55.00
## Max. :2026-01-11 12:54:09 Max. :90.00
##
## gender annual_income education marital_status
## Length :1000000 Min. : 20000 Length :1000000 Length :1000000
## N.unique : 3 1st Qu.: 32713 N.unique : 5 N.unique : 4
## N.blank : 0 Median : 48955 N.blank : 0 N.blank : 0
## Min.nchar: 4 Mean : 58788 Min.nchar: 3 Min.nchar: 6
## Max.nchar: 6 3rd Qu.: 73475 Max.nchar: 11 Max.nchar: 8
## Max. :250000
## NAs :29959
## dependents tenure contract payment_method
## Min. :0.0000 Min. : 1.00 Length :1000000 Length :1000000
## 1st Qu.:0.0000 1st Qu.: 6.00 N.unique : 3 N.unique : 4
## Median :1.0000 Median :16.00 N.blank : 0 N.blank : 0
## Mean :0.7998 Mean :22.38 Min.nchar: 8 Min.nchar: 11
## 3rd Qu.:1.0000 3rd Qu.:33.00 Max.nchar: 14 Max.nchar: 16
## Max. :5.0000 Max. :72.00
##
## paperless_billing senior_citizen monthlycharges totalcharges
## Length :1000000 Min. :0.0000 Min. : 20.00 Min. : 16.3
## N.unique : 2 1st Qu.:0.0000 1st Qu.: 70.49 1st Qu.: 484.5
## N.blank : 0 Median :0.0000 Median : 85.48 Median : 1249.8
## Min.nchar: 2 Mean :0.1995 Mean : 86.44 Mean : 1837.3
## Max.nchar: 3 3rd Qu.:0.0000 3rd Qu.:100.68 3rd Qu.: 2617.7
## Max. :1.0000 Max. :854.96 Max. :16252.9
##
## num_services has_phone_service has_internet_service has_online_security
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :2.000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :2.564 Mean :0.7696 Mean :0.8497 Mean :0.3395
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## has_online_backup has_device_protection has_tech_support has_streaming_tv
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0
## Mean :0.4257 Mean :0.2971 Mean :0.4973 Mean :0.6
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0
##
## has_streaming_movies customer_satisfaction num_complaints num_service_calls
## Min. :0.0000 Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:0.0000 1st Qu.: 1.00
## Median :1.0000 Median :7.000 Median :1.0000 Median : 1.00
## Mean :0.5491 Mean :6.159 Mean :0.7017 Mean : 1.76
## 3rd Qu.:1.0000 3rd Qu.:8.000 3rd Qu.:1.0000 3rd Qu.: 3.00
## Max. :1.0000 Max. :9.000 Max. :7.0000 Max. :12.00
## NAs :19921 NAs :29906
## late_payments avg_monthly_gb days_since_last_interaction credit_score
## Min. :0.000 Min. : 0.00 Min. : 1.00 Min. :300.0
## 1st Qu.:0.000 1st Qu.: 12.84 1st Qu.: 12.00 1st Qu.:619.0
## Median :0.000 Median : 27.77 Median : 31.00 Median :680.0
## Mean :0.401 Mean : 39.10 Mean : 44.49 Mean :678.6
## 3rd Qu.:1.000 3rd Qu.: 51.23 3rd Qu.: 62.00 3rd Qu.:740.0
## Max. :5.000 Max. :557.82 Max. :365.00 Max. :850.0
## NAs :50012 NAs :40395
## churn
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.09923
## 3rd Qu.:0.00000
## Max. :1.00000
##
The summary statistics show that the numeric variables are generally within reasonable ranges. For example, customer age is within the expected adult range, and tenure ranges from new customers to long-term customers. Monthly charges and total charges also show large variation, which is reasonable for telecom customers with different contract types and service usage. Overall, there are no obvious extreme data-entry errors based on the summary output.
# ── Missing values per column ─────────────────────────────────────────────
missing_counts <- colSums(is.na(df_raw))
missing_pct <- round(missing_counts / nrow(df_raw) * 100, 2)
missing_df <- data.frame(Column = names(missing_counts),
Missing_Count = missing_counts,
Missing_Pct = missing_pct)
missing_df <- missing_df[missing_df$Missing_Count > 0, ]
if (nrow(missing_df) == 0) {
cat("No missing values found in any column.\n")
} else {
print(missing_df)
}## Column Missing_Count Missing_Pct
## annual_income annual_income 29959 3.00
## customer_satisfaction customer_satisfaction 19921 1.99
## num_complaints num_complaints 29906 2.99
## avg_monthly_gb avg_monthly_gb 50012 5.00
## credit_score credit_score 40395 4.04
The missing value check shows that a few variables contain missing
values, including annual_income,
customer_satisfaction, num_complaints,
avg_monthly_gb, and credit_score. These
missing values are not removed in this section. They are handled later
in the data cleaning and processing section.
# ── Target variable distribution (Churn) ─────────────────────────────────
churn_table <- table(df_raw$churn)
churn_pct <- round(prop.table(churn_table) * 100, 2)
cat("Churn Distribution:\n")## Churn Distribution:
##
## 0 1
## 900773 99227
##
## Churn Percentage:
##
## 0 1
## 90.08 9.92
The churn distribution shows that the dataset is imbalanced, with fewer churned customers than retained customers. This is important because model accuracy alone may not be enough for evaluating the classification model. Therefore, other metrics such as precision, recall, and F1-score are also considered later.
The dataset contains the following 32 columns:
| # | Column | Type | Description |
|---|---|---|---|
| 1 | customer_id | Character | Unique customer identifier |
| 2 | signup_date | Character | Customer signup date |
| 3 | age | Numeric | Customer age |
| 4 | gender | Character | Customer gender |
| 5 | annual_income | Numeric | Annual customer income |
| 6 | education | Character | Education level |
| 7 | marital_status | Character | Marital status |
| 8 | dependents | Character | Whether customer has dependents |
| 9 | tenure | Numeric | Months as a customer |
| 10 | contract | Character | Contract type |
| 11 | payment_method | Character | Payment method |
| 12 | paperless_billing | Character | Whether customer uses paperless billing |
| 13 | senior_citizen | Character | Senior citizen status |
| 14 | monthlycharges | Numeric | Monthly charges |
| 15 | totalcharges | Numeric | Total accumulated charges (Regression target) |
| 16 | num_services | Numeric | Number of subscribed services |
| 17 | has_phone_service | Character | Phone service subscription |
| 18 | has_internet_service | Character | Internet service subscription |
| 19 | has_online_security | Character | Online security subscription |
| 20 | has_online_backup | Character | Online backup subscription |
| 21 | has_device_protection | Character | Device protection subscription |
| 22 | has_tech_support | Character | Tech support subscription |
| 23 | has_streaming_tv | Character | Streaming TV subscription |
| 24 | has_streaming_movies | Character | Streaming movie subscription |
| 25 | customer_satisfaction | Numeric | Customer satisfaction score |
| 26 | num_complaints | Numeric | Number of complaints |
| 27 | num_service_calls | Numeric | Number of service calls |
| 28 | late_payments | Numeric | Number of late payments |
| 29 | avg_monthly_gb | Numeric | Average monthly internet usage (GB) |
| 30 | days_since_last_interaction | Numeric | Days since last interaction |
| 31 | credit_score | Numeric | Customer credit score |
| 32 | churn | Character | Churn status (Classification target) |
Overall, this dataset is suitable for this project because it
supports three types of data mining tasks. First, the churn
variable can be used for classification to predict whether a customer
will leave or stay. Second, totalcharges is a continuous
numeric variable, so it can be used for regression. Third, customer
behaviour and account variables such as tenure,
monthlycharges, num_services,
customer_satisfaction, num_complaints, and
late_payments can be used for clustering to group similar
customers. Therefore, the dataset matches the objectives of
classification, regression, and clustering in this project.
Since this is a synthetic dataset, the findings should be interpreted as an academic simulation rather than direct evidence from a real telecom company.
Data cleaning is applied in a structured sequence of steps to produce a reliable, analysis-ready dataset.
# ── Make a working copy ───────────────────────────────────────────────────
df <- df_raw
# ── Remove customer_id: unique identifier with no predictive value ───────
df$customer_id <- NULL
df$signup_date <- NULL
cat("customer_id removed. Columns remaining:", ncol(df), "\n")## customer_id removed. Columns remaining: 30
## Handle missing values
# Mode function for categorical variables
get_mode <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
# Numeric columns: impute with median
numeric_cols <- c(
"age", "annual_income", "tenure", "monthlycharges", "totalcharges",
"num_services", "customer_satisfaction", "num_complaints",
"num_service_calls", "late_payments", "avg_monthly_gb",
"days_since_last_interaction", "credit_score"
)
for (col in numeric_cols) {
if (col %in% colnames(df) && sum(is.na(df[[col]])) > 0) {
df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}
}
# Categorical columns: impute with mode
factor_cols <- c(
"gender", "education", "marital_status", "dependents",
"contract", "payment_method", "paperless_billing", "senior_citizen",
"has_phone_service", "has_internet_service",
"has_online_security", "has_online_backup",
"has_device_protection", "has_tech_support",
"has_streaming_tv", "has_streaming_movies", "churn"
)
for (col in factor_cols) {
if (col %in% colnames(df) && sum(is.na(df[[col]])) > 0) {
df[[col]][is.na(df[[col]])] <- get_mode(df[[col]])
}
}
cat("Missing values handled.\n")## Missing values handled.
## Convert all variables to correct types
factor_cols <- c(
"gender", "education", "marital_status", "dependents",
"contract", "payment_method", "paperless_billing", "senior_citizen",
"has_phone_service", "has_internet_service",
"has_online_security", "has_online_backup",
"has_device_protection", "has_tech_support",
"has_streaming_tv", "has_streaming_movies", "churn"
)
numeric_cols <- c(
"age", "annual_income", "tenure", "monthlycharges", "totalcharges",
"num_services", "customer_satisfaction", "num_complaints",
"num_service_calls", "late_payments", "avg_monthly_gb",
"days_since_last_interaction", "credit_score"
)
# Convert to factor
for (col in factor_cols) {
if (col %in% colnames(df)) df[[col]] <- as.factor(df[[col]])
}
# Convert to numeric
for (col in numeric_cols) {
if (col %in% colnames(df)) {
df[[col]] <- as.numeric(as.character(df[[col]]))
df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}
}
cat("All variable types converted successfully.\n")## All variable types converted successfully.
##Treat outliers
cap_outliers <- function(x) {
if (!is.numeric(x)) return(x)
q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
iqr <- q[2] - q[1]
if (iqr == 0) return(x)
pmax(pmin(x, q[2] + 1.5*iqr), q[1] - 1.5*iqr)
}
outlier_vars <- c("monthlycharges", "annual_income")
for (col in outlier_vars) {
if (col %in% colnames(df)) df[[col]] <- cap_outliers(df[[col]])
}
cat("Outliers treated.\n")## Outliers treated.
# Save cleaned data to CSV for later use
write.csv(df, "cleaned_customer_data.csv", row.names = FALSE)
# Verify save
cat("Cleaned dataset saved successfully.\n")## Cleaned dataset saved successfully.
## File: cleaned_customer_data.csv
## Rows saved: 1000000
## Columns saved: 30
# Final cleaned dataset ready for EDA and modelling
df_clean <- df
cat("df_clean is ready for EDA and modelling.\n")## df_clean is ready for EDA and modelling.
In this section, we explore the cleaned dataset to understand the patterns in customer behaviour and identify which variables are likely to be useful for our research questions on churn prediction (Q1), total charges prediction (Q2), and customer segmentation (Q3).
Before anything else, we need to understand how balanced our target variable is. If the data is heavily imbalanced, the classification model may be biased.
# —— Bar chart: how many customers churned vs stayed ——————————————————————
churn_counts <- data.frame(
Status = c("Not Churn", "Churn"),
Count = c(sum(as.character(df_clean$churn) == "0"),
sum(as.character(df_clean$churn) == "1"))
) %>%
mutate(Pct = round(Count / sum(Count) * 100, 1))
ggplot(churn_counts, aes(x = Status, y = Count, fill = Status)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = paste0(Pct, "%")),
vjust = -0.5, size = 5, fontface = "bold") +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
scale_y_continuous(labels = comma) +
labs(title = "Customer Churn Distribution",
x = "Status", y = "Number of Customers") +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))From the chart, only 9.9% of customers churned while 90.1% stayed. This means the dataset is quite imbalanced. When we build the classification model later, we will need to handle this imbalance so the model does not just predict “Not Churn” for everyone.
Contract type is one of the most commonly discussed factors in churn analysis. Here we check whether customers on different contract types have different churn rates.
# —— Stacked bar: churn rate by contract type —————————————————————————————
df_clean %>%
mutate(Churn_Label = ifelse(as.character(churn) == "1",
"Churn", "Not Churn")) %>%
count(contract, Churn_Label) %>%
ggplot(aes(x = contract, y = n, fill = Churn_Label)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = percent) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Churn Rate by Contract Type",
x = "Contract Type", y = "Proportion", fill = "Status") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Customers on month-to-month contracts have a clearly higher churn rate compared to those on one-year or two-year contracts. This makes sense because short-term contracts give customers more flexibility to leave. This variable is likely to be an important predictor in our classification model.
We look at two important numerical variables together — monthly charges and tenure — to see how they differ between customers who churned and those who did not.
# —— Left plot: monthly charges by churn ——————————————————————————————————
p1 <- df_clean %>%
mutate(Churn_Label = ifelse(as.character(churn) == "1",
"Churn", "Not Churn")) %>%
ggplot(aes(x = Churn_Label, y = monthlycharges, fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Monthly Charges",
x = "", y = "Monthly Charges (USD)") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
# —— Right plot: tenure by churn ——————————————————————————————————————————
p2 <- df_clean %>%
mutate(Churn_Label = ifelse(as.character(churn) == "1",
"Churn", "Not Churn")) %>%
ggplot(aes(x = Churn_Label, y = tenure, fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Tenure",
x = "", y = "Tenure (months)") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
# —— Arrange side by side —————————————————————————————————————————————————
grid.arrange(p1, p2, ncol = 2)From the boxplots, both monthlycharges and
tenure show very similar distributions between churned and
retained customers, with median differences of less than $1 and
approximately 1 month respectively. This is consistent with their
near-zero Pearson correlation with churn (r < 0.02), confirming that
these two variables have limited standalone predictive power for churn
(Q1).
However, they are retained in the classification model because they
may contribute through combined effects with other variables, such as
contract type or number of services. More importantly,
tenure is the strongest predictor of
totalcharges in our regression model (Q2), with a Pearson r
of 0.93, as confirmed in the scatter plot below. The variables with the
strongest direct relationship to churn — including
customer_satisfaction, num_complaints,
num_service_calls, and late_payments — are
explored in the next section.
The Pearson correlation analysis (Section 4.6) identifies
customer_satisfaction, num_complaints,
num_service_calls, and late_payments as the
four numerical variables with the strongest linear association with
churn (r = −0.085, +0.080, +0.077, +0.048 respectively). This section
visualises their distributions across churned and retained customers to
support the findings.
# —— Four key churn predictors side by side ———————————————————————————————
df_exp <- df_clean %>%
mutate(Churn_Label = ifelse(as.character(churn) == "1",
"Churn", "Not Churn"))
p_sat <- ggplot(df_exp,
aes(x = Churn_Label, y = customer_satisfaction,
fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Customer Satisfaction",
x = "", y = "Satisfaction Score") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p_comp <- ggplot(df_exp,
aes(x = Churn_Label, y = num_complaints,
fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Number of Complaints",
x = "", y = "Count") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p_calls <- ggplot(df_exp,
aes(x = Churn_Label, y = num_service_calls,
fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Service Calls",
x = "", y = "Count") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p_late <- ggplot(df_exp,
aes(x = Churn_Label, y = late_payments,
fill = Churn_Label)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
scale_fill_manual(values = c("Churn" = "#E74C3C",
"Not Churn" = "#2C3E50")) +
labs(title = "Late Payments",
x = "", y = "Count") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p_sat, p_comp, p_calls, p_late, ncol = 4)Churned customers show slightly lower satisfaction scores and slightly higher complaint counts, service call counts, and late payment counts compared to retained customers. Although the differences appear moderate — consistent with the synthetic nature of the dataset where most variables were generated with low correlation to churn — these four variables represent the strongest available linear signals for churn prediction. They are therefore prioritised as key input features in the classification model (Q1).
This chart is specifically to support our regression question (Q2). We want to see whether there is a clear linear relationship between tenure and total charges, since this is what the regression model will try to predict.
# —— Scatter plot: tenure vs totalcharges with regression line ————————————
# Using a sample of 10,000 rows to keep the plot readable
set.seed(42)
df_sample <- df_clean %>% sample_n(10000)
ggplot(df_sample, aes(x = tenure, y = totalcharges)) +
geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
geom_smooth(method = "lm", colour = "#E74C3C",
se = TRUE, linewidth = 1) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Tenure vs Total Charges",
subtitle = "Based on a random sample of 10,000 customers",
x = "Tenure (months)", y = "Total Charges (USD)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))There is a clear positive linear relationship between tenure and total charges — the longer a customer has been subscribed, the higher their total charges. This is expected because total charges accumulate over time. The regression model should be able to use tenure as a strong predictor of total charges.
Finally, we look at the correlation between all numerical variables to get an overall picture of how they relate to each other. This is useful for both models.
# —— Select numerical columns and compute correlation matrix ———————————————
num_vars <- df_clean %>%
select(age, annual_income, tenure, monthlycharges, totalcharges,
num_services, num_complaints, num_service_calls,
late_payments, customer_satisfaction, credit_score,
avg_monthly_gb, days_since_last_interaction) %>%
mutate(across(everything(), as.numeric))
cor_matrix <- cor(num_vars, use = "complete.obs", method = "pearson")
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.85,
tl.col = "black",
addCoef.col = "black",
number.cex = 0.60,
col = colorRampPalette(
c("#E74C3C", "white", "#2C3E50"))(200),
title = "Correlation Heatmap of Numerical Variables",
mar = c(0, 0, 2, 0))From the heatmap, most variables have very little correlation with each other, with values close to 0. This is consistent with the synthetic nature of the dataset, where most variables were generated independently.
The strongest relationship is between tenure and
totalcharges (r = 0.91), which is mathematically expected
since total charges accumulate over time as a function of tenure and
monthly charges. Two additional notable correlations were also found:
num_complaints and num_service_calls (r =
0.44), suggesting that dissatisfied customers tend to both complain and
contact support more frequently; and monthlycharges and
num_services (r = 0.49), which reflects the logical pricing
relationship where subscribing to more services leads to higher monthly
bills.
Because totalcharges is closely related to
tenure and monthlycharges, the regression
model is expected to achieve a high R-squared. This relationship is
explored further in the regression section.
Summary:
Pearson correlation is used to (1) visualise feature correlations, (2)
screen potential multicollinearity (highly redundant features), and (3)
observe the initial linear trend between numeric features and churn.
Explanation:
Pearson correlation captures linear relationships only. In
later sections, we use XGBoost + SHAP to detect non-linear patterns and
interactions that Pearson correlation may miss.
find_redundant_by_correlation <- function(cor_mat, cutoff = 0.90) {
# A simple iterative filter:
# repeatedly remove the variable with the highest average absolute correlation
cor_abs <- abs(cor_mat)
diag(cor_abs) <- 0
to_remove <- character(0)
keep <- colnames(cor_abs)
while (TRUE) {
sub <- cor_abs[keep, keep, drop = FALSE]
if (max(sub, na.rm = TRUE) < cutoff) break
mean_corr <- colMeans(sub, na.rm = TRUE)
remove_var <- names(which.max(mean_corr))
to_remove <- c(to_remove, remove_var)
keep <- setdiff(keep, remove_var)
}
list(keep = keep, remove = unique(to_remove))
}
redundancy <- find_redundant_by_correlation(cor_matrix, cutoff = 0.90)
redundant_features <- redundancy$remove
cat("Highly redundant numeric features (|r| >= 0.90):\n")## Highly redundant numeric features (|r| >= 0.90):
## [1] "totalcharges"
# Reduced numeric set (can be used for some models to reduce redundancy)
num_vars_reduced <- num_vars[, redundancy$keep, drop = FALSE]
cat("Original numeric features:", ncol(num_vars), "\n")## Original numeric features: 13
## Reduced numeric features: 12
Note:
tenureandtotalchargesare expected to be highly correlated because total charges accumulate over time. For Q2 regression, we keep them (and even model the interaction with monthly charges). For Q1 classification, we keep the required variables but also report redundancy so the model interpretation remains cautious. For Q3 clustering, the numeric variables identified here are used as the input features after standardisation.
# churn is coded as 0/1 in the dataset; convert to numeric for correlation
churn_num <- as.numeric(as.character(df_clean$churn))
corr_with_churn <- sapply(num_vars, function(x) {
suppressWarnings(cor(x, churn_num, use = "complete.obs", method = "pearson"))
})
corr_df <- data.frame(
Feature = names(corr_with_churn),
Pearson_r = as.numeric(corr_with_churn)
) %>%
arrange(desc(abs(Pearson_r)))
head(corr_df, 12)## Feature Pearson_r
## 1 customer_satisfaction -0.0838270226
## 2 num_complaints 0.0786322707
## 3 num_service_calls 0.0772835977
## 4 late_payments 0.0477047675
## 5 num_services -0.0333776309
## 6 monthlycharges -0.0171147524
## 7 totalcharges -0.0165011945
## 8 tenure -0.0127086881
## 9 avg_monthly_gb -0.0060656635
## 10 annual_income 0.0004749200
## 11 days_since_last_interaction 0.0003913459
## 12 credit_score -0.0003690728
ggplot(corr_df, aes(x = reorder(Feature, abs(Pearson_r)),
y = Pearson_r,
fill = Pearson_r > 0)) +
geom_col() +
coord_flip() +
scale_fill_manual(
values = c("TRUE" = "#E74C3C",
"FALSE" = "#2C3E50"),
labels = c("Negative (protective — higher value → lower churn)",
"Positive (risk factor — higher value → higher churn)"),
name = "Direction"
) +
labs(
title = "Pearson Correlation with Churn (Linear Trend Only)",
subtitle = "Red = risk factor; Dark = protective factor (linear association only)",
x = "Feature", y = "Pearson r (with churn coded as 0/1)"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))Q1 Classification (Churn):
The target is imbalanced (minority churn class). EDA suggests that churn
is not explained by a single variable alone; instead, it likely depends
on combined effects of customer experience (satisfaction,
complaints, service calls), billing behaviour (late payments), and
contract/payment choices. Pearson correlation provides only an initial
linear signal, so tree-based models and SHAP are later used to uncover
non-linear effects.
Q2 Regression (Total Charges):
EDA confirms a strong linear relationship between tenure
and totalcharges, supporting the linear regression approach
and motivating an interaction term with monthlycharges.
Q3 Clustering (Segmentation):
The distribution and scales of tenure, charges, service counts, and
experience variables motivate standardisation before clustering, and the
feature set used later reflects the behavioural drivers identified in
EDA.
Summary:: This section focuses on predicting whether a customer will churn or not. Since the target variable churn has two possible outcomes, this is a binary classification problem.
Explanation:
The target variable is churn, where 0
represents customers who did not churn and 1 represents
customers who churned. Logistic regression was selected as the baseline
classification model because it is suitable for binary classification
and easy to interpret. The model uses both numerical and categorical
customer-related variables, including tenure, monthly charges, total
charges, customer satisfaction, number of complaints, service calls,
late payments, credit score, contract type, payment method, and
paperless billing.
Summary:
Before building the model, the cleaned dataset df_clean was
used. The target variable and selected categorical predictors were
converted into factor variables, and the dataset was split into 70%
training data and 30% testing data.
Explanation:
The training data was used to build the logistic regression model, while
the testing data was used to evaluate the model performance on unseen
data. Categorical variables such as contract type, payment method, and
paperless billing were converted into factor variables so that they
could be properly included in the logistic regression model.
df_clean$churn <- as.factor(df_clean$churn)
df_clean$contract <- as.factor(df_clean$contract)
df_clean$payment_method <- as.factor(df_clean$payment_method)
df_clean$paperless_billing <- as.factor(df_clean$paperless_billing)
set.seed(123)
train_index <- sample(1:nrow(df_clean), size = 0.7 * nrow(df_clean))
train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]
table(df_clean$churn)##
## 0 1
## 900773 99227
##
## 0 1
## 0.900773 0.099227
The churn distribution shows that most customers belong to the non-churn group, while only a smaller proportion of customers churned. This indicates that the dataset is imbalanced. Therefore, the model evaluation should not rely only on accuracy.
Summary:
A logistic regression model was built using both numerical and
categorical predictors. These variables are related to customer usage,
payment behaviour, customer experience, and service-related
characteristics.
Explanation:
The logistic regression model estimates the probability that a customer
will churn. The selected predictors include tenure, monthly charges,
total charges, customer satisfaction, number of complaints, number of
service calls, late payments, credit score, contract type, payment
method, and paperless billing. Including categorical predictors makes
the model more consistent with the EDA findings, especially because
contract type is strongly related to customer churn.
logit_model <- glm(
churn ~ tenure + monthlycharges + totalcharges + customer_satisfaction +
num_complaints + num_service_calls + late_payments + credit_score +
contract + payment_method + paperless_billing,
data = train_data,
family = binomial
)
summary(logit_model)##
## Call:
## glm(formula = churn ~ tenure + monthlycharges + totalcharges +
## customer_satisfaction + num_complaints + num_service_calls +
## late_payments + credit_score + contract + payment_method +
## paperless_billing, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.776e-01 4.577e-02 -10.435 <2e-16 ***
## tenure -1.465e-03 6.479e-04 -2.262 0.0237 *
## monthlycharges -2.505e-03 2.392e-04 -10.473 <2e-16 ***
## totalcharges -9.659e-06 7.585e-06 -1.274 0.2028
## customer_satisfaction -1.200e-01 1.695e-03 -70.808 <2e-16 ***
## num_complaints 2.083e-01 5.084e-03 40.965 <2e-16 ***
## num_service_calls 1.092e-01 2.854e-03 38.248 <2e-16 ***
## late_payments 2.385e-01 5.929e-03 40.216 <2e-16 ***
## credit_score 1.266e-05 4.752e-05 0.266 0.7899
## contractone_year -9.484e-01 2.016e-02 -47.046 <2e-16 ***
## contracttwo_year -1.838e+00 2.111e-02 -87.087 <2e-16 ***
## payment_methodcredit_card -5.607e-04 1.069e-02 -0.052 0.9582
## payment_methodelectronic_check 5.874e-03 1.153e-02 0.510 0.6104
## payment_methodmailed_check -6.981e-04 1.333e-02 -0.052 0.9582
## paperless_billingYes -1.951e-02 9.381e-03 -2.080 0.0375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 453250 on 699999 degrees of freedom
## Residual deviance: 427809 on 699985 degrees of freedom
## AIC: 427839
##
## Number of Fisher Scoring iterations: 5
Summary: After building the model, the churn probability was predicted for each customer in the testing dataset. Since the churn class is relatively small, a threshold of 0.10 was used instead of the default 0.50.
Explanation:
A lower threshold makes the model more sensitive to potential churn
customers. Customers with a predicted churn probability of 0.10 or above
were classified as churn. This helps reduce the issue where the model
predicts almost all customers as non-churn, which can happen when the
dataset is imbalanced.
pred_prob <- predict(logit_model, newdata = test_data, type = "response")
threshold <- 0.10
pred_class <- ifelse(pred_prob >= threshold, "1", "0")
pred_class <- factor(pred_class, levels = levels(test_data$churn))
conf_matrix <- table(
Actual = test_data$churn,
Predicted = pred_class
)
conf_matrix## Predicted
## Actual 0 1
## 0 168407 101942
## 1 10903 18748
Summary: The model was evaluated using a confusion matrix, accuracy, precision, recall, specificity, balanced accuracy, and F1 score. Recall and F1 score are especially important because churn customers are the minority class.
Explanation:
Accuracy measures the overall proportion of correct predictions.
Precision shows how many customers predicted as churn actually churned.
Recall shows how many actual churn customers were successfully detected.
Specificity measures how well the model identifies non-churn customers.
Balanced accuracy considers both recall and specificity, which is useful
for imbalanced classification problems. F1 score combines precision and
recall into one measure.
TN <- conf_matrix["0", "0"]
FP <- conf_matrix["0", "1"]
FN <- conf_matrix["1", "0"]
TP <- conf_matrix["1", "1"]
accuracy <- (TP + TN) / sum(conf_matrix)
precision <- ifelse((TP + FP) == 0, 0, TP / (TP + FP))
recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
specificity <- ifelse((TN + FP) == 0, 0, TN / (TN + FP))
balanced_accuracy <- (recall + specificity) / 2
f1_score <- ifelse((precision + recall) == 0, 0,
2 * precision * recall / (precision + recall))
classification_result <- data.frame(
Metric = c(
"Accuracy",
"Precision",
"Recall",
"Specificity",
"Balanced Accuracy",
"F1 Score"
),
Value = round(c(
accuracy,
precision,
recall,
specificity,
balanced_accuracy,
f1_score
), 4)
)
classification_result## Metric Value
## 1 Accuracy 0.6238
## 2 Precision 0.1553
## 3 Recall 0.6323
## 4 Specificity 0.6229
## 5 Balanced Accuracy 0.6276
## 6 F1 Score 0.2494
The evaluation results show how well the model predicts churn customers after including both numerical and categorical predictors. Since the churn class is relatively small, recall and F1 score are useful for understanding model performance beyond accuracy.
conf_df <- as.data.frame(conf_matrix)
ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Freq)) +
geom_tile() +
geom_text(aes(label = scales::comma(Freq)), size = 5) +
labs(
title = "Confusion Matrix Heatmap",
x = "Predicted Class",
y = "Actual Class",
fill = "Count"
) +
theme_minimal()The confusion matrix heatmap provides a visual summary of correct and incorrect predictions. The diagonal cells represent correct predictions, while the off-diagonal cells represent false positives and false negatives.
Summary:
The ROC curve and AUC were used as an additional evaluation method. This
helps evaluate how well the model separates churn and non-churn
customers without depending only on one threshold.
Explanation:
The ROC curve shows the trade-off between the true positive rate and the
false positive rate. The AUC value summarises the model’s overall
ability to distinguish churn customers from non-churn customers. This is
useful for imbalanced classification problems because it evaluates the
ranking ability of the model across different thresholds.
roc_obj <- roc(
response = test_data$churn,
predictor = as.numeric(pred_prob),
levels = c("0", "1"),
direction = "<",
quiet = TRUE
)
auc_value <- auc(roc_obj)
cat("AUC-ROC:", round(auc_value, 4), "\n")## AUC-ROC: 0.6777
Summary:
The precision-recall curve was used to further evaluate the model
because the churn class is much smaller than the non-churn class.
Explanation:
The precision-recall curve focuses on the model’s ability to identify
churn customers. This is useful for imbalanced classification problems
because precision and recall directly describe the model performance on
the minority class. In this project, recall shows how many actual churn
customers were detected, while precision shows how many predicted churn
customers actually churned.
pr_obj <- pr.curve(
scores.class0 = pred_prob[test_data$churn == "1"],
scores.class1 = pred_prob[test_data$churn == "0"],
curve = TRUE
)
plot(
pr_obj,
main = "Precision-Recall Curve for Churn Classification"
)## AUC-PR: 0.194
Summary: The probability plot shows the distribution of predicted churn probabilities. Most customers have relatively low predicted churn probabilities, which is reasonable because most customers did not churn.
Explanation:
The plot confirms that most predicted churn probabilities are relatively
low, and the churn and non-churn groups overlap. This supports the use
of multiple evaluation metrics instead of relying only on accuracy.
prediction_df <- data.frame(
Actual_Churn = test_data$churn,
Predicted_Probability = pred_prob
)
ggplot(prediction_df, aes(x = Predicted_Probability, fill = Actual_Churn)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
labs(
title = "Predicted Probability of Customer Churn",
x = "Predicted Churn Probability",
y = "Number of Customers",
fill = "Actual Churn"
) +
theme_minimal()Summary:
To extend the baseline logistic regression, we trained two additional
classification models: a decision tree and a random forest. The goal is
to examine whether a non-linear model can improve churn detection on an
imbalanced dataset.
Explanation:
Logistic regression is highly interpretable but assumes a linear
relationship between predictors and the log-odds of churn. In contrast,
decision trees can model non-linear interactions, while random forests
often achieve better predictive performance by aggregating many trees.
However, tree-based models may require more computation, especially on a
large dataset. To keep runtime manageable, we optionally down-sample the
training set for the tree-based models while keeping the same testing
set for fair evaluation.
evaluate_binary_classifier <- function(actual, prob, threshold = 0.10) {
# actual must be a factor with levels c("0","1")
pred_class <- ifelse(prob >= threshold, "1", "0")
pred_class <- factor(pred_class, levels = levels(actual))
cm <- table(Actual = actual, Predicted = pred_class)
TN <- cm["0", "0"]
FP <- cm["0", "1"]
FN <- cm["1", "0"]
TP <- cm["1", "1"]
accuracy <- (TP + TN) / sum(cm)
precision <- ifelse((TP + FP) == 0, 0, TP / (TP + FP))
recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
specificity <- ifelse((TN + FP) == 0, 0, TN / (TN + FP))
balanced_accuracy <- (recall + specificity) / 2
f1_score <- ifelse((precision + recall) == 0, 0,
2 * precision * recall / (precision + recall))
roc_obj <- pROC::roc(
response = actual,
predictor = as.numeric(prob),
levels = c("0", "1"),
direction = "<",
quiet = TRUE
)
auc_roc <- as.numeric(pROC::auc(roc_obj))
pr_obj <- PRROC::pr.curve(
scores.class0 = prob[actual == "1"],
scores.class1 = prob[actual == "0"],
curve = FALSE
)
auc_pr <- as.numeric(pr_obj$auc.integral)
metrics <- data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
Specificity = specificity,
Balanced_Accuracy = balanced_accuracy,
F1_Score = f1_score,
AUC_ROC = auc_roc,
AUC_PR = auc_pr
)
list(confusion_matrix = cm, metrics = metrics, roc_obj = roc_obj)
}
# Optional down-sampling for tree-based models (speed / memory)
train_tree <- train_data
set.seed(123)
if (nrow(train_tree) > 200000) {
train_tree <- train_tree[sample(1:nrow(train_tree), 200000), ]
}
cat("Tree/RF training rows used:", nrow(train_tree), "\n")## Tree/RF training rows used: 200000
Summary:
A decision tree was trained to capture non-linear splits of customer
attributes. A shallow depth was used to reduce overfitting and improve
interpretability.
# Load required libraries for decision tree training & visualization
library(rpart)
library(rpart.plot)
# Define classification formula for churn prediction model
# Predict churn status using all selected customer feature variables
tree_formula <- churn ~ tenure + monthlycharges + totalcharges + customer_satisfaction +
num_complaints + num_service_calls + late_payments + credit_score +
contract + payment_method + paperless_billing
# Train base classification decision tree with regularization control parameters
dt_model <- rpart(
formula = tree_formula,
data = train_tree,
method = "class", # Specify classification task for binary churn prediction
# Regularization settings to avoid severe overfitting before pruning
control = rpart.control(
maxdepth = 6, # Restrict maximum tree depth for business interpretability
minsplit = 200, # Minimum number of samples required to split a node
cp = 0.001 # Minimum complexity threshold to accept a split
)
)
# Print CP table to inspect cross-validation error values for pruning reference
printcp(dt_model)##
## Classification tree:
## rpart(formula = tree_formula, data = train_tree, method = "class",
## control = rpart.control(maxdepth = 6, minsplit = 200, cp = 0.001))
##
## Variables actually used in tree construction:
## character(0)
##
## Root node error: 19916/200000 = 0.09958
##
## n= 200000
##
## CP nsplit rel error xerror xstd
## 1 0 0 1 0 0
# Extract the optimal CP value that yields the minimal cross-validation error (xerror)
best_cp <- dt_model$cptable[which.min(dt_model$cptable[, "xerror"]), "CP"]
# Print selected best CP for debugging & report records
cat("Selected optimal CP value for pruning:", best_cp, "\n")## Selected optimal CP value for pruning: 0
# Prune the original decision tree using the optimal CP value to reduce overfitting
dt_pruned <- prune(dt_model, cp = best_cp)
# Export high-resolution tree plot as PNG for embedding into HTML analysis report
png("pruned_churn_decision_tree.png", width = 1600, height = 1000, res = 300)
rpart.plot(
x = dt_pruned,
type = 2,
extra = 103,
fallen.leaves = TRUE,
box.col = c("#b8e1ff", "#ffb8b8"),
main = "Pruned Decision Tree - Customer Churn Prediction",
cex = 0.85
)
dev.off()## quartz_off_screen
## 2
# Generate churn probability predictions on unseen test dataset
dt_prob <- predict(
object = dt_pruned,
newdata = test_data,
type = "prob"
)[, "1"] # Extract probability of positive churn class (churn = 1)
# Evaluate binary classification performance
# Fix: pass arguments by position to avoid unused parameter error
dt_eval <- evaluate_binary_classifier(test_data$churn, dt_prob, threshold)
# Print confusion matrix of test set predictions
cat("\nConfusion Matrix of Pruned Decision Tree:\n")##
## Confusion Matrix of Pruned Decision Tree:
## Predicted
## Actual 0 1
## 0 270349 0
## 1 29651 0
# Print classification metrics rounded to 4 decimal places
cat("\nPerformance Metrics (4 decimal precision):\n")##
## Performance Metrics (4 decimal precision):
## Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1 0.9012 0 0 1 0.5 0 0.5
## AUC_PR
## 1 0.0988
Summary:
A random forest was trained to improve predictive performance by
averaging many decision trees. Variable importance was also extracted to
identify the most influential drivers of churn.
# Compute mtry based on the number of columns in the design matrix
p <- ncol(model.matrix(tree_formula, data = train_tree)) - 1
mtry_val <- max(1, floor(sqrt(p)))
rf_model <- randomForest(
tree_formula,
data = train_tree,
ntree = 300,
mtry = mtry_val,
importance = TRUE
)
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[, "1"]
rf_eval <- evaluate_binary_classifier(test_data$churn, rf_prob, threshold = threshold)
rf_eval$confusion_matrix## Predicted
## Actual 0 1
## 0 174218 96131
## 1 13092 16559
## Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1 0.6359 0.1469 0.5585 0.6444 0.6014 0.2327 0.6429
## AUC_PR
## 1 0.1688
lr_eval <- evaluate_binary_classifier(test_data$churn, pred_prob, threshold = threshold)
comparison_tbl <- dplyr::bind_rows(
data.frame(Model = "Logistic Regression", lr_eval$metrics),
data.frame(Model = "Decision Tree", dt_eval$metrics),
data.frame(Model = "Random Forest", rf_eval$metrics)
) %>%
mutate(across(where(is.numeric), ~ round(.x, 4)))
comparison_tbl## Model Accuracy Precision Recall Specificity Balanced_Accuracy
## 1 Logistic Regression 0.6238 0.1553 0.6323 0.6229 0.6276
## 2 Decision Tree 0.9012 0.0000 0.0000 1.0000 0.5000
## 3 Random Forest 0.6359 0.1469 0.5585 0.6444 0.6014
## F1_Score AUC_ROC AUC_PR
## 1 0.2494 0.6777 0.1940
## 2 0.0000 0.5000 0.0988
## 3 0.2327 0.6429 0.1688
Summary:
We further trained an XGBoost churn model and used SHAP values to
interpret both global feature importance and local (single-customer)
explanations. This helps uncover non-linear and interaction effects that
Pearson correlation may miss.
Explanation:
- XGBoost is a strong gradient-boosted tree model that
typically performs well on tabular business data.
- SHAP decomposes each prediction into additive
contributions from each feature, providing a consistent explanation
framework.
# Prepare one-hot encoded design matrix for XGBoost
# Use the same features as the classification models above.
df_train_xgb <- train_tree
df_test_xgb <- test_data
mm_all <- model.matrix(tree_formula, data = rbind(df_train_xgb, df_test_xgb))
X_all <- mm_all[, -1, drop = FALSE] # drop intercept
X_train <- X_all[1:nrow(df_train_xgb), , drop = FALSE]
X_test <- X_all[(nrow(df_train_xgb) + 1):nrow(X_all), , drop = FALSE]
y_train <- as.numeric(as.character(df_train_xgb$churn))
y_test <- as.numeric(as.character(df_test_xgb$churn))
scale_pos_weight <- sum(y_train == 0) / sum(y_train == 1)
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
eta = 0.05,
max_depth = 6,
min_child_weight = 1,
subsample = 0.8,
colsample_bytree = 0.8,
scale_pos_weight = scale_pos_weight
)
set.seed(123)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 2000,
watchlist = list(train = dtrain, eval = dtest),
early_stopping_rounds = 50,
verbose = 0
)
xgb_prob <- predict(xgb_model, dtest)
xgb_eval <- evaluate_binary_classifier(test_data$churn, xgb_prob, threshold = threshold)
xgb_eval$confusion_matrix## Predicted
## Actual 0 1
## 0 0 270349
## 1 0 29651
## Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1 0.0988 0.0988 1 0 0.5 0.1799 0.6765
## AUC_PR
## 1 0.1928
# SHAP can be computationally heavy on very large test sets.
# Use a subset of the test set for explanation plots.
set.seed(123)
shap_n <- min(20000, nrow(X_test))
shap_idx <- sample(seq_len(nrow(X_test)), shap_n)
sv <- shapviz(xgb_model, X_pred = X_test[shap_idx, , drop = FALSE])
# 1) Global feature importance (bar)
p_imp <- shapviz::sv_importance(sv, kind = "bar", max_display = 15) +
ggtitle("SHAP Global Feature Importance (Top 15)")
print(p_imp)# 2) Beeswarm plot (global distribution of SHAP values)
# Note: some older shapviz versions do not have sv_beeswarm().
# Fall back to sv_importance(kind = "beeswarm") if available; otherwise show bar importance.
p_bee <- tryCatch(
shapviz::sv_beeswarm(sv, max_display = 15),
error = function(e) NULL
)
if (is.null(p_bee)) {
p_bee <- tryCatch(
shapviz::sv_importance(sv, kind = "beeswarm", max_display = 15),
error = function(e) {
message("sv_beeswarm() is not available in your shapviz version. Showing bar importance instead.")
shapviz::sv_importance(sv, kind = "bar", max_display = 15)
}
)
}
print(p_bee + ggtitle("SHAP Beeswarm (Top 15)"))# 3) Single-customer explanation (waterfall)
# Use the first row in the SHAP subset as an example.
p_wf <- shapviz::sv_waterfall(sv, row_id = 1, max_display = 12) +
ggtitle("SHAP Waterfall (One Customer Example)")
print(p_wf)comparison_tbl_ext <- dplyr::bind_rows(
data.frame(Model = "Logistic Regression", lr_eval$metrics),
data.frame(Model = "Decision Tree", dt_eval$metrics),
data.frame(Model = "Random Forest", rf_eval$metrics),
data.frame(Model = "XGBoost", xgb_eval$metrics)
) %>%
mutate(across(where(is.numeric), ~ round(.x, 4)))
comparison_tbl_ext## Model Accuracy Precision Recall Specificity Balanced_Accuracy
## 1 Logistic Regression 0.6238 0.1553 0.6323 0.6229 0.6276
## 2 Decision Tree 0.9012 0.0000 0.0000 1.0000 0.5000
## 3 Random Forest 0.6359 0.1469 0.5585 0.6444 0.6014
## 4 XGBoost 0.0988 0.0988 1.0000 0.0000 0.5000
## F1_Score AUC_ROC AUC_PR
## 1 0.2494 0.6777 0.1940
## 2 0.0000 0.5000 0.0988
## 3 0.2327 0.6429 0.1688
## 4 0.1799 0.6765 0.1928
Summary:
Overall, logistic regression provides a simple and interpretable
baseline model for customer churn prediction. By including both
numerical and categorical predictors, the model becomes more consistent
with the EDA findings.
Explanation:
The model includes customer usage, payment behaviour, customer
experience, and service-related categorical variables such as contract
type, payment method, and paperless billing. Since the churn class is
imbalanced, a lower threshold was used to improve the model’s ability to
detect potential churn customers. However, this may also increase false
positives. This trade-off is acceptable in a customer churn context
because companies may prefer to identify more potential churn customers
for retention actions.
Compared with logistic regression, the decision tree can capture non-linear splits (for example, churn risk increasing sharply beyond certain thresholds of tenure or complaints). However, single trees can be unstable and may underperform if the true pattern is complex. The random forest typically improves performance by averaging many trees and often yields better ranking metrics (AUC-ROC / AUC-PR), which is important when the churn class is imbalanced.
In practice, logistic regression remains attractive when explainability and ease of deployment are priorities, while random forest can be used when higher predictive performance is needed for targeted retention campaigns. The choice depends on the business trade-off between model interpretability, operational cost, and the cost of false positives vs false negatives.
This section predicts totalcharges, which is the total
amount paid by each customer. Since this variable is numeric and
continuous, it is treated as a regression problem.
Linear regression was used because it is simple and easy to
interpret. The model also includes an interaction term between
tenure and monthlycharges, because total
charges are naturally related to both how long a customer stays and how
much the customer pays each month. The other predictors are
num_services, customer_satisfaction,
credit_score, age, and
annual_income.
This part uses the same training and testing split that was already created in the classification section. Using the same split ensures that the regression and classification results are based on the same customers, which is important when the two models are combined later.
# Reuse the same train/test split from the classification section
train_reg <- train_data
test_reg <- test_data
cat("Training set size:", nrow(train_reg), "\n")## Training set size: 700000
## Testing set size: 300000
The regression model was fitted using the training data. The model
includes tenure, monthlycharges, and their
interaction term. In R, tenure * monthlycharges includes
tenure, monthlycharges, and the interaction
effect between them.
This is suitable for this task because customers with both long tenure and high monthly charges usually have much higher total charges.
lm_model <- lm(
totalcharges ~ tenure * monthlycharges + num_services +
customer_satisfaction + credit_score + age + annual_income,
data = train_reg
)
summary(lm_model)##
## Call:
## lm(formula = totalcharges ~ tenure * monthlycharges + num_services +
## customer_satisfaction + credit_score + age + annual_income,
## data = train_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7806.8 -31.0 14.5 106.3 6421.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.100e-01 4.373e+00 -0.117 0.907158
## tenure -2.871e-01 7.853e-02 -3.656 0.000257 ***
## monthlycharges -1.363e-01 2.843e-02 -4.793 1.65e-06 ***
## num_services 3.045e+00 3.282e-01 9.277 < 2e-16 ***
## customer_satisfaction -9.813e-02 1.752e-01 -0.560 0.575418
## credit_score 6.977e-03 4.706e-03 1.483 0.138169
## age -7.720e-03 2.793e-02 -0.276 0.782222
## annual_income 1.363e-05 1.327e-05 1.027 0.304489
## tenure:monthlycharges 9.585e-01 8.836e-04 1084.774 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 338.1 on 699991 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.9649
## F-statistic: 2.405e+06 on 8 and 699991 DF, p-value: < 2.2e-16
The fitted model was then used to predict totalcharges
for the testing data. The predicted values were compared with the actual
values to evaluate model performance.
pred_totalcharges <- predict(lm_model, newdata = test_reg)
head(data.frame(
Actual = round(test_reg$totalcharges, 2),
Predicted = round(pred_totalcharges, 2)
), 10)## Actual Predicted
## 5 648.79 944.33
## 7 3602.75 3153.26
## 8 2067.41 2079.01
## 11 504.35 525.26
## 12 3187.81 2925.58
## 13 677.30 706.72
## 22 10279.40 8951.59
## 24 4679.77 4090.45
## 28 5465.79 5142.84
## 31 92.20 83.90
The model was evaluated using MAE, RMSE, and R-squared. MAE shows the
average prediction error in dollars. RMSE gives more weight to larger
errors. R-squared shows how much variation in totalcharges
can be explained by the model.
actuals <- test_reg$totalcharges
residuals <- actuals - pred_totalcharges
mae <- mean(abs(residuals))
rmse <- sqrt(mean(residuals^2))
sse <- sum(residuals^2)
sst <- sum((actuals - mean(actuals))^2)
r_squared <- 1 - sse / sst
regression_result <- data.frame(
Metric = c("MAE (USD)", "RMSE (USD)", "R-squared"),
Value = round(c(mae, rmse, r_squared), 4)
)
regression_result## Metric Value
## 1 MAE (USD) 177.1236
## 2 RMSE (USD) 338.3381
## 3 R-squared 0.9647
The actual-versus-predicted plot was used to check how close the predictions were to the real values. The residual plot was used to check whether the prediction errors were roughly centred around zero.
set.seed(42)
plot_df <- data.frame(Actual = actuals, Predicted = pred_totalcharges)
plot_sample <- plot_df %>% sample_n(min(10000, nrow(plot_df)))
ggplot(plot_sample, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
geom_abline(slope = 1, intercept = 0,
colour = "#E74C3C", linewidth = 1) +
scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Actual vs Predicted Total Charges",
subtitle = "Random sample of 10,000 customers; red line = perfect prediction",
x = "Actual Total Charges (USD)",
y = "Predicted Total Charges (USD)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))residual_df <- data.frame(Predicted = pred_totalcharges, Residual = residuals)
residual_sample <- residual_df %>% sample_n(min(10000, nrow(residual_df)))
ggplot(residual_sample, aes(x = Predicted, y = Residual)) +
geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
geom_hline(yintercept = 0, colour = "#E74C3C", linewidth = 1) +
scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Residual Plot",
subtitle = "Random sample of 10,000 customers; residuals should be centred on zero",
x = "Predicted Total Charges (USD)",
y = "Residual (Actual − Predicted)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Based on the evaluation results, the model achieved an R-squared of
0.965, with an MAE of about $177 and
an RMSE of about $338. This suggests that the model has
good predictive performance for totalcharges. This is
reasonable because total charges are naturally related to how long a
customer has stayed with the company and how much the customer pays each
month.
The interaction term between tenure and
monthlycharges helps the model capture their combined
effect on total charges. However, the result should still be interpreted
carefully because totalcharges is closely related to these
variables by definition. Therefore, this model is useful for estimating
customer spending value, but the result should not be
over-interpreted.
One limitation is that outlier capping was applied to some input
features (e.g., monthlycharges and
annual_income) during data cleaning to reduce the influence
of extreme values. This improves model stability, but it may slightly
dampen the estimated effect size for those predictors. Importantly, the
regression target (totalcharges) itself was
not capped in this version of the pipeline.
A 5-fold cross-validation was used to check whether the regression model is stable. This cross-validation was used as an additional stability check on the cleaned dataset, separate from the fixed train-test split used for the main model comparison. The dataset was divided into five parts. Each time, four parts were used for training and one part was used for testing.
set.seed(123)
k_folds <- 5
fold_ids <- sample(rep(1:k_folds, length.out = nrow(df_clean)))
cv_results <- data.frame(
Fold = as.character(1:k_folds),
MAE = NA_real_,
RMSE = NA_real_,
R2 = NA_real_
)
for (i in 1:k_folds) {
cv_train <- df_clean[fold_ids != i, ]
cv_test <- df_clean[fold_ids == i, ]
cv_model <- lm(
totalcharges ~ tenure * monthlycharges + num_services +
customer_satisfaction + credit_score + age + annual_income,
data = cv_train
)
cv_pred <- predict(cv_model, newdata = cv_test)
cv_actual <- cv_test$totalcharges
cv_results$MAE[i] <- mean(abs(cv_actual - cv_pred))
cv_results$RMSE[i] <- sqrt(mean((cv_actual - cv_pred)^2))
cv_results$R2[i] <- 1 - sum((cv_actual - cv_pred)^2) /
sum((cv_actual - mean(cv_actual))^2)
}
cv_results <- rbind(
cv_results,
data.frame(
Fold = "Average",
MAE = mean(cv_results$MAE),
RMSE = mean(cv_results$RMSE),
R2 = mean(cv_results$R2)
)
)
cv_results$MAE <- round(cv_results$MAE, 4)
cv_results$RMSE <- round(cv_results$RMSE, 4)
cv_results$R2 <- round(cv_results$R2, 4)
cv_results## Fold MAE RMSE R2
## 1 1 177.0401 338.4160 0.9649
## 2 2 176.3736 336.6289 0.9649
## 3 3 177.0498 337.4517 0.9651
## 4 4 178.2538 339.7117 0.9645
## 5 5 177.1107 338.6662 0.9648
## 6 Average 177.1656 338.1749 0.9648
The bar chart shows the R-squared value for each fold. If the values are close to each other, it means the model performance is stable across different data splits.
cv_plot_df <- cv_results[cv_results$Fold != "Average", ]
ggplot(cv_plot_df, aes(x = Fold, y = R2)) +
geom_bar(stat = "identity", fill = "#2C3E50", width = 0.5) +
geom_text(aes(label = R2), vjust = -0.5, fontface = "bold", size = 4) +
geom_hline(yintercept = mean(cv_plot_df$R2),
colour = "#E74C3C", linetype = "dashed", linewidth = 1) +
coord_cartesian(ylim = c(0, max(cv_plot_df$R2) * 1.1)) +
labs(title = "5-Fold Cross-Validation: R-squared by Fold",
subtitle = "Red dashed line shows the mean across folds",
x = "Fold", y = "R-squared") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))A baseline model was also used for comparison. This baseline model
simply predicts the mean value of totalcharges from the
training data. If the regression model has lower MAE and RMSE than the
baseline model, it means the selected predictors and interaction term
are useful.
baseline_pred <- rep(mean(train_reg$totalcharges), nrow(test_reg))
baseline_actual <- test_reg$totalcharges
baseline_mae <- mean(abs(baseline_actual - baseline_pred))
baseline_rmse <- sqrt(mean((baseline_actual - baseline_pred)^2))
comparison_table <- data.frame(
Model = c("Baseline (Predict Mean)", "Regression with Interaction"),
MAE_USD = round(c(baseline_mae, mae), 2),
RMSE_USD = round(c(baseline_rmse, rmse), 2)
)
comparison_table## Model MAE_USD RMSE_USD
## 1 Baseline (Predict Mean) 1387.32 1801.58
## 2 Regression with Interaction 177.12 338.34
comp_long <- data.frame(
Model = rep(comparison_table$Model, 2),
Metric = rep(c("MAE (USD)", "RMSE (USD)"), each = 2),
Value = c(comparison_table$MAE_USD, comparison_table$RMSE_USD)
)
ggplot(comp_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
geom_text(aes(label = round(Value, 1)),
position = position_dodge(width = 0.6),
vjust = -0.5, fontface = "bold", size = 4) +
scale_fill_manual(values = c("Baseline (Predict Mean)" = "#E74C3C",
"Regression with Interaction" = "#2C3E50")) +
labs(title = "Regression Model vs Baseline Model",
x = "Metric (Lower is Better)",
y = "Error (USD)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))The cross-validation result shows that the model performs
consistently across the five folds, which suggests that the model is
stable across different data splits. The baseline comparison also shows
that the regression model has much lower MAE and RMSE than simply
predicting the mean value of totalcharges.
This confirms that the selected predictors provide useful information. Overall, this regression model can help estimate customer value based on account and customer characteristics.
The residual plot also suggests that the error spread may not be perfectly constant, so the regression result should be interpreted mainly as a prediction model rather than a strict inference model.
This section uses K-means clustering to group customers into different segments. Unlike classification and regression, clustering does not use a target variable. The purpose is to find customers with similar behaviour and account characteristics.
Seven numeric variables were selected for clustering:
tenure, monthlycharges,
num_services, customer_satisfaction,
credit_score, num_complaints, and
late_payments.
The variables were standardised before applying K-means. This is important because K-means is based on distance. Without standardisation, variables with larger values may have too much influence on the clustering result.
A sample of 50,000 customers was used to reduce computation time.
cluster_vars <- df_clean %>%
select(tenure, monthlycharges, num_services, customer_satisfaction,
credit_score, num_complaints, late_payments) %>%
mutate(across(everything(), as.numeric))
set.seed(123)
cluster_idx <- sample(1:nrow(cluster_vars), size = 50000)
cluster_data <- cluster_vars[cluster_idx, ]
cluster_scaled <- scale(cluster_data)
cat("Sample size for clustering:", nrow(cluster_data), "\n")## Sample size for clustering: 50000
## Number of features used: 7
Based on the elbow plot and the later comparison of silhouette scores, k = 4 was selected as a reasonable number of clusters. The within-cluster sum of squares was calculated for different k values to compare the compactness of the clusters.
set.seed(123)
wss <- sapply(1:10, function(k) {
kmeans(cluster_scaled, centers = k, nstart = 10, iter.max = 30)$tot.withinss
})
elbow_df <- data.frame(k = 1:10, WSS = wss)
ggplot(elbow_df, aes(x = k, y = WSS)) +
geom_line(colour = "#2C3E50", linewidth = 1) +
geom_point(size = 3, colour = "#E74C3C") +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method for Optimal Number of Clusters",
x = "Number of Clusters (k)",
y = "Within-Cluster Sum of Squares (WSS)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))The final K-means model was built using k = 4. The option
nstart = 25 was used so that the algorithm tries different
starting points and keeps the best result.
set.seed(123)
k_optimal <- 4
km_model <- kmeans(cluster_scaled, centers = k_optimal, nstart = 25)
cluster_data$Cluster <- factor(km_model$cluster)
cat("Cluster sizes:\n")## Cluster sizes:
##
## 1 2 3 4
## 8077 11302 18363 12258
After clustering, the mean value of each variable was calculated for every cluster. This profile table helps explain the characteristics of each customer segment.
cluster_profile <- cluster_data %>%
group_by(Cluster) %>%
summarise(
Count = n(),
Avg_Tenure = round(mean(tenure), 1),
Avg_MonthlyCharges = round(mean(monthlycharges), 1),
Avg_NumServices = round(mean(num_services), 1),
Avg_Satisfaction = round(mean(customer_satisfaction), 1),
Avg_CreditScore = round(mean(credit_score), 0),
Avg_Complaints = round(mean(num_complaints), 1),
Avg_LatePayments = round(mean(late_payments), 1)
)
cluster_profile## # A tibble: 4 × 9
## Cluster Count Avg_Tenure Avg_MonthlyCharges Avg_NumServices Avg_Satisfaction
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 8077 58.1 82.8 2.3 6.2
## 2 2 11302 16 79.4 2.1 6.2
## 3 3 18363 13.7 75.6 1.8 6.1
## 4 4 12258 17.5 110. 4.3 6.2
## # ℹ 3 more variables: Avg_CreditScore <dbl>, Avg_Complaints <dbl>,
## # Avg_LatePayments <dbl>
PCA was used to visualise the clustering result in two dimensions. Since the clustering model uses seven variables, PCA helps show the general separation of clusters in a simple plot.
pca <- prcomp(cluster_scaled, center = FALSE, scale. = FALSE)
pca_df <- data.frame(
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
Cluster = cluster_data$Cluster
)
ggplot(pca_df, aes(x = PC1, y = PC2, colour = Cluster)) +
geom_point(alpha = 0.4, size = 0.8) +
scale_colour_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
labs(title = "Customer Segments (PCA 2D Projection)",
subtitle = paste0("k = ", k_optimal,
"; PC1 explains ",
round(summary(pca)$importance[2, 1] * 100, 1), "%, PC2 explains ",
round(summary(pca)$importance[2, 2] * 100, 1), "% of variance"),
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))Based on the cluster profile table, the four clusters mainly differ in customer tenure, monthly charges, and the number of services used. Customer satisfaction is almost the same across all clusters, so it is not a strong factor in separating the groups.
One cluster contains customers with the longest tenure, suggesting a group of long-term customers with relatively stable relationships with the company. Another cluster has the highest monthly charges and the highest number of services, suggesting a high-spending group. The remaining clusters mainly contain newer customers with shorter tenure and lower spending.
These differences can help the company design different retention strategies. For example, long-term customers and high-spending customers may need more personalised retention actions because they represent important customer groups, while lower-spending customers may be handled with lower-cost retention actions.
Unlike classification and regression, clustering has no true target label. Therefore, accuracy and RMSE are not suitable for clustering evaluation. In this report, silhouette width and the BSS/TSS ratio were used to evaluate the clustering result.
Silhouette width was used to evaluate how well each customer fits into the assigned cluster. A higher average silhouette value means the clusters are better separated.
set.seed(42)
sil_idx <- sample(1:nrow(cluster_scaled), 10000)
sil_data <- cluster_scaled[sil_idx, ]
sil_clusters <- km_model$cluster[sil_idx]
sil <- silhouette(sil_clusters, dist(sil_data))
sil_overall <- mean(sil[, "sil_width"])
sil_per_cluster <- tapply(sil[, "sil_width"], sil[, "cluster"], mean)
cat("Overall average silhouette width:", round(sil_overall, 4), "\n\n")## Overall average silhouette width: 0.144
## Average silhouette per cluster:
## 1 2 3 4
## 0.1184 0.1013 0.1963 0.1224
The silhouette plot provides a visual check of the clustering quality. Customers with positive silhouette values fit better within their assigned clusters.
plot(sil,
col = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12"),
border = NA,
main = "Silhouette Plot for K-Means (k = 4)")The BSS/TSS ratio was also used to evaluate the clustering result. A higher ratio means that more variation in the data is explained by the cluster grouping.
wss_final <- km_model$tot.withinss
bss_final <- km_model$betweenss
tss_final <- km_model$totss
ratio <- bss_final / tss_final
variance_table <- data.frame(
Metric = c("Within-Cluster SS (WSS)",
"Between-Cluster SS (BSS)",
"Total SS (TSS)",
"BSS / TSS (variance explained)"),
Value = c(round(wss_final, 2),
round(bss_final, 2),
round(tss_final, 2),
paste0(round(ratio * 100, 2), "%"))
)
variance_table## Metric Value
## 1 Within-Cluster SS (WSS) 246018.49
## 2 Between-Cluster SS (BSS) 103974.51
## 3 Total SS (TSS) 349993
## 4 BSS / TSS (variance explained) 29.71%
Different k values were compared to check whether k = 4 was a reasonable choice. The final decision should consider both the evaluation metrics and whether the clusters are easy to interpret.
set.seed(123)
compare_k <- data.frame(k = 3:6,
Avg_Silhouette = NA_real_,
BSS_TSS_Ratio = NA_real_)
for (i in 1:nrow(compare_k)) {
k_test <- compare_k$k[i]
km_test <- kmeans(cluster_scaled, centers = k_test,
nstart = 10, iter.max = 30)
sil_test_clusters <- km_test$cluster[sil_idx]
sil_test <- silhouette(sil_test_clusters, dist(sil_data))
compare_k$Avg_Silhouette[i] <- round(mean(sil_test[, "sil_width"]), 4)
compare_k$BSS_TSS_Ratio[i] <- round(km_test$betweenss / km_test$totss, 4)
}
compare_k## k Avg_Silhouette BSS_TSS_Ratio
## 1 3 0.1407 0.2291
## 2 4 0.1440 0.2971
## 3 5 0.1338 0.3475
## 4 6 0.1379 0.3965
ggplot(compare_k, aes(x = factor(k), y = Avg_Silhouette)) +
geom_bar(stat = "identity", fill = "#2C3E50", width = 0.5) +
geom_text(aes(label = sprintf("%.3f", Avg_Silhouette)),
vjust = -0.5, fontface = "bold", size = 4) +
labs(title = "Average Silhouette Width Across Different k Values",
subtitle = "The k value with the highest bar is the strongest candidate",
x = "Number of Clusters (k)",
y = "Average Silhouette Width") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))The clustering evaluation shows that the K-means result has a relatively low average silhouette width (about 0.144) and a BSS/TSS ratio of about 29.7%. This means the clusters are not strongly separated and there is some overlap between groups. This is common for real customer data, where customer behaviour usually changes gradually instead of forming clearly separated groups.
When comparing different k values, k = 4 showed a reasonable balance between clustering performance and business interpretability. Even though the separation is not very strong, the clusters still show clear differences in tenure, spending, and service usage. Therefore, the clustering result is still useful for customer segmentation, as long as it is interpreted together with the cluster profile rather than relying only on the metrics.
Since K-means uses Euclidean distance, it works best with continuous numeric variables. Some variables used here, such as satisfaction score and complaint counts, are ordinal or count-based. Therefore, the clustering result should be interpreted as an approximate customer segmentation rather than a perfect grouping.
This section compares the three models used in this project: classification, regression, and clustering. Each model answers a different question about customer churn and customer value.
Each model answers a different question. Classification and regression are supervised models, which means they learn from a known target variable. Clustering is unsupervised, so it does not use any target variable. Because of this, the three models cannot be compared using the same metric directly.
model_roles <- data.frame(
Model = c("Classification",
"Regression",
"Clustering"),
Algorithm = c("Logistic Regression",
"Linear Regression (with interaction)",
"K-Means (k = 4)"),
Learning_Type = c("Supervised (binary)",
"Supervised (continuous)",
"Unsupervised"),
Target_Variable = c("churn (0/1)",
"totalcharges (USD)",
"None"),
Business_Question = c("WHO is likely to churn?",
"HOW MUCH is each customer worth?",
"WHAT kind of customer is each?"),
Output = c("Churn probability",
"Predicted total charges",
"Cluster label (1–4)")
)
model_roles## Model Algorithm Learning_Type
## 1 Classification Logistic Regression Supervised (binary)
## 2 Regression Linear Regression (with interaction) Supervised (continuous)
## 3 Clustering K-Means (k = 4) Unsupervised
## Target_Variable Business_Question Output
## 1 churn (0/1) WHO is likely to churn? Churn probability
## 2 totalcharges (USD) HOW MUCH is each customer worth? Predicted total charges
## 3 None WHAT kind of customer is each? Cluster label (1–4)
Even though the three models use different metrics, the main performance result for each model is shown together for an overall comparison. F1 score, R-squared, and silhouette width are shown as the main summary metrics for the three models, but they should not be directly compared because the models solve different tasks.
performance_summary <- data.frame(
Model = c("Classification (Logistic Regression)",
"Regression (Linear Regression with interaction)",
"Clustering (K-Means, k = 4)"),
Primary_Metric = c(paste0("F1 Score = ", round(f1_score, 3)),
paste0("R-squared = ", round(r_squared, 3)),
paste0("Silhouette = ", round(sil_overall, 3))),
Secondary_Metric = c(paste0("Accuracy = ", round(accuracy, 3)),
paste0("RMSE = $", round(rmse, 0)),
paste0("BSS/TSS = ", round(ratio*100, 1), "%")),
Interpretation = c("Model has limited ability to identify churners under class imbalance",
"Regression model explains most of the variance in totalcharges",
"Four customer segments were identified, but the silhouette score suggests some overlap")
)
performance_summary## Model Primary_Metric
## 1 Classification (Logistic Regression) F1 Score = 0.249
## 2 Regression (Linear Regression with interaction) R-squared = 0.965
## 3 Clustering (K-Means, k = 4) Silhouette = 0.144
## Secondary_Metric
## 1 Accuracy = 0.624
## 2 RMSE = $338
## 3 BSS/TSS = 29.7%
## Interpretation
## 1 Model has limited ability to identify churners under class imbalance
## 2 Regression model explains most of the variance in totalcharges
## 3 Four customer segments were identified, but the silhouette score suggests some overlap
The three-panel plot shows the output of each model in one view: the churn probability distribution from the classification model, the predicted-versus-actual plot from the regression model, and the cluster sizes from the K-means model.
library(gridExtra)
p1 <- data.frame(Prob = pred_prob) %>%
ggplot(aes(x = Prob)) +
geom_histogram(bins = 30, fill = "#E74C3C", alpha = 0.8) +
labs(title = "Classification Output",
subtitle = "Distribution of predicted churn probability",
x = "Churn Probability", y = "Customer Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))
set.seed(42)
reg_sample <- data.frame(Actual = test_reg$totalcharges,
Predicted = pred_totalcharges) %>%
sample_n(5000)
p2 <- ggplot(reg_sample, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.2, colour = "#2C3E50", size = 0.6) +
geom_abline(slope = 1, intercept = 0, colour = "#E74C3C", linewidth = 1) +
labs(title = "Regression Output",
subtitle = "Predicted vs actual total charges",
x = "Actual (USD)", y = "Predicted (USD)") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"))
p3 <- as.data.frame(table(cluster_data$Cluster)) %>%
rename(Cluster = Var1, Count = Freq) %>%
ggplot(aes(x = Cluster, y = Count, fill = Cluster)) +
geom_bar(stat = "identity", width = 0.6) +
geom_text(aes(label = Count), vjust = -0.4, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
labs(title = "Clustering Output",
subtitle = "Customer count per cluster",
x = "Cluster", y = "Number of Customers") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"),
legend.position = "none")
grid.arrange(p1, p2, p3, ncol = 3)The classification and regression outputs are combined into a retention-priority score for every testing customer, while the clustering result is used to interpret the customer segment. This score answers an important business question: which customers should the retention team contact first?
The score was calculated using the same testing customers from the classification and regression models, so that each churn probability and predicted total charge refer to the same customer.
A customer is treated as higher priority when the customer has both a high churn probability and a high predicted total charge. The formula combines the classification probability and the regression prediction:
\[\text{Retention Priority} = P(\text{Churn}) \times \text{Predicted Total Charges}\]
The clustering result then suggests which retention strategy to use for each high-priority customer.
# Assign test customers to clusters
test_features <- test_reg %>%
select(tenure, monthlycharges, num_services, customer_satisfaction,
credit_score, num_complaints, late_payments) %>%
mutate(across(everything(), as.numeric)) %>%
as.matrix()
test_scaled <- scale(test_features,
center = attr(cluster_scaled, "scaled:center"),
scale = attr(cluster_scaled, "scaled:scale"))
assign_to_cluster <- function(scaled_data, centers) {
apply(scaled_data, 1, function(point) {
dists <- rowSums((centers - matrix(point,
nrow = nrow(centers),
ncol = length(point),
byrow = TRUE))^2)
which.min(dists)
})
}
test_clusters <- assign_to_cluster(test_scaled, km_model$centers)
# Create combined result table (using the final regression model's predictions)
combined_results <- data.frame(
Cluster = factor(test_clusters),
Churn_Probability = round(pred_prob, 4),
Predicted_TotalCharges = round(pred_totalcharges, 2),
Actual_TotalCharges = round(test_reg$totalcharges, 2),
Actual_Churn = test_reg$churn
)
# Calculate retention priority score
combined_results$Retention_Priority <- round(
combined_results$Churn_Probability * combined_results$Predicted_TotalCharges,
2
)
summary(combined_results$Retention_Priority)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.97 38.65 103.20 174.25 229.10 3885.69
# Top 10 highest-priority customers
top_priority <- combined_results %>%
arrange(desc(Retention_Priority)) %>%
head(10)
top_priority## Cluster Churn_Probability Predicted_TotalCharges Actual_TotalCharges
## 256008 1 0.3987 9745.90 10605.68
## 566831 1 0.5378 6808.99 7728.85
## 296576 1 0.5195 6537.42 6888.60
## 611640 4 0.4628 7159.88 7149.09
## 682415 1 0.3919 8412.76 9202.15
## 36107 4 0.3388 9345.60 6318.23
## 173046 4 0.3490 9042.90 8681.82
## 626011 1 0.4922 6328.20 6574.42
## 288400 4 0.3075 10055.47 11333.97
## 103548 1 0.3746 7916.84 8759.24
## Actual_Churn Retention_Priority
## 256008 1 3885.69
## 566831 1 3661.87
## 296576 0 3396.19
## 611640 0 3313.59
## 682415 0 3296.96
## 36107 0 3166.29
## 173046 0 3155.97
## 626011 1 3114.74
## 288400 0 3092.06
## 103548 0 2965.65
The boxplot shows the distribution of retention priority scores across the four customer clusters. Segments with higher overall priority scores may contain more urgent retention candidates because these customers combine higher churn risk with higher predicted customer value.
ggplot(combined_results,
aes(x = Cluster, y = Retention_Priority, fill = Cluster)) +
geom_boxplot(alpha = 0.8, outlier.size = 1.5) +
scale_fill_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
scale_y_continuous(labels = comma_format()) +
labs(title = "Retention Priority Score by Customer Segment",
subtitle = "Higher boxes indicate segments with more urgent retention needs",
x = "Customer Cluster",
y = "Retention Priority Score") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey50"),
legend.position = "none")priority_by_cluster <- combined_results %>%
group_by(Cluster) %>%
summarise(
Customer_Count = n(),
Mean_Churn_Prob = round(mean(Churn_Probability), 4),
Mean_Predicted_Value = round(mean(Predicted_TotalCharges), 2),
Mean_Priority_Score = round(mean(Retention_Priority), 2),
Top_10_Pct_Threshold = round(quantile(Retention_Priority, 0.90), 2)
)
priority_by_cluster## # A tibble: 4 × 6
## Cluster Customer_Count Mean_Churn_Prob Mean_Predicted_Value
## <fct> <int> <dbl> <dbl>
## 1 1 49170 0.0906 4591.
## 2 2 68667 0.122 1219.
## 3 3 109513 0.0948 977.
## 4 4 72650 0.0919 1854.
## # ℹ 2 more variables: Mean_Priority_Score <dbl>, Top_10_Pct_Threshold <dbl>
The table below summarises the strengths, limitations, and possible future improvements for each model.
strengths_limits <- data.frame(
Model = c("Classification (Logistic / Tree / RF / XGBoost + SHAP)",
"Regression (Linear Regression with interaction)",
"Clustering (K-Means)"),
Strengths = c(
"Baseline interpretability (logistic) + stronger performance (RF/XGBoost) + explainability via SHAP",
"Easy to interpret and captures the combined effect of tenure and monthly charges",
"Finds customer segments without needing a target variable"
),
Limitations = c(
"More computation for advanced models; threshold tuning is needed for imbalanced churn",
"Still depends on a specified model form and may not capture all non-linear patterns",
"Requires choosing k and depends on feature scaling"
),
Future_Improvement = c(
"Use cross-validation and cost-sensitive threshold selection; explore additional features",
"Try another regression method and compare the error",
"Try other clustering methods such as PAM or Gaussian mixture models"
)
)
strengths_limits## Model
## 1 Classification (Logistic / Tree / RF / XGBoost + SHAP)
## 2 Regression (Linear Regression with interaction)
## 3 Clustering (K-Means)
## Strengths
## 1 Baseline interpretability (logistic) + stronger performance (RF/XGBoost) + explainability via SHAP
## 2 Easy to interpret and captures the combined effect of tenure and monthly charges
## 3 Finds customer segments without needing a target variable
## Limitations
## 1 More computation for advanced models; threshold tuning is needed for imbalanced churn
## 2 Still depends on a specified model form and may not capture all non-linear patterns
## 3 Requires choosing k and depends on feature scaling
## Future_Improvement
## 1 Use cross-validation and cost-sensitive threshold selection; explore additional features
## 2 Try another regression method and compare the error
## 3 Try other clustering methods such as PAM or Gaussian mixture models
Overall, the three models are useful in different ways. The classification model helps identify customers who may churn. The regression model helps estimate customer value. The clustering model helps group customers with similar behaviour. When used together, these models can support better customer retention decisions. However, the clustering result should be interpreted carefully because the silhouette score suggests that customer groups are not strongly separated.
Since the dataset is synthetic, the conclusions should be interpreted as an academic modelling exercise rather than direct evidence from a real telecom company.
For simplicity in this academic demonstration, preprocessing was conducted on the entire cleaned dataset. In a stricter production pipeline, statistics for missing value imputation and outlier capping should be derived only from the training set and then applied to the testing set to reduce data leakage.
This project demonstrated a complete customer churn analytics workflow using data cleaning, exploratory data analysis, classification, regression, and clustering techniques in R. The preprocessing stage improved dataset quality by handling missing values, converting data types, treating outliers, and engineering additional behavioural features.
The overall report is designed to answer the following three questions:
We trained multiple classification models on the same 70/30 split to predict churn based on tenure, monthly charges, total charges, customer satisfaction, number of complaints, service calls, late payments, credit score, and key categorical variables (contract type, payment method, and paperless billing).
In the exploration stage, Pearson correlation was used to screen linear trends and potential multicollinearity. In the modelling stage, SHAP was used to interpret XGBoost, providing global and local explanations and revealing non-linear and interaction effects that Pearson correlation can miss.
We used linear regression with an interaction term between tenure and monthly charges to estimate total charges. EDA confirmed a strong positive relationship between tenure and total charges, and the interaction term reflects how long-term customers with higher monthly plans accumulate substantially higher total charges. This supports customer value estimation and budgeting.
We applied clustering to segment customers based on tenure, monthly charges, number of services, satisfaction, credit score, complaints, and late payments. The resulting clusters can be used to tailor retention actions (e.g., different offers for high-value loyal customers vs price-sensitive or complaint-heavy segments).
Although the models achieved meaningful analytical results, several limitations remain. The dataset is synthetic and may not fully represent real-world telecom customer behaviour. In addition, clustering showed moderate separation between groups, meaning segments can overlap in practice. Future improvements could include cross-validation, more robust feature selection, and threshold selection based on the business cost of false positives vs false negatives.
Overall, the project shows how data science techniques can support customer retention decision-making by helping organisations identify high-risk customers, estimate customer value, and design targeted retention strategies more effectively.