1 Introduction

1.1 Background

Customer churn — the phenomenon where customers discontinue a service — is one of the most critical challenges facing telecommunications companies worldwide. Acquiring a new customer costs five to ten times more than retaining an existing one, making churn prediction a high-priority business problem. By identifying customers likely to leave before they actually do, companies can deploy targeted retention campaigns, personalise offers, and improve service quality.

Data-driven machine learning approaches enable companies to analyse behavioural and demographic patterns at scale and build predictive models that generalise across millions of customers.

1.2 Dataset

This project uses the Customer Churn Prediction Dataset (1M) published on Kaggle by isandeep06 (2025). The dataset is a large-scale synthetic telecom churn dataset modelled on real-world patterns, containing 1,000,000 customer records across multiple features capturing demographics, service subscriptions, contract details, and billing information.

Kaggle URL: https://www.kaggle.com/datasets/isandeep06/customer-churn-prediction-dataset-1m
Rows: 1,000,000
Columns: 32
Dimension: 1,000,000 × 32 = 32,000,000 data points (far exceeds the 100,000 requirement)

1.3 Research Objectives

Classification (Q1): Predict whether a customer will churn (Yes) or stay (No) based on customer tenure, monthly charges, total charges, customer satisfaction, number of complaints, service calls, late payments, and credit score.
Regression (Q2): Predict a customer’s total charges (totalcharges) based on customer tenure, monthly charges, their interaction effect, services, and other customer-related variables.
Clustering (Q3): Group customers into different segments based on customer tenure, monthly charges, number of services, customer satisfaction, credit score, number of complaints, and late payments.

2 Dataset Description

2.1 Load and Install Required Packages

library(data.table)
library(ggplot2)
library(dplyr)
library(corrplot)
library(scales)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(randomForest)
library(cluster)
library(pROC)
library(PRROC)
library(Matrix)
library(xgboost)
library(shapviz)

2.2 Load the Dataset

# ── IMPORTANT: Change this path to where your CSV file is saved ──────────
# Example Windows: "C:/Users/YourName/Downloads/customer_churn_1m.csv"
# Example Mac:     "/Users/YourName/Downloads/customer_churn_1m.csv"

# data.table::fread() is much faster than read.csv() for large files
csv_path <- "customer_churn_1m.csv"
if (!file.exists(csv_path)) {
  message("CSV not found in the current folder: ", normalizePath(csv_path, mustWork = FALSE))
  message("Please choose the customer_churn_1m.csv file manually.")
  csv_path <- file.choose()
}

df_raw <- fread(csv_path, stringsAsFactors = FALSE)
cat("CSV path used:", csv_path, "\n")

## CSV path used: customer_churn_1m.csv

# Convert to standard data frame for compatibility with all functions
df_raw <- as.data.frame(df_raw)

cat("Dataset loaded successfully!\n")

## Dataset loaded successfully!

cat("Rows:     ", nrow(df_raw), "\n")

## Rows:      1000000

cat("Columns:  ", ncol(df_raw), "\n")

## Columns:   32

cat("Dimension:", nrow(df_raw) * ncol(df_raw), "\n")

## Dimension: 32000000

The dataset contains 1,000,000 customer records. In the working CSV file used for this project, there are 32 columns, including customer demographic variables, account information, service usage variables, behavioural risk variables, and the churn target variable.

2.3 Explore the Raw Data

# ── Data structure: column names and types ───────────────────────────────
str(df_raw)

## 'data.frame':    1000000 obs. of  32 variables:
##  $ customer_id                : chr  "CUST0000000001" "CUST0000000002" "CUST0000000003" "CUST0000000004" ...
##  $ signup_date                : POSIXct, format: "2022-12-12 12:53:58" "2022-01-13 12:53:58" ...
##  $ age                        : int  43 18 38 44 45 55 47 39 31 38 ...
##  $ gender                     : chr  "Female" "Male" "Female" "Male" ...
##  $ annual_income              : num  55085 60786 73184 40924 36401 ...
##  $ education                  : chr  "college" "master" "high_school" "high_school" ...
##  $ marital_status             : chr  "married" "married" "widowed" "married" ...
##  $ dependents                 : int  1 1 0 1 0 1 0 0 0 0 ...
##  $ tenure                     : int  2 22 3 6 9 47 26 18 1 31 ...
##  $ contract                   : chr  "two_year" "one_year" "two_year" "two_year" ...
##  $ payment_method             : chr  "electronic_check" "bank_transfer" "credit_card" "bank_transfer" ...
##  $ paperless_billing          : chr  "Yes" "Yes" "No" "Yes" ...
##  $ senior_citizen             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ monthlycharges             : num  67.2 71.5 112.2 107.5 110 ...
##  $ totalcharges               : num  144 1602 329 643 649 ...
##  $ num_services               : int  1 2 4 3 3 2 4 3 4 1 ...
##  $ has_phone_service          : int  1 0 1 1 0 1 1 1 1 1 ...
##  $ has_internet_service       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_online_security        : int  0 0 0 0 0 0 1 1 1 0 ...
##  $ has_online_backup          : int  0 1 0 0 0 0 1 0 1 0 ...
##  $ has_device_protection      : int  1 0 1 1 0 0 0 0 1 0 ...
##  $ has_tech_support           : int  1 0 1 1 0 0 0 1 1 1 ...
##  $ has_streaming_tv           : int  1 0 1 0 0 1 1 1 0 0 ...
##  $ has_streaming_movies       : int  1 1 0 1 1 1 1 0 1 0 ...
##  $ customer_satisfaction      : num  9 7 6 5 8 8 4 5 4 5 ...
##  $ num_complaints             : num  0 0 1 2 1 0 2 2 NA 0 ...
##  $ num_service_calls          : int  0 3 1 2 1 0 3 4 0 1 ...
##  $ late_payments              : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ avg_monthly_gb             : num  109.6 63.2 47.8 50.8 16.7 ...
##  $ days_since_last_interaction: int  16 134 11 6 18 22 11 106 28 91 ...
##  $ credit_score               : num  NA 585 632 569 657 702 716 525 637 850 ...
##  $ churn                      : int  0 0 0 0 0 0 0 0 0 0 ...

# ── First 6 rows ──────────────────────────────────────────────────────────
head(df_raw, 6)

##      customer_id         signup_date age gender annual_income   education
## 1 CUST0000000001 2022-12-12 12:53:58  43 Female      55085.25     college
## 2 CUST0000000002 2022-01-13 12:53:58  18   Male      60786.11      master
## 3 CUST0000000003 2023-09-04 12:53:58  38 Female      73184.32 high_school
## 4 CUST0000000004 2022-06-27 12:53:58  44   Male      40923.78 high_school
## 5 CUST0000000005 2022-12-08 12:53:58  45 Female      36400.94    bachelor
## 6 CUST0000000006 2023-01-12 12:53:58  55 Female      30771.38      master
##   marital_status dependents tenure contract   payment_method paperless_billing
## 1        married          1      2 two_year electronic_check               Yes
## 2        married          1     22 one_year    bank_transfer               Yes
## 3        widowed          0      3 two_year      credit_card                No
## 4        married          1      6 two_year    bank_transfer               Yes
## 5         single          0      9 two_year      credit_card               Yes
## 6        married          1     47 two_year      credit_card               Yes
##   senior_citizen monthlycharges totalcharges num_services has_phone_service
## 1              0          67.20       144.39            1                 1
## 2              0          71.54      1602.22            2                 0
## 3              0         112.20       328.81            4                 1
## 4              0         107.49       643.45            3                 1
## 5              0         110.05       648.79            3                 0
## 6              0          71.80      3041.87            2                 1
##   has_internet_service has_online_security has_online_backup
## 1                    1                   0                 0
## 2                    1                   0                 1
## 3                    1                   0                 0
## 4                    1                   0                 0
## 5                    1                   0                 0
## 6                    1                   0                 0
##   has_device_protection has_tech_support has_streaming_tv has_streaming_movies
## 1                     1                1                1                    1
## 2                     0                0                0                    1
## 3                     1                1                1                    0
## 4                     1                1                0                    1
## 5                     0                0                0                    1
## 6                     0                0                1                    1
##   customer_satisfaction num_complaints num_service_calls late_payments
## 1                     9              0                 0             0
## 2                     7              0                 3             1
## 3                     6              1                 1             0
## 4                     5              2                 2             1
## 5                     8              1                 1             0
## 6                     8              0                 0             0
##   avg_monthly_gb days_since_last_interaction credit_score churn
## 1         109.63                          16           NA     0
## 2          63.25                         134          585     0
## 3          47.77                          11          632     0
## 4          50.82                           6          569     0
## 5          16.74                          18          657     0
## 6          24.56                          22          702     0

# ── Statistical summary ───────────────────────────────────────────────────
summary(df_raw)

##     customer_id       signup_date                       age       
##  Length   :1000000   Min.   :2021-01-13 12:53:58   Min.   :18.00  
##  N.unique :1000000   1st Qu.:2022-04-14 12:54:00   1st Qu.:34.00  
##  N.blank  :      0   Median :2023-07-14 12:54:09   Median :44.00  
##  Min.nchar:     14   Mean   :2023-07-14 13:39:07   Mean   :44.72  
##  Max.nchar:     24   3rd Qu.:2024-10-12 12:54:09   3rd Qu.:55.00  
##                      Max.   :2026-01-11 12:54:09   Max.   :90.00  
##                                                                   
##        gender        annual_income        education         marital_status   
##  Length   :1000000   Min.   : 20000   Length   :1000000   Length   :1000000  
##  N.unique :      3   1st Qu.: 32713   N.unique :      5   N.unique :      4  
##  N.blank  :      0   Median : 48955   N.blank  :      0   N.blank  :      0  
##  Min.nchar:      4   Mean   : 58788   Min.nchar:      3   Min.nchar:      6  
##  Max.nchar:      6   3rd Qu.: 73475   Max.nchar:     11   Max.nchar:      8  
##                      Max.   :250000                                          
##                      NAs    :29959                                           
##    dependents         tenure           contract         payment_method   
##  Min.   :0.0000   Min.   : 1.00   Length   :1000000   Length   :1000000  
##  1st Qu.:0.0000   1st Qu.: 6.00   N.unique :      3   N.unique :      4  
##  Median :1.0000   Median :16.00   N.blank  :      0   N.blank  :      0  
##  Mean   :0.7998   Mean   :22.38   Min.nchar:      8   Min.nchar:     11  
##  3rd Qu.:1.0000   3rd Qu.:33.00   Max.nchar:     14   Max.nchar:     16  
##  Max.   :5.0000   Max.   :72.00                                          
##                                                                          
##  paperless_billing   senior_citizen   monthlycharges    totalcharges    
##  Length   :1000000   Min.   :0.0000   Min.   : 20.00   Min.   :   16.3  
##  N.unique :      2   1st Qu.:0.0000   1st Qu.: 70.49   1st Qu.:  484.5  
##  N.blank  :      0   Median :0.0000   Median : 85.48   Median : 1249.8  
##  Min.nchar:      2   Mean   :0.1995   Mean   : 86.44   Mean   : 1837.3  
##  Max.nchar:      3   3rd Qu.:0.0000   3rd Qu.:100.68   3rd Qu.: 2617.7  
##                      Max.   :1.0000   Max.   :854.96   Max.   :16252.9  
##                                                                         
##   num_services   has_phone_service has_internet_service has_online_security
##  Min.   :1.000   Min.   :0.0000    Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:1.000   1st Qu.:1.0000    1st Qu.:1.0000       1st Qu.:0.0000     
##  Median :2.000   Median :1.0000    Median :1.0000       Median :0.0000     
##  Mean   :2.564   Mean   :0.7696    Mean   :0.8497       Mean   :0.3395     
##  3rd Qu.:3.000   3rd Qu.:1.0000    3rd Qu.:1.0000       3rd Qu.:1.0000     
##  Max.   :6.000   Max.   :1.0000    Max.   :1.0000       Max.   :1.0000     
##                                                                            
##  has_online_backup has_device_protection has_tech_support has_streaming_tv
##  Min.   :0.0000    Min.   :0.0000        Min.   :0.0000   Min.   :0.0     
##  1st Qu.:0.0000    1st Qu.:0.0000        1st Qu.:0.0000   1st Qu.:0.0     
##  Median :0.0000    Median :0.0000        Median :0.0000   Median :1.0     
##  Mean   :0.4257    Mean   :0.2971        Mean   :0.4973   Mean   :0.6     
##  3rd Qu.:1.0000    3rd Qu.:1.0000        3rd Qu.:1.0000   3rd Qu.:1.0     
##  Max.   :1.0000    Max.   :1.0000        Max.   :1.0000   Max.   :1.0     
##                                                                           
##  has_streaming_movies customer_satisfaction num_complaints   num_service_calls
##  Min.   :0.0000       Min.   :1.000         Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:0.0000       1st Qu.:4.000         1st Qu.:0.0000   1st Qu.: 1.00    
##  Median :1.0000       Median :7.000         Median :1.0000   Median : 1.00    
##  Mean   :0.5491       Mean   :6.159         Mean   :0.7017   Mean   : 1.76    
##  3rd Qu.:1.0000       3rd Qu.:8.000         3rd Qu.:1.0000   3rd Qu.: 3.00    
##  Max.   :1.0000       Max.   :9.000         Max.   :7.0000   Max.   :12.00    
##                       NAs    :19921         NAs    :29906                     
##  late_payments   avg_monthly_gb   days_since_last_interaction  credit_score  
##  Min.   :0.000   Min.   :  0.00   Min.   :  1.00              Min.   :300.0  
##  1st Qu.:0.000   1st Qu.: 12.84   1st Qu.: 12.00              1st Qu.:619.0  
##  Median :0.000   Median : 27.77   Median : 31.00              Median :680.0  
##  Mean   :0.401   Mean   : 39.10   Mean   : 44.49              Mean   :678.6  
##  3rd Qu.:1.000   3rd Qu.: 51.23   3rd Qu.: 62.00              3rd Qu.:740.0  
##  Max.   :5.000   Max.   :557.82   Max.   :365.00              Max.   :850.0  
##                  NAs    :50012                                NAs    :40395  
##      churn        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.09923  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
##

The summary statistics show that the numeric variables are generally within reasonable ranges. For example, customer age is within the expected adult range, and tenure ranges from new customers to long-term customers. Monthly charges and total charges also show large variation, which is reasonable for telecom customers with different contract types and service usage. Overall, there are no obvious extreme data-entry errors based on the summary output.

# ── Missing values per column ─────────────────────────────────────────────
missing_counts <- colSums(is.na(df_raw))
missing_pct    <- round(missing_counts / nrow(df_raw) * 100, 2)
missing_df     <- data.frame(Column = names(missing_counts),
                              Missing_Count = missing_counts,
                              Missing_Pct   = missing_pct)
missing_df <- missing_df[missing_df$Missing_Count > 0, ]

if (nrow(missing_df) == 0) {
  cat("No missing values found in any column.\n")
} else {
  print(missing_df)
}

##                                      Column Missing_Count Missing_Pct
## annual_income                 annual_income         29959        3.00
## customer_satisfaction customer_satisfaction         19921        1.99
## num_complaints               num_complaints         29906        2.99
## avg_monthly_gb               avg_monthly_gb         50012        5.00
## credit_score                   credit_score         40395        4.04

The missing value check shows that a few variables contain missing values, including annual_income, customer_satisfaction, num_complaints, avg_monthly_gb, and credit_score. These missing values are not removed in this section. They are handled later in the data cleaning and processing section.

# ── Target variable distribution (Churn) ─────────────────────────────────
churn_table <- table(df_raw$churn)
churn_pct   <- round(prop.table(churn_table) * 100, 2)

cat("Churn Distribution:\n")

## Churn Distribution:

print(churn_table)

## 
##      0      1 
## 900773  99227

cat("\nChurn Percentage:\n")

## 
## Churn Percentage:

print(churn_pct)

## 
##     0     1 
## 90.08  9.92

The churn distribution shows that the dataset is imbalanced, with fewer churned customers than retained customers. This is important because model accuracy alone may not be enough for evaluating the classification model. Therefore, other metrics such as precision, recall, and F1-score are also considered later.

2.4 Column Descriptions

The dataset contains the following 32 columns:

#	Column	Type	Description
1	customer_id	Character	Unique customer identifier
2	signup_date	Character	Customer signup date
3	age	Numeric	Customer age
4	gender	Character	Customer gender
5	annual_income	Numeric	Annual customer income
6	education	Character	Education level
7	marital_status	Character	Marital status
8	dependents	Character	Whether customer has dependents
9	tenure	Numeric	Months as a customer
10	contract	Character	Contract type
11	payment_method	Character	Payment method
12	paperless_billing	Character	Whether customer uses paperless billing
13	senior_citizen	Character	Senior citizen status
14	monthlycharges	Numeric	Monthly charges
15	totalcharges	Numeric	Total accumulated charges (Regression target)
16	num_services	Numeric	Number of subscribed services
17	has_phone_service	Character	Phone service subscription
18	has_internet_service	Character	Internet service subscription
19	has_online_security	Character	Online security subscription
20	has_online_backup	Character	Online backup subscription
21	has_device_protection	Character	Device protection subscription
22	has_tech_support	Character	Tech support subscription
23	has_streaming_tv	Character	Streaming TV subscription
24	has_streaming_movies	Character	Streaming movie subscription
25	customer_satisfaction	Numeric	Customer satisfaction score
26	num_complaints	Numeric	Number of complaints
27	num_service_calls	Numeric	Number of service calls
28	late_payments	Numeric	Number of late payments
29	avg_monthly_gb	Numeric	Average monthly internet usage (GB)
30	days_since_last_interaction	Numeric	Days since last interaction
31	credit_score	Numeric	Customer credit score
32	churn	Character	Churn status (Classification target)

Overall, this dataset is suitable for this project because it supports three types of data mining tasks. First, the churn variable can be used for classification to predict whether a customer will leave or stay. Second, totalcharges is a continuous numeric variable, so it can be used for regression. Third, customer behaviour and account variables such as tenure, monthlycharges, num_services, customer_satisfaction, num_complaints, and late_payments can be used for clustering to group similar customers. Therefore, the dataset matches the objectives of classification, regression, and clustering in this project.

Since this is a synthetic dataset, the findings should be interpreted as an academic simulation rather than direct evidence from a real telecom company.

3 Data Cleaning

Data cleaning is applied in a structured sequence of steps to produce a reliable, analysis-ready dataset.

3.1 Remove customer_id

# ── Make a working copy ───────────────────────────────────────────────────
df <- df_raw

# ── Remove customer_id: unique identifier with no predictive value ───────
df$customer_id <- NULL
df$signup_date <- NULL
cat("customer_id removed. Columns remaining:", ncol(df), "\n")

## customer_id removed. Columns remaining: 30

3.2 Handle Missing Values

## Handle missing values
# Mode function for categorical variables
get_mode <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# Numeric columns: impute with median
numeric_cols <- c(
  "age", "annual_income", "tenure", "monthlycharges", "totalcharges",
  "num_services", "customer_satisfaction", "num_complaints",
  "num_service_calls", "late_payments", "avg_monthly_gb",
  "days_since_last_interaction", "credit_score"
)

for (col in numeric_cols) {
  if (col %in% colnames(df) && sum(is.na(df[[col]])) > 0) {
    df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
  }
}

# Categorical columns: impute with mode
factor_cols <- c(
  "gender", "education", "marital_status", "dependents",
  "contract", "payment_method", "paperless_billing", "senior_citizen",
  "has_phone_service", "has_internet_service",
  "has_online_security", "has_online_backup",
  "has_device_protection", "has_tech_support",
  "has_streaming_tv", "has_streaming_movies", "churn"
)

for (col in factor_cols) {
  if (col %in% colnames(df) && sum(is.na(df[[col]])) > 0) {
    df[[col]][is.na(df[[col]])] <- get_mode(df[[col]])
  }
}

cat("Missing values handled.\n")

## Missing values handled.

3.3 Convert Categorical Variables to Factors

## Convert all variables to correct types

factor_cols <- c(
  "gender", "education", "marital_status", "dependents",
  "contract", "payment_method", "paperless_billing", "senior_citizen",
  "has_phone_service", "has_internet_service",
  "has_online_security", "has_online_backup",
  "has_device_protection", "has_tech_support",
  "has_streaming_tv", "has_streaming_movies", "churn"
)

numeric_cols <- c(
  "age", "annual_income", "tenure", "monthlycharges", "totalcharges",
  "num_services", "customer_satisfaction", "num_complaints",
  "num_service_calls", "late_payments", "avg_monthly_gb",
  "days_since_last_interaction", "credit_score"
)

# Convert to factor
for (col in factor_cols) {
  if (col %in% colnames(df)) df[[col]] <- as.factor(df[[col]])
}

# Convert to numeric
for (col in numeric_cols) {
  if (col %in% colnames(df)) {
    df[[col]] <- as.numeric(as.character(df[[col]]))
    df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
  }
}

cat("All variable types converted successfully.\n")

## All variable types converted successfully.

3.4 Detect and Cap Outliers

##Treat outliers

cap_outliers <- function(x) {
  if (!is.numeric(x)) return(x)
  q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- q[2] - q[1]
  if (iqr == 0) return(x)
  pmax(pmin(x, q[2] + 1.5*iqr), q[1] - 1.5*iqr)
}

outlier_vars <- c("monthlycharges", "annual_income")
for (col in outlier_vars) {
  if (col %in% colnames(df)) df[[col]] <- cap_outliers(df[[col]])
}

cat("Outliers treated.\n")

## Outliers treated.

3.5 Save Cleaned Dataset

# Save cleaned data to CSV for later use
write.csv(df, "cleaned_customer_data.csv", row.names = FALSE)

# Verify save
cat("Cleaned dataset saved successfully.\n")

## Cleaned dataset saved successfully.

cat("File: cleaned_customer_data.csv\n")

## File: cleaned_customer_data.csv

cat("Rows saved:", nrow(df), "\n")

## Rows saved: 1000000

cat("Columns saved:", ncol(df), "\n")

## Columns saved: 30

# Final cleaned dataset ready for EDA and modelling
df_clean <- df
cat("df_clean is ready for EDA and modelling.\n")

## df_clean is ready for EDA and modelling.

4 Exploratory Data Analysis

In this section, we explore the cleaned dataset to understand the patterns in customer behaviour and identify which variables are likely to be useful for our research questions on churn prediction (Q1), total charges prediction (Q2), and customer segmentation (Q3).

4.1 Churn Distribution

Before anything else, we need to understand how balanced our target variable is. If the data is heavily imbalanced, the classification model may be biased.

# —— Bar chart: how many customers churned vs stayed ——————————————————————
churn_counts <- data.frame(
  Status = c("Not Churn", "Churn"),
  Count  = c(sum(as.character(df_clean$churn) == "0"),
             sum(as.character(df_clean$churn) == "1"))
) %>%
  mutate(Pct = round(Count / sum(Count) * 100, 1))

ggplot(churn_counts, aes(x = Status, y = Count, fill = Status)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = paste0(Pct, "%")),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Churn"     = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  scale_y_continuous(labels = comma) +
  labs(title = "Customer Churn Distribution",
       x = "Status", y = "Number of Customers") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

From the chart, only 9.9% of customers churned while 90.1% stayed. This means the dataset is quite imbalanced. When we build the classification model later, we will need to handle this imbalance so the model does not just predict “Not Churn” for everyone.

4.2 Churn Rate by Contract Type

Contract type is one of the most commonly discussed factors in churn analysis. Here we check whether customers on different contract types have different churn rates.

# —— Stacked bar: churn rate by contract type —————————————————————————————
df_clean %>%
  mutate(Churn_Label = ifelse(as.character(churn) == "1",
                              "Churn", "Not Churn")) %>%
  count(contract, Churn_Label) %>%
  ggplot(aes(x = contract, y = n, fill = Churn_Label)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = percent) +
  scale_fill_manual(values = c("Churn"     = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type", y = "Proportion", fill = "Status") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Customers on month-to-month contracts have a clearly higher churn rate compared to those on one-year or two-year contracts. This makes sense because short-term contracts give customers more flexibility to leave. This variable is likely to be an important predictor in our classification model.

4.3 Monthly Charges and Tenure by Churn Status

We look at two important numerical variables together — monthly charges and tenure — to see how they differ between customers who churned and those who did not.

# —— Left plot: monthly charges by churn ——————————————————————————————————
p1 <- df_clean %>%
  mutate(Churn_Label = ifelse(as.character(churn) == "1",
                              "Churn", "Not Churn")) %>%
  ggplot(aes(x = Churn_Label, y = monthlycharges, fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn"     = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Monthly Charges",
       x = "", y = "Monthly Charges (USD)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

# —— Right plot: tenure by churn ——————————————————————————————————————————
p2 <- df_clean %>%
  mutate(Churn_Label = ifelse(as.character(churn) == "1",
                              "Churn", "Not Churn")) %>%
  ggplot(aes(x = Churn_Label, y = tenure, fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn"     = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Tenure",
       x = "", y = "Tenure (months)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

# —— Arrange side by side —————————————————————————————————————————————————
grid.arrange(p1, p2, ncol = 2)

From the boxplots, both monthlycharges and tenure show very similar distributions between churned and retained customers, with median differences of less than $1 and approximately 1 month respectively. This is consistent with their near-zero Pearson correlation with churn (r < 0.02), confirming that these two variables have limited standalone predictive power for churn (Q1).

However, they are retained in the classification model because they may contribute through combined effects with other variables, such as contract type or number of services. More importantly, tenure is the strongest predictor of totalcharges in our regression model (Q2), with a Pearson r of 0.93, as confirmed in the scatter plot below. The variables with the strongest direct relationship to churn — including customer_satisfaction, num_complaints, num_service_calls, and late_payments — are explored in the next section.

4.4 Key Churn Predictors: Customer Experience Variables

The Pearson correlation analysis (Section 4.6) identifies customer_satisfaction, num_complaints, num_service_calls, and late_payments as the four numerical variables with the strongest linear association with churn (r = −0.085, +0.080, +0.077, +0.048 respectively). This section visualises their distributions across churned and retained customers to support the findings.

# —— Four key churn predictors side by side ———————————————————————————————
df_exp <- df_clean %>%
  mutate(Churn_Label = ifelse(as.character(churn) == "1",
                              "Churn", "Not Churn"))

p_sat <- ggplot(df_exp,
                aes(x = Churn_Label, y = customer_satisfaction,
                    fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn" = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Customer Satisfaction",
       x = "", y = "Satisfaction Score") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p_comp <- ggplot(df_exp,
                 aes(x = Churn_Label, y = num_complaints,
                     fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn" = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Number of Complaints",
       x = "", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p_calls <- ggplot(df_exp,
                  aes(x = Churn_Label, y = num_service_calls,
                      fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn" = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Service Calls",
       x = "", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p_late <- ggplot(df_exp,
                 aes(x = Churn_Label, y = late_payments,
                     fill = Churn_Label)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1) +
  scale_fill_manual(values = c("Churn" = "#E74C3C",
                               "Not Churn" = "#2C3E50")) +
  labs(title = "Late Payments",
       x = "", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p_sat, p_comp, p_calls, p_late, ncol = 4)

Churned customers show slightly lower satisfaction scores and slightly higher complaint counts, service call counts, and late payment counts compared to retained customers. Although the differences appear moderate — consistent with the synthetic nature of the dataset where most variables were generated with low correlation to churn — these four variables represent the strongest available linear signals for churn prediction. They are therefore prioritised as key input features in the classification model (Q1).

4.5 Tenure vs Total Charges (Regression Preview)

This chart is specifically to support our regression question (Q2). We want to see whether there is a clear linear relationship between tenure and total charges, since this is what the regression model will try to predict.

# —— Scatter plot: tenure vs totalcharges with regression line ————————————
# Using a sample of 10,000 rows to keep the plot readable
set.seed(42)
df_sample <- df_clean %>% sample_n(10000)

ggplot(df_sample, aes(x = tenure, y = totalcharges)) +
  geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
  geom_smooth(method = "lm", colour = "#E74C3C",
              se = TRUE, linewidth = 1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Tenure vs Total Charges",
       subtitle = "Based on a random sample of 10,000 customers",
       x = "Tenure (months)", y = "Total Charges (USD)") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

There is a clear positive linear relationship between tenure and total charges — the longer a customer has been subscribed, the higher their total charges. This is expected because total charges accumulate over time. The regression model should be able to use tenure as a strong predictor of total charges.

4.6 Correlation Between Numerical Variables

Finally, we look at the correlation between all numerical variables to get an overall picture of how they relate to each other. This is useful for both models.

# —— Select numerical columns and compute correlation matrix ———————————————
num_vars <- df_clean %>%
  select(age, annual_income, tenure, monthlycharges, totalcharges,
         num_services, num_complaints, num_service_calls,
         late_payments, customer_satisfaction, credit_score,
         avg_monthly_gb, days_since_last_interaction) %>%
  mutate(across(everything(), as.numeric))

cor_matrix <- cor(num_vars, use = "complete.obs", method = "pearson")

corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         tl.cex      = 0.85,
         tl.col      = "black",
         addCoef.col = "black",
         number.cex  = 0.60,
         col         = colorRampPalette(
                         c("#E74C3C", "white", "#2C3E50"))(200),
         title       = "Correlation Heatmap of Numerical Variables",
         mar         = c(0, 0, 2, 0))

From the heatmap, most variables have very little correlation with each other, with values close to 0. This is consistent with the synthetic nature of the dataset, where most variables were generated independently.

The strongest relationship is between tenure and totalcharges (r = 0.91), which is mathematically expected since total charges accumulate over time as a function of tenure and monthly charges. Two additional notable correlations were also found: num_complaints and num_service_calls (r = 0.44), suggesting that dissatisfied customers tend to both complain and contact support more frequently; and monthlycharges and num_services (r = 0.49), which reflects the logical pricing relationship where subscribing to more services leads to higher monthly bills.

Because totalcharges is closely related to tenure and monthlycharges, the regression model is expected to achieve a high R-squared. This relationship is explored further in the regression section.

4.7 Pearson Correlation Analysis (Exploration Stage)

Summary:
Pearson correlation is used to (1) visualise feature correlations, (2) screen potential multicollinearity (highly redundant features), and (3) observe the initial linear trend between numeric features and churn.

Explanation:
Pearson correlation captures linear relationships only. In later sections, we use XGBoost + SHAP to detect non-linear patterns and interactions that Pearson correlation may miss.

4.7.1 1) Screen multicollinearity and remove highly redundant numeric features

find_redundant_by_correlation <- function(cor_mat, cutoff = 0.90) {
  # A simple iterative filter:
  # repeatedly remove the variable with the highest average absolute correlation
  cor_abs <- abs(cor_mat)
  diag(cor_abs) <- 0

  to_remove <- character(0)
  keep <- colnames(cor_abs)

  while (TRUE) {
    sub <- cor_abs[keep, keep, drop = FALSE]
    if (max(sub, na.rm = TRUE) < cutoff) break

    mean_corr <- colMeans(sub, na.rm = TRUE)
    remove_var <- names(which.max(mean_corr))
    to_remove <- c(to_remove, remove_var)
    keep <- setdiff(keep, remove_var)
  }

  list(keep = keep, remove = unique(to_remove))
}

redundancy <- find_redundant_by_correlation(cor_matrix, cutoff = 0.90)
redundant_features <- redundancy$remove

cat("Highly redundant numeric features (|r| >= 0.90):\n")

## Highly redundant numeric features (|r| >= 0.90):

print(redundant_features)

## [1] "totalcharges"

# Reduced numeric set (can be used for some models to reduce redundancy)
num_vars_reduced <- num_vars[, redundancy$keep, drop = FALSE]
cat("Original numeric features:", ncol(num_vars), "\n")

## Original numeric features: 13

cat("Reduced numeric features: ", ncol(num_vars_reduced), "\n")

## Reduced numeric features:  12

Note: tenure and totalcharges are expected to be highly correlated because total charges accumulate over time. For Q2 regression, we keep them (and even model the interaction with monthly charges). For Q1 classification, we keep the required variables but also report redundancy so the model interpretation remains cautious. For Q3 clustering, the numeric variables identified here are used as the input features after standardisation.

4.7.2 2) Initial linear trend between features and churn (Pearson, point-biserial)

# churn is coded as 0/1 in the dataset; convert to numeric for correlation
churn_num <- as.numeric(as.character(df_clean$churn))

corr_with_churn <- sapply(num_vars, function(x) {
  suppressWarnings(cor(x, churn_num, use = "complete.obs", method = "pearson"))
})

corr_df <- data.frame(
  Feature = names(corr_with_churn),
  Pearson_r = as.numeric(corr_with_churn)
) %>%
  arrange(desc(abs(Pearson_r)))

head(corr_df, 12)

##                        Feature     Pearson_r
## 1        customer_satisfaction -0.0838270226
## 2               num_complaints  0.0786322707
## 3            num_service_calls  0.0772835977
## 4                late_payments  0.0477047675
## 5                 num_services -0.0333776309
## 6               monthlycharges -0.0171147524
## 7                 totalcharges -0.0165011945
## 8                       tenure -0.0127086881
## 9               avg_monthly_gb -0.0060656635
## 10               annual_income  0.0004749200
## 11 days_since_last_interaction  0.0003913459
## 12                credit_score -0.0003690728

ggplot(corr_df, aes(x = reorder(Feature, abs(Pearson_r)),
                    y = Pearson_r,
                    fill = Pearson_r > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(
    values = c("TRUE"  = "#E74C3C",
               "FALSE" = "#2C3E50"),
    labels = c("Negative (protective — higher value → lower churn)",
               "Positive (risk factor — higher value → higher churn)"),
    name = "Direction"
  ) +
  labs(
    title    = "Pearson Correlation with Churn (Linear Trend Only)",
    subtitle = "Red = risk factor; Dark = protective factor (linear association only)",
    x = "Feature", y = "Pearson r (with churn coded as 0/1)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

4.8 EDA Summary Aligned with the Three Project Objectives

Q1 Classification (Churn):
The target is imbalanced (minority churn class). EDA suggests that churn is not explained by a single variable alone; instead, it likely depends on combined effects of customer experience (satisfaction, complaints, service calls), billing behaviour (late payments), and contract/payment choices. Pearson correlation provides only an initial linear signal, so tree-based models and SHAP are later used to uncover non-linear effects.

Q2 Regression (Total Charges):
EDA confirms a strong linear relationship between tenure and totalcharges, supporting the linear regression approach and motivating an interaction term with monthlycharges.

Q3 Clustering (Segmentation):
The distribution and scales of tenure, charges, service counts, and experience variables motivate standardisation before clustering, and the feature set used later reflects the behavioural drivers identified in EDA.

5 Classification Model: Logistic Regression

5.1 Define Classification Objective

Summary:: This section focuses on predicting whether a customer will churn or not. Since the target variable churn has two possible outcomes, this is a binary classification problem.

Explanation:
The target variable is churn, where 0 represents customers who did not churn and 1 represents customers who churned. Logistic regression was selected as the baseline classification model because it is suitable for binary classification and easy to interpret. The model uses both numerical and categorical customer-related variables, including tenure, monthly charges, total charges, customer satisfaction, number of complaints, service calls, late payments, credit score, contract type, payment method, and paperless billing.

5.2 Prepare Data and Split Dataset

Summary:
Before building the model, the cleaned dataset df_clean was used. The target variable and selected categorical predictors were converted into factor variables, and the dataset was split into 70% training data and 30% testing data.

Explanation:
The training data was used to build the logistic regression model, while the testing data was used to evaluate the model performance on unseen data. Categorical variables such as contract type, payment method, and paperless billing were converted into factor variables so that they could be properly included in the logistic regression model.

df_clean$churn <- as.factor(df_clean$churn)

df_clean$contract <- as.factor(df_clean$contract)
df_clean$payment_method <- as.factor(df_clean$payment_method)
df_clean$paperless_billing <- as.factor(df_clean$paperless_billing)

set.seed(123)

train_index <- sample(1:nrow(df_clean), size = 0.7 * nrow(df_clean))

train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]

table(df_clean$churn)

## 
##      0      1 
## 900773  99227

prop.table(table(df_clean$churn))

## 
##        0        1 
## 0.900773 0.099227

The churn distribution shows that most customers belong to the non-churn group, while only a smaller proportion of customers churned. This indicates that the dataset is imbalanced. Therefore, the model evaluation should not rely only on accuracy.

5.3 Build Logistic Regression Model

Summary:
A logistic regression model was built using both numerical and categorical predictors. These variables are related to customer usage, payment behaviour, customer experience, and service-related characteristics.

Explanation:
The logistic regression model estimates the probability that a customer will churn. The selected predictors include tenure, monthly charges, total charges, customer satisfaction, number of complaints, number of service calls, late payments, credit score, contract type, payment method, and paperless billing. Including categorical predictors makes the model more consistent with the EDA findings, especially because contract type is strongly related to customer churn.

logit_model <- glm(
  churn ~ tenure + monthlycharges + totalcharges + customer_satisfaction +
    num_complaints + num_service_calls + late_payments + credit_score +
    contract + payment_method + paperless_billing,
  data = train_data,
  family = binomial
)

summary(logit_model)

## 
## Call:
## glm(formula = churn ~ tenure + monthlycharges + totalcharges + 
##     customer_satisfaction + num_complaints + num_service_calls + 
##     late_payments + credit_score + contract + payment_method + 
##     paperless_billing, family = binomial, data = train_data)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -4.776e-01  4.577e-02 -10.435   <2e-16 ***
## tenure                         -1.465e-03  6.479e-04  -2.262   0.0237 *  
## monthlycharges                 -2.505e-03  2.392e-04 -10.473   <2e-16 ***
## totalcharges                   -9.659e-06  7.585e-06  -1.274   0.2028    
## customer_satisfaction          -1.200e-01  1.695e-03 -70.808   <2e-16 ***
## num_complaints                  2.083e-01  5.084e-03  40.965   <2e-16 ***
## num_service_calls               1.092e-01  2.854e-03  38.248   <2e-16 ***
## late_payments                   2.385e-01  5.929e-03  40.216   <2e-16 ***
## credit_score                    1.266e-05  4.752e-05   0.266   0.7899    
## contractone_year               -9.484e-01  2.016e-02 -47.046   <2e-16 ***
## contracttwo_year               -1.838e+00  2.111e-02 -87.087   <2e-16 ***
## payment_methodcredit_card      -5.607e-04  1.069e-02  -0.052   0.9582    
## payment_methodelectronic_check  5.874e-03  1.153e-02   0.510   0.6104    
## payment_methodmailed_check     -6.981e-04  1.333e-02  -0.052   0.9582    
## paperless_billingYes           -1.951e-02  9.381e-03  -2.080   0.0375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453250  on 699999  degrees of freedom
## Residual deviance: 427809  on 699985  degrees of freedom
## AIC: 427839
## 
## Number of Fisher Scoring iterations: 5

5.4 Predict Customer Churn

Summary: After building the model, the churn probability was predicted for each customer in the testing dataset. Since the churn class is relatively small, a threshold of 0.10 was used instead of the default 0.50.

Explanation:
A lower threshold makes the model more sensitive to potential churn customers. Customers with a predicted churn probability of 0.10 or above were classified as churn. This helps reduce the issue where the model predicts almost all customers as non-churn, which can happen when the dataset is imbalanced.

pred_prob <- predict(logit_model, newdata = test_data, type = "response")

threshold <- 0.10
pred_class <- ifelse(pred_prob >= threshold, "1", "0")
pred_class <- factor(pred_class, levels = levels(test_data$churn))

conf_matrix <- table(
  Actual = test_data$churn,
  Predicted = pred_class
)

conf_matrix

##       Predicted
## Actual      0      1
##      0 168407 101942
##      1  10903  18748

5.5 Classification Model Evaluation

Summary: The model was evaluated using a confusion matrix, accuracy, precision, recall, specificity, balanced accuracy, and F1 score. Recall and F1 score are especially important because churn customers are the minority class.

Explanation:
Accuracy measures the overall proportion of correct predictions. Precision shows how many customers predicted as churn actually churned. Recall shows how many actual churn customers were successfully detected. Specificity measures how well the model identifies non-churn customers. Balanced accuracy considers both recall and specificity, which is useful for imbalanced classification problems. F1 score combines precision and recall into one measure.

TN <- conf_matrix["0", "0"]
FP <- conf_matrix["0", "1"]
FN <- conf_matrix["1", "0"]
TP <- conf_matrix["1", "1"]

accuracy <- (TP + TN) / sum(conf_matrix)

precision <- ifelse((TP + FP) == 0, 0, TP / (TP + FP))
recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
specificity <- ifelse((TN + FP) == 0, 0, TN / (TN + FP))
balanced_accuracy <- (recall + specificity) / 2

f1_score <- ifelse((precision + recall) == 0, 0,
                   2 * precision * recall / (precision + recall))

classification_result <- data.frame(
  Metric = c(
    "Accuracy",
    "Precision",
    "Recall",
    "Specificity",
    "Balanced Accuracy",
    "F1 Score"
  ),
  Value = round(c(
    accuracy,
    precision,
    recall,
    specificity,
    balanced_accuracy,
    f1_score
  ), 4)
)

classification_result

##              Metric  Value
## 1          Accuracy 0.6238
## 2         Precision 0.1553
## 3            Recall 0.6323
## 4       Specificity 0.6229
## 5 Balanced Accuracy 0.6276
## 6          F1 Score 0.2494

The evaluation results show how well the model predicts churn customers after including both numerical and categorical predictors. Since the churn class is relatively small, recall and F1 score are useful for understanding model performance beyond accuracy.

conf_df <- as.data.frame(conf_matrix)

ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = scales::comma(Freq)), size = 5) +
  labs(
    title = "Confusion Matrix Heatmap",
    x = "Predicted Class",
    y = "Actual Class",
    fill = "Count"
  ) +
  theme_minimal()

The confusion matrix heatmap provides a visual summary of correct and incorrect predictions. The diagonal cells represent correct predictions, while the off-diagonal cells represent false positives and false negatives.

5.6 ROC Curve and AUC

Summary:
The ROC curve and AUC were used as an additional evaluation method. This helps evaluate how well the model separates churn and non-churn customers without depending only on one threshold.

Explanation:
The ROC curve shows the trade-off between the true positive rate and the false positive rate. The AUC value summarises the model’s overall ability to distinguish churn customers from non-churn customers. This is useful for imbalanced classification problems because it evaluates the ranking ability of the model across different thresholds.

roc_obj <- roc(
  response = test_data$churn,
  predictor = as.numeric(pred_prob),
  levels = c("0", "1"),
  direction = "<",
  quiet = TRUE
)

auc_value <- auc(roc_obj)

cat("AUC-ROC:", round(auc_value, 4), "\n")

## AUC-ROC: 0.6777

plot(
  roc_obj,
  col = "#2C3E50",
  lwd = 2,
  main = "ROC Curve for Churn Classification"
)

5.7 Precision-Recall Curve

Summary:
The precision-recall curve was used to further evaluate the model because the churn class is much smaller than the non-churn class.

Explanation:
The precision-recall curve focuses on the model’s ability to identify churn customers. This is useful for imbalanced classification problems because precision and recall directly describe the model performance on the minority class. In this project, recall shows how many actual churn customers were detected, while precision shows how many predicted churn customers actually churned.

pr_obj <- pr.curve(
  scores.class0 = pred_prob[test_data$churn == "1"],
  scores.class1 = pred_prob[test_data$churn == "0"],
  curve = TRUE
)

plot(
  pr_obj,
  main = "Precision-Recall Curve for Churn Classification"
)

cat("AUC-PR:", round(pr_obj$auc.integral, 4), "\n")

## AUC-PR: 0.194

5.8 Visualise Predicted Churn Probability

Summary: The probability plot shows the distribution of predicted churn probabilities. Most customers have relatively low predicted churn probabilities, which is reasonable because most customers did not churn.

Explanation:
The plot confirms that most predicted churn probabilities are relatively low, and the churn and non-churn groups overlap. This supports the use of multiple evaluation metrics instead of relying only on accuracy.

prediction_df <- data.frame(
  Actual_Churn = test_data$churn,
  Predicted_Probability = pred_prob
)

ggplot(prediction_df, aes(x = Predicted_Probability, fill = Actual_Churn)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  labs(
    title = "Predicted Probability of Customer Churn",
    x = "Predicted Churn Probability",
    y = "Number of Customers",
    fill = "Actual Churn"
  ) +
  theme_minimal()

5.9 Additional Classification Models: Decision Tree & Random Forest

Summary:
To extend the baseline logistic regression, we trained two additional classification models: a decision tree and a random forest. The goal is to examine whether a non-linear model can improve churn detection on an imbalanced dataset.

Explanation:
Logistic regression is highly interpretable but assumes a linear relationship between predictors and the log-odds of churn. In contrast, decision trees can model non-linear interactions, while random forests often achieve better predictive performance by aggregating many trees. However, tree-based models may require more computation, especially on a large dataset. To keep runtime manageable, we optionally down-sample the training set for the tree-based models while keeping the same testing set for fair evaluation.

evaluate_binary_classifier <- function(actual, prob, threshold = 0.10) {
  # actual must be a factor with levels c("0","1")
  pred_class <- ifelse(prob >= threshold, "1", "0")
  pred_class <- factor(pred_class, levels = levels(actual))

  cm <- table(Actual = actual, Predicted = pred_class)

  TN <- cm["0", "0"]
  FP <- cm["0", "1"]
  FN <- cm["1", "0"]
  TP <- cm["1", "1"]

  accuracy <- (TP + TN) / sum(cm)
  precision <- ifelse((TP + FP) == 0, 0, TP / (TP + FP))
  recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
  specificity <- ifelse((TN + FP) == 0, 0, TN / (TN + FP))
  balanced_accuracy <- (recall + specificity) / 2
  f1_score <- ifelse((precision + recall) == 0, 0,
                     2 * precision * recall / (precision + recall))

  roc_obj <- pROC::roc(
    response = actual,
    predictor = as.numeric(prob),
    levels = c("0", "1"),
    direction = "<",
    quiet = TRUE
  )
  auc_roc <- as.numeric(pROC::auc(roc_obj))

  pr_obj <- PRROC::pr.curve(
    scores.class0 = prob[actual == "1"],
    scores.class1 = prob[actual == "0"],
    curve = FALSE
  )
  auc_pr <- as.numeric(pr_obj$auc.integral)

  metrics <- data.frame(
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    Specificity = specificity,
    Balanced_Accuracy = balanced_accuracy,
    F1_Score = f1_score,
    AUC_ROC = auc_roc,
    AUC_PR = auc_pr
  )

  list(confusion_matrix = cm, metrics = metrics, roc_obj = roc_obj)
}

# Optional down-sampling for tree-based models (speed / memory)
train_tree <- train_data
set.seed(123)
if (nrow(train_tree) > 200000) {
  train_tree <- train_tree[sample(1:nrow(train_tree), 200000), ]
}
cat("Tree/RF training rows used:", nrow(train_tree), "\n")

## Tree/RF training rows used: 200000

5.9.1 Decision Tree Model (rpart)

Summary:
A decision tree was trained to capture non-linear splits of customer attributes. A shallow depth was used to reduce overfitting and improve interpretability.

# Load required libraries for decision tree training & visualization
library(rpart)
library(rpart.plot)

# Define classification formula for churn prediction model
# Predict churn status using all selected customer feature variables
tree_formula <- churn ~ tenure + monthlycharges + totalcharges + customer_satisfaction +
  num_complaints + num_service_calls + late_payments + credit_score +
  contract + payment_method + paperless_billing

# Train base classification decision tree with regularization control parameters
dt_model <- rpart(
  formula = tree_formula,
  data = train_tree,
  method = "class",  # Specify classification task for binary churn prediction
  # Regularization settings to avoid severe overfitting before pruning
  control = rpart.control(
    maxdepth = 6,      # Restrict maximum tree depth for business interpretability
    minsplit = 200,    # Minimum number of samples required to split a node
    cp = 0.001         # Minimum complexity threshold to accept a split
  )
)

# Print CP table to inspect cross-validation error values for pruning reference
printcp(dt_model)

## 
## Classification tree:
## rpart(formula = tree_formula, data = train_tree, method = "class", 
##     control = rpart.control(maxdepth = 6, minsplit = 200, cp = 0.001))
## 
## Variables actually used in tree construction:
## character(0)
## 
## Root node error: 19916/200000 = 0.09958
## 
## n= 200000 
## 
##   CP nsplit rel error xerror xstd
## 1  0      0         1      0    0

# Extract the optimal CP value that yields the minimal cross-validation error (xerror)
best_cp <- dt_model$cptable[which.min(dt_model$cptable[, "xerror"]), "CP"]
# Print selected best CP for debugging & report records
cat("Selected optimal CP value for pruning:", best_cp, "\n")

## Selected optimal CP value for pruning: 0

# Prune the original decision tree using the optimal CP value to reduce overfitting
dt_pruned <- prune(dt_model, cp = best_cp)

# Export high-resolution tree plot as PNG for embedding into HTML analysis report
png("pruned_churn_decision_tree.png", width = 1600, height = 1000, res = 300)
rpart.plot(
  x = dt_pruned,
  type = 2,
  extra = 103,
  fallen.leaves = TRUE,
  box.col = c("#b8e1ff", "#ffb8b8"),
  main = "Pruned Decision Tree - Customer Churn Prediction",
  cex = 0.85
)
dev.off()

## quartz_off_screen 
##                 2

# Generate churn probability predictions on unseen test dataset
dt_prob <- predict(
  object = dt_pruned,
  newdata = test_data,
  type = "prob"
)[, "1"]  # Extract probability of positive churn class (churn = 1)

# Evaluate binary classification performance
# Fix: pass arguments by position to avoid unused parameter error
dt_eval <- evaluate_binary_classifier(test_data$churn, dt_prob, threshold)

# Print confusion matrix of test set predictions
cat("\nConfusion Matrix of Pruned Decision Tree:\n")

## 
## Confusion Matrix of Pruned Decision Tree:

print(dt_eval$confusion_matrix)

##       Predicted
## Actual      0      1
##      0 270349      0
##      1  29651      0

# Print classification metrics rounded to 4 decimal places
cat("\nPerformance Metrics (4 decimal precision):\n")

## 
## Performance Metrics (4 decimal precision):

print(round(dt_eval$metrics, 4))

##   Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1   0.9012         0      0           1               0.5        0     0.5
##   AUC_PR
## 1 0.0988

5.9.2 Random Forest Model (randomForest)

Summary:
A random forest was trained to improve predictive performance by averaging many decision trees. Variable importance was also extracted to identify the most influential drivers of churn.

# Compute mtry based on the number of columns in the design matrix
p <- ncol(model.matrix(tree_formula, data = train_tree)) - 1
mtry_val <- max(1, floor(sqrt(p)))

rf_model <- randomForest(
  tree_formula,
  data = train_tree,
  ntree = 300,
  mtry = mtry_val,
  importance = TRUE
)

rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[, "1"]
rf_eval <- evaluate_binary_classifier(test_data$churn, rf_prob, threshold = threshold)

rf_eval$confusion_matrix

##       Predicted
## Actual      0      1
##      0 174218  96131
##      1  13092  16559

round(rf_eval$metrics, 4)

##   Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1   0.6359    0.1469 0.5585      0.6444            0.6014   0.2327  0.6429
##   AUC_PR
## 1 0.1688

varImpPlot(rf_model, n.var = 10,
           main = "Top Variable Importance (Random Forest)")

5.9.3 Model Comparison (Logistic vs Tree vs Random Forest)

lr_eval <- evaluate_binary_classifier(test_data$churn, pred_prob, threshold = threshold)

comparison_tbl <- dplyr::bind_rows(
  data.frame(Model = "Logistic Regression", lr_eval$metrics),
  data.frame(Model = "Decision Tree", dt_eval$metrics),
  data.frame(Model = "Random Forest", rf_eval$metrics)
) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))

comparison_tbl

##                 Model Accuracy Precision Recall Specificity Balanced_Accuracy
## 1 Logistic Regression   0.6238    0.1553 0.6323      0.6229            0.6276
## 2       Decision Tree   0.9012    0.0000 0.0000      1.0000            0.5000
## 3       Random Forest   0.6359    0.1469 0.5585      0.6444            0.6014
##   F1_Score AUC_ROC AUC_PR
## 1   0.2494  0.6777 0.1940
## 2   0.0000  0.5000 0.0988
## 3   0.2327  0.6429 0.1688

5.10 XGBoost Modelling + SHAP Interpretation (Modelling & Deep Analysis Stage)

Summary:
We further trained an XGBoost churn model and used SHAP values to interpret both global feature importance and local (single-customer) explanations. This helps uncover non-linear and interaction effects that Pearson correlation may miss.

Explanation:
- XGBoost is a strong gradient-boosted tree model that typically performs well on tabular business data.
- SHAP decomposes each prediction into additive contributions from each feature, providing a consistent explanation framework.

# Prepare one-hot encoded design matrix for XGBoost
# Use the same features as the classification models above.
df_train_xgb <- train_tree
df_test_xgb  <- test_data

mm_all <- model.matrix(tree_formula, data = rbind(df_train_xgb, df_test_xgb))
X_all  <- mm_all[, -1, drop = FALSE]  # drop intercept

X_train <- X_all[1:nrow(df_train_xgb), , drop = FALSE]
X_test  <- X_all[(nrow(df_train_xgb) + 1):nrow(X_all), , drop = FALSE]

y_train <- as.numeric(as.character(df_train_xgb$churn))
y_test  <- as.numeric(as.character(df_test_xgb$churn))

scale_pos_weight <- sum(y_train == 0) / sum(y_train == 1)

dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest  <- xgb.DMatrix(data = X_test,  label = y_test)

params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.05,
  max_depth = 6,
  min_child_weight = 1,
  subsample = 0.8,
  colsample_bytree = 0.8,
  scale_pos_weight = scale_pos_weight
)

set.seed(123)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 2000,
  watchlist = list(train = dtrain, eval = dtest),
  early_stopping_rounds = 50,
  verbose = 0
)

xgb_prob <- predict(xgb_model, dtest)
xgb_eval <- evaluate_binary_classifier(test_data$churn, xgb_prob, threshold = threshold)

xgb_eval$confusion_matrix

##       Predicted
## Actual      0      1
##      0      0 270349
##      1      0  29651

round(xgb_eval$metrics, 4)

##   Accuracy Precision Recall Specificity Balanced_Accuracy F1_Score AUC_ROC
## 1   0.0988    0.0988      1           0               0.5   0.1799  0.6765
##   AUC_PR
## 1 0.1928

5.10.1 SHAP: Global importance, beeswarm, and single-customer waterfall

# SHAP can be computationally heavy on very large test sets.
# Use a subset of the test set for explanation plots.
set.seed(123)
shap_n <- min(20000, nrow(X_test))
shap_idx <- sample(seq_len(nrow(X_test)), shap_n)

sv <- shapviz(xgb_model, X_pred = X_test[shap_idx, , drop = FALSE])

# 1) Global feature importance (bar)
p_imp <- shapviz::sv_importance(sv, kind = "bar", max_display = 15) +
  ggtitle("SHAP Global Feature Importance (Top 15)")
print(p_imp)

# 2) Beeswarm plot (global distribution of SHAP values)
# Note: some older shapviz versions do not have sv_beeswarm().
# Fall back to sv_importance(kind = "beeswarm") if available; otherwise show bar importance.
p_bee <- tryCatch(
  shapviz::sv_beeswarm(sv, max_display = 15),
  error = function(e) NULL
)
if (is.null(p_bee)) {
  p_bee <- tryCatch(
    shapviz::sv_importance(sv, kind = "beeswarm", max_display = 15),
    error = function(e) {
      message("sv_beeswarm() is not available in your shapviz version. Showing bar importance instead.")
      shapviz::sv_importance(sv, kind = "bar", max_display = 15)
    }
  )
}
print(p_bee + ggtitle("SHAP Beeswarm (Top 15)"))

# 3) Single-customer explanation (waterfall)
# Use the first row in the SHAP subset as an example.
p_wf <- shapviz::sv_waterfall(sv, row_id = 1, max_display = 12) +
  ggtitle("SHAP Waterfall (One Customer Example)")
print(p_wf)

5.10.2 Extended model comparison (including XGBoost)

comparison_tbl_ext <- dplyr::bind_rows(
  data.frame(Model = "Logistic Regression", lr_eval$metrics),
  data.frame(Model = "Decision Tree", dt_eval$metrics),
  data.frame(Model = "Random Forest", rf_eval$metrics),
  data.frame(Model = "XGBoost", xgb_eval$metrics)
) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))

comparison_tbl_ext

##                 Model Accuracy Precision Recall Specificity Balanced_Accuracy
## 1 Logistic Regression   0.6238    0.1553 0.6323      0.6229            0.6276
## 2       Decision Tree   0.9012    0.0000 0.0000      1.0000            0.5000
## 3       Random Forest   0.6359    0.1469 0.5585      0.6444            0.6014
## 4             XGBoost   0.0988    0.0988 1.0000      0.0000            0.5000
##   F1_Score AUC_ROC AUC_PR
## 1   0.2494  0.6777 0.1940
## 2   0.0000  0.5000 0.0988
## 3   0.2327  0.6429 0.1688
## 4   0.1799  0.6765 0.1928

5.11 Classification Result Discussion

Summary:
Overall, logistic regression provides a simple and interpretable baseline model for customer churn prediction. By including both numerical and categorical predictors, the model becomes more consistent with the EDA findings.

Explanation:
The model includes customer usage, payment behaviour, customer experience, and service-related categorical variables such as contract type, payment method, and paperless billing. Since the churn class is imbalanced, a lower threshold was used to improve the model’s ability to detect potential churn customers. However, this may also increase false positives. This trade-off is acceptable in a customer churn context because companies may prefer to identify more potential churn customers for retention actions.

Compared with logistic regression, the decision tree can capture non-linear splits (for example, churn risk increasing sharply beyond certain thresholds of tenure or complaints). However, single trees can be unstable and may underperform if the true pattern is complex. The random forest typically improves performance by averaging many trees and often yields better ranking metrics (AUC-ROC / AUC-PR), which is important when the churn class is imbalanced.

In practice, logistic regression remains attractive when explainability and ease of deployment are priorities, while random forest can be used when higher predictive performance is needed for targeted retention campaigns. The choice depends on the business trade-off between model interpretability, operational cost, and the cost of false positives vs false negatives.

6 Regression Model: Linear Regression with Interaction Term

6.1 Define Regression Objective

This section predicts totalcharges, which is the total amount paid by each customer. Since this variable is numeric and continuous, it is treated as a regression problem.

Linear regression was used because it is simple and easy to interpret. The model also includes an interaction term between tenure and monthlycharges, because total charges are naturally related to both how long a customer stays and how much the customer pays each month. The other predictors are num_services, customer_satisfaction, credit_score, age, and annual_income.

6.2 Prepare Data and Split Dataset

This part uses the same training and testing split that was already created in the classification section. Using the same split ensures that the regression and classification results are based on the same customers, which is important when the two models are combined later.

# Reuse the same train/test split from the classification section
train_reg <- train_data
test_reg  <- test_data

cat("Training set size:", nrow(train_reg), "\n")

## Training set size: 700000

cat("Testing set size: ", nrow(test_reg),  "\n")

## Testing set size:  300000

6.3 Build Linear Regression Model

The regression model was fitted using the training data. The model includes tenure, monthlycharges, and their interaction term. In R, tenure * monthlycharges includes tenure, monthlycharges, and the interaction effect between them.

This is suitable for this task because customers with both long tenure and high monthly charges usually have much higher total charges.

lm_model <- lm(
  totalcharges ~ tenure * monthlycharges + num_services +
    customer_satisfaction + credit_score + age + annual_income,
  data = train_reg
)

summary(lm_model)

## 
## Call:
## lm(formula = totalcharges ~ tenure * monthlycharges + num_services + 
##     customer_satisfaction + credit_score + age + annual_income, 
##     data = train_reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7806.8   -31.0    14.5   106.3  6421.8 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           -5.100e-01  4.373e+00   -0.117 0.907158    
## tenure                -2.871e-01  7.853e-02   -3.656 0.000257 ***
## monthlycharges        -1.363e-01  2.843e-02   -4.793 1.65e-06 ***
## num_services           3.045e+00  3.282e-01    9.277  < 2e-16 ***
## customer_satisfaction -9.813e-02  1.752e-01   -0.560 0.575418    
## credit_score           6.977e-03  4.706e-03    1.483 0.138169    
## age                   -7.720e-03  2.793e-02   -0.276 0.782222    
## annual_income          1.363e-05  1.327e-05    1.027 0.304489    
## tenure:monthlycharges  9.585e-01  8.836e-04 1084.774  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 338.1 on 699991 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.9649 
## F-statistic: 2.405e+06 on 8 and 699991 DF,  p-value: < 2.2e-16

6.4 Predict Total Charges

The fitted model was then used to predict totalcharges for the testing data. The predicted values were compared with the actual values to evaluate model performance.

pred_totalcharges <- predict(lm_model, newdata = test_reg)

head(data.frame(
  Actual    = round(test_reg$totalcharges, 2),
  Predicted = round(pred_totalcharges,    2)
), 10)

##      Actual Predicted
## 5    648.79    944.33
## 7   3602.75   3153.26
## 8   2067.41   2079.01
## 11   504.35    525.26
## 12  3187.81   2925.58
## 13   677.30    706.72
## 22 10279.40   8951.59
## 24  4679.77   4090.45
## 28  5465.79   5142.84
## 31    92.20     83.90

6.5 Evaluate Regression Model

The model was evaluated using MAE, RMSE, and R-squared. MAE shows the average prediction error in dollars. RMSE gives more weight to larger errors. R-squared shows how much variation in totalcharges can be explained by the model.

actuals   <- test_reg$totalcharges
residuals <- actuals - pred_totalcharges

mae       <- mean(abs(residuals))
rmse      <- sqrt(mean(residuals^2))
sse       <- sum(residuals^2)
sst       <- sum((actuals - mean(actuals))^2)
r_squared <- 1 - sse / sst

regression_result <- data.frame(
  Metric = c("MAE (USD)", "RMSE (USD)", "R-squared"),
  Value  = round(c(mae, rmse, r_squared), 4)
)

regression_result

##       Metric    Value
## 1  MAE (USD) 177.1236
## 2 RMSE (USD) 338.3381
## 3  R-squared   0.9647

6.6 Visualise Actual vs Predicted

The actual-versus-predicted plot was used to check how close the predictions were to the real values. The residual plot was used to check whether the prediction errors were roughly centred around zero.

set.seed(42)
plot_df     <- data.frame(Actual = actuals, Predicted = pred_totalcharges)
plot_sample <- plot_df %>% sample_n(min(10000, nrow(plot_df)))

ggplot(plot_sample, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
  geom_abline(slope = 1, intercept = 0,
              colour = "#E74C3C", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title    = "Actual vs Predicted Total Charges",
       subtitle = "Random sample of 10,000 customers; red line = perfect prediction",
       x = "Actual Total Charges (USD)",
       y = "Predicted Total Charges (USD)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

residual_df     <- data.frame(Predicted = pred_totalcharges, Residual = residuals)
residual_sample <- residual_df %>% sample_n(min(10000, nrow(residual_df)))

ggplot(residual_sample, aes(x = Predicted, y = Residual)) +
  geom_point(alpha = 0.2, size = 0.8, colour = "#2C3E50") +
  geom_hline(yintercept = 0, colour = "#E74C3C", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title    = "Residual Plot",
       subtitle = "Random sample of 10,000 customers; residuals should be centred on zero",
       x = "Predicted Total Charges (USD)",
       y = "Residual (Actual − Predicted)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

6.7 Regression Result Discussion

Based on the evaluation results, the model achieved an R-squared of 0.965, with an MAE of about $177 and an RMSE of about $338. This suggests that the model has good predictive performance for totalcharges. This is reasonable because total charges are naturally related to how long a customer has stayed with the company and how much the customer pays each month.

The interaction term between tenure and monthlycharges helps the model capture their combined effect on total charges. However, the result should still be interpreted carefully because totalcharges is closely related to these variables by definition. Therefore, this model is useful for estimating customer spending value, but the result should not be over-interpreted.

One limitation is that outlier capping was applied to some input features (e.g., monthlycharges and annual_income) during data cleaning to reduce the influence of extreme values. This improves model stability, but it may slightly dampen the estimated effect size for those predictors. Importantly, the regression target (totalcharges) itself was not capped in this version of the pipeline.

6.8 5-Fold Cross-Validation

A 5-fold cross-validation was used to check whether the regression model is stable. This cross-validation was used as an additional stability check on the cleaned dataset, separate from the fixed train-test split used for the main model comparison. The dataset was divided into five parts. Each time, four parts were used for training and one part was used for testing.

set.seed(123)
k_folds  <- 5
fold_ids <- sample(rep(1:k_folds, length.out = nrow(df_clean)))

cv_results <- data.frame(
  Fold = as.character(1:k_folds),
  MAE  = NA_real_,
  RMSE = NA_real_,
  R2   = NA_real_
)

for (i in 1:k_folds) {
  cv_train <- df_clean[fold_ids != i, ]
  cv_test  <- df_clean[fold_ids == i, ]

  cv_model <- lm(
    totalcharges ~ tenure * monthlycharges + num_services +
      customer_satisfaction + credit_score + age + annual_income,
    data = cv_train
  )

  cv_pred   <- predict(cv_model, newdata = cv_test)
  cv_actual <- cv_test$totalcharges

  cv_results$MAE[i]  <- mean(abs(cv_actual - cv_pred))
  cv_results$RMSE[i] <- sqrt(mean((cv_actual - cv_pred)^2))
  cv_results$R2[i]   <- 1 - sum((cv_actual - cv_pred)^2) /
                            sum((cv_actual - mean(cv_actual))^2)
}

cv_results <- rbind(
  cv_results,
  data.frame(
    Fold = "Average",
    MAE  = mean(cv_results$MAE),
    RMSE = mean(cv_results$RMSE),
    R2   = mean(cv_results$R2)
  )
)

cv_results$MAE  <- round(cv_results$MAE,  4)
cv_results$RMSE <- round(cv_results$RMSE, 4)
cv_results$R2   <- round(cv_results$R2,   4)

cv_results

##      Fold      MAE     RMSE     R2
## 1       1 177.0401 338.4160 0.9649
## 2       2 176.3736 336.6289 0.9649
## 3       3 177.0498 337.4517 0.9651
## 4       4 178.2538 339.7117 0.9645
## 5       5 177.1107 338.6662 0.9648
## 6 Average 177.1656 338.1749 0.9648

6.9 Visualise Cross-Validation Stability

The bar chart shows the R-squared value for each fold. If the values are close to each other, it means the model performance is stable across different data splits.

cv_plot_df <- cv_results[cv_results$Fold != "Average", ]

ggplot(cv_plot_df, aes(x = Fold, y = R2)) +
  geom_bar(stat = "identity", fill = "#2C3E50", width = 0.5) +
  geom_text(aes(label = R2), vjust = -0.5, fontface = "bold", size = 4) +
  geom_hline(yintercept = mean(cv_plot_df$R2),
             colour = "#E74C3C", linetype = "dashed", linewidth = 1) +
  coord_cartesian(ylim = c(0, max(cv_plot_df$R2) * 1.1)) +
  labs(title    = "5-Fold Cross-Validation: R-squared by Fold",
       subtitle = "Red dashed line shows the mean across folds",
       x = "Fold", y = "R-squared") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

6.10 Compare Against Baseline Model

A baseline model was also used for comparison. This baseline model simply predicts the mean value of totalcharges from the training data. If the regression model has lower MAE and RMSE than the baseline model, it means the selected predictors and interaction term are useful.

baseline_pred   <- rep(mean(train_reg$totalcharges), nrow(test_reg))
baseline_actual <- test_reg$totalcharges

baseline_mae  <- mean(abs(baseline_actual - baseline_pred))
baseline_rmse <- sqrt(mean((baseline_actual - baseline_pred)^2))

comparison_table <- data.frame(
  Model     = c("Baseline (Predict Mean)", "Regression with Interaction"),
  MAE_USD   = round(c(baseline_mae,  mae),  2),
  RMSE_USD  = round(c(baseline_rmse, rmse), 2)
)

comparison_table

##                         Model MAE_USD RMSE_USD
## 1     Baseline (Predict Mean) 1387.32  1801.58
## 2 Regression with Interaction  177.12   338.34

6.11 Visualise Baseline vs Regression Model

comp_long <- data.frame(
  Model  = rep(comparison_table$Model, 2),
  Metric = rep(c("MAE (USD)", "RMSE (USD)"), each = 2),
  Value  = c(comparison_table$MAE_USD, comparison_table$RMSE_USD)
)

ggplot(comp_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  geom_text(aes(label = round(Value, 1)),
            position = position_dodge(width = 0.6),
            vjust = -0.5, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("Baseline (Predict Mean)"     = "#E74C3C",
                               "Regression with Interaction" = "#2C3E50")) +
  labs(title = "Regression Model vs Baseline Model",
       x = "Metric (Lower is Better)",
       y = "Error (USD)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

6.12 Regression Evaluation Discussion

The cross-validation result shows that the model performs consistently across the five folds, which suggests that the model is stable across different data splits. The baseline comparison also shows that the regression model has much lower MAE and RMSE than simply predicting the mean value of totalcharges.

This confirms that the selected predictors provide useful information. Overall, this regression model can help estimate customer value based on account and customer characteristics.

The residual plot also suggests that the error spread may not be perfectly constant, so the regression result should be interpreted mainly as a prediction model rather than a strict inference model.

7 Clustering: K-Means Customer Segmentation

7.1 Define Clustering Objective

This section uses K-means clustering to group customers into different segments. Unlike classification and regression, clustering does not use a target variable. The purpose is to find customers with similar behaviour and account characteristics.

7.2 Prepare Data for Clustering

Seven numeric variables were selected for clustering: tenure, monthlycharges, num_services, customer_satisfaction, credit_score, num_complaints, and late_payments.

The variables were standardised before applying K-means. This is important because K-means is based on distance. Without standardisation, variables with larger values may have too much influence on the clustering result.

A sample of 50,000 customers was used to reduce computation time.

cluster_vars <- df_clean %>%
  select(tenure, monthlycharges, num_services, customer_satisfaction,
         credit_score, num_complaints, late_payments) %>%
  mutate(across(everything(), as.numeric))

set.seed(123)
cluster_idx    <- sample(1:nrow(cluster_vars), size = 50000)
cluster_data   <- cluster_vars[cluster_idx, ]

cluster_scaled <- scale(cluster_data)

cat("Sample size for clustering:", nrow(cluster_data), "\n")

## Sample size for clustering: 50000

cat("Number of features used:   ", ncol(cluster_data), "\n")

## Number of features used:    7

7.3 Determine Optimal Number of Clusters (Elbow Method)

Based on the elbow plot and the later comparison of silhouette scores, k = 4 was selected as a reasonable number of clusters. The within-cluster sum of squares was calculated for different k values to compare the compactness of the clusters.

set.seed(123)
wss <- sapply(1:10, function(k) {
  kmeans(cluster_scaled, centers = k, nstart = 10, iter.max = 30)$tot.withinss
})

elbow_df <- data.frame(k = 1:10, WSS = wss)

ggplot(elbow_df, aes(x = k, y = WSS)) +
  geom_line(colour = "#2C3E50", linewidth = 1) +
  geom_point(size = 3, colour = "#E74C3C") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method for Optimal Number of Clusters",
       x = "Number of Clusters (k)",
       y = "Within-Cluster Sum of Squares (WSS)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

7.4 Build K-Means Model

The final K-means model was built using k = 4. The option nstart = 25 was used so that the algorithm tries different starting points and keeps the best result.

set.seed(123)
k_optimal <- 4
km_model  <- kmeans(cluster_scaled, centers = k_optimal, nstart = 25)

cluster_data$Cluster <- factor(km_model$cluster)

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(cluster_data$Cluster))

## 
##     1     2     3     4 
##  8077 11302 18363 12258

7.5 Profile Each Cluster

After clustering, the mean value of each variable was calculated for every cluster. This profile table helps explain the characteristics of each customer segment.

cluster_profile <- cluster_data %>%
  group_by(Cluster) %>%
  summarise(
    Count              = n(),
    Avg_Tenure         = round(mean(tenure), 1),
    Avg_MonthlyCharges = round(mean(monthlycharges), 1),
    Avg_NumServices    = round(mean(num_services), 1),
    Avg_Satisfaction   = round(mean(customer_satisfaction), 1),
    Avg_CreditScore    = round(mean(credit_score), 0),
    Avg_Complaints     = round(mean(num_complaints), 1),
    Avg_LatePayments   = round(mean(late_payments), 1)
  )

cluster_profile

## # A tibble: 4 × 9
##   Cluster Count Avg_Tenure Avg_MonthlyCharges Avg_NumServices Avg_Satisfaction
##   <fct>   <int>      <dbl>              <dbl>           <dbl>            <dbl>
## 1 1        8077       58.1               82.8             2.3              6.2
## 2 2       11302       16                 79.4             2.1              6.2
## 3 3       18363       13.7               75.6             1.8              6.1
## 4 4       12258       17.5              110.              4.3              6.2
## # ℹ 3 more variables: Avg_CreditScore <dbl>, Avg_Complaints <dbl>,
## #   Avg_LatePayments <dbl>

7.6 Visualise Clusters Using PCA

PCA was used to visualise the clustering result in two dimensions. Since the clustering model uses seven variables, PCA helps show the general separation of clusters in a simple plot.

pca <- prcomp(cluster_scaled, center = FALSE, scale. = FALSE)
pca_df <- data.frame(
  PC1     = pca$x[, 1],
  PC2     = pca$x[, 2],
  Cluster = cluster_data$Cluster
)

ggplot(pca_df, aes(x = PC1, y = PC2, colour = Cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  scale_colour_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
  labs(title    = "Customer Segments (PCA 2D Projection)",
       subtitle = paste0("k = ", k_optimal,
                         "; PC1 explains ",
                         round(summary(pca)$importance[2, 1] * 100, 1), "%, PC2 explains ",
                         round(summary(pca)$importance[2, 2] * 100, 1), "% of variance"),
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

7.7 Clustering Result Discussion

Based on the cluster profile table, the four clusters mainly differ in customer tenure, monthly charges, and the number of services used. Customer satisfaction is almost the same across all clusters, so it is not a strong factor in separating the groups.

One cluster contains customers with the longest tenure, suggesting a group of long-term customers with relatively stable relationships with the company. Another cluster has the highest monthly charges and the highest number of services, suggesting a high-spending group. The remaining clusters mainly contain newer customers with shorter tenure and lower spending.

These differences can help the company design different retention strategies. For example, long-term customers and high-spending customers may need more personalised retention actions because they represent important customer groups, while lower-spending customers may be handled with lower-cost retention actions.

Unlike classification and regression, clustering has no true target label. Therefore, accuracy and RMSE are not suitable for clustering evaluation. In this report, silhouette width and the BSS/TSS ratio were used to evaluate the clustering result.

7.8 Compute Silhouette Width

Silhouette width was used to evaluate how well each customer fits into the assigned cluster. A higher average silhouette value means the clusters are better separated.

set.seed(42)
sil_idx      <- sample(1:nrow(cluster_scaled), 10000)
sil_data     <- cluster_scaled[sil_idx, ]
sil_clusters <- km_model$cluster[sil_idx]

sil <- silhouette(sil_clusters, dist(sil_data))

sil_overall     <- mean(sil[, "sil_width"])
sil_per_cluster <- tapply(sil[, "sil_width"], sil[, "cluster"], mean)

cat("Overall average silhouette width:", round(sil_overall, 4), "\n\n")

## Overall average silhouette width: 0.144

cat("Average silhouette per cluster:\n")

## Average silhouette per cluster:

print(round(sil_per_cluster, 4))

##      1      2      3      4 
## 0.1184 0.1013 0.1963 0.1224

7.9 Visualise the Silhouette Plot

The silhouette plot provides a visual check of the clustering quality. Customers with positive silhouette values fit better within their assigned clusters.

plot(sil,
     col    = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12"),
     border = NA,
     main   = "Silhouette Plot for K-Means (k = 4)")

7.10 Within-Cluster vs Between-Cluster Variance

The BSS/TSS ratio was also used to evaluate the clustering result. A higher ratio means that more variation in the data is explained by the cluster grouping.

wss_final <- km_model$tot.withinss
bss_final <- km_model$betweenss
tss_final <- km_model$totss
ratio     <- bss_final / tss_final

variance_table <- data.frame(
  Metric = c("Within-Cluster SS (WSS)",
             "Between-Cluster SS (BSS)",
             "Total SS (TSS)",
             "BSS / TSS (variance explained)"),
  Value  = c(round(wss_final, 2),
             round(bss_final, 2),
             round(tss_final, 2),
             paste0(round(ratio * 100, 2), "%"))
)

variance_table

##                           Metric     Value
## 1        Within-Cluster SS (WSS) 246018.49
## 2       Between-Cluster SS (BSS) 103974.51
## 3                 Total SS (TSS)    349993
## 4 BSS / TSS (variance explained)    29.71%

7.11 Compare Different k Values

Different k values were compared to check whether k = 4 was a reasonable choice. The final decision should consider both the evaluation metrics and whether the clusters are easy to interpret.

set.seed(123)
compare_k <- data.frame(k             = 3:6,
                        Avg_Silhouette = NA_real_,
                        BSS_TSS_Ratio  = NA_real_)

for (i in 1:nrow(compare_k)) {
  k_test  <- compare_k$k[i]
  km_test <- kmeans(cluster_scaled, centers = k_test,
                    nstart = 10, iter.max = 30)

  sil_test_clusters <- km_test$cluster[sil_idx]
  sil_test <- silhouette(sil_test_clusters, dist(sil_data))

  compare_k$Avg_Silhouette[i] <- round(mean(sil_test[, "sil_width"]), 4)
  compare_k$BSS_TSS_Ratio[i]  <- round(km_test$betweenss / km_test$totss, 4)
}

compare_k

##   k Avg_Silhouette BSS_TSS_Ratio
## 1 3         0.1407        0.2291
## 2 4         0.1440        0.2971
## 3 5         0.1338        0.3475
## 4 6         0.1379        0.3965

ggplot(compare_k, aes(x = factor(k), y = Avg_Silhouette)) +
  geom_bar(stat = "identity", fill = "#2C3E50", width = 0.5) +
  geom_text(aes(label = sprintf("%.3f", Avg_Silhouette)),
            vjust = -0.5, fontface = "bold", size = 4) +
  labs(title    = "Average Silhouette Width Across Different k Values",
       subtitle = "The k value with the highest bar is the strongest candidate",
       x = "Number of Clusters (k)",
       y = "Average Silhouette Width") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

7.12 Clustering Evaluation Discussion

The clustering evaluation shows that the K-means result has a relatively low average silhouette width (about 0.144) and a BSS/TSS ratio of about 29.7%. This means the clusters are not strongly separated and there is some overlap between groups. This is common for real customer data, where customer behaviour usually changes gradually instead of forming clearly separated groups.

When comparing different k values, k = 4 showed a reasonable balance between clustering performance and business interpretability. Even though the separation is not very strong, the clusters still show clear differences in tenure, spending, and service usage. Therefore, the clustering result is still useful for customer segmentation, as long as it is interpreted together with the cluster profile rather than relying only on the metrics.

Since K-means uses Euclidean distance, it works best with continuous numeric variables. Some variables used here, such as satisfaction score and complaint counts, are ordinal or count-based. Therefore, the clustering result should be interpreted as an approximate customer segmentation rather than a perfect grouping.

8 Comparative Analysis of the Three Models

This section compares the three models used in this project: classification, regression, and clustering. Each model answers a different question about customer churn and customer value.

8.1 Model Roles and Differences

Each model answers a different question. Classification and regression are supervised models, which means they learn from a known target variable. Clustering is unsupervised, so it does not use any target variable. Because of this, the three models cannot be compared using the same metric directly.

model_roles <- data.frame(
  Model = c("Classification",
            "Regression",
            "Clustering"),
  Algorithm = c("Logistic Regression",
                "Linear Regression (with interaction)",
                "K-Means (k = 4)"),
  Learning_Type = c("Supervised (binary)",
                    "Supervised (continuous)",
                    "Unsupervised"),
  Target_Variable = c("churn (0/1)",
                      "totalcharges (USD)",
                      "None"),
  Business_Question = c("WHO is likely to churn?",
                        "HOW MUCH is each customer worth?",
                        "WHAT kind of customer is each?"),
  Output = c("Churn probability",
             "Predicted total charges",
             "Cluster label (1–4)")
)

model_roles

##            Model                            Algorithm           Learning_Type
## 1 Classification                  Logistic Regression     Supervised (binary)
## 2     Regression Linear Regression (with interaction) Supervised (continuous)
## 3     Clustering                      K-Means (k = 4)            Unsupervised
##      Target_Variable                Business_Question                  Output
## 1        churn (0/1)          WHO is likely to churn?       Churn probability
## 2 totalcharges (USD) HOW MUCH is each customer worth? Predicted total charges
## 3               None   WHAT kind of customer is each?     Cluster label (1–4)

8.2 Side-by-Side Performance Summary

Even though the three models use different metrics, the main performance result for each model is shown together for an overall comparison. F1 score, R-squared, and silhouette width are shown as the main summary metrics for the three models, but they should not be directly compared because the models solve different tasks.

performance_summary <- data.frame(
  Model = c("Classification (Logistic Regression)",
            "Regression (Linear Regression with interaction)",
            "Clustering (K-Means, k = 4)"),
  Primary_Metric = c(paste0("F1 Score = ",   round(f1_score,    3)),
                     paste0("R-squared = ",  round(r_squared,   3)),
                     paste0("Silhouette = ", round(sil_overall, 3))),
  Secondary_Metric = c(paste0("Accuracy = ", round(accuracy,  3)),
                       paste0("RMSE = $",    round(rmse, 0)),
                       paste0("BSS/TSS = ",  round(ratio*100, 1), "%")),
  Interpretation = c("Model has limited ability to identify churners under class imbalance",
                     "Regression model explains most of the variance in totalcharges",
                     "Four customer segments were identified, but the silhouette score suggests some overlap")
)

performance_summary

##                                             Model     Primary_Metric
## 1            Classification (Logistic Regression)   F1 Score = 0.249
## 2 Regression (Linear Regression with interaction)  R-squared = 0.965
## 3                     Clustering (K-Means, k = 4) Silhouette = 0.144
##   Secondary_Metric
## 1 Accuracy = 0.624
## 2      RMSE = $338
## 3  BSS/TSS = 29.7%
##                                                                           Interpretation
## 1                   Model has limited ability to identify churners under class imbalance
## 2                         Regression model explains most of the variance in totalcharges
## 3 Four customer segments were identified, but the silhouette score suggests some overlap

8.3 Visualise the Three Models’ Outputs

The three-panel plot shows the output of each model in one view: the churn probability distribution from the classification model, the predicted-versus-actual plot from the regression model, and the cluster sizes from the K-means model.

library(gridExtra)

p1 <- data.frame(Prob = pred_prob) %>%
  ggplot(aes(x = Prob)) +
  geom_histogram(bins = 30, fill = "#E74C3C", alpha = 0.8) +
  labs(title    = "Classification Output",
       subtitle = "Distribution of predicted churn probability",
       x = "Churn Probability", y = "Customer Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

set.seed(42)
reg_sample <- data.frame(Actual    = test_reg$totalcharges,
                         Predicted = pred_totalcharges) %>%
  sample_n(5000)

p2 <- ggplot(reg_sample, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.2, colour = "#2C3E50", size = 0.6) +
  geom_abline(slope = 1, intercept = 0, colour = "#E74C3C", linewidth = 1) +
  labs(title    = "Regression Output",
       subtitle = "Predicted vs actual total charges",
       x = "Actual (USD)", y = "Predicted (USD)") +
  theme_minimal(base_size = 11) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(colour = "grey50"))

p3 <- as.data.frame(table(cluster_data$Cluster)) %>%
  rename(Cluster = Var1, Count = Freq) %>%
  ggplot(aes(x = Cluster, y = Count, fill = Cluster)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = Count), vjust = -0.4, fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
  labs(title    = "Clustering Output",
       subtitle = "Customer count per cluster",
       x = "Cluster", y = "Number of Customers") +
  theme_minimal(base_size = 11) +
  theme(plot.title      = element_text(face = "bold"),
        plot.subtitle   = element_text(colour = "grey50"),
        legend.position = "none")

grid.arrange(p1, p2, p3, ncol = 3)

8.4 Integrated Business Application — Customer Retention Priority Score

The classification and regression outputs are combined into a retention-priority score for every testing customer, while the clustering result is used to interpret the customer segment. This score answers an important business question: which customers should the retention team contact first?

The score was calculated using the same testing customers from the classification and regression models, so that each churn probability and predicted total charge refer to the same customer.

A customer is treated as higher priority when the customer has both a high churn probability and a high predicted total charge. The formula combines the classification probability and the regression prediction:

\[\text{Retention Priority} = P(\text{Churn}) \times \text{Predicted Total Charges}\]

The clustering result then suggests which retention strategy to use for each high-priority customer.

# Assign test customers to clusters
test_features <- test_reg %>%
  select(tenure, monthlycharges, num_services, customer_satisfaction,
         credit_score, num_complaints, late_payments) %>%
  mutate(across(everything(), as.numeric)) %>%
  as.matrix()

test_scaled <- scale(test_features,
                     center = attr(cluster_scaled, "scaled:center"),
                     scale  = attr(cluster_scaled, "scaled:scale"))

assign_to_cluster <- function(scaled_data, centers) {
  apply(scaled_data, 1, function(point) {
    dists <- rowSums((centers - matrix(point,
                                       nrow  = nrow(centers),
                                       ncol  = length(point),
                                       byrow = TRUE))^2)
    which.min(dists)
  })
}

test_clusters <- assign_to_cluster(test_scaled, km_model$centers)

# Create combined result table (using the final regression model's predictions)
combined_results <- data.frame(
  Cluster                = factor(test_clusters),
  Churn_Probability      = round(pred_prob, 4),
  Predicted_TotalCharges = round(pred_totalcharges, 2),
  Actual_TotalCharges    = round(test_reg$totalcharges, 2),
  Actual_Churn           = test_reg$churn
)

# Calculate retention priority score
combined_results$Retention_Priority <- round(
  combined_results$Churn_Probability * combined_results$Predicted_TotalCharges,
  2
)

summary(combined_results$Retention_Priority)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.97   38.65  103.20  174.25  229.10 3885.69

# Top 10 highest-priority customers
top_priority <- combined_results %>%
  arrange(desc(Retention_Priority)) %>%
  head(10)

top_priority

##        Cluster Churn_Probability Predicted_TotalCharges Actual_TotalCharges
## 256008       1            0.3987                9745.90            10605.68
## 566831       1            0.5378                6808.99             7728.85
## 296576       1            0.5195                6537.42             6888.60
## 611640       4            0.4628                7159.88             7149.09
## 682415       1            0.3919                8412.76             9202.15
## 36107        4            0.3388                9345.60             6318.23
## 173046       4            0.3490                9042.90             8681.82
## 626011       1            0.4922                6328.20             6574.42
## 288400       4            0.3075               10055.47            11333.97
## 103548       1            0.3746                7916.84             8759.24
##        Actual_Churn Retention_Priority
## 256008            1            3885.69
## 566831            1            3661.87
## 296576            0            3396.19
## 611640            0            3313.59
## 682415            0            3296.96
## 36107             0            3166.29
## 173046            0            3155.97
## 626011            1            3114.74
## 288400            0            3092.06
## 103548            0            2965.65

8.5 Retention Priority by Customer Segment

The boxplot shows the distribution of retention priority scores across the four customer clusters. Segments with higher overall priority scores may contain more urgent retention candidates because these customers combine higher churn risk with higher predicted customer value.

ggplot(combined_results,
       aes(x = Cluster, y = Retention_Priority, fill = Cluster)) +
  geom_boxplot(alpha = 0.8, outlier.size = 1.5) +
  scale_fill_manual(values = c("#E74C3C", "#2C3E50", "#3498DB", "#F39C12")) +
  scale_y_continuous(labels = comma_format()) +
  labs(title    = "Retention Priority Score by Customer Segment",
       subtitle = "Higher boxes indicate segments with more urgent retention needs",
       x = "Customer Cluster",
       y = "Retention Priority Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title      = element_text(face = "bold"),
        plot.subtitle   = element_text(colour = "grey50"),
        legend.position = "none")

priority_by_cluster <- combined_results %>%
  group_by(Cluster) %>%
  summarise(
    Customer_Count       = n(),
    Mean_Churn_Prob      = round(mean(Churn_Probability),      4),
    Mean_Predicted_Value = round(mean(Predicted_TotalCharges), 2),
    Mean_Priority_Score  = round(mean(Retention_Priority),     2),
    Top_10_Pct_Threshold = round(quantile(Retention_Priority, 0.90), 2)
  )

priority_by_cluster

## # A tibble: 4 × 6
##   Cluster Customer_Count Mean_Churn_Prob Mean_Predicted_Value
##   <fct>            <int>           <dbl>                <dbl>
## 1 1                49170          0.0906                4591.
## 2 2                68667          0.122                 1219.
## 3 3               109513          0.0948                 977.
## 4 4                72650          0.0919                1854.
## # ℹ 2 more variables: Mean_Priority_Score <dbl>, Top_10_Pct_Threshold <dbl>

8.6 Strengths and Limitations of Each Model

The table below summarises the strengths, limitations, and possible future improvements for each model.

strengths_limits <- data.frame(
  Model = c("Classification (Logistic / Tree / RF / XGBoost + SHAP)",
            "Regression (Linear Regression with interaction)",
            "Clustering (K-Means)"),
  Strengths = c(
    "Baseline interpretability (logistic) + stronger performance (RF/XGBoost) + explainability via SHAP",
    "Easy to interpret and captures the combined effect of tenure and monthly charges",
    "Finds customer segments without needing a target variable"
  ),
  Limitations = c(
    "More computation for advanced models; threshold tuning is needed for imbalanced churn",
    "Still depends on a specified model form and may not capture all non-linear patterns",
    "Requires choosing k and depends on feature scaling"
  ),
  Future_Improvement = c(
    "Use cross-validation and cost-sensitive threshold selection; explore additional features",
    "Try another regression method and compare the error",
    "Try other clustering methods such as PAM or Gaussian mixture models"
  )
)

strengths_limits

##                                                    Model
## 1 Classification (Logistic / Tree / RF / XGBoost + SHAP)
## 2        Regression (Linear Regression with interaction)
## 3                                   Clustering (K-Means)
##                                                                                            Strengths
## 1 Baseline interpretability (logistic) + stronger performance (RF/XGBoost) + explainability via SHAP
## 2                   Easy to interpret and captures the combined effect of tenure and monthly charges
## 3                                          Finds customer segments without needing a target variable
##                                                                             Limitations
## 1 More computation for advanced models; threshold tuning is needed for imbalanced churn
## 2   Still depends on a specified model form and may not capture all non-linear patterns
## 3                                    Requires choosing k and depends on feature scaling
##                                                                         Future_Improvement
## 1 Use cross-validation and cost-sensitive threshold selection; explore additional features
## 2                                      Try another regression method and compare the error
## 3                      Try other clustering methods such as PAM or Gaussian mixture models

8.7 Overall Comparison of the Three Models

Overall, the three models are useful in different ways. The classification model helps identify customers who may churn. The regression model helps estimate customer value. The clustering model helps group customers with similar behaviour. When used together, these models can support better customer retention decisions. However, the clustering result should be interpreted carefully because the silhouette score suggests that customer groups are not strongly separated.

Since the dataset is synthetic, the conclusions should be interpreted as an academic modelling exercise rather than direct evidence from a real telecom company.

For simplicity in this academic demonstration, preprocessing was conducted on the entire cleaned dataset. In a stricter production pipeline, statistics for missing value imputation and outlier capping should be derived only from the training set and then applied to the testing set to reduce data leakage.

9 Final Conclusion

This project demonstrated a complete customer churn analytics workflow using data cleaning, exploratory data analysis, classification, regression, and clustering techniques in R. The preprocessing stage improved dataset quality by handling missing values, converting data types, treating outliers, and engineering additional behavioural features.

The overall report is designed to answer the following three questions:

9.1 Q1 (Classification): Predict whether a customer will churn (Yes) or stay (No)

We trained multiple classification models on the same 70/30 split to predict churn based on tenure, monthly charges, total charges, customer satisfaction, number of complaints, service calls, late payments, credit score, and key categorical variables (contract type, payment method, and paperless billing).

Logistic regression provides a strong and interpretable baseline.
Decision tree / Random forest capture non-linear patterns.
XGBoost provides an advanced boosted-tree model with strong ranking performance.

In the exploration stage, Pearson correlation was used to screen linear trends and potential multicollinearity. In the modelling stage, SHAP was used to interpret XGBoost, providing global and local explanations and revealing non-linear and interaction effects that Pearson correlation can miss.

9.2 Q2 (Regression): Predict a customer’s total charges (totalcharges)

We used linear regression with an interaction term between tenure and monthly charges to estimate total charges. EDA confirmed a strong positive relationship between tenure and total charges, and the interaction term reflects how long-term customers with higher monthly plans accumulate substantially higher total charges. This supports customer value estimation and budgeting.

9.3 Q3 (Clustering): Group customers into behavioural segments

We applied clustering to segment customers based on tenure, monthly charges, number of services, satisfaction, credit score, complaints, and late payments. The resulting clusters can be used to tailor retention actions (e.g., different offers for high-value loyal customers vs price-sensitive or complaint-heavy segments).

Although the models achieved meaningful analytical results, several limitations remain. The dataset is synthetic and may not fully represent real-world telecom customer behaviour. In addition, clustering showed moderate separation between groups, meaning segments can overlap in practice. Future improvements could include cross-validation, more robust feature selection, and threshold selection based on the business cost of false positives vs false negatives.

Overall, the project shows how data science techniques can support customer retention decision-making by helping organisations identify high-risk customers, estimate customer value, and design targeted retention strategies more effectively.

Customer Churn Prediction Analysis

WQD7004 Programming for Data Science – Group Project

Group Members: CHEN MIAOJUN, HUANG SHENGQI, WANG HEXIAO, WANG PEIYUAN

2026-06-02