Machine Learning Based Churn Prediction for Telecom Customers

Name	Matric No
Kenneth Wong Wei Keong (Leader)	U2103199
Heng Huey Ying	24202166
Kaung Htet Shyan	24213227
Yong Kai Jing	24219325
Lee Jer Shen	U2103193
Ong Keong Yee	24058788

1.0 Introduction

Our group decided to proceed with a research project titled “Machine Learning Based Churn Prediction for Telecom Customers”, which aims to apply data science and machine learning techniques to address the problem of customer churn in the telecommunications industry. By leveraging customer demographic, service subscription, and billing data, the project seeks to identify patterns and key factors associated with customer attrition.

1.1 Background / Problem Context

Customer churn is a critical challenge in the telecommunications industry, where competition is intense and customers can easily switch service providers due to similar pricing structures and service offerings. Customer churn refers to the phenomenon in which customers discontinue their subscription or terminate their relationship with a company. High churn rates can significantly impact a company’s revenue, profitability, and long-term sustainability, as acquiring new customers is often more costly than retaining existing ones.

Telecommunication companies typically collect large volumes of customer-related data, including demographic information, service subscriptions, billing details, and contract types. When analyzed effectively, this data can provide valuable insights into customer behavior and help identify patterns associated with churn. Traditional methods of churn analysis may fail to capture complex relationships within the data, making them less effective for accurate prediction.

With the advancement of data science and machine learning techniques, predictive models can be developed to identify customers who are at a higher risk of churning. By predicting churn in advance, telecom operators can implement targeted retention strategies, such as personalized offers or service improvements, to reduce customer attrition. Applying machine learning approaches to customer churn analysis is a valuable problem in the context of modern data-driven decision-making.

1.2 Project Objectives

The project objective is to apply R programming, data science, and machine learning techniques to analyze and predict customer churn in the telecommunications industry. This study utilizes the Telco Customer Churn dataset obtained from Kaggle, which contains customer demographic information, service subscription details, contract types, and billing records. By working with a real-world dataset, the project aims to demonstrate the practical application of programming for data science in solving a relevant business problem.

High-level Data Science Methodology (DISM) proposed to ensure a structured and systematic analytical process. The methodology begins with understanding the business problem and dataset, followed by data collection and exploration. Data cleaning and preprocessing are then performed using R programming to address data quality issues such as missing values, inconsistent data types, and irrelevant attributes. Exploratory data analysis is conducted to identify patterns and relationships between customer characteristics and churn behavior.

Machine learning classification models are developed and compared to predict whether a customer is likely to churn, while a regression model is built to predict customer charges based on service usage and subscription features. The performance of the models is evaluated and interpreted in a business-relevant context to provide insights that may support customer retention strategies in the telecommunications industry.

2.0 Dataset Description and Exploration

Project identified and selected Telco Customer Churn dataset obtained from Kaggle as our project dataset source. The dataset is used to explore customer behavior, identify factors associated with churn, and support the development of machine learning models for both classification and regression tasks.

2.1 Dataset Source and Context

The dataset used in this project is titled Telco Customer Churn and was obtained from the Kaggle data repository. The dataset was originally published in 2017 and is based on a real-world business scenario in the telecommunications industry. Its primary purpose is to support the analysis of customer behavior and to enable the development of predictive models for identifying customers who are at risk of churning.

The dataset contains customer demographic information, service subscription details, contract types, and billing records, making it suitable for supervised machine learning tasks such as classification and regression. By providing structured and labeled data, the dataset facilitates the application of data science techniques to study factors associated with customer attrition and customer spending behavior.

2.2 Dataset Dimensions

The Telco Customer Churn dataset consists of 7,043 customer records and 21 variables. Each record represents an individual customer, while each variable captures specific information related to customer demographics, service subscriptions, billing details, or churn status. The dataset size is best fit for our project machine learning analysis using R programming.

2.3 Data Structure

The table below is the summary of variables in the Telco Customer Churn dataset along with their corresponding data types and descriptions.

Variable Name	Data Type	Description
customerID	Character	Unique identifier assigned to each customer
gender	Categorical	Customer’s gender
SeniorCitizen	Categorical	Indicates whether the customer is a senior citizen
Partner	Categorical	Indicates if the customer has a partner
Dependents	Categorical	Indicates if the customer has dependents
tenure	Numeric	Number of months the customer has stayed with the company
PhoneService	Categorical	Indicates whether the customer has phone service
MultipleLines	Categorical	Indicates if multiple phone lines are used
InternetService	Categorical	Type of internet service subscribed
OnlineSecurity	Categorical	Indicates if online security service is subscribed
OnlineBackup	Categorical	Indicates if online backup service is subscribed
DeviceProtection	Categorical	Indicates if device protection service is subscribed
TechSupport	Categorical	Indicates if technical support service is subscribed
StreamingTV	Categorical	Indicates if streaming TV service is subscribed
StreamingMovies	Categorical	Indicates if streaming movie service is subscribed
Contract	Categorical	Type of customer contract
PaperlessBilling	Categorical	Indicates if paperless billing is enabled
PaymentMethod	Categorical	Method of payment used by the customer
MonthlyCharges	Numeric	Monthly charges billed to the customer
TotalCharges	Numeric	Total charges accumulated by the customer
Churn	Categorical	Indicates whether the customer churned

2.4 Summary

Initial exploration of the dataset indicates an imbalance in the churn variable, with a larger proportion of customers remaining subscribed compared to those who churned. Customers with shorter tenure periods and those on month-to-month contracts tend to exhibit higher churn rates, while customers under longer-term contracts demonstrate stronger retention. Several data quality issues are observed, including inconsistent data types and missing or blank values in billing-related variables. These findings emphasize the importance of performing systematic data cleaning and preprocessing before proceeding to exploratory data analysis and model development

3.0 Data Cleaning

3.1 Cleaning Goals

The main objective of the data cleaning process is to transform the raw Telco Customer Churn dataset into a clean and consistent format suitable for data analysis and machine learning modelling. Specifically, the cleaning process aims to resolve missing values, correct inconsistent data types, remove redundant categories, and ensure that all variables are properly formatted for classification and regression tasks.

3.2 Cleaning Steps

The data cleaning process was conducted using R programming with the tidyverse package. The steps were performed to ensure transparency, reproducibility, and data integrity.

Load tidyverse

suppressWarnings(suppressMessages(library("tidyverse")))

Import the dataset .csv file into a dataframe

df <- read.csv("TelcoCustomerChurn.csv")

Check data structure before cleaning

cat("Data Structure Before Cleaning:\n")

## Data Structure Before Cleaning:

glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

cat("\n")

Check for missing value (11 missing values for TotalCharges)

cat("Missing Values for Each Column:\n")

## Missing Values for Each Column:

colSums(is.na(df))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

cat("\n")

Filter and view the tenure column for rows where TotalCharges is NA (All the tenure values are 0, meaning the TotalCharges should be 0 too)

missing_rows <- df %>%
  filter(is.na(TotalCharges)) %>%
  select(customerID, tenure, TotalCharges)
cat("Check Tenure for Rows With Missing TotalCharges:\n")

## Check Tenure for Rows With Missing TotalCharges:

print(missing_rows)

##    customerID tenure TotalCharges
## 1  4472-LVYGI      0           NA
## 2  3115-CZMZD      0           NA
## 3  5709-LVOEQ      0           NA
## 4  4367-NUYAO      0           NA
## 5  1371-DWPAZ      0           NA
## 6  7644-OMVMY      0           NA
## 7  3213-VVOLG      0           NA
## 8  2520-SGTTA      0           NA
## 9  2923-ARZLG      0           NA
## 10 4075-WKNIU      0           NA
## 11 2775-SEFEE      0           NA

cat("\n")

Clean and transform data

df_clean <- df %>%
  
  # Drop CustomerID column (not needed for modeling)
  select(-customerID) %>%
  
  # Impute 0 for missing TotalCharges (new customers)
  mutate(TotalCharges = replace_na(TotalCharges, 0)) %>%
  
  # Collapse redundant categories ("No internet service" is the same as "No")
  mutate(across(c(OnlineSecurity, OnlineBackup, DeviceProtection,
                  TechSupport, StreamingTV, StreamingMovies),
                ~ ifelse(. == "No internet service", "No", .))) %>%
  
  # Collapse redundant categories ("No phone service" is the same as "No")
  mutate(MultipleLines = ifelse(MultipleLines == "No phone service", "No",
                                MultipleLines)) %>%
  
  # Convert SeniorCitizen 0/1 (integer) to "No"/"Yes" (factor)
  mutate(SeniorCitizen = factor(ifelse(SeniorCitizen == 1, "Yes", "No"))) %>%
  
  # Convert all character columns to factors
  mutate(across(where(is.character), as.factor))

Check the entire cleaned dataset for missing value (should be 0 now)

cat("Missing Values in the Dataset after Cleaning:\n")

## Missing Values in the Dataset after Cleaning:

print(sum(is.na(df_clean)))

## [1] 0

cat("\n")

Check data structure after cleaning

cat("Data Structure After Cleaning:\n")

## Data Structure After Cleaning:

glimpse(df_clean)

## Rows: 7,043
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

cat("\n")

Verify collapsed columns (should only have “Yes” and “No” after cleaning)

collapsed_cols <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection", 
                    "TechSupport", "StreamingTV", "StreamingMovies", 
                    "MultipleLines")
cat("Verifying Collapsed Columns (Should only contain 'Yes'/'No'):\n")

## Verifying Collapsed Columns (Should only contain 'Yes'/'No'):

cat("\n")

print(lapply(df_clean[collapsed_cols], levels))

## $OnlineSecurity
## [1] "No"  "Yes"
## 
## $OnlineBackup
## [1] "No"  "Yes"
## 
## $DeviceProtection
## [1] "No"  "Yes"
## 
## $TechSupport
## [1] "No"  "Yes"
## 
## $StreamingTV
## [1] "No"  "Yes"
## 
## $StreamingMovies
## [1] "No"  "Yes"
## 
## $MultipleLines
## [1] "No"  "Yes"

cat("\n")

Verify factor columns to ensure all character columns are converted

cat("Verifying factor columns and their levels:\n")

## Verifying factor columns and their levels:

df_clean %>%
  select(where(is.factor)) %>%
  map(levels) %>% #
  print()

## $gender
## [1] "Female" "Male"  
## 
## $SeniorCitizen
## [1] "No"  "Yes"
## 
## $Partner
## [1] "No"  "Yes"
## 
## $Dependents
## [1] "No"  "Yes"
## 
## $PhoneService
## [1] "No"  "Yes"
## 
## $MultipleLines
## [1] "No"  "Yes"
## 
## $InternetService
## [1] "DSL"         "Fiber optic" "No"         
## 
## $OnlineSecurity
## [1] "No"  "Yes"
## 
## $OnlineBackup
## [1] "No"  "Yes"
## 
## $DeviceProtection
## [1] "No"  "Yes"
## 
## $TechSupport
## [1] "No"  "Yes"
## 
## $StreamingTV
## [1] "No"  "Yes"
## 
## $StreamingMovies
## [1] "No"  "Yes"
## 
## $Contract
## [1] "Month-to-month" "One year"       "Two year"      
## 
## $PaperlessBilling
## [1] "No"  "Yes"
## 
## $PaymentMethod
## [1] "Bank transfer (automatic)" "Credit card (automatic)"  
## [3] "Electronic check"          "Mailed check"             
## 
## $Churn
## [1] "No"  "Yes"

cat("\n")

Write to the cleaned dataset to .csv and .rds files

write_csv(df_clean, "TelcoCustomerChurn_Cleaned.csv")
saveRDS(df_clean, "TelcoCustomerChurn_Cleaned.rds")

3.3 Exploratory Data Analysis (EDA)

The exploratory data analysis (EDA) was conducted to better understand churn distribution, billing patterns, contract effects, and relationships between numeric variables. The visualizations below support feature selection and guide model choices for both classification and regression tasks.

3.3.1 Churn Distribution and Total Charges Distribution

# Load necessary libraries
library(tidyverse)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.2

## corrplot 0.95 loaded

library(patchwork)

## Warning: package 'patchwork' was built under R version 4.5.2

# Import cleaned dataset rds file into a dataframe
df <- readRDS("TelcoCustomerChurn_Cleaned.rds")

# Check data structure to ensure data types are correct
cat("Data Structure from RDS File:\n")

## Data Structure from RDS File:

glimpse(df)

## Rows: 7,043
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

cat("\n")

# Plot bar charts for churn distribution to check class imbalance
p1 <- ggplot(df, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
  labs(title = "Churn Distribution") +
  theme_minimal() +
  theme(legend.position = "none")

# Plot histogram for total charges distribution to check skewness
p2 <- ggplot(df, aes(x = TotalCharges)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Total Charges Distribution") +
  theme_minimal()

# Combine two plots side by side
p1 + p2

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The churn distribution plot on the left reveals a significant class imbalance. There are far more customers who stayed (“No” = 5,174) than churned (“Yes” = 1,869). This could potentially lead to the model being naturally biased toward predicting “No”.

The total charges distribution on the right reveals that the data is right-skewed, indicating most customers have lower total charges, with a long tail of high-value customers

3.3.2 Contract Type & Monthly Charges vs Churn

# Plot bar charts for churn by contract type
p3 <- ggplot(df, aes(x = Contract, fill = Churn)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("No" = "tomato", "Yes" = "seagreen")) +
  labs(y = "Proportion", title = "Churn by Contract Type") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

# Plot box plots for monthly charges vs churn
p4 <- ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
  geom_boxplot() +
  scale_fill_manual(values = c("No" = "tomato", "Yes" = "seagreen")) +
  labs(title = "Monthly Charges vs Churn") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine two plots side by side
p3 + p4

The contract type is likely one of the, if not the strongest categorical predictor in the dataset. The churn rate for Month-to-month contracts is drastically higher than for One-year or Two-year contracts. Long-term contracts are essentially “lock in” customers.

The boxplot reveals that customers who churn (“Yes” / Green) tend to have higher median monthly charges compared to those who stay (“No” / Red). This suggests price sensitivity is a factor in churning.

3.3.3 Correlation Matrix

# Plot correlational matrix to check for multicollinearity
# Isolate numeric columns first
num_cols <- df %>% select(where(is.numeric))
cor_matrix <- cor(num_cols)
corrplot(cor_matrix, method = "number", type = "upper", 
         title = "Correlation Matrix (Check Multicollinearity)", 
         mar=c(0,0,2,0))

There is a very strong positive correlation of 0.83 between ‘tenure’ and ‘TotalCharges’. This indicates multicollinearity as ‘TotalCharges’ is essentially a calculation of ‘Tenure’ * ‘MonthlyCharges’.

3.3.4 Tenure vs Total Charges

# Plot for tenure vs total charges to visualize correlation
ggplot(df, aes(x = tenure, y = TotalCharges, color = Churn)) +
  geom_point(alpha = 0.5) +
  labs(title = "Tenure vs Total Charges") +
  theme_minimal()

The scatter plot confirms the correlation matrix from earlier visually. It can be observed that the ‘TotalCharges’ grows linearly as ‘tenure’ increases.

4.0 Classification Model

4.1 Problem Definition

The classification task aims to predict customer churn (Yes/No) using the available customer demographics, service subscriptions, contract type, and billing-related variables from the Telco dataset. This is important because early identification of high-risk customers enables telecom providers to implement targeted retention actions and reduce revenue loss. Therefore, supervised classification models are trained and compared to determine the most effective approach for churn prediction.

4.2 Models Implemented

We have evaluated four models: Decision Tree, Random Forest, Logistic Regression, and Support Vector Machine (SVM) to predict customer churn using the available demographic, behavioral, and transaction data.

Create Functions

Function to convert confusion matrix into data frame

matrix2df <- function(cm, model_name){
  as.data.frame(cm$table) %>%
    mutate(Model = model_name)
}

Function to get elements from confusion matrix and return as data frame

get_metrics <- function(cm, model_name){
  accuracy     <- round(cm$overall["Accuracy"], 4)
  p_value      <- cm$overall["AccuracyPValue"]
  sensitivity  <- round(cm$byClass["Sensitivity"], 4)
  specificity  <- round(cm$byClass["Specificity"], 4)
  precision    <- round(cm$byClass["Pos Pred Value"], 4)
  prevalence   <- round(cm$byClass["Prevalence"], 4)
  f1_score     <- round(2 * (precision * sensitivity) / (precision + sensitivity), 3)
  
  data.frame(
    Model = model_name,
    Accuracy = accuracy,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    F1 = f1_score,
    Prevalence = prevalence,
    p_value = p_value
  )
}

standardize varImp() and sort Overall in descending order

clean_vi <- function(model, model_name) {
  vi <- varImp(model)$importance %>%
    as.data.frame()
  
  # Handle multiclass importance
  if (ncol(vi) > 1) {
    vi$Overall <- rowMeans(vi)
  } else {
    colnames(vi) <- "Overall"
  }
  
  vi %>%
    mutate(
      var = rownames(.),
      model = model_name
    ) %>%
    select(var, Overall, model) %>%
    arrange(desc(Overall))
}

Load libraries

#load library
library(caret)

## Warning: package 'caret' was built under R version 4.5.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(dplyr)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.5.2

Read input files(.rds)

telco<-readRDS("TelcoCustomerChurn_Cleaned.rds")
telco$Churn <- factor(telco$Churn, levels = c("No", "Yes"))
summary(telco)

##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3488   No :5901      No :3641   No :4933   Min.   : 0.00   No : 682    
##  Male  :3555   Yes:1142      Yes:3402   Yes:2110   1st Qu.: 9.00   Yes:6361    
##                                                    Median :29.00               
##                                                    Mean   :32.37               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
##  No :4072      DSL        :2421   No :5024       No :4614     No :4621        
##  Yes:2971      Fiber optic:3096   Yes:2019       Yes:2429     Yes:2422        
##                No         :1526                                               
##                                                                               
##                                                                               
##                                                                               
##  TechSupport StreamingTV StreamingMovies           Contract    PaperlessBilling
##  No :4999    No :4336    No :4311        Month-to-month:3875   No :2872        
##  Yes:2044    Yes:2707    Yes:2732        One year      :1473   Yes:4171        
##                                          Two year      :1695                   
##                                                                                
##                                                                                
##                                                                                
##                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
##  Bank transfer (automatic):1544   Min.   : 18.25   Min.   :   0.0   No :5174  
##  Credit card (automatic)  :1522   1st Qu.: 35.50   1st Qu.: 398.6   Yes:1869  
##  Electronic check         :2365   Median : 70.35   Median :1394.5             
##  Mailed check             :1612   Mean   : 64.76   Mean   :2279.7             
##                                   3rd Qu.: 89.85   3rd Qu.:3786.6             
##                                   Max.   :118.75   Max.   :8684.8

Set seed for reproducibility

set.seed(123)

Define training dataset as 80% of total rows, then create partitions on dataset accordingly

split<- 0.80
#partition the data
train_index <- createDataPartition(telco$Churn, p = split, list = FALSE)

Define training and testing dataset

#define training dataset
data_train <- telco[train_index, ]

#define testing dataset
data_test <- telco[-train_index, ]

Define the model to be trained using 5-fold cross validations

train_control <- trainControl(method = "cv", number = 5)

Train the model using Decision Tree and make predictions Decision trees capture non-linear relationships and provide easy interpretability.

data_model <- train(Churn ~ ., data = data_train, method = "rpart", trControl = train_control,tuneLength = 5)

prediction <- predict(data_model, data_test)

Store the performance of the model into a variable

churn_dt_performance <- confusionMatrix(prediction, data_test$Churn)

Run varImp() to determine the mportance of variables

dt_vi  <- clean_vi(data_model,  "Decision Tree")

Run Step 8 - 10 using Random Forest Random forest improves prediction performance by combining multiple decision trees and reducing overfitting.

data_model <- train(Churn ~ ., data = data_train, method = "rf", trControl = train_control,tuneLength = 5)

prediction <- predict(data_model, data_test)

churn_rf_performance <- confusionMatrix(prediction, data_test$Churn)

rf_vi  <- clean_vi(data_model,  "Random Forest")

Run Step 8 - 10 using Logistic Regression Logistic regression is used as a baseline classification model due to its simplicity and interpretability.

data_model <- train(Churn ~ ., data = data_train, method = "glm", family = "binomial", trControl = train_control,tuneLength = 5)

prediction <- predict(data_model, data_test)

churn_lr_performance <- confusionMatrix(prediction, data_test$Churn)

lr_vi  <- clean_vi(data_model,  "Logistic Regression")

Run Step 8 - 10 using SVM SVM is used to handle complex decision boundaries and high-dimensional feature spaces.

data_model <- train(Churn ~ ., data = data_train, method = "svmRadial", preProcess = c("center", "scale"), trControl = train_control,tuneLength = 5)

prediction <- predict(data_model, data_test)

churn_svm_performance <- confusionMatrix(prediction, data_test$Churn)

svm_vi  <- clean_vi(data_model,  "Support Vector Machine")

Convert confusion matrix from all models into data frame and combine into one

dt_df  <- matrix2df(churn_dt_performance, "Decision Tree")
rf_df  <- matrix2df(churn_rf_performance, "Random Forest")
lr_df <- matrix2df(churn_lr_performance, "Logistic Regression")
svm_df <- matrix2df(churn_svm_performance, "SVM")

cm_combined <- bind_rows(dt_df, rf_df, lr_df, svm_df)

Plot confusion matrices

ggplot(cm_combined, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 5) +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  facet_wrap(~ Model) +
  labs(title = "Confusion Matrices: All Models") +
  theme_minimal() +
  theme(strip.text = element_text(size = 12))

Get elements from confusion matrices to compare accuracy, p-value, sensitivity, specificity, precision, prevalence and fi-score

dt_metrics  <- get_metrics(churn_dt_performance, "Decision Tree")
rf_metrics  <- get_metrics(churn_rf_performance, "Random Forest")
lr_metrics <- get_metrics(churn_lr_performance, "Logistic Regression")
svm_metrics <- get_metrics(churn_svm_performance, "SVM")

# Combine into one table
metrics_table <- bind_rows(dt_metrics, rf_metrics, lr_metrics, svm_metrics)
kable(metrics_table, row.names = FALSE, format = "pipe")

Model	Accuracy	Sensitivity	Specificity	Precision	F1	Prevalence	p_value
Decision Tree	0.7889	0.9246	0.4129	0.8136	0.866	0.7349	1.5e-06
Random Forest	0.8060	0.9265	0.4718	0.8294	0.875	0.7349	0.0e+00
Logistic Regression	0.8074	0.8994	0.5523	0.8478	0.873	0.7349	0.0e+00
SVM	0.8031	0.9178	0.4853	0.8317	0.873	0.7349	0.0e+00

4.2.1 Result Interpretation:

The confusion matrix summarizes the prediction of Churn using four models by comparing the predicted and actual outcomes. Based on the result, Logistic Regression has the highest accuracy (80.74%), closely followed by Random Forest (80.60%) and SVM (80.31%). Decision Tree is slight lower at 78.89%. Random Forest has the highest sensitivity, which means it’s best at identifying customers who will churn. Decision Tree is slightly lower, followed by SVM and lastly Logistic Regression. As for the specificity, Logistic Regression has the highest with 0.5523, which means it’s best at avoiding false positives. Decision Tree has the lowest at 0.4129. Logistic Regression is also the highest for precision with 0.8478, and same goes to Decision Tree, which is the lowest (0.8136). Random Forest, Logistic Regression SVM are performed well in F1-score, while Random Forest is the highest among them. All four models are the same in prevalence metric (73.49%). Lastly, the p-value indicated that all four models are statistically significant. Among all models, Random Forest achieved the highest overall accuracy and balanced performance across precision and recall, indicating its robustness in predicting customer churn. Logistic regression provided interpretable results but showed lower predictive performance. Decision trees were prone to overfitting, while SVM performed competitively but required feature scaling.

combine importance of variables from all models

vi_combined <- bind_rows(dt_vi, rf_vi, lr_vi, svm_vi)

Get top 10 of importance variables for each models

vi_top10 <- vi_combined %>%
  group_by(model) %>%
  slice_max(Overall, n = 10) %>%
  mutate(var = reorder_within(var, Overall, model))

Plot graphs

ggplot(vi_top10, aes(x = var, y = Overall)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ model, scales = "free_y") +
  scale_x_reordered() +
  labs(x = "Variable", y = "Importance")

4.2.2 Result Interpretation:

The variable importance plots reveal both consistent and model-specific drivers of customer churn across the four classification models. Tenure emerges as the most influential predictor in all models, indicating that customers with shorter service duration are significantly more likely to churn. This suggests customer loyalty strengthens over time, making early-stage customers the most vulnerable group. Contract type (one-year and two-year contracts) also plays a critical role, particularly in Logistic Regression and SVM, where longer contracts are associated with lower churn risk, highlighting the stabilizing effect of long-term commitments. MonthlyCharges and TotalCharges are consistently important across Random Forest and SVM, suggesting that higher financial burden increases churn probability. Service-related features such as InternetService (Fiber optic), OnlineSecurity, and TechSupport appear prominently in tree-based models, indicating that service quality and perceived value strongly influence customer retention. While Logistic Regression emphasizes linear financial and contractual effects, tree-based models capture more complex interactions among service features. Overall, despite differences in ranking, all models consistently identify tenure, contract duration, and pricing variables as the key determinants of churn, reinforcing the robustness of these predictors across different classification approaches.

5.0 Regression Model

5.1 Problem Definition

The regression task aims to predict customers’ MonthlyCharges based on their subscribed services, contract characteristics, and demographic attributes. Understanding the factors that influence monthly billing amounts is important for telecommunications companies to support pricing optimisation, revenue forecasting, and service bundle design. Supervised regression models are developed and evaluated to identify patterns that explain variations in customer charges.

5.2 Models Implemented

Multiple regression models, including Linear Regression, Random Forest Regression, and Partial Least Squares (PLS), were implemented to capture both linear and non-linear relationships between customer attributes and monthly charges while addressing potential multicollinearity among predictors.

library(caret)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.5.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(pls)

## Warning: package 'pls' was built under R version 4.5.2

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:stats':
## 
##     loadings

library(ggplot2)

Load the Dataset

telco <- readRDS("TelcoCustomerChurn_Cleaned.rds")
str(telco)

## 'data.frame':    7043 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

summary(telco)

##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3488   No :5901      No :3641   No :4933   Min.   : 0.00   No : 682    
##  Male  :3555   Yes:1142      Yes:3402   Yes:2110   1st Qu.: 9.00   Yes:6361    
##                                                    Median :29.00               
##                                                    Mean   :32.37               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
##  No :4072      DSL        :2421   No :5024       No :4614     No :4621        
##  Yes:2971      Fiber optic:3096   Yes:2019       Yes:2429     Yes:2422        
##                No         :1526                                               
##                                                                               
##                                                                               
##                                                                               
##  TechSupport StreamingTV StreamingMovies           Contract    PaperlessBilling
##  No :4999    No :4336    No :4311        Month-to-month:3875   No :2872        
##  Yes:2044    Yes:2707    Yes:2732        One year      :1473   Yes:4171        
##                                          Two year      :1695                   
##                                                                                
##                                                                                
##                                                                                
##                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
##  Bank transfer (automatic):1544   Min.   : 18.25   Min.   :   0.0   No :5174  
##  Credit card (automatic)  :1522   1st Qu.: 35.50   1st Qu.: 398.6   Yes:1869  
##  Electronic check         :2365   Median : 70.35   Median :1394.5             
##  Mailed check             :1612   Mean   : 64.76   Mean   :2279.7             
##                                   3rd Qu.: 89.85   3rd Qu.:3786.6             
##                                   Max.   :118.75   Max.   :8684.8

#To avoid leakage
telco_reg <- subset(telco, select = -c(Churn, TotalCharges))

Dataset Split

set.seed(42)
split <- 0.80
train_idx <- createDataPartition(telco_reg$MonthlyCharges, p = split, list = FALSE)

train_data <- telco_reg[train_idx, ]
test_data  <- telco_reg[-train_idx, ]

dim(train_data)

## [1] 5636   18

dim(test_data)

## [1] 1407   18

Train 3 models (Linear Regression, Random Forest, PLS)

set.seed(42)

ctrl <- trainControl(method = "cv", number = 5, verboseIter = TRUE, allowParallel = TRUE)

# Linear Regression (LR)
fit_lr <- train(
  MonthlyCharges ~ .,
  data = train_data,
  method = "lm",
  metric = "RMSE",
  trControl = ctrl
)

## + Fold1: intercept=TRUE 
## - Fold1: intercept=TRUE 
## + Fold2: intercept=TRUE 
## - Fold2: intercept=TRUE 
## + Fold3: intercept=TRUE 
## - Fold3: intercept=TRUE 
## + Fold4: intercept=TRUE 
## - Fold4: intercept=TRUE 
## + Fold5: intercept=TRUE 
## - Fold5: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

# Random Forest (RF)
fit_rf <- train(
  MonthlyCharges ~ .,
  data = train_data,
  method = "rf",
  tuneLength = 5,      
  metric = "RMSE",
  trControl = ctrl
)

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry= 6 
## - Fold1: mtry= 6 
## + Fold1: mtry=11 
## - Fold1: mtry=11 
## + Fold1: mtry=16 
## - Fold1: mtry=16 
## + Fold1: mtry=21 
## - Fold1: mtry=21 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry= 6 
## - Fold2: mtry= 6 
## + Fold2: mtry=11 
## - Fold2: mtry=11 
## + Fold2: mtry=16 
## - Fold2: mtry=16 
## + Fold2: mtry=21 
## - Fold2: mtry=21 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry= 6 
## - Fold3: mtry= 6 
## + Fold3: mtry=11 
## - Fold3: mtry=11 
## + Fold3: mtry=16 
## - Fold3: mtry=16 
## + Fold3: mtry=21 
## - Fold3: mtry=21 
## + Fold4: mtry= 2 
## - Fold4: mtry= 2 
## + Fold4: mtry= 6 
## - Fold4: mtry= 6 
## + Fold4: mtry=11 
## - Fold4: mtry=11 
## + Fold4: mtry=16 
## - Fold4: mtry=16 
## + Fold4: mtry=21 
## - Fold4: mtry=21 
## + Fold5: mtry= 2 
## - Fold5: mtry= 2 
## + Fold5: mtry= 6 
## - Fold5: mtry= 6 
## + Fold5: mtry=11 
## - Fold5: mtry=11 
## + Fold5: mtry=16 
## - Fold5: mtry=16 
## + Fold5: mtry=21 
## - Fold5: mtry=21 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 16 on full training set

# Partial Least Squares (PLS) 
fit_pls <- train(
  MonthlyCharges ~ .,
  data = train_data,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 10,     
  metric = "RMSE",
  trControl = ctrl
)

## + Fold1: ncomp=10 
## - Fold1: ncomp=10 
## + Fold2: ncomp=10 
## - Fold2: ncomp=10 
## + Fold3: ncomp=10 
## - Fold3: ncomp=10 
## + Fold4: ncomp=10 
## - Fold4: ncomp=10 
## + Fold5: ncomp=10 
## - Fold5: ncomp=10 
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 10 on full training set

fit_lr

## Linear Regression 
## 
## 5636 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4509, 4509, 4508, 4510, 4508 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.024385  0.9988405  0.7832199
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

fit_rf

## Random Forest 
## 
## 5636 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4509, 4508, 4510, 4508, 4509 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    5.530269  0.9833686  4.483291
##    6    1.627212  0.9971904  1.193221
##   11    1.390421  0.9978638  1.022253
##   16    1.366523  0.9979321  1.002964
##   21    1.370940  0.9979185  1.002905
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.

fit_pls

## Partial Least Squares 
## 
## 5636 samples
##   17 predictor
## 
## Pre-processing: centered (21), scaled (21) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4509, 4510, 4508, 4508, 4509 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     9.323039  0.9039015  7.3368494
##    2     4.959831  0.9727608  4.0029247
##    3     2.360437  0.9938354  1.9147502
##    4     1.449075  0.9976762  1.1640424
##    5     1.153290  0.9985299  0.8995188
##    6     1.062011  0.9987538  0.8166797
##    7     1.031180  0.9988263  0.7901844
##    8     1.025804  0.9988387  0.7848406
##    9     1.025284  0.9988400  0.7844231
##   10     1.025209  0.9988403  0.7841488
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.

4.Test-Set Evaluation

# Test predictions
pred_lr  <- predict(fit_lr,  newdata = test_data)
pred_rf  <- predict(fit_rf,  newdata = test_data)
pred_pls <- predict(fit_pls, newdata = test_data)

# Metrics: RMSE, RSquared, MAE
m_lr   <- postResample(pred = pred_lr,        obs = test_data$MonthlyCharges)
m_rf   <- postResample(pred = pred_rf,        obs = test_data$MonthlyCharges)
m_pls  <- postResample(pred = pred_pls,       obs = test_data$MonthlyCharges)

results <- data.frame(
  Model = c("Linear Regression (lm)", "Random Forest (rf)", "PLS (pls)"),
  RMSE = c(m_lr["RMSE"], m_rf["RMSE"], m_pls["RMSE"]),
  Rsquared = c(m_lr["Rsquared"], m_rf["Rsquared"], m_pls["Rsquared"]),
  MAE = c(m_lr["MAE"], m_rf["MAE"], m_pls["MAE"])
)

results

5.2.1 Evaluation Discussion

Based On the test set evaluation, Partial Least Squares is the best performing model because it achieves the lowest prediction errors and the highest R squared among the three models with RMSE 1.043026, MAE 0.7860574 and R squared 0.998815. Linear Regression is a very close second which indicates the relationship between the predictors and MonthlyCharges is predominantly linear, while PLS gains a slight edge by using latent components that reduce the impact of correlated predictors and focus on the strongest shared signal. In contrast, Random Forest underperforms because its RMSE and MAE are materially higher, suggesting that its non linear flexibility does not translate into better generalization on this dataset and may introduce extra variance when the underlying pattern is already well explained by linear structure.

5.Diagnostic Plots

obs <- test_data$MonthlyCharges

preds <- data.frame(LR = pred_lr, RF = pred_rf, PLS = pred_pls)
long  <- cbind(Actual = obs, stack(preds))
colnames(long) <- c("Actual", "Predicted", "Model")
long$Residual <- long$Actual - long$Predicted

center_title <- theme(plot.title = element_text(hjust = 0.5))

# 1) Actual vs Predicted
ggplot(long, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.35, size = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~ Model) +
  coord_equal() +
  labs(
    title = "Actual vs Predicted (Test Set)",
    x = "Actual MonthlyCharges",
    y = "Predicted MonthlyCharges"
  ) +
  theme_minimal() +
  center_title

## `geom_smooth()` using formula = 'y ~ x'

# 2) Residuals vs Predicted
ggplot(long, aes(x = Predicted, y = Residual)) +
  geom_point(alpha = 0.35, size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~ Model) +
  labs(
    title = "Residuals vs Predicted (Test Set)",
    x = "Predicted MonthlyCharges",
    y = "Residual (Actual - Predicted)"
  ) +
  theme_minimal() +
  center_title

## `geom_smooth()` using formula = 'y ~ x'

# 3) Residual distribution
ggplot(long, aes(x = Residual)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, alpha = 0.5) +
  geom_density() +
  facet_wrap(~ Model, scales = "free_y") +
  labs(
    title = "Residual Distribution (Test Set)",
    x = "Residual",
    y = "Density"
  ) +
  theme_minimal() +
  center_title

5.2.2 Diagnostic Plot Discussion

In the Actual vs Predicted plots, LR and PLS points lie almost exactly on the 45 degree reference line across the full range of MonthlyCharges and the smooth trend line stays close to the diagonal. This indicates excellent calibration where the models are not systematically overpredicting or underpredicting. RF shows the same overall trend but with wider scatter around the diagonal especially in the mid to higher charge range which aligns with its higher RMSE and MAE.

In the Residuals vs Predicted plots, LR and PLS residuals are centered near zero and mostly remain relatively tight which suggests stable errors and good generalization. There is a mild increase in residual spread at higher predicted MonthlyCharges which indicates slight heteroscedasticity where error variance increases as charges rise. The smooth line also drifts slightly above zero at the high end which implies mild underprediction for higher charge customers since positive residual means actual is greater than predicted. RF displays noticeably larger dispersion and more extreme residuals confirming that its errors are less stable and that the model introduces additional variability without improving fit.

In the Residual Distribution plots, LR and PLS shows good symmetric distributions centered near zero, indicating low bias and consistently small errors. RF shows a wider distribution with a more pronounced right tail which suggests more frequent and larger underpredictions which matches its weaker error metrics.

Overall, the plots justify selecting PLS as the best model because it combines near perfect calibration with the tightest residual spread and the most symmetric residual distribution.

Variable Importance

imp_lr  <- varImp(fit_lr,  scale = TRUE)
imp_rf  <- varImp(fit_rf,  scale = TRUE)
imp_pls <- varImp(fit_pls, scale = TRUE)

plot(imp_lr, top = 10, main = "Top 10 Predictors - LR")

plot(imp_rf, top = 10, main = "Top 10 Predictors - RF")

plot(imp_pls, top = 10, main = "Top 10 Predictors - PLS")

5.2.3 Variable Importance Discussion

Across all three models, InternetServiceFiber optic is the most influential predictor with InternetServiceNo also ranking very highly. This is expected in a telecom pricing context because internet service type typically sets the base price tier. Fiber optic plans are usually priced higher, while having no internet service implies a much lower monthly bill.

Variables such as StreamingTVYes, StreamingMoviesYes, PhoneServiceYes, and MultipleLinesYes appear repeatedly in the top 10. This indicates that once the base internet tier is set, bundled entertainment and phone features explain additional variation in monthly charges. These are the upsell components that add predictable increments to the bill.

OnlineBackupYes, OnlineSecurityYes, DeviceProtectionYes, and TechSupportYes also appear among the top predictors. These are optional add-on services that is expected to contribute to MonthlyCharges but their impact is smaller than the base internet service tier and major bundle features.

In the PLS plot, PaperlessBillingYes and PaymentMethodMailed check appear in the top 10. This does not mean these billing choices directly change MonthlyCharges. Instead, they act as a signal of plan type because customers who pay by mailed check often subscribe to different services than customers who use paperless billing or other payment methods. Therefore, in PLS they are captured within the same latent components as plan and service variables because these features tend to vary together. Overall, all three models agrees that internet service tier and bundled add-on services are the key drivers of MonthlyCharges.

6.0 Conclusion

This project applied R programming and data science techniques to analyse the Telco Customer Churn dataset with the aim of understanding customer behaviour and developing predictive models for business decision support. The dataset was systematically cleaned and prepared to address missing values, inconsistent data types, and redundant categories, ensuring that the data was suitable for reliable analysis and modelling.

For the classification task, multiple supervised learning models were implemented and compared to predict customer churn. Among the models evaluated, Random Forest demonstrated the most balanced performance, effectively identifying customers at higher risk of churn while maintaining overall predictive accuracy. Key predictors such as tenure, contract type, and monthly charges were consistently highlighted across models, reinforcing their importance in customer retention analysis.

In the regression task, models were developed to predict customers’ monthly charges based on their subscribed services and contract attributes. Random Forest regression achieved the best predictive performance, suggesting that non-linear relationships exist between customer features and billing amounts. Overall, the project demonstrates how a structured and reproducible data science workflow can provide actionable insights to support customer retention strategies and pricing decisions in the telecommunications industry.