| Name | Matric No |
|---|---|
| Kenneth Wong Wei Keong (Leader) | U2103199 |
| Heng Huey Ying | 24202166 |
| Kaung Htet Shyan | 24213227 |
| Yong Kai Jing | 24219325 |
| Lee Jer Shen | U2103193 |
| Ong Keong Yee | 24058788 |
Our group decided to proceed with a research project titled “Machine Learning Based Churn Prediction for Telecom Customers”, which aims to apply data science and machine learning techniques to address the problem of customer churn in the telecommunications industry. By leveraging customer demographic, service subscription, and billing data, the project seeks to identify patterns and key factors associated with customer attrition.
Customer churn is a critical challenge in the telecommunications industry, where competition is intense and customers can easily switch service providers due to similar pricing structures and service offerings. Customer churn refers to the phenomenon in which customers discontinue their subscription or terminate their relationship with a company. High churn rates can significantly impact a company’s revenue, profitability, and long-term sustainability, as acquiring new customers is often more costly than retaining existing ones.
Telecommunication companies typically collect large volumes of customer-related data, including demographic information, service subscriptions, billing details, and contract types. When analyzed effectively, this data can provide valuable insights into customer behavior and help identify patterns associated with churn. Traditional methods of churn analysis may fail to capture complex relationships within the data, making them less effective for accurate prediction.
With the advancement of data science and machine learning techniques, predictive models can be developed to identify customers who are at a higher risk of churning. By predicting churn in advance, telecom operators can implement targeted retention strategies, such as personalized offers or service improvements, to reduce customer attrition. Applying machine learning approaches to customer churn analysis is a valuable problem in the context of modern data-driven decision-making.
The project objective is to apply R programming, data science, and machine learning techniques to analyze and predict customer churn in the telecommunications industry. This study utilizes the Telco Customer Churn dataset obtained from Kaggle, which contains customer demographic information, service subscription details, contract types, and billing records. By working with a real-world dataset, the project aims to demonstrate the practical application of programming for data science in solving a relevant business problem.
High-level Data Science Methodology (DISM) proposed to ensure a structured and systematic analytical process. The methodology begins with understanding the business problem and dataset, followed by data collection and exploration. Data cleaning and preprocessing are then performed using R programming to address data quality issues such as missing values, inconsistent data types, and irrelevant attributes. Exploratory data analysis is conducted to identify patterns and relationships between customer characteristics and churn behavior.
Machine learning classification models are developed and compared to predict whether a customer is likely to churn, while a regression model is built to predict customer charges based on service usage and subscription features. The performance of the models is evaluated and interpreted in a business-relevant context to provide insights that may support customer retention strategies in the telecommunications industry.
Project identified and selected Telco Customer Churn dataset obtained from Kaggle as our project dataset source. The dataset is used to explore customer behavior, identify factors associated with churn, and support the development of machine learning models for both classification and regression tasks.
The dataset used in this project is titled Telco Customer Churn and was obtained from the Kaggle data repository. The dataset was originally published in 2017 and is based on a real-world business scenario in the telecommunications industry. Its primary purpose is to support the analysis of customer behavior and to enable the development of predictive models for identifying customers who are at risk of churning.
The dataset contains customer demographic information, service subscription details, contract types, and billing records, making it suitable for supervised machine learning tasks such as classification and regression. By providing structured and labeled data, the dataset facilitates the application of data science techniques to study factors associated with customer attrition and customer spending behavior.
The Telco Customer Churn dataset consists of 7,043 customer records and 21 variables. Each record represents an individual customer, while each variable captures specific information related to customer demographics, service subscriptions, billing details, or churn status. The dataset size is best fit for our project machine learning analysis using R programming.
The table below is the summary of variables in the Telco Customer Churn dataset along with their corresponding data types and descriptions.
| Variable Name | Data Type | Description |
|---|---|---|
| customerID | Character | Unique identifier assigned to each customer |
| gender | Categorical | Customer’s gender |
| SeniorCitizen | Categorical | Indicates whether the customer is a senior citizen |
| Partner | Categorical | Indicates if the customer has a partner |
| Dependents | Categorical | Indicates if the customer has dependents |
| tenure | Numeric | Number of months the customer has stayed with the company |
| PhoneService | Categorical | Indicates whether the customer has phone service |
| MultipleLines | Categorical | Indicates if multiple phone lines are used |
| InternetService | Categorical | Type of internet service subscribed |
| OnlineSecurity | Categorical | Indicates if online security service is subscribed |
| OnlineBackup | Categorical | Indicates if online backup service is subscribed |
| DeviceProtection | Categorical | Indicates if device protection service is subscribed |
| TechSupport | Categorical | Indicates if technical support service is subscribed |
| StreamingTV | Categorical | Indicates if streaming TV service is subscribed |
| StreamingMovies | Categorical | Indicates if streaming movie service is subscribed |
| Contract | Categorical | Type of customer contract |
| PaperlessBilling | Categorical | Indicates if paperless billing is enabled |
| PaymentMethod | Categorical | Method of payment used by the customer |
| MonthlyCharges | Numeric | Monthly charges billed to the customer |
| TotalCharges | Numeric | Total charges accumulated by the customer |
| Churn | Categorical | Indicates whether the customer churned |
Initial exploration of the dataset indicates an imbalance in the churn variable, with a larger proportion of customers remaining subscribed compared to those who churned. Customers with shorter tenure periods and those on month-to-month contracts tend to exhibit higher churn rates, while customers under longer-term contracts demonstrate stronger retention. Several data quality issues are observed, including inconsistent data types and missing or blank values in billing-related variables. These findings emphasize the importance of performing systematic data cleaning and preprocessing before proceeding to exploratory data analysis and model development
The main objective of the data cleaning process is to transform the raw Telco Customer Churn dataset into a clean and consistent format suitable for data analysis and machine learning modelling. Specifically, the cleaning process aims to resolve missing values, correct inconsistent data types, remove redundant categories, and ensure that all variables are properly formatted for classification and regression tasks.
The data cleaning process was conducted using R programming with the tidyverse package. The steps were performed to ensure transparency, reproducibility, and data integrity.
## Data Structure Before Cleaning:
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
## Missing Values for Each Column:
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
missing_rows <- df %>%
filter(is.na(TotalCharges)) %>%
select(customerID, tenure, TotalCharges)
cat("Check Tenure for Rows With Missing TotalCharges:\n")## Check Tenure for Rows With Missing TotalCharges:
## customerID tenure TotalCharges
## 1 4472-LVYGI 0 NA
## 2 3115-CZMZD 0 NA
## 3 5709-LVOEQ 0 NA
## 4 4367-NUYAO 0 NA
## 5 1371-DWPAZ 0 NA
## 6 7644-OMVMY 0 NA
## 7 3213-VVOLG 0 NA
## 8 2520-SGTTA 0 NA
## 9 2923-ARZLG 0 NA
## 10 4075-WKNIU 0 NA
## 11 2775-SEFEE 0 NA
df_clean <- df %>%
# Drop CustomerID column (not needed for modeling)
select(-customerID) %>%
# Impute 0 for missing TotalCharges (new customers)
mutate(TotalCharges = replace_na(TotalCharges, 0)) %>%
# Collapse redundant categories ("No internet service" is the same as "No")
mutate(across(c(OnlineSecurity, OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, StreamingMovies),
~ ifelse(. == "No internet service", "No", .))) %>%
# Collapse redundant categories ("No phone service" is the same as "No")
mutate(MultipleLines = ifelse(MultipleLines == "No phone service", "No",
MultipleLines)) %>%
# Convert SeniorCitizen 0/1 (integer) to "No"/"Yes" (factor)
mutate(SeniorCitizen = factor(ifelse(SeniorCitizen == 1, "Yes", "No"))) %>%
# Convert all character columns to factors
mutate(across(where(is.character), as.factor))## Missing Values in the Dataset after Cleaning:
## [1] 0
## Data Structure After Cleaning:
## Rows: 7,043
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
collapsed_cols <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies",
"MultipleLines")
cat("Verifying Collapsed Columns (Should only contain 'Yes'/'No'):\n")## Verifying Collapsed Columns (Should only contain 'Yes'/'No'):
## $OnlineSecurity
## [1] "No" "Yes"
##
## $OnlineBackup
## [1] "No" "Yes"
##
## $DeviceProtection
## [1] "No" "Yes"
##
## $TechSupport
## [1] "No" "Yes"
##
## $StreamingTV
## [1] "No" "Yes"
##
## $StreamingMovies
## [1] "No" "Yes"
##
## $MultipleLines
## [1] "No" "Yes"
## Verifying factor columns and their levels:
## $gender
## [1] "Female" "Male"
##
## $SeniorCitizen
## [1] "No" "Yes"
##
## $Partner
## [1] "No" "Yes"
##
## $Dependents
## [1] "No" "Yes"
##
## $PhoneService
## [1] "No" "Yes"
##
## $MultipleLines
## [1] "No" "Yes"
##
## $InternetService
## [1] "DSL" "Fiber optic" "No"
##
## $OnlineSecurity
## [1] "No" "Yes"
##
## $OnlineBackup
## [1] "No" "Yes"
##
## $DeviceProtection
## [1] "No" "Yes"
##
## $TechSupport
## [1] "No" "Yes"
##
## $StreamingTV
## [1] "No" "Yes"
##
## $StreamingMovies
## [1] "No" "Yes"
##
## $Contract
## [1] "Month-to-month" "One year" "Two year"
##
## $PaperlessBilling
## [1] "No" "Yes"
##
## $PaymentMethod
## [1] "Bank transfer (automatic)" "Credit card (automatic)"
## [3] "Electronic check" "Mailed check"
##
## $Churn
## [1] "No" "Yes"
The exploratory data analysis (EDA) was conducted to better understand churn distribution, billing patterns, contract effects, and relationships between numeric variables. The visualizations below support feature selection and guide model choices for both classification and regression tasks.
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
## Warning: package 'patchwork' was built under R version 4.5.2
# Import cleaned dataset rds file into a dataframe
df <- readRDS("TelcoCustomerChurn_Cleaned.rds")
# Check data structure to ensure data types are correct
cat("Data Structure from RDS File:\n")## Data Structure from RDS File:
## Rows: 7,043
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
# Plot bar charts for churn distribution to check class imbalance
p1 <- ggplot(df, aes(x = Churn, fill = Churn)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
labs(title = "Churn Distribution") +
theme_minimal() +
theme(legend.position = "none")
# Plot histogram for total charges distribution to check skewness
p2 <- ggplot(df, aes(x = TotalCharges)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Total Charges Distribution") +
theme_minimal()
# Combine two plots side by side
p1 + p2## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The churn distribution plot on the left reveals a significant class imbalance. There are far more customers who stayed (“No” = 5,174) than churned (“Yes” = 1,869). This could potentially lead to the model being naturally biased toward predicting “No”.
The total charges distribution on the right reveals that the data is right-skewed, indicating most customers have lower total charges, with a long tail of high-value customers
# Plot bar charts for churn by contract type
p3 <- ggplot(df, aes(x = Contract, fill = Churn)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("No" = "tomato", "Yes" = "seagreen")) +
labs(y = "Proportion", title = "Churn by Contract Type") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
# Plot box plots for monthly charges vs churn
p4 <- ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
geom_boxplot() +
scale_fill_manual(values = c("No" = "tomato", "Yes" = "seagreen")) +
labs(title = "Monthly Charges vs Churn") +
theme_minimal() +
theme(legend.position = "none")
# Combine two plots side by side
p3 + p4The contract type is likely one of the, if not the strongest categorical predictor in the dataset. The churn rate for Month-to-month contracts is drastically higher than for One-year or Two-year contracts. Long-term contracts are essentially “lock in” customers.
The boxplot reveals that customers who churn (“Yes” / Green) tend to have higher median monthly charges compared to those who stay (“No” / Red). This suggests price sensitivity is a factor in churning.
# Plot correlational matrix to check for multicollinearity
# Isolate numeric columns first
num_cols <- df %>% select(where(is.numeric))
cor_matrix <- cor(num_cols)
corrplot(cor_matrix, method = "number", type = "upper",
title = "Correlation Matrix (Check Multicollinearity)",
mar=c(0,0,2,0))There is a very strong positive correlation of 0.83 between ‘tenure’ and ‘TotalCharges’. This indicates multicollinearity as ‘TotalCharges’ is essentially a calculation of ‘Tenure’ * ‘MonthlyCharges’.
# Plot for tenure vs total charges to visualize correlation
ggplot(df, aes(x = tenure, y = TotalCharges, color = Churn)) +
geom_point(alpha = 0.5) +
labs(title = "Tenure vs Total Charges") +
theme_minimal()The scatter plot confirms the correlation matrix from earlier visually. It can be observed that the ‘TotalCharges’ grows linearly as ‘tenure’ increases.
The classification task aims to predict customer churn (Yes/No) using the available customer demographics, service subscriptions, contract type, and billing-related variables from the Telco dataset. This is important because early identification of high-risk customers enables telecom providers to implement targeted retention actions and reduce revenue loss. Therefore, supervised classification models are trained and compared to determine the most effective approach for churn prediction.
We have evaluated four models: Decision Tree, Random Forest, Logistic Regression, and Support Vector Machine (SVM) to predict customer churn using the available demographic, behavioral, and transaction data.
get_metrics <- function(cm, model_name){
accuracy <- round(cm$overall["Accuracy"], 4)
p_value <- cm$overall["AccuracyPValue"]
sensitivity <- round(cm$byClass["Sensitivity"], 4)
specificity <- round(cm$byClass["Specificity"], 4)
precision <- round(cm$byClass["Pos Pred Value"], 4)
prevalence <- round(cm$byClass["Prevalence"], 4)
f1_score <- round(2 * (precision * sensitivity) / (precision + sensitivity), 3)
data.frame(
Model = model_name,
Accuracy = accuracy,
Sensitivity = sensitivity,
Specificity = specificity,
Precision = precision,
F1 = f1_score,
Prevalence = prevalence,
p_value = p_value
)
}clean_vi <- function(model, model_name) {
vi <- varImp(model)$importance %>%
as.data.frame()
# Handle multiclass importance
if (ncol(vi) > 1) {
vi$Overall <- rowMeans(vi)
} else {
colnames(vi) <- "Overall"
}
vi %>%
mutate(
var = rownames(.),
model = model_name
) %>%
select(var, Overall, model) %>%
arrange(desc(Overall))
}## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning: package 'knitr' was built under R version 4.5.2
## Warning: package 'tidytext' was built under R version 4.5.2
telco<-readRDS("TelcoCustomerChurn_Cleaned.rds")
telco$Churn <- factor(telco$Churn, levels = c("No", "Yes"))
summary(telco) ## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3488 No :5901 No :3641 No :4933 Min. : 0.00 No : 682
## Male :3555 Yes:1142 Yes:3402 Yes:2110 1st Qu.: 9.00 Yes:6361
## Median :29.00
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## No :4072 DSL :2421 No :5024 No :4614 No :4621
## Yes:2971 Fiber optic:3096 Yes:2019 Yes:2429 Yes:2422
## No :1526
##
##
##
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## No :4999 No :4336 No :4311 Month-to-month:3875 No :2872
## Yes:2044 Yes:2707 Yes:2732 One year :1473 Yes:4171
## Two year :1695
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Churn
## Bank transfer (automatic):1544 Min. : 18.25 Min. : 0.0 No :5174
## Credit card (automatic) :1522 1st Qu.: 35.50 1st Qu.: 398.6 Yes:1869
## Electronic check :2365 Median : 70.35 Median :1394.5
## Mailed check :1612 Mean : 64.76 Mean :2279.7
## 3rd Qu.: 89.85 3rd Qu.:3786.6
## Max. :118.75 Max. :8684.8
split<- 0.80
#partition the data
train_index <- createDataPartition(telco$Churn, p = split, list = FALSE)#define training dataset
data_train <- telco[train_index, ]
#define testing dataset
data_test <- telco[-train_index, ]data_model <- train(Churn ~ ., data = data_train, method = "rpart", trControl = train_control,tuneLength = 5)
prediction <- predict(data_model, data_test)data_model <- train(Churn ~ ., data = data_train, method = "rf", trControl = train_control,tuneLength = 5)
prediction <- predict(data_model, data_test)
churn_rf_performance <- confusionMatrix(prediction, data_test$Churn)
rf_vi <- clean_vi(data_model, "Random Forest")data_model <- train(Churn ~ ., data = data_train, method = "glm", family = "binomial", trControl = train_control,tuneLength = 5)
prediction <- predict(data_model, data_test)
churn_lr_performance <- confusionMatrix(prediction, data_test$Churn)
lr_vi <- clean_vi(data_model, "Logistic Regression")data_model <- train(Churn ~ ., data = data_train, method = "svmRadial", preProcess = c("center", "scale"), trControl = train_control,tuneLength = 5)
prediction <- predict(data_model, data_test)
churn_svm_performance <- confusionMatrix(prediction, data_test$Churn)
svm_vi <- clean_vi(data_model, "Support Vector Machine")dt_df <- matrix2df(churn_dt_performance, "Decision Tree")
rf_df <- matrix2df(churn_rf_performance, "Random Forest")
lr_df <- matrix2df(churn_lr_performance, "Logistic Regression")
svm_df <- matrix2df(churn_svm_performance, "SVM")
cm_combined <- bind_rows(dt_df, rf_df, lr_df, svm_df)ggplot(cm_combined, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 5) +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
facet_wrap(~ Model) +
labs(title = "Confusion Matrices: All Models") +
theme_minimal() +
theme(strip.text = element_text(size = 12))dt_metrics <- get_metrics(churn_dt_performance, "Decision Tree")
rf_metrics <- get_metrics(churn_rf_performance, "Random Forest")
lr_metrics <- get_metrics(churn_lr_performance, "Logistic Regression")
svm_metrics <- get_metrics(churn_svm_performance, "SVM")
# Combine into one table
metrics_table <- bind_rows(dt_metrics, rf_metrics, lr_metrics, svm_metrics)
kable(metrics_table, row.names = FALSE, format = "pipe")| Model | Accuracy | Sensitivity | Specificity | Precision | F1 | Prevalence | p_value |
|---|---|---|---|---|---|---|---|
| Decision Tree | 0.7889 | 0.9246 | 0.4129 | 0.8136 | 0.866 | 0.7349 | 1.5e-06 |
| Random Forest | 0.8060 | 0.9265 | 0.4718 | 0.8294 | 0.875 | 0.7349 | 0.0e+00 |
| Logistic Regression | 0.8074 | 0.8994 | 0.5523 | 0.8478 | 0.873 | 0.7349 | 0.0e+00 |
| SVM | 0.8031 | 0.9178 | 0.4853 | 0.8317 | 0.873 | 0.7349 | 0.0e+00 |
The confusion matrix summarizes the prediction of Churn using four models by comparing the predicted and actual outcomes. Based on the result, Logistic Regression has the highest accuracy (80.74%), closely followed by Random Forest (80.60%) and SVM (80.31%). Decision Tree is slight lower at 78.89%. Random Forest has the highest sensitivity, which means it’s best at identifying customers who will churn. Decision Tree is slightly lower, followed by SVM and lastly Logistic Regression. As for the specificity, Logistic Regression has the highest with 0.5523, which means it’s best at avoiding false positives. Decision Tree has the lowest at 0.4129. Logistic Regression is also the highest for precision with 0.8478, and same goes to Decision Tree, which is the lowest (0.8136). Random Forest, Logistic Regression SVM are performed well in F1-score, while Random Forest is the highest among them. All four models are the same in prevalence metric (73.49%). Lastly, the p-value indicated that all four models are statistically significant. Among all models, Random Forest achieved the highest overall accuracy and balanced performance across precision and recall, indicating its robustness in predicting customer churn. Logistic regression provided interpretable results but showed lower predictive performance. Decision trees were prone to overfitting, while SVM performed competitively but required feature scaling.
vi_top10 <- vi_combined %>%
group_by(model) %>%
slice_max(Overall, n = 10) %>%
mutate(var = reorder_within(var, Overall, model))ggplot(vi_top10, aes(x = var, y = Overall)) +
geom_col() +
coord_flip() +
facet_wrap(~ model, scales = "free_y") +
scale_x_reordered() +
labs(x = "Variable", y = "Importance")The variable importance plots reveal both consistent and model-specific drivers of customer churn across the four classification models. Tenure emerges as the most influential predictor in all models, indicating that customers with shorter service duration are significantly more likely to churn. This suggests customer loyalty strengthens over time, making early-stage customers the most vulnerable group. Contract type (one-year and two-year contracts) also plays a critical role, particularly in Logistic Regression and SVM, where longer contracts are associated with lower churn risk, highlighting the stabilizing effect of long-term commitments. MonthlyCharges and TotalCharges are consistently important across Random Forest and SVM, suggesting that higher financial burden increases churn probability. Service-related features such as InternetService (Fiber optic), OnlineSecurity, and TechSupport appear prominently in tree-based models, indicating that service quality and perceived value strongly influence customer retention. While Logistic Regression emphasizes linear financial and contractual effects, tree-based models capture more complex interactions among service features. Overall, despite differences in ranking, all models consistently identify tenure, contract duration, and pricing variables as the key determinants of churn, reinforcing the robustness of these predictors across different classification approaches.
The regression task aims to predict customers’ MonthlyCharges based on their subscribed services, contract characteristics, and demographic attributes. Understanding the factors that influence monthly billing amounts is important for telecommunications companies to support pricing optimisation, revenue forecasting, and service bundle design. Supervised regression models are developed and evaluated to identify patterns that explain variations in customer charges.
Multiple regression models, including Linear Regression, Random Forest Regression, and Partial Least Squares (PLS), were implemented to capture both linear and non-linear relationships between customer attributes and monthly charges while addressing potential multicollinearity among predictors.
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'pls' was built under R version 4.5.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
## 'data.frame': 7043 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3488 No :5901 No :3641 No :4933 Min. : 0.00 No : 682
## Male :3555 Yes:1142 Yes:3402 Yes:2110 1st Qu.: 9.00 Yes:6361
## Median :29.00
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## No :4072 DSL :2421 No :5024 No :4614 No :4621
## Yes:2971 Fiber optic:3096 Yes:2019 Yes:2429 Yes:2422
## No :1526
##
##
##
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## No :4999 No :4336 No :4311 Month-to-month:3875 No :2872
## Yes:2044 Yes:2707 Yes:2732 One year :1473 Yes:4171
## Two year :1695
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Churn
## Bank transfer (automatic):1544 Min. : 18.25 Min. : 0.0 No :5174
## Credit card (automatic) :1522 1st Qu.: 35.50 1st Qu.: 398.6 Yes:1869
## Electronic check :2365 Median : 70.35 Median :1394.5
## Mailed check :1612 Mean : 64.76 Mean :2279.7
## 3rd Qu.: 89.85 3rd Qu.:3786.6
## Max. :118.75 Max. :8684.8
set.seed(42)
split <- 0.80
train_idx <- createDataPartition(telco_reg$MonthlyCharges, p = split, list = FALSE)
train_data <- telco_reg[train_idx, ]
test_data <- telco_reg[-train_idx, ]
dim(train_data)## [1] 5636 18
## [1] 1407 18
set.seed(42)
ctrl <- trainControl(method = "cv", number = 5, verboseIter = TRUE, allowParallel = TRUE)
# Linear Regression (LR)
fit_lr <- train(
MonthlyCharges ~ .,
data = train_data,
method = "lm",
metric = "RMSE",
trControl = ctrl
)## + Fold1: intercept=TRUE
## - Fold1: intercept=TRUE
## + Fold2: intercept=TRUE
## - Fold2: intercept=TRUE
## + Fold3: intercept=TRUE
## - Fold3: intercept=TRUE
## + Fold4: intercept=TRUE
## - Fold4: intercept=TRUE
## + Fold5: intercept=TRUE
## - Fold5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Random Forest (RF)
fit_rf <- train(
MonthlyCharges ~ .,
data = train_data,
method = "rf",
tuneLength = 5,
metric = "RMSE",
trControl = ctrl
)## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry= 6
## - Fold1: mtry= 6
## + Fold1: mtry=11
## - Fold1: mtry=11
## + Fold1: mtry=16
## - Fold1: mtry=16
## + Fold1: mtry=21
## - Fold1: mtry=21
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry= 6
## - Fold2: mtry= 6
## + Fold2: mtry=11
## - Fold2: mtry=11
## + Fold2: mtry=16
## - Fold2: mtry=16
## + Fold2: mtry=21
## - Fold2: mtry=21
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry= 6
## - Fold3: mtry= 6
## + Fold3: mtry=11
## - Fold3: mtry=11
## + Fold3: mtry=16
## - Fold3: mtry=16
## + Fold3: mtry=21
## - Fold3: mtry=21
## + Fold4: mtry= 2
## - Fold4: mtry= 2
## + Fold4: mtry= 6
## - Fold4: mtry= 6
## + Fold4: mtry=11
## - Fold4: mtry=11
## + Fold4: mtry=16
## - Fold4: mtry=16
## + Fold4: mtry=21
## - Fold4: mtry=21
## + Fold5: mtry= 2
## - Fold5: mtry= 2
## + Fold5: mtry= 6
## - Fold5: mtry= 6
## + Fold5: mtry=11
## - Fold5: mtry=11
## + Fold5: mtry=16
## - Fold5: mtry=16
## + Fold5: mtry=21
## - Fold5: mtry=21
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 16 on full training set
# Partial Least Squares (PLS)
fit_pls <- train(
MonthlyCharges ~ .,
data = train_data,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 10,
metric = "RMSE",
trControl = ctrl
)## + Fold1: ncomp=10
## - Fold1: ncomp=10
## + Fold2: ncomp=10
## - Fold2: ncomp=10
## + Fold3: ncomp=10
## - Fold3: ncomp=10
## + Fold4: ncomp=10
## - Fold4: ncomp=10
## + Fold5: ncomp=10
## - Fold5: ncomp=10
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 10 on full training set
## Linear Regression
##
## 5636 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4509, 4509, 4508, 4510, 4508
## Resampling results:
##
## RMSE Rsquared MAE
## 1.024385 0.9988405 0.7832199
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Random Forest
##
## 5636 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4509, 4508, 4510, 4508, 4509
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.530269 0.9833686 4.483291
## 6 1.627212 0.9971904 1.193221
## 11 1.390421 0.9978638 1.022253
## 16 1.366523 0.9979321 1.002964
## 21 1.370940 0.9979185 1.002905
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.
## Partial Least Squares
##
## 5636 samples
## 17 predictor
##
## Pre-processing: centered (21), scaled (21)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4509, 4510, 4508, 4508, 4509
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 9.323039 0.9039015 7.3368494
## 2 4.959831 0.9727608 4.0029247
## 3 2.360437 0.9938354 1.9147502
## 4 1.449075 0.9976762 1.1640424
## 5 1.153290 0.9985299 0.8995188
## 6 1.062011 0.9987538 0.8166797
## 7 1.031180 0.9988263 0.7901844
## 8 1.025804 0.9988387 0.7848406
## 9 1.025284 0.9988400 0.7844231
## 10 1.025209 0.9988403 0.7841488
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
4.Test-Set Evaluation
# Test predictions
pred_lr <- predict(fit_lr, newdata = test_data)
pred_rf <- predict(fit_rf, newdata = test_data)
pred_pls <- predict(fit_pls, newdata = test_data)
# Metrics: RMSE, RSquared, MAE
m_lr <- postResample(pred = pred_lr, obs = test_data$MonthlyCharges)
m_rf <- postResample(pred = pred_rf, obs = test_data$MonthlyCharges)
m_pls <- postResample(pred = pred_pls, obs = test_data$MonthlyCharges)
results <- data.frame(
Model = c("Linear Regression (lm)", "Random Forest (rf)", "PLS (pls)"),
RMSE = c(m_lr["RMSE"], m_rf["RMSE"], m_pls["RMSE"]),
Rsquared = c(m_lr["Rsquared"], m_rf["Rsquared"], m_pls["Rsquared"]),
MAE = c(m_lr["MAE"], m_rf["MAE"], m_pls["MAE"])
)
resultsBased On the test set evaluation, Partial Least Squares is the best performing model because it achieves the lowest prediction errors and the highest R squared among the three models with RMSE 1.043026, MAE 0.7860574 and R squared 0.998815. Linear Regression is a very close second which indicates the relationship between the predictors and MonthlyCharges is predominantly linear, while PLS gains a slight edge by using latent components that reduce the impact of correlated predictors and focus on the strongest shared signal. In contrast, Random Forest underperforms because its RMSE and MAE are materially higher, suggesting that its non linear flexibility does not translate into better generalization on this dataset and may introduce extra variance when the underlying pattern is already well explained by linear structure.
5.Diagnostic Plots
obs <- test_data$MonthlyCharges
preds <- data.frame(LR = pred_lr, RF = pred_rf, PLS = pred_pls)
long <- cbind(Actual = obs, stack(preds))
colnames(long) <- c("Actual", "Predicted", "Model")
long$Residual <- long$Actual - long$Predicted
center_title <- theme(plot.title = element_text(hjust = 0.5))
# 1) Actual vs Predicted
ggplot(long, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.35, size = 1) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE) +
facet_wrap(~ Model) +
coord_equal() +
labs(
title = "Actual vs Predicted (Test Set)",
x = "Actual MonthlyCharges",
y = "Predicted MonthlyCharges"
) +
theme_minimal() +
center_title## `geom_smooth()` using formula = 'y ~ x'
# 2) Residuals vs Predicted
ggplot(long, aes(x = Predicted, y = Residual)) +
geom_point(alpha = 0.35, size = 1) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE) +
facet_wrap(~ Model) +
labs(
title = "Residuals vs Predicted (Test Set)",
x = "Predicted MonthlyCharges",
y = "Residual (Actual - Predicted)"
) +
theme_minimal() +
center_title## `geom_smooth()` using formula = 'y ~ x'
# 3) Residual distribution
ggplot(long, aes(x = Residual)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, alpha = 0.5) +
geom_density() +
facet_wrap(~ Model, scales = "free_y") +
labs(
title = "Residual Distribution (Test Set)",
x = "Residual",
y = "Density"
) +
theme_minimal() +
center_titleIn the Actual vs Predicted plots, LR and PLS points lie almost exactly on the 45 degree reference line across the full range of MonthlyCharges and the smooth trend line stays close to the diagonal. This indicates excellent calibration where the models are not systematically overpredicting or underpredicting. RF shows the same overall trend but with wider scatter around the diagonal especially in the mid to higher charge range which aligns with its higher RMSE and MAE.
In the Residuals vs Predicted plots, LR and PLS residuals are centered near zero and mostly remain relatively tight which suggests stable errors and good generalization. There is a mild increase in residual spread at higher predicted MonthlyCharges which indicates slight heteroscedasticity where error variance increases as charges rise. The smooth line also drifts slightly above zero at the high end which implies mild underprediction for higher charge customers since positive residual means actual is greater than predicted. RF displays noticeably larger dispersion and more extreme residuals confirming that its errors are less stable and that the model introduces additional variability without improving fit.
In the Residual Distribution plots, LR and PLS shows good symmetric distributions centered near zero, indicating low bias and consistently small errors. RF shows a wider distribution with a more pronounced right tail which suggests more frequent and larger underpredictions which matches its weaker error metrics.
Overall, the plots justify selecting PLS as the best model because it combines near perfect calibration with the tightest residual spread and the most symmetric residual distribution.
imp_lr <- varImp(fit_lr, scale = TRUE)
imp_rf <- varImp(fit_rf, scale = TRUE)
imp_pls <- varImp(fit_pls, scale = TRUE)
plot(imp_lr, top = 10, main = "Top 10 Predictors - LR")Across all three models, InternetServiceFiber optic is the most influential predictor with InternetServiceNo also ranking very highly. This is expected in a telecom pricing context because internet service type typically sets the base price tier. Fiber optic plans are usually priced higher, while having no internet service implies a much lower monthly bill.
Variables such as StreamingTVYes, StreamingMoviesYes, PhoneServiceYes, and MultipleLinesYes appear repeatedly in the top 10. This indicates that once the base internet tier is set, bundled entertainment and phone features explain additional variation in monthly charges. These are the upsell components that add predictable increments to the bill.
OnlineBackupYes, OnlineSecurityYes, DeviceProtectionYes, and TechSupportYes also appear among the top predictors. These are optional add-on services that is expected to contribute to MonthlyCharges but their impact is smaller than the base internet service tier and major bundle features.
In the PLS plot, PaperlessBillingYes and PaymentMethodMailed check appear in the top 10. This does not mean these billing choices directly change MonthlyCharges. Instead, they act as a signal of plan type because customers who pay by mailed check often subscribe to different services than customers who use paperless billing or other payment methods. Therefore, in PLS they are captured within the same latent components as plan and service variables because these features tend to vary together. Overall, all three models agrees that internet service tier and bundled add-on services are the key drivers of MonthlyCharges.
This project applied R programming and data science techniques to analyse the Telco Customer Churn dataset with the aim of understanding customer behaviour and developing predictive models for business decision support. The dataset was systematically cleaned and prepared to address missing values, inconsistent data types, and redundant categories, ensuring that the data was suitable for reliable analysis and modelling.
For the classification task, multiple supervised learning models were implemented and compared to predict customer churn. Among the models evaluated, Random Forest demonstrated the most balanced performance, effectively identifying customers at higher risk of churn while maintaining overall predictive accuracy. Key predictors such as tenure, contract type, and monthly charges were consistently highlighted across models, reinforcing their importance in customer retention analysis.
In the regression task, models were developed to predict customers’ monthly charges based on their subscribed services and contract attributes. Random Forest regression achieved the best predictive performance, suggesting that non-linear relationships exist between customer features and billing amounts. Overall, the project demonstrates how a structured and reproducible data science workflow can provide actionable insights to support customer retention strategies and pricing decisions in the telecommunications industry.