Tan Foo Hou (23074766)
Shaik Mohammed Muzeeb (23097292)
Loh Ke Yi (23092758)
Loh Jing Wei (23118574)
Lim Sze Chie (s2157250)
In the competitive telecommunication industry, retaining customers is crucial for sustaining profitability and long-term growth. Customer churn, the phenomenon of customers leaving a service provider, poses a significant challenge, leading to revenue losses and increased acquisition costs. Understanding customer behavior, particularly their spending patterns and likelihood to churn, can enable companies to develop targeted strategies that enhance customer satisfaction and retention. This project leverages data-driven approaches to analyze and predict customers’ service usage and churn likelihood.
To classify customers likely to churn using classification models, facilitating the development of targeted retention strategies to minimize customer turnover.
To predict customer spending patterns using regression models, enabling the design of personalized services, benefits, packages or promotion tailored to individual customer needs.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.4.2
## Loaded ROSE 0.0-4
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
setwd("C:\\Users\\user\\OneDrive\\Documents")
data <- read.csv("C:/Users/user/OneDrive/Documents/Telco-Customer-Churn.csv", stringsAsFactors = FALSE)
The dataset chosen for this project is Telco Customer Churn, originally uploaded to Kaggle in 2018. It comprises detailed customer data from a telecommunications company, with the goal of analyzing the factors affecting customer behaviour, service engagement, and churn.
The dataset consists of 7,043 observations and 21 columns. Each observation represents a unique customer.
This dataset is used for two separate models, each with a different target variable. For the classification model, the target variable in this dataset is Churn, a categorical variable indicating whether the customer has churned within the last month. The values are:
For the regression model, the target variable is TotalCharges, a variable representing the total amount charged to the customer.
The 19 independent variables in this dataset includes:
Demographic Information:
customerID, gender,
SeniorCitizen, Partner,
DependentsAccount Information:
tenure, Contract,
PaperlessBilling, PaymentMethod,
MonthlyChargesServices Subscribed:
PhoneService, MultipleLines,
InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection,
TechSupport, StreamingTV,
StreamingMoviesstr(data)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
summary(data)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
Convert ‘TotalCharges’ column to numeric datatype
data$TotalCharges <- as.numeric(data$TotalCharges)
We performed a Chi-Square test on all categorical variables to evaluate their association with the target variable Churn.
# categorical variables
categorical_vars <- c("Contract", "PaymentMethod", "InternetService", "TechSupport", "StreamingTV",
"OnlineSecurity", "OnlineBackup", "DeviceProtection", "MultipleLines", "StreamingMovies",
"PaperlessBilling", "SeniorCitizen", "Partner", "gender", "Dependents", "PhoneService")
# Create an empty data frame to store the chi2 results
chi2_results <- data.frame(Var=character(), Chi_2=numeric(), P_Val=numeric(), stringsAsFactors=FALSE)
for (var in categorical_vars){
# Create a contingency table
contingency_table <- table(data[[var]], data$Churn)
# Perform the chi-square test
chi2_test <- chisq.test(contingency_table)
chi2_value <- chi2_test$statistic
p_value <- chi2_test$p.value
p_value <- ifelse(p_value < 0.0001, "<0.0001", round(p_value, 4))
# Store the results in the data frame
chi2_results <- rbind(chi2_results, data.frame(Var=var, Chi_2=chi2_value, P_Val=p_value))
}
# sort the results by chi-square value
chi2_results <- chi2_results[order(-chi2_results$Chi_2),]
print(chi2_results)
## Var Chi_2 P_Val
## X-squared Contract 1184.5965721 <0.0001
## X-squared5 OnlineSecurity 849.9989680 <0.0001
## X-squared3 TechSupport 828.1970685 <0.0001
## X-squared2 InternetService 732.3095897 <0.0001
## X-squared1 PaymentMethod 648.1423275 <0.0001
## X-squared6 OnlineBackup 601.8127901 <0.0001
## X-squared7 DeviceProtection 558.4193694 <0.0001
## X-squared9 StreamingMovies 375.6614793 <0.0001
## X-squared4 StreamingTV 374.2039433 <0.0001
## X-squared10 PaperlessBilling 258.2776491 <0.0001
## X-squared14 Dependents 189.1292494 <0.0001
## X-squared11 SeniorCitizen 159.4263004 <0.0001
## X-squared12 Partner 158.7333820 <0.0001
## X-squared8 MultipleLines 11.3304415 0.0035
## X-squared15 PhoneService 0.9150330 0.3388
## X-squared13 gender 0.4840829 0.4866
The chi-squared test was used to analyze the relationship between each categorical variable and the target variable, Churn. A higher chi-squared value indicates a stronger statistical association, but it does not show the direction or real-world impact of the relationship.
Most variables demonstrate a significant association with Churn at the 95% confidence level (p-value < 0.05). However, PhoneService, Gender, and MultipleLines have higher p-values, indicating no significant association.
While high chi-squared values suggest stronger relationships, visual analysis is essential to understand how different groups within each variable relate to Churn. This will help uncover the practical importance and direction of the associations.
num_data <- data %>% select(tenure, MonthlyCharges, TotalCharges)
# Compute correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")
print(cor_matrix)
## tenure MonthlyCharges TotalCharges
## tenure 1.0000000 0.2468618 0.8258805
## MonthlyCharges 0.2468618 1.0000000 0.6510648
## TotalCharges 0.8258805 0.6510648 1.0000000
corrplot::corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black")
The plot_var function visualizes a
categorical variable’s distribution and churn percentage in one combined
plot. It uses a bar chart to show category counts and
overlays a line chart to display the churn percentage
for each group, enabling a clear comparison of distribution and churn
trends.
# Write a function to plot the bar chart and line chart for categorical variables
plot_var <- function(df, var, target="Churn"){
# Group by the variable and target variable
grouped_by_var <- df %>%
group_by(!!sym(var), !!sym(target)) %>%
summarise(n = n(),.groups = 'keep')
# Calculate the churn rate
churn_rate <- df %>%
group_by(!!sym(var), !!sym(target)) %>%
summarise(n = n(),.groups = 'drop_last') %>%
mutate(Percentage_of_Churn = n / sum(n) * 100) %>%
filter(!!sym(target) == "Yes")
# Calculate the maximum number of customers for the grouped variable for the y-axis limit
max_customers <- max(grouped_by_var$n)
# Plot the bar chart
p <- ggplot(data=grouped_by_var, aes(x=!!sym(var),y=n,fill=!!sym(target))) +
geom_bar(stat="identity", position="dodge") +
geom_text(data=grouped_by_var, aes(label=n, y=n), vjust=-0.5, position=position_dodge(width=0.9)) +
theme_minimal()
# Plot the line chart
p <- p + geom_line(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, group=!!sym(target)), color="black", linewidth=1) +
geom_point(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, color=!!sym(target)), color="black", size=3) +
geom_text(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, label=round(Percentage_of_Churn,2), vjust=-0.5)) +
scale_y_continuous(sec.axis = sec_axis(~. * 100/max_customers, name="Churn Rate (%)")) +
labs(title=paste(var, "and Its Influence on Churn Rates"), x=var, y="Number of Customers") +
theme(legend.position="bottom", plot.title = element_text(hjust = 0.5))
return(p)
}
data$tenure_bin <- cut(data$tenure, breaks=seq(0,72,by=12), right=TRUE,include.lowest=TRUE, labels=c("0-12", "13-24", "25-36", "37-48", "49-60", "61-72"))
print(plot_var(data, "tenure_bin"))
Customers with shorter tenure (within 1 year) have the highest churn rate of 47.44%. The churn rate consistently decreases as tenure increases, with customers having over 10 years of tenure exhibiting the lowest churn rate (6.61%). This highlights that longer-tenured customers are less likely to churn.
data$TotalCharges_bin <- cut(data$TotalCharges, breaks=seq(18.8,8684.8,by=1083.25), include.lowest=TRUE, labels=c("[18.8-1102.05]", "(1102.05-2185.3]", "(2185.3-3268.55]", "(3268.55-4351.8]", "(4351.8-5435.05]", "(5435.05-6518.3]", "(6518.3-7601.55]", "(7601.55-8684.8]"))
print(plot_var(data, "TotalCharges_bin") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8)))
The highest churn rate (36.32%) is observed among customers with the lowest total charges, in the range of $18.8–$1,102.05, typically linked to shorter tenure. As total charges increase—reflecting longer tenure with the company—churn rates steadily decrease, reaching a low of 7.78% for the highest total charges in the range of $7,601.55–$8,684.8.
data$MonthlyCharges_bin <- cut(data$MonthlyCharges, breaks=seq(18.25,118.75,by=16.75), include.lowest=TRUE, labels=c("[18.25-35]", "(35-51.75]", "(51.75-68.5]", "(68.5-85.25]", "(85.25-102]", "[102-118.75]"))
print(plot_var(data, "MonthlyCharges_bin"))
Customers with monthly charges of $68.5–$85.25 have the highest churn rate (38%), while those with the lowest charges of $18.25–$35 have the lowest churn rate (10.88%). Churn decreases for charges above $102.
ggplot(data, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
geom_boxplot() +
ggtitle("Monthly Charges by Churn") +
xlab("Churn") +
ylab("Monthly Charges") +
theme_minimal()
The boxplot compares the distribution of monthly charges between customers who churned (“Yes”) and those who did not churn (“No”).
Customers who churned generally have higher monthly charges, with a median monthly charge above that of non-churning customers
The interquartile range (IQR) for both groups overlaps, but non-churned customers show a slightly wider range, suggesting more variability in their monthly charges
Higher monthly charges may be associated with a higher likelihood of churn, it will be investigated further in later section.
# Remove the binned columns after plotting the graphs
data <- data %>% select(-tenure_bin)
data <- data %>% select(-MonthlyCharges_bin)
data <- data %>% select(-TotalCharges_bin)
print(plot_var(data, "Contract"))
Month-to-month contracts have the highest churn rate (42.71%), showing customers are more likely to leave when not bound by longer agreements. In contrast, one-year and two-year contracts show lower churn rates at 11.27% and 2.83%, respectively, indicating longer contracts improve retention.
print(plot_var(data, "InternetService"))
The type of internet service also greatly affects churn rates. Customers using fiber optic services have the highest churn rate (41.89%), followed by DSL users (18.96%). Customers without internet service have the lowest churn rate (7.4%).
print(plot_var(data, "OnlineSecurity"))
print(plot_var(data, "TechSupport", "Churn"))
print(plot_var(data, "OnlineBackup"))
print(plot_var(data, "DeviceProtection"))
These plots of additional services, including online security, tech support, online backup, and device protection, reveals a common trend. Customers who do not subscribe to these services exhibit a churn rate ranging from 39% to 42%. In contrast, customers who do subscribe show significantly lower churn rates, ranging from 14% to 22%. Those categorized as “No Internet Service” consistently have the lowest churn rate of 7.4%. This highlights the importance of offering and encouraging the use of additional services to enhance customer retention.
print(plot_var(data, "PaymentMethod"))
Electronic check users have the highest churn rate (45.29%), while bank transfer, credit card, and mailed check users have lower rates (15%-19%). This suggests electronic check users may face more difficulties or inconveniences.
print(plot_var(data, "StreamingMovies"))
print(plot_var(data, "StreamingTV"))
Streaming services, including streaming movies and streaming TV, exhibit similar patterns. Customers who do not subscribe to these services have higher churn rates, approximately 33% to 34%, while those who subscribe show lower churn rates, around 29% to 31%. Suggesting that attractive streaming options might be an effective approach to lowering churn.
print(plot_var(data, "PaperlessBilling"))
Customers using paperless billing have a higher churn rate (33.57%) compared to those using traditional billing (16.33%). This suggests that traditional billing may be more comfortable for some customers, acting as a tangible reminder and fostering stronger engagement, which helps reduce churn.
print(plot_var(data, "Dependents"))
Customers without dependents exhibit a churn rate of 31.28%, significantly higher than the 15.45% churn rate among customers with dependents. Dependents may act as a stabilizing factor, encouraging customers to stay.
print(plot_var(data, "SeniorCitizen"))
Senior citizens have a higher churn rate of 41.68% compared to non-senior citizens at 23.61%. This indicates that senior citizens may face challenges or lack engagement with the services offered.
print(plot_var(data, "Partner"))
Customers without a partner have a churn rate of 32.96%, compared to 19.66% for those with a partner. Having a partner may contribute to a more stable service usage pattern.
Count NA values in each column
na_count <- sapply(data, function(x) sum(is.na(x)))
print(na_count)
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
Upon inspection, the column
TotalCharges has 11 missing
values (NA)
To understand the nature of these missing values, we inspected the
rows where TotalCharges is
NA:
data[is.na(data$TotalCharges), ]
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 489 4472-LVYGI Female 0 Yes Yes 0 No
## 754 3115-CZMZD Male 0 No Yes 0 Yes
## 937 5709-LVOEQ Female 0 Yes Yes 0 Yes
## 1083 4367-NUYAO Male 0 Yes Yes 0 Yes
## 1341 1371-DWPAZ Female 0 Yes Yes 0 No
## 3332 7644-OMVMY Male 0 Yes Yes 0 Yes
## 3827 3213-VVOLG Male 0 Yes Yes 0 Yes
## 4381 2520-SGTTA Female 0 Yes Yes 0 Yes
## 5219 2923-ARZLG Male 0 Yes Yes 0 Yes
## 6671 4075-WKNIU Female 0 Yes Yes 0 Yes
## 6755 2775-SEFEE Male 0 No Yes 0 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 489 No phone service DSL Yes No
## 754 No No No internet service No internet service
## 937 No DSL Yes Yes
## 1083 Yes No No internet service No internet service
## 1341 No phone service DSL Yes Yes
## 3332 No No No internet service No internet service
## 3827 Yes No No internet service No internet service
## 4381 No No No internet service No internet service
## 5219 No No No internet service No internet service
## 6671 Yes DSL No Yes
## 6755 Yes DSL Yes Yes
## DeviceProtection TechSupport StreamingTV
## 489 Yes Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes No Yes
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 No Yes No
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 489 No Two year Yes Bank transfer (automatic)
## 754 No internet service Two year No Mailed check
## 937 Yes Two year No Mailed check
## 1083 No internet service Two year No Mailed check
## 1341 No Two year No Credit card (automatic)
## 3332 No internet service Two year No Mailed check
## 3827 No internet service Two year No Mailed check
## 4381 No internet service Two year No Mailed check
## 5219 No internet service One year Yes Mailed check
## 6671 No Two year No Mailed check
## 6755 No Two year Yes Bank transfer (automatic)
## MonthlyCharges TotalCharges Churn
## 489 52.55 NA No
## 754 20.25 NA No
## 937 80.85 NA No
## 1083 25.75 NA No
## 1341 56.05 NA No
## 3332 19.85 NA No
## 3827 25.35 NA No
## 4381 20.00 NA No
## 5219 19.70 NA No
## 6671 73.35 NA No
## 6755 61.90 NA No
Upon inspection, the 11 missing values
(NA) in the TotalCharges column correspond to
rows where the tenure is 0.
This indicates that these customers are new and have not incurred any
charges yet.
To address this, replace the NA values in the
TotalCharges column with 0.
data$TotalCharges[is.na(data$TotalCharges)] <- 0
After handling missing values, the number of NA values
in each column is calculated
na_count <- sapply(data, function(x) sum(is.na(x)))
print(na_count)
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Churn
## 0
customerID column is drop as it is redundant and does not contribute to the analysis
data <- data %>% select(-customerID)
For machine learning and analysis, categorical variables need to be converted to factors to ensure proper handling during modeling. The following columns in the dataset are identified as categorical and are converted into factors:
# Convert categorical columns to factors
data$gender <- as.factor(data$gender)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
data$Partner <- as.factor(data$Partner)
data$Dependents <- as.factor(data$Dependents)
data$PhoneService <- as.factor(data$PhoneService)
data$MultipleLines <- as.factor(data$MultipleLines)
data$InternetService <- as.factor(data$InternetService)
data$OnlineSecurity <- as.factor(data$OnlineSecurity)
data$OnlineBackup <- as.factor(data$OnlineBackup)
data$DeviceProtection <- as.factor(data$DeviceProtection)
data$TechSupport <- as.factor(data$TechSupport)
data$StreamingTV <- as.factor(data$StreamingTV)
data$StreamingMovies <- as.factor(data$StreamingMovies)
data$Contract <- as.factor(data$Contract)
data$PaperlessBilling <- as.factor(data$PaperlessBilling)
data$PaymentMethod <- as.factor(data$PaymentMethod)
data$Churn <- as.factor(data$Churn)
Outliers in the dataset can have a significant impact on the results
of machine learning models. To identify potential outliers,
summary statistics such as mean, minimum, maximum,
median, standard deviation, and quartiles (Q1 and
Q3) are calculated for each numerical column:
# Identify numerical columns in 'data'
numerical_attributes <- sapply(data, is.numeric)
numerical_data <- data[, numerical_attributes]
# Calculate summary statistics for numerical data
numerical_distribution <- sapply(numerical_data, function(x) {
c(mean = round(mean(x, na.rm = TRUE), digits = 2),
min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = round(sd(x, na.rm = TRUE), digits = 2),
q1 = quantile(x, probs = 0.25, na.rm = TRUE),
q3 = quantile(x, probs = 0.75, na.rm = TRUE))
}) %>% t() %>% as.data.frame()
numerical_distribution
## mean min max median sd q1.25% q3.75%
## tenure 32.37 0.00 72.00 29.00 24.56 9.00 55.00
## MonthlyCharges 64.76 18.25 118.75 70.35 30.09 35.50 89.85
## TotalCharges 2279.73 0.00 8684.80 1394.55 2266.79 398.55 3786.60
Upon inspecting the minimum and maximum values for each numerical column, no unrealistic values were detected.
However, to ensure that subtle outliers are accounted for, we use the Interquartile Range (IQR) method to detect and handle potential outliers.
# Function to check for outliers using IQR method
check_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
sum(x < lower_bound | x > upper_bound, na.rm = TRUE) > 0
}
# Identify numerical columns
numerical_attributes <- sapply(data, is.numeric)
numerical_data <- data[, numerical_attributes]
# Check for outliers in each numerical column
outlier_check <- sapply(numerical_data, check_outliers)
# Display results
outlier_summary <- data.frame(
Column = names(outlier_check),
Has_Outliers = ifelse(outlier_check, "Yes", "No")
)
# Print the summary table
print(outlier_summary)
## Column Has_Outliers
## tenure tenure No
## MonthlyCharges MonthlyCharges No
## TotalCharges TotalCharges No
As shown on the above table, no outliers were detected.
Boxplot for Tenure
boxplot(data$tenure, main = "Tenure Outliers", ylab = "Tenure")
Boxplot for MonthlyChargers
boxplot(data$MonthlyCharges, main = "Monthly Charges Outliers", ylab = "Monthly Charges")
Boxplot for TotalCharges
boxplot(data$TotalCharges, main = "Total Charges Outliers", ylab = "Total Charges")
Z-score normalization - scale()
function was applied to numerical columns (tenure,
MonthlyCharges, and TotalCharges),
transforming them to have a mean of 0 and a standard deviation of 1.
This ensures that all numerical variables contribute equally to the model, avoiding biases caused by differences in scale.
# List of numerical columns to normalize
numerical_cols <- c("tenure", "MonthlyCharges", "TotalCharges")
# Normalizing the numerical columns
data[numerical_cols] <- scale(data[numerical_cols])
# Preview the normalized data
head(data)
## gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 Female 0 Yes No -1.27735389 No
## 2 Male 0 No No 0.06632271 Yes
## 3 Male 0 No No -1.23663642 Yes
## 4 Male 0 No No 0.51421491 No
## 5 Female 0 No No -1.23663642 Yes
## 6 Female 0 No No -0.99233158 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check -1.1602405 -0.9925401 No
## 2 Mailed check -0.2596105 -0.1721525 No
## 3 Mailed check -0.3626346 -0.9579979 Yes
## 4 Bank transfer (automatic) -0.7464825 -0.1936586 No
## 5 Electronic check 0.1973512 -0.9388078 Yes
## 6 Electronic check 1.1594634 -0.6437435 Yes
ggplot(data, aes(x = Churn, fill = Churn)) +
geom_bar() +
ggtitle("Churn Distribution") +
xlab("Churn") +
ylab("Count") +
theme_minimal()
The chart categorized customers into two groups: those who did not churn (“No”) and those who did churn (“Yes”).
This imbalance in the data highlights that the dataset is skewed towards non-churners, model may become biased towards predicting non-churners, leading to poor performance in identifying actual churners
To address this imbalance, oversampling was performed using the ROSE package, which increased the number of churned (“Yes”) customers to balance the dataset and ensure fairer representation for predictive model training.
# Oversampling
data <- ovun.sample(Churn ~ ., data = data, method = "over")$data
ggplot(data, aes(x = Churn, fill = Churn)) +
geom_bar() +
ggtitle("Churn Distribution After Resampling") +
xlab("Churn") +
ylab("Count") +
theme_minimal()
table(data$Churn)
##
## No Yes
## 5174 5276
After Oversampling is done, there is almost equal numbers of Churners and Non-Churners
Save the cleaned dataset for further analysis or modeling
write.csv(data, "cleaned_telco_data.csv", row.names = FALSE)
Divide the dataset into training and testing sets using an 70-30 split.
set.seed(123)
train_index <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Load required libraries
library(ada)
## Warning: package 'ada' was built under R version 4.4.2
## Loading required package: rpart
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(pROC)
# Train AdaBoost model
set.seed(123)
adaboost_model <- ada(Churn ~ ., data = train_data, iter = 50, bag.frac = 0.5)
# Make predictions on the test data
adaboost_model_pred <- predict(adaboost_model, newdata = test_data)
# Evaluate the model using confusion matrix
adaboost_model_confusion_mat <- confusionMatrix(adaboost_model_pred, test_data$Churn)
print(adaboost_model_confusion_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1121 304
## Yes 397 1314
##
## Accuracy : 0.7765
## 95% CI : (0.7615, 0.7909)
## No Information Rate : 0.5159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5516
##
## Mcnemar's Test P-Value : 0.0005112
##
## Sensitivity : 0.7385
## Specificity : 0.8121
## Pos Pred Value : 0.7867
## Neg Pred Value : 0.7680
## Prevalence : 0.4841
## Detection Rate : 0.3575
## Detection Prevalence : 0.4544
## Balanced Accuracy : 0.7753
##
## 'Positive' Class : No
##
# Extract performance metrics
adaboost_precision <- adaboost_model_confusion_mat$byClass["Precision"]
adaboost_recall <- adaboost_model_confusion_mat$byClass["Recall"]
adaboost_f1_score <- 2 * ((adaboost_precision * adaboost_recall) / (adaboost_precision + adaboost_recall))
# Compute probabilities manually (since ada::predict does not return probabilities directly)
# Obtain the predicted class proportions from the training data
yes_prob <- mean(train_data$Churn == "Yes")
no_prob <- 1 - yes_prob
# Assign probabilities to predictions
adaboost_probs <- ifelse(adaboost_model_pred == "Yes", yes_prob, no_prob)
# Convert the actual test labels to binary for ROC
binary_churn <- ifelse(test_data$Churn == "Yes", 1, 0)
# Compute ROC and AUC
adaboost_roc_curve <- roc(binary_churn, adaboost_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
adaboost_auc <- auc(adaboost_roc_curve)
# Print performance metrics and AUC
cat("AdaBoost - Precision:", adaboost_precision, "\n")
## AdaBoost - Precision: 0.7866667
cat("AdaBoost - Recall:", adaboost_recall, "\n")
## AdaBoost - Recall: 0.7384717
cat("AdaBoost - F1-Score:", adaboost_f1_score, "\n")
## AdaBoost - F1-Score: 0.7618077
cat("AdaBoost - ROC-AUC:", adaboost_auc, "\n")
## AdaBoost - ROC-AUC: 0.7752927
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
library(caret)
# Train Naive Bayes model
nb_model <- naiveBayes(Churn~., data = train_data)
# Make predictions on the test data
nb_model_pred <- predict(nb_model, newdata = test_data)
# Evaluate the model using confusion matrix
library(caret)
nb_model_confusion_mat <- confusionMatrix(nb_model_pred, test_data$Churn)
nb_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 981 269
## Yes 537 1349
##
## Accuracy : 0.743
## 95% CI : (0.7273, 0.7582)
## No Information Rate : 0.5159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4826
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6462
## Specificity : 0.8337
## Pos Pred Value : 0.7848
## Neg Pred Value : 0.7153
## Prevalence : 0.4841
## Detection Rate : 0.3128
## Detection Prevalence : 0.3986
## Balanced Accuracy : 0.7400
##
## 'Positive' Class : No
##
# Extract performance metrics
nb_precision <- nb_model_confusion_mat$byClass["Precision"]
nb_recall <- nb_model_confusion_mat$byClass["Recall"]
nb_f1_score <- 2 * ((nb_precision * nb_recall) / (nb_precision + nb_recall))
# Compute ROC and AUC
nb_probs <- predict(nb_model, newdata = test_data, type = "raw")
nb_roc_curve <- roc(test_data$Churn, nb_probs[, "Yes"])
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
nb_auc <- auc(nb_roc_curve)
# Print results
cat("Naive Bayes - Precision:", nb_precision, "\n")
## Naive Bayes - Precision: 0.7848
cat("Naive Bayes - Recall:", nb_recall, "\n")
## Naive Bayes - Recall: 0.6462451
cat("Naive Bayes - F1-Score:", nb_f1_score, "\n")
## Naive Bayes - F1-Score: 0.708815
cat("Naive Bayes - ROC-AUC:", nb_auc, "\n")
## Naive Bayes - ROC-AUC: 0.8162446
library(caret)
control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
# Train KNN model
knn_model <- train(Churn~., data = train_data, method = "knn",
trControl = control)
knn_model
## k-Nearest Neighbors
##
## 7314 samples
## 19 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6583, 6582, 6582, 6584, 6582, 6583, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7575915 0.5151534
## 7 0.7548570 0.5096909
## 9 0.7559475 0.5118746
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# Make predictions on the test data
knn_model_pred <- predict(knn_model, newdata = test_data)
# Evaluate the model using confusion matrix
knn_model_confusion_mat <- confusionMatrix(knn_model_pred, test_data$Churn)
knn_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 978 233
## Yes 540 1385
##
## Accuracy : 0.7535
## 95% CI : (0.738, 0.7685)
## No Information Rate : 0.5159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5034
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6443
## Specificity : 0.8560
## Pos Pred Value : 0.8076
## Neg Pred Value : 0.7195
## Prevalence : 0.4841
## Detection Rate : 0.3119
## Detection Prevalence : 0.3862
## Balanced Accuracy : 0.7501
##
## 'Positive' Class : No
##
# Extract performance metrics
knn_precision <- knn_model_confusion_mat$byClass["Precision"]
knn_recall <- knn_model_confusion_mat$byClass["Recall"]
knn_f1_score <- 2 * ((knn_precision * knn_recall) / (knn_precision + knn_recall))
# Compute ROC and AUC
knn_probs <- predict(knn_model, newdata = test_data, type = "prob")
knn_roc_curve <- roc(test_data$Churn, knn_probs$Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
knn_auc <- auc(knn_roc_curve)
# Print results
cat("KNN - Precision:", knn_precision, "\n")
## KNN - Precision: 0.807597
cat("KNN - Recall:", knn_recall, "\n")
## KNN - Recall: 0.6442688
cat("KNN - F1-Score:", knn_f1_score, "\n")
## KNN - F1-Score: 0.7167461
cat("KNN - ROC-AUC:", knn_auc, "\n")
## KNN - ROC-AUC: 0.8383042
library(caret)
library(pROC)
# Define train control with probability output and cross-validation
control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE, # Enables probability predictions
summaryFunction = twoClassSummary # Enables AUC calculation
)
# Train the logistic regression model
logistic_model <- train(
Churn ~ .,
data = train_data,
method = "glm",
family = "binomial",
trControl = control,
metric = "ROC" # Optimize based on ROC
)
# Display the trained model
print(logistic_model)
## Generalized Linear Model
##
## 7314 samples
## 19 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6583, 6582, 6583, 6583, 6583, 6583, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8430136 0.7232001 0.8004439
# Make predictions on the test data
logistic_model_pred <- predict(logistic_model, newdata = test_data)
logistic_model_probs <- predict(logistic_model, newdata = test_data, type = "prob")
# Evaluate the model using confusion matrix
logistic_model_confusion_mat <- confusionMatrix(logistic_model_pred, test_data$Churn, positive = "Yes")
print(logistic_model_confusion_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1094 309
## Yes 424 1309
##
## Accuracy : 0.7663
## 95% CI : (0.751, 0.781)
## No Information Rate : 0.5159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.531
##
## Mcnemar's Test P-Value : 2.546e-05
##
## Sensitivity : 0.8090
## Specificity : 0.7207
## Pos Pred Value : 0.7553
## Neg Pred Value : 0.7798
## Prevalence : 0.5159
## Detection Rate : 0.4174
## Detection Prevalence : 0.5526
## Balanced Accuracy : 0.7649
##
## 'Positive' Class : Yes
##
# Extract performance metrics: Precision, Recall, F1-Score
logistic_precision <- logistic_model_confusion_mat$byClass["Precision"]
logistic_recall <- logistic_model_confusion_mat$byClass["Recall"]
logistic_f1_score <- 2 * ((logistic_precision * logistic_recall) / (logistic_precision + logistic_recall))
cat("Precision:", logistic_precision, "\n")
## Precision: 0.7553376
cat("Recall:", logistic_recall, "\n")
## Recall: 0.8090235
cat("F1-Score:", logistic_f1_score, "\n")
## F1-Score: 0.7812593
# Compute ROC and AUC
roc_curve <- roc(test_data$Churn, logistic_model_probs$Yes) # Assuming "Yes" is the positive class
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("ROC-AUC:", auc_value, "\n")
## ROC-AUC: 0.8463893
# Calculate Balanced Accuracy
TP <- logistic_model_confusion_mat$table["Yes", "Yes"] # True Positives
TN <- logistic_model_confusion_mat$table["No", "No"] # True Negatives
FP <- logistic_model_confusion_mat$table["Yes", "No"] # False Positives
FN <- logistic_model_confusion_mat$table["No", "Yes"] # False Negatives
# Calculate Sensitivity and Specificity
sensitivity <- TP / (TP + FN) # Recall
specificity <- TN / (TN + FP)
# Calculate Balanced Accuracy
balanced_accuracy <- (sensitivity + specificity) / 2
# Print Balanced Accuracy
cat("Balanced Accuracy:", balanced_accuracy, "\n")
## Balanced Accuracy: 0.7648543
# Create a comparison table
classification_comparison <- data.frame(
Model = c("AdaBoost", "Naive Bayes", "KNN", "Logistic Regression"),
Precision = c(adaboost_precision, nb_precision, knn_precision, logistic_precision),
Recall = c(adaboost_recall, nb_recall, knn_recall, logistic_recall),
F1_Score = c(adaboost_f1_score, nb_f1_score, knn_f1_score, logistic_f1_score),
ROC_AUC = c(adaboost_auc, nb_auc, knn_auc, auc_value)
)
# Print the table
print(classification_comparison)
## Model Precision Recall F1_Score ROC_AUC
## 1 AdaBoost 0.7866667 0.7384717 0.7618077 0.7752927
## 2 Naive Bayes 0.7848000 0.6462451 0.7088150 0.8162446
## 3 KNN 0.8075970 0.6442688 0.7167461 0.8383042
## 4 Logistic Regression 0.7553376 0.8090235 0.7812593 0.8463893
library(tidyr)
library(ggplot2)
# Reshape data for plotting
comparison_plot_data <- classification_comparison %>%
pivot_longer(cols = c(Precision, Recall, F1_Score, ROC_AUC), names_to = "Metric", values_to = "Value")
# Plot
ggplot(comparison_plot_data, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Comparison of Classification Metrics") +
ylab("Metric Value") +
xlab("Model") +
theme_minimal()
plot(adaboost_roc_curve, col = "red", lwd = 2, main = "ROC Curves for Models")
plot(nb_roc_curve, col = "blue", lwd = 2, add = TRUE)
plot(knn_roc_curve, col = "green", lwd = 2, add = TRUE)
plot(roc_curve, col = "purple", lwd = 2, add = TRUE)
legend("bottomright", legend = c("AdaBoost", "Naive Bayes", "KNN", "Logistic Regression"),
col = c("red", "blue", "green", "purple"), lwd = 2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
# Train Random Forest model
rf_model <- randomForest(TotalCharges ~ ., data = train_data, ntree = 100)
# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model using regression metrics
rf_results <- postResample(rf_predictions, test_data$TotalCharges)
# Compute MAE
rf_mae <- mean(abs(rf_predictions - test_data$TotalCharges))
# Extract RMSE and R-squared (interpreted as 'accuracy' here)
rf_rmse <- rf_results[["RMSE"]]
rf_rsquared <- rf_results[["Rsquared"]]
# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
rf_accuracy_percent <- round(rf_rsquared * 100, 2)
# Print the results
cat("The root mean squared error (RMSE) is", round(rf_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0479 .
cat("The mean absolute error (MAE) is", round(rf_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.028 .
cat("The R-squared value is", round(rf_rsquared, 4), ".\n")
## The R-squared value is 0.9978 .
cat("The accuracy is", rf_accuracy_percent, "%.\n")
## The accuracy is 99.78 %.
library(e1071)
library(caret)
# Train SVR model
svr_model <- svm(TotalCharges ~ ., data = train_data, type = "eps-regression")
# Make predictions on the test data
svr_predictions <- predict(svr_model, newdata = test_data)
# Evaluate the model using regression metrics
svr_results <- postResample(svr_predictions, test_data$TotalCharges)
# Compute MAE
svr_mae <- mean(abs(svr_predictions - test_data$TotalCharges))
# Extract RMSE and R-squared (interpreted as 'accuracy' here)
svr_rmse <- svr_results[["RMSE"]]
svr_rsquared <- svr_results[["Rsquared"]]
# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
svr_accuracy_percent <- round(svr_rsquared * 100, 2)
# Print the results
cat("The root mean squared error (RMSE) is", round(svr_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0449 .
cat("The mean absolute error (MAE) is", round(svr_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.0357 .
cat("The R-squared value is", round(svr_rsquared, 4), ".\n")
## The R-squared value is 0.9979 .
cat("The accuracy is", svr_accuracy_percent, "%.\n")
## The accuracy is 99.79 %.
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.2
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(caret)
# Train GBM model
set.seed(123)
gbm_model <- gbm(TotalCharges ~ .,
data = train_data,
distribution = "gaussian",
n.trees = 500,
interaction.depth = 5,
shrinkage = 0.01,
cv.folds = 5)
# Make predictions on the test data
gbm_predictions <- predict(gbm_model, newdata = test_data, n.trees = 500)
# Evaluate the model using regression metrics
gbm_results <- postResample(gbm_predictions, test_data$TotalCharges)
# Compute MAE
gbm_mae <- mean(abs(gbm_predictions - test_data$TotalCharges))
# Extract RMSE and R-squared (interpreted as 'accuracy' here)
gbm_rmse <- gbm_results[["RMSE"]]
gbm_rsquared <- gbm_results[["Rsquared"]]
# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
gbm_accuracy_percent <- round(gbm_rsquared * 100, 2)
# Print the results
cat("The root mean squared error (RMSE) is", round(gbm_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0434 .
cat("The mean absolute error (MAE) is", round(gbm_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.0311 .
cat("The R-squared value is", round(gbm_rsquared, 4), ".\n")
## The R-squared value is 0.9982 .
cat("The accuracy is", gbm_accuracy_percent, "%.\n")
## The accuracy is 99.82 %.
# Create a data frame to summarize regression results
regression_comparison <- data.frame(
Model = c("Random Forest", "SVR", "GBM"),
RMSE = c(rf_rmse, svr_rmse, gbm_rmse),
MAE = c(rf_mae, svr_mae, gbm_mae),
R_Squared = c(rf_rsquared, svr_rsquared, gbm_rsquared)
)
# Print the comparison table
print(regression_comparison)
## Model RMSE MAE R_Squared
## 1 Random Forest 0.04787860 0.02804931 0.9978233
## 2 SVR 0.04486929 0.03571839 0.9978528
## 3 GBM 0.04343913 0.03109167 0.9982029
library(ggplot2)
library(tidyr)
# Reshape data for plotting
regression_plot_data <- regression_comparison %>%
pivot_longer(cols = c(RMSE, MAE, R_Squared), names_to = "Metric", values_to = "Value")
# Plot R-Squared separately
rsquared_plot <- ggplot(
subset(regression_plot_data, Metric == "R_Squared"),
aes(x = Model, y = Value, fill = Metric)
) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("R-Squared Comparison of Regression Models") +
ylab("R-Squared Value") +
xlab("Model") +
theme_minimal()
# Plot RMSE and MAE together
rmse_mae_plot <- ggplot(
subset(regression_plot_data, Metric %in% c("RMSE", "MAE")),
aes(x = Model, y = Value, fill = Metric)
) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("RMSE and MAE Comparison of Regression Models") +
ylab("Metric Value") +
xlab("Model") +
theme_minimal()
# Print plots
print(rsquared_plot)
print(rmse_mae_plot)