1 Title: Enhancing Customer Retention in a Telecommunication Company Through Service Usage and Churn Analysis

1.1 Group Members

Tan Foo Hou (23074766)

Shaik Mohammed Muzeeb (23097292)

Loh Ke Yi (23092758)

Loh Jing Wei (23118574)

Lim Sze Chie (s2157250)

2 Introduction

In the competitive telecommunication industry, retaining customers is crucial for sustaining profitability and long-term growth. Customer churn, the phenomenon of customers leaving a service provider, poses a significant challenge, leading to revenue losses and increased acquisition costs. Understanding customer behavior, particularly their spending patterns and likelihood to churn, can enable companies to develop targeted strategies that enhance customer satisfaction and retention. This project leverages data-driven approaches to analyze and predict customers’ service usage and churn likelihood.

2.1 Objectives

  1. To classify customers likely to churn using classification models, facilitating the development of targeted retention strategies to minimize customer turnover.

  2. To predict customer spending patterns using regression models, enabling the design of personalized services, benefits, packages or promotion tailored to individual customer needs.

2.2 Environment Setup

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.4.2
## Loaded ROSE 0.0-4
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

2.2.1 Set Working Directory

setwd("C:\\Users\\user\\OneDrive\\Documents")

2.2.2 Load the Dataset

data <- read.csv("C:/Users/user/OneDrive/Documents/Telco-Customer-Churn.csv", stringsAsFactors = FALSE)

3 Dataset Overview

The dataset chosen for this project is Telco Customer Churn, originally uploaded to Kaggle in 2018. It comprises detailed customer data from a telecommunications company, with the goal of analyzing the factors affecting customer behaviour, service engagement, and churn.

The dataset consists of 7,043 observations and 21 columns. Each observation represents a unique customer.

3.1 Target Variables

This dataset is used for two separate models, each with a different target variable. For the classification model, the target variable in this dataset is Churn, a categorical variable indicating whether the customer has churned within the last month. The values are:

  • Yes: The customer has churned.
  • No: The customer has remained with the service.

For the regression model, the target variable is TotalCharges, a variable representing the total amount charged to the customer.

3.2 Independent Variables

The 19 independent variables in this dataset includes:

  • Demographic Information:

    • customerID, gender, SeniorCitizen, Partner, Dependents
  • Account Information:

    • tenure, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges
  • Services Subscribed:

    • PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies

4 Exploratory Data Analysis (EDA)

4.1 Initial Overview of the Data

str(data)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
summary(data)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Convert ‘TotalCharges’ column to numeric datatype

data$TotalCharges <- as.numeric(data$TotalCharges)

4.2 Chi-Squared Test

We performed a Chi-Square test on all categorical variables to evaluate their association with the target variable Churn.

# categorical variables
categorical_vars <- c("Contract", "PaymentMethod", "InternetService", "TechSupport", "StreamingTV", 
                    "OnlineSecurity", "OnlineBackup", "DeviceProtection", "MultipleLines", "StreamingMovies", 
                    "PaperlessBilling", "SeniorCitizen", "Partner", "gender", "Dependents", "PhoneService")

# Create an empty data frame to store the chi2 results
chi2_results <- data.frame(Var=character(), Chi_2=numeric(), P_Val=numeric(), stringsAsFactors=FALSE)
for (var in categorical_vars){
    # Create a contingency table
    contingency_table <- table(data[[var]], data$Churn)
    # Perform the chi-square test
    chi2_test <- chisq.test(contingency_table)
    chi2_value <- chi2_test$statistic
    p_value <- chi2_test$p.value
    p_value <- ifelse(p_value < 0.0001, "<0.0001", round(p_value, 4))
    # Store the results in the data frame
    chi2_results <- rbind(chi2_results, data.frame(Var=var, Chi_2=chi2_value, P_Val=p_value))
}

# sort the results by chi-square value
chi2_results <- chi2_results[order(-chi2_results$Chi_2),]
print(chi2_results)
##                          Var        Chi_2   P_Val
## X-squared           Contract 1184.5965721 <0.0001
## X-squared5    OnlineSecurity  849.9989680 <0.0001
## X-squared3       TechSupport  828.1970685 <0.0001
## X-squared2   InternetService  732.3095897 <0.0001
## X-squared1     PaymentMethod  648.1423275 <0.0001
## X-squared6      OnlineBackup  601.8127901 <0.0001
## X-squared7  DeviceProtection  558.4193694 <0.0001
## X-squared9   StreamingMovies  375.6614793 <0.0001
## X-squared4       StreamingTV  374.2039433 <0.0001
## X-squared10 PaperlessBilling  258.2776491 <0.0001
## X-squared14       Dependents  189.1292494 <0.0001
## X-squared11    SeniorCitizen  159.4263004 <0.0001
## X-squared12          Partner  158.7333820 <0.0001
## X-squared8     MultipleLines   11.3304415  0.0035
## X-squared15     PhoneService    0.9150330  0.3388
## X-squared13           gender    0.4840829  0.4866

The chi-squared test was used to analyze the relationship between each categorical variable and the target variable, Churn. A higher chi-squared value indicates a stronger statistical association, but it does not show the direction or real-world impact of the relationship.

Most variables demonstrate a significant association with Churn at the 95% confidence level (p-value < 0.05). However, PhoneService, Gender, and MultipleLines have higher p-values, indicating no significant association.

While high chi-squared values suggest stronger relationships, visual analysis is essential to understand how different groups within each variable relate to Churn. This will help uncover the practical importance and direction of the associations.

4.3 Correlation of Numerical Features in a table

num_data <- data %>% select(tenure, MonthlyCharges, TotalCharges)

# Compute correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")
print(cor_matrix)
##                   tenure MonthlyCharges TotalCharges
## tenure         1.0000000      0.2468618    0.8258805
## MonthlyCharges 0.2468618      1.0000000    0.6510648
## TotalCharges   0.8258805      0.6510648    1.0000000

4.3.1 Correlation Plot

corrplot::corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black")

4.4 Distribution of the variables

The plot_var function visualizes a categorical variable’s distribution and churn percentage in one combined plot. It uses a bar chart to show category counts and overlays a line chart to display the churn percentage for each group, enabling a clear comparison of distribution and churn trends.

# Write a function to plot the bar chart and line chart for categorical variables
plot_var <- function(df, var, target="Churn"){
    # Group by the variable and target variable
    grouped_by_var <- df %>% 
                    group_by(!!sym(var), !!sym(target)) %>% 
                    summarise(n = n(),.groups = 'keep')

    # Calculate the churn rate
    churn_rate <- df %>% 
                group_by(!!sym(var), !!sym(target)) %>% 
                summarise(n = n(),.groups = 'drop_last') %>%
                mutate(Percentage_of_Churn = n / sum(n) * 100) %>%
                filter(!!sym(target) == "Yes")
    
    # Calculate the maximum number of customers for the grouped variable for the y-axis limit
    max_customers <- max(grouped_by_var$n)
    
    # Plot the bar chart
    p <- ggplot(data=grouped_by_var, aes(x=!!sym(var),y=n,fill=!!sym(target))) +
        geom_bar(stat="identity", position="dodge") +
        geom_text(data=grouped_by_var, aes(label=n, y=n), vjust=-0.5, position=position_dodge(width=0.9)) +
        theme_minimal()

    # Plot the line chart
    p <- p + geom_line(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, group=!!sym(target)), color="black", linewidth=1) +
        geom_point(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, color=!!sym(target)), color="black", size=3) +
        geom_text(data=churn_rate, aes(y=Percentage_of_Churn/100 * max_customers, label=round(Percentage_of_Churn,2), vjust=-0.5)) +
        scale_y_continuous(sec.axis = sec_axis(~. * 100/max_customers, name="Churn Rate (%)")) +
        labs(title=paste(var, "and Its Influence on Churn Rates"), x=var, y="Number of Customers") +
        theme(legend.position="bottom", plot.title = element_text(hjust = 0.5))
    
    return(p)
}

4.4.1 Numerical Variables and Their Impact on Churn

4.4.1.1 Tenure (in months)

data$tenure_bin <- cut(data$tenure, breaks=seq(0,72,by=12), right=TRUE,include.lowest=TRUE, labels=c("0-12", "13-24", "25-36", "37-48", "49-60", "61-72"))
print(plot_var(data, "tenure_bin"))

Customers with shorter tenure (within 1 year) have the highest churn rate of 47.44%. The churn rate consistently decreases as tenure increases, with customers having over 10 years of tenure exhibiting the lowest churn rate (6.61%). This highlights that longer-tenured customers are less likely to churn.

4.4.1.2 Total Charges

data$TotalCharges_bin <- cut(data$TotalCharges, breaks=seq(18.8,8684.8,by=1083.25), include.lowest=TRUE, labels=c("[18.8-1102.05]", "(1102.05-2185.3]", "(2185.3-3268.55]", "(3268.55-4351.8]", "(4351.8-5435.05]", "(5435.05-6518.3]", "(6518.3-7601.55]", "(7601.55-8684.8]"))
print(plot_var(data, "TotalCharges_bin") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8)))

The highest churn rate (36.32%) is observed among customers with the lowest total charges, in the range of $18.8–$1,102.05, typically linked to shorter tenure. As total charges increase—reflecting longer tenure with the company—churn rates steadily decrease, reaching a low of 7.78% for the highest total charges in the range of $7,601.55–$8,684.8.

4.4.1.3 Monthly Charges

data$MonthlyCharges_bin <- cut(data$MonthlyCharges, breaks=seq(18.25,118.75,by=16.75), include.lowest=TRUE, labels=c("[18.25-35]", "(35-51.75]", "(51.75-68.5]", "(68.5-85.25]", "(85.25-102]", "[102-118.75]"))
print(plot_var(data, "MonthlyCharges_bin"))

Customers with monthly charges of $68.5–$85.25 have the highest churn rate (38%), while those with the lowest charges of $18.25–$35 have the lowest churn rate (10.88%). Churn decreases for charges above $102.

4.4.1.4 Monthly Charges by Churn using side-by-side Boxplots

ggplot(data, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
  geom_boxplot() +
  ggtitle("Monthly Charges by Churn") +
  xlab("Churn") +
  ylab("Monthly Charges") +
  theme_minimal()

The boxplot compares the distribution of monthly charges between customers who churned (“Yes”) and those who did not churn (“No”).

  • Customers who churned generally have higher monthly charges, with a median monthly charge above that of non-churning customers

  • The interquartile range (IQR) for both groups overlaps, but non-churned customers show a slightly wider range, suggesting more variability in their monthly charges

Higher monthly charges may be associated with a higher likelihood of churn, it will be investigated further in later section.

# Remove the binned columns after plotting the graphs
data <- data %>% select(-tenure_bin)
data <- data %>% select(-MonthlyCharges_bin)
data <- data %>% select(-TotalCharges_bin)

4.4.2 Categorical Variables and Their Impact on Churn

4.4.2.1 Contract Type

print(plot_var(data, "Contract"))

Month-to-month contracts have the highest churn rate (42.71%), showing customers are more likely to leave when not bound by longer agreements. In contrast, one-year and two-year contracts show lower churn rates at 11.27% and 2.83%, respectively, indicating longer contracts improve retention.

4.4.2.2 Internet Service Type

print(plot_var(data, "InternetService"))

The type of internet service also greatly affects churn rates. Customers using fiber optic services have the highest churn rate (41.89%), followed by DSL users (18.96%). Customers without internet service have the lowest churn rate (7.4%).

4.4.2.3 Additional Services

4.4.2.3.1 Online Security Services
print(plot_var(data, "OnlineSecurity"))

4.4.2.3.2 Tech Support Services
print(plot_var(data, "TechSupport", "Churn"))

4.4.2.3.3 Online Backup Services
print(plot_var(data, "OnlineBackup"))

4.4.2.3.4 Device Protection Services
print(plot_var(data, "DeviceProtection"))

These plots of additional services, including online security, tech support, online backup, and device protection, reveals a common trend. Customers who do not subscribe to these services exhibit a churn rate ranging from 39% to 42%. In contrast, customers who do subscribe show significantly lower churn rates, ranging from 14% to 22%. Those categorized as “No Internet Service” consistently have the lowest churn rate of 7.4%. This highlights the importance of offering and encouraging the use of additional services to enhance customer retention.

4.4.2.4 Payment Method

print(plot_var(data, "PaymentMethod"))

Electronic check users have the highest churn rate (45.29%), while bank transfer, credit card, and mailed check users have lower rates (15%-19%). This suggests electronic check users may face more difficulties or inconveniences.

4.4.2.5 Streaming Services

4.4.2.5.1 Streaming Movies
print(plot_var(data, "StreamingMovies"))

4.4.2.5.2 Streaming TV
print(plot_var(data, "StreamingTV"))

Streaming services, including streaming movies and streaming TV, exhibit similar patterns. Customers who do not subscribe to these services have higher churn rates, approximately 33% to 34%, while those who subscribe show lower churn rates, around 29% to 31%. Suggesting that attractive streaming options might be an effective approach to lowering churn.

4.4.2.6 Paperless Billing

print(plot_var(data, "PaperlessBilling"))

Customers using paperless billing have a higher churn rate (33.57%) compared to those using traditional billing (16.33%). This suggests that traditional billing may be more comfortable for some customers, acting as a tangible reminder and fostering stronger engagement, which helps reduce churn.

4.4.2.7 Demographic Features

4.4.2.7.1 Dependents
print(plot_var(data, "Dependents"))

Customers without dependents exhibit a churn rate of 31.28%, significantly higher than the 15.45% churn rate among customers with dependents. Dependents may act as a stabilizing factor, encouraging customers to stay.

4.4.2.7.2 Senior Citizen Status
print(plot_var(data, "SeniorCitizen"))

Senior citizens have a higher churn rate of 41.68% compared to non-senior citizens at 23.61%. This indicates that senior citizens may face challenges or lack engagement with the services offered.

4.4.2.7.3 Partner Status
print(plot_var(data, "Partner"))

Customers without a partner have a churn rate of 32.96%, compared to 19.66% for those with a partner. Having a partner may contribute to a more stable service usage pattern.

5 Data Cleaning

5.1 Missing Values

Count NA values in each column

na_count <- sapply(data, function(x) sum(is.na(x)))
print(na_count)
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

Upon inspection, the column TotalCharges has 11 missing values (NA)

To understand the nature of these missing values, we inspected the rows where TotalCharges is NA:

data[is.na(data$TotalCharges), ]
##      customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 489  4472-LVYGI Female             0     Yes        Yes      0           No
## 754  3115-CZMZD   Male             0      No        Yes      0          Yes
## 937  5709-LVOEQ Female             0     Yes        Yes      0          Yes
## 1083 4367-NUYAO   Male             0     Yes        Yes      0          Yes
## 1341 1371-DWPAZ Female             0     Yes        Yes      0           No
## 3332 7644-OMVMY   Male             0     Yes        Yes      0          Yes
## 3827 3213-VVOLG   Male             0     Yes        Yes      0          Yes
## 4381 2520-SGTTA Female             0     Yes        Yes      0          Yes
## 5219 2923-ARZLG   Male             0     Yes        Yes      0          Yes
## 6671 4075-WKNIU Female             0     Yes        Yes      0          Yes
## 6755 2775-SEFEE   Male             0      No        Yes      0          Yes
##         MultipleLines InternetService      OnlineSecurity        OnlineBackup
## 489  No phone service             DSL                 Yes                  No
## 754                No              No No internet service No internet service
## 937                No             DSL                 Yes                 Yes
## 1083              Yes              No No internet service No internet service
## 1341 No phone service             DSL                 Yes                 Yes
## 3332               No              No No internet service No internet service
## 3827              Yes              No No internet service No internet service
## 4381               No              No No internet service No internet service
## 5219               No              No No internet service No internet service
## 6671              Yes             DSL                  No                 Yes
## 6755              Yes             DSL                 Yes                 Yes
##         DeviceProtection         TechSupport         StreamingTV
## 489                  Yes                 Yes                 Yes
## 754  No internet service No internet service No internet service
## 937                  Yes                  No                 Yes
## 1083 No internet service No internet service No internet service
## 1341                 Yes                 Yes                 Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671                 Yes                 Yes                 Yes
## 6755                  No                 Yes                  No
##          StreamingMovies Contract PaperlessBilling             PaymentMethod
## 489                   No Two year              Yes Bank transfer (automatic)
## 754  No internet service Two year               No              Mailed check
## 937                  Yes Two year               No              Mailed check
## 1083 No internet service Two year               No              Mailed check
## 1341                  No Two year               No   Credit card (automatic)
## 3332 No internet service Two year               No              Mailed check
## 3827 No internet service Two year               No              Mailed check
## 4381 No internet service Two year               No              Mailed check
## 5219 No internet service One year              Yes              Mailed check
## 6671                  No Two year               No              Mailed check
## 6755                  No Two year              Yes Bank transfer (automatic)
##      MonthlyCharges TotalCharges Churn
## 489           52.55           NA    No
## 754           20.25           NA    No
## 937           80.85           NA    No
## 1083          25.75           NA    No
## 1341          56.05           NA    No
## 3332          19.85           NA    No
## 3827          25.35           NA    No
## 4381          20.00           NA    No
## 5219          19.70           NA    No
## 6671          73.35           NA    No
## 6755          61.90           NA    No

Upon inspection, the 11 missing values (NA) in the TotalCharges column correspond to rows where the tenure is 0. This indicates that these customers are new and have not incurred any charges yet.

To address this, replace the NA values in the TotalCharges column with 0.

data$TotalCharges[is.na(data$TotalCharges)] <- 0

After handling missing values, the number of NA values in each column is calculated

na_count <- sapply(data, function(x) sum(is.na(x)))
print(na_count)
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0                0 
##            Churn 
##                0

5.2 Drop Irrelevant Features

customerID column is drop as it is redundant and does not contribute to the analysis

data <- data %>% select(-customerID)

5.3 Convert Categorical Variables to Factors

For machine learning and analysis, categorical variables need to be converted to factors to ensure proper handling during modeling. The following columns in the dataset are identified as categorical and are converted into factors:

# Convert categorical columns to factors
data$gender <- as.factor(data$gender)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
data$Partner <- as.factor(data$Partner)
data$Dependents <- as.factor(data$Dependents)
data$PhoneService <- as.factor(data$PhoneService)
data$MultipleLines <- as.factor(data$MultipleLines)
data$InternetService <- as.factor(data$InternetService)
data$OnlineSecurity <- as.factor(data$OnlineSecurity)
data$OnlineBackup <- as.factor(data$OnlineBackup)
data$DeviceProtection <- as.factor(data$DeviceProtection)
data$TechSupport <- as.factor(data$TechSupport)
data$StreamingTV <- as.factor(data$StreamingTV)
data$StreamingMovies <- as.factor(data$StreamingMovies)
data$Contract <- as.factor(data$Contract)
data$PaperlessBilling <- as.factor(data$PaperlessBilling)
data$PaymentMethod <- as.factor(data$PaymentMethod)
data$Churn <- as.factor(data$Churn)

5.4 Handle Outliers

Outliers in the dataset can have a significant impact on the results of machine learning models. To identify potential outliers, summary statistics such as mean, minimum, maximum, median, standard deviation, and quartiles (Q1 and Q3) are calculated for each numerical column:

# Identify numerical columns in 'data'
numerical_attributes <- sapply(data, is.numeric)
numerical_data <- data[, numerical_attributes]

# Calculate summary statistics for numerical data
numerical_distribution <- sapply(numerical_data, function(x) {
  c(mean = round(mean(x, na.rm = TRUE), digits = 2),
    min = min(x, na.rm = TRUE),
    max = max(x, na.rm = TRUE),
    median = median(x, na.rm = TRUE),
    sd = round(sd(x, na.rm = TRUE), digits = 2),
    q1 = quantile(x, probs = 0.25, na.rm = TRUE),
    q3 = quantile(x, probs = 0.75, na.rm = TRUE))
}) %>% t() %>% as.data.frame()

numerical_distribution
##                   mean   min     max  median      sd q1.25%  q3.75%
## tenure           32.37  0.00   72.00   29.00   24.56   9.00   55.00
## MonthlyCharges   64.76 18.25  118.75   70.35   30.09  35.50   89.85
## TotalCharges   2279.73  0.00 8684.80 1394.55 2266.79 398.55 3786.60

Upon inspecting the minimum and maximum values for each numerical column, no unrealistic values were detected.

However, to ensure that subtle outliers are accounted for, we use the Interquartile Range (IQR) method to detect and handle potential outliers.

# Function to check for outliers using IQR method
check_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  sum(x < lower_bound | x > upper_bound, na.rm = TRUE) > 0
}

# Identify numerical columns
numerical_attributes <- sapply(data, is.numeric)
numerical_data <- data[, numerical_attributes]

# Check for outliers in each numerical column
outlier_check <- sapply(numerical_data, check_outliers)

# Display results
outlier_summary <- data.frame(
  Column = names(outlier_check),
  Has_Outliers = ifelse(outlier_check, "Yes", "No")
)

# Print the summary table
print(outlier_summary)
##                        Column Has_Outliers
## tenure                 tenure           No
## MonthlyCharges MonthlyCharges           No
## TotalCharges     TotalCharges           No

As shown on the above table, no outliers were detected.

Boxplot for Tenure

boxplot(data$tenure, main = "Tenure Outliers", ylab = "Tenure")

Boxplot for MonthlyChargers

boxplot(data$MonthlyCharges, main = "Monthly Charges Outliers", ylab = "Monthly Charges")

Boxplot for TotalCharges

boxplot(data$TotalCharges, main = "Total Charges Outliers", ylab = "Total Charges")

5.5 Data Transformation

Z-score normalization - scale() function was applied to numerical columns (tenure, MonthlyCharges, and TotalCharges), transforming them to have a mean of 0 and a standard deviation of 1.

This ensures that all numerical variables contribute equally to the model, avoiding biases caused by differences in scale.

# List of numerical columns to normalize
numerical_cols <- c("tenure", "MonthlyCharges", "TotalCharges")

# Normalizing the numerical columns
data[numerical_cols] <- scale(data[numerical_cols])

# Preview the normalized data
head(data)
##   gender SeniorCitizen Partner Dependents      tenure PhoneService
## 1 Female             0     Yes         No -1.27735389           No
## 2   Male             0      No         No  0.06632271          Yes
## 3   Male             0      No         No -1.23663642          Yes
## 4   Male             0      No         No  0.51421491           No
## 5 Female             0      No         No -1.23663642          Yes
## 6 Female             0      No         No -0.99233158          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check     -1.1602405   -0.9925401    No
## 2              Mailed check     -0.2596105   -0.1721525    No
## 3              Mailed check     -0.3626346   -0.9579979   Yes
## 4 Bank transfer (automatic)     -0.7464825   -0.1936586    No
## 5          Electronic check      0.1973512   -0.9388078   Yes
## 6          Electronic check      1.1594634   -0.6437435   Yes

5.6 Re-sampling Imbalanced Dataset

5.6.1 Churn Distribution using Bar chart

ggplot(data, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  ggtitle("Churn Distribution") +
  xlab("Churn") +
  ylab("Count") +
  theme_minimal()

The chart categorized customers into two groups: those who did not churn (“No”) and those who did churn (“Yes”).

  • The majority of customers (over 5,000) did not churn, while a smaller proportion (around 1,800) did churn.

This imbalance in the data highlights that the dataset is skewed towards non-churners, model may become biased towards predicting non-churners, leading to poor performance in identifying actual churners

5.6.2 Oversampling

To address this imbalance, oversampling was performed using the ROSE package, which increased the number of churned (“Yes”) customers to balance the dataset and ensure fairer representation for predictive model training.

# Oversampling
data <- ovun.sample(Churn ~ ., data = data, method = "over")$data

5.6.3 Churn Distribution after Oversampling using Bar chart

ggplot(data, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  ggtitle("Churn Distribution After Resampling") +
  xlab("Churn") +
  ylab("Count") +
  theme_minimal()

table(data$Churn)
## 
##   No  Yes 
## 5174 5276

After Oversampling is done, there is almost equal numbers of Churners and Non-Churners

Save the cleaned dataset for further analysis or modeling

write.csv(data, "cleaned_telco_data.csv", row.names = FALSE)

6 Data Splitting

Divide the dataset into training and testing sets using an 70-30 split.

set.seed(123)
train_index <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

7 Data Modeling

7.1 Objective 1: Churn Prediction

7.1.1 Classification Algorithm 1: Adaptive Boosting (AdaBoost)

# Load required libraries
library(ada)
## Warning: package 'ada' was built under R version 4.4.2
## Loading required package: rpart
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(pROC)

# Train AdaBoost model
set.seed(123)
adaboost_model <- ada(Churn ~ ., data = train_data, iter = 50, bag.frac = 0.5)

# Make predictions on the test data
adaboost_model_pred <- predict(adaboost_model, newdata = test_data)

# Evaluate the model using confusion matrix
adaboost_model_confusion_mat <- confusionMatrix(adaboost_model_pred, test_data$Churn)
print(adaboost_model_confusion_mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1121  304
##        Yes  397 1314
##                                           
##                Accuracy : 0.7765          
##                  95% CI : (0.7615, 0.7909)
##     No Information Rate : 0.5159          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5516          
##                                           
##  Mcnemar's Test P-Value : 0.0005112       
##                                           
##             Sensitivity : 0.7385          
##             Specificity : 0.8121          
##          Pos Pred Value : 0.7867          
##          Neg Pred Value : 0.7680          
##              Prevalence : 0.4841          
##          Detection Rate : 0.3575          
##    Detection Prevalence : 0.4544          
##       Balanced Accuracy : 0.7753          
##                                           
##        'Positive' Class : No              
## 
# Extract performance metrics
adaboost_precision <- adaboost_model_confusion_mat$byClass["Precision"]
adaboost_recall <- adaboost_model_confusion_mat$byClass["Recall"]
adaboost_f1_score <- 2 * ((adaboost_precision * adaboost_recall) / (adaboost_precision + adaboost_recall))

# Compute probabilities manually (since ada::predict does not return probabilities directly)
# Obtain the predicted class proportions from the training data
yes_prob <- mean(train_data$Churn == "Yes")
no_prob <- 1 - yes_prob

# Assign probabilities to predictions
adaboost_probs <- ifelse(adaboost_model_pred == "Yes", yes_prob, no_prob)

# Convert the actual test labels to binary for ROC
binary_churn <- ifelse(test_data$Churn == "Yes", 1, 0)

# Compute ROC and AUC
adaboost_roc_curve <- roc(binary_churn, adaboost_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
adaboost_auc <- auc(adaboost_roc_curve)

# Print performance metrics and AUC
cat("AdaBoost - Precision:", adaboost_precision, "\n")
## AdaBoost - Precision: 0.7866667
cat("AdaBoost - Recall:", adaboost_recall, "\n")
## AdaBoost - Recall: 0.7384717
cat("AdaBoost - F1-Score:", adaboost_f1_score, "\n")
## AdaBoost - F1-Score: 0.7618077
cat("AdaBoost - ROC-AUC:", adaboost_auc, "\n")
## AdaBoost - ROC-AUC: 0.7752927

7.1.2 Classification Algorithm 2: Naive Bayes (NB)

library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
library(caret)

# Train Naive Bayes model
nb_model <- naiveBayes(Churn~., data = train_data)

# Make predictions on the test data
nb_model_pred <- predict(nb_model, newdata = test_data)

# Evaluate the model using confusion matrix
library(caret)
nb_model_confusion_mat <- confusionMatrix(nb_model_pred, test_data$Churn)
nb_model_confusion_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   981  269
##        Yes  537 1349
##                                           
##                Accuracy : 0.743           
##                  95% CI : (0.7273, 0.7582)
##     No Information Rate : 0.5159          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4826          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6462          
##             Specificity : 0.8337          
##          Pos Pred Value : 0.7848          
##          Neg Pred Value : 0.7153          
##              Prevalence : 0.4841          
##          Detection Rate : 0.3128          
##    Detection Prevalence : 0.3986          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : No              
## 
# Extract performance metrics
nb_precision <- nb_model_confusion_mat$byClass["Precision"]
nb_recall <- nb_model_confusion_mat$byClass["Recall"]
nb_f1_score <- 2 * ((nb_precision * nb_recall) / (nb_precision + nb_recall))

# Compute ROC and AUC
nb_probs <- predict(nb_model, newdata = test_data, type = "raw")
nb_roc_curve <- roc(test_data$Churn, nb_probs[, "Yes"])
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
nb_auc <- auc(nb_roc_curve)

# Print results
cat("Naive Bayes - Precision:", nb_precision, "\n")
## Naive Bayes - Precision: 0.7848
cat("Naive Bayes - Recall:", nb_recall, "\n")
## Naive Bayes - Recall: 0.6462451
cat("Naive Bayes - F1-Score:", nb_f1_score, "\n")
## Naive Bayes - F1-Score: 0.708815
cat("Naive Bayes - ROC-AUC:", nb_auc, "\n")
## Naive Bayes - ROC-AUC: 0.8162446

7.1.3 Classification Algorithm 3: K-Nearest Neighbors (KNN)

library(caret)
control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation

# Train KNN model
knn_model <- train(Churn~., data = train_data, method = "knn",
                         trControl = control)
knn_model
## k-Nearest Neighbors 
## 
## 7314 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6583, 6582, 6582, 6584, 6582, 6583, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7575915  0.5151534
##   7  0.7548570  0.5096909
##   9  0.7559475  0.5118746
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# Make predictions on the test data
knn_model_pred <- predict(knn_model, newdata = test_data)

# Evaluate the model using confusion matrix
knn_model_confusion_mat <- confusionMatrix(knn_model_pred, test_data$Churn)
knn_model_confusion_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   978  233
##        Yes  540 1385
##                                          
##                Accuracy : 0.7535         
##                  95% CI : (0.738, 0.7685)
##     No Information Rate : 0.5159         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5034         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.6443         
##             Specificity : 0.8560         
##          Pos Pred Value : 0.8076         
##          Neg Pred Value : 0.7195         
##              Prevalence : 0.4841         
##          Detection Rate : 0.3119         
##    Detection Prevalence : 0.3862         
##       Balanced Accuracy : 0.7501         
##                                          
##        'Positive' Class : No             
## 
# Extract performance metrics
knn_precision <- knn_model_confusion_mat$byClass["Precision"]
knn_recall <- knn_model_confusion_mat$byClass["Recall"]
knn_f1_score <- 2 * ((knn_precision * knn_recall) / (knn_precision + knn_recall))

# Compute ROC and AUC
knn_probs <- predict(knn_model, newdata = test_data, type = "prob")
knn_roc_curve <- roc(test_data$Churn, knn_probs$Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
knn_auc <- auc(knn_roc_curve)

# Print results
cat("KNN - Precision:", knn_precision, "\n")
## KNN - Precision: 0.807597
cat("KNN - Recall:", knn_recall, "\n")
## KNN - Recall: 0.6442688
cat("KNN - F1-Score:", knn_f1_score, "\n")
## KNN - F1-Score: 0.7167461
cat("KNN - ROC-AUC:", knn_auc, "\n")
## KNN - ROC-AUC: 0.8383042

7.1.4 Classification Algorithm 4: Logistic Regression (LR)

library(caret)
library(pROC)

# Define train control with probability output and cross-validation
control <- trainControl(
  method = "cv", 
  number = 10, 
  classProbs = TRUE,  # Enables probability predictions
  summaryFunction = twoClassSummary  # Enables AUC calculation
)

# Train the logistic regression model
logistic_model <- train(
  Churn ~ ., 
  data = train_data, 
  method = "glm", 
  family = "binomial", 
  trControl = control,
  metric = "ROC"  # Optimize based on ROC
)

# Display the trained model
print(logistic_model)
## Generalized Linear Model 
## 
## 7314 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6583, 6582, 6583, 6583, 6583, 6583, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8430136  0.7232001  0.8004439
# Make predictions on the test data
logistic_model_pred <- predict(logistic_model, newdata = test_data)
logistic_model_probs <- predict(logistic_model, newdata = test_data, type = "prob")

# Evaluate the model using confusion matrix
logistic_model_confusion_mat <- confusionMatrix(logistic_model_pred, test_data$Churn, positive = "Yes")
print(logistic_model_confusion_mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1094  309
##        Yes  424 1309
##                                         
##                Accuracy : 0.7663        
##                  95% CI : (0.751, 0.781)
##     No Information Rate : 0.5159        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.531         
##                                         
##  Mcnemar's Test P-Value : 2.546e-05     
##                                         
##             Sensitivity : 0.8090        
##             Specificity : 0.7207        
##          Pos Pred Value : 0.7553        
##          Neg Pred Value : 0.7798        
##              Prevalence : 0.5159        
##          Detection Rate : 0.4174        
##    Detection Prevalence : 0.5526        
##       Balanced Accuracy : 0.7649        
##                                         
##        'Positive' Class : Yes           
## 
# Extract performance metrics: Precision, Recall, F1-Score
logistic_precision <- logistic_model_confusion_mat$byClass["Precision"]
logistic_recall <- logistic_model_confusion_mat$byClass["Recall"]
logistic_f1_score <- 2 * ((logistic_precision * logistic_recall) / (logistic_precision + logistic_recall))

cat("Precision:", logistic_precision, "\n")
## Precision: 0.7553376
cat("Recall:", logistic_recall, "\n")
## Recall: 0.8090235
cat("F1-Score:", logistic_f1_score, "\n")
## F1-Score: 0.7812593
# Compute ROC and AUC
roc_curve <- roc(test_data$Churn, logistic_model_probs$Yes)  # Assuming "Yes" is the positive class
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("ROC-AUC:", auc_value, "\n")
## ROC-AUC: 0.8463893
# Calculate Balanced Accuracy
TP <- logistic_model_confusion_mat$table["Yes", "Yes"]  # True Positives
TN <- logistic_model_confusion_mat$table["No", "No"]     # True Negatives
FP <- logistic_model_confusion_mat$table["Yes", "No"]    # False Positives
FN <- logistic_model_confusion_mat$table["No", "Yes"]    # False Negatives

# Calculate Sensitivity and Specificity
sensitivity <- TP / (TP + FN)  # Recall
specificity <- TN / (TN + FP)

# Calculate Balanced Accuracy
balanced_accuracy <- (sensitivity + specificity) / 2

# Print Balanced Accuracy
cat("Balanced Accuracy:", balanced_accuracy, "\n")
## Balanced Accuracy: 0.7648543

7.1.5 Summarize of classification results

# Create a comparison table
classification_comparison <- data.frame(
  Model = c("AdaBoost", "Naive Bayes", "KNN", "Logistic Regression"),
  Precision = c(adaboost_precision, nb_precision, knn_precision, logistic_precision),
  Recall = c(adaboost_recall, nb_recall, knn_recall, logistic_recall),
  F1_Score = c(adaboost_f1_score, nb_f1_score, knn_f1_score, logistic_f1_score),
  ROC_AUC = c(adaboost_auc, nb_auc, knn_auc, auc_value)
)

# Print the table
print(classification_comparison)
##                 Model Precision    Recall  F1_Score   ROC_AUC
## 1            AdaBoost 0.7866667 0.7384717 0.7618077 0.7752927
## 2         Naive Bayes 0.7848000 0.6462451 0.7088150 0.8162446
## 3                 KNN 0.8075970 0.6442688 0.7167461 0.8383042
## 4 Logistic Regression 0.7553376 0.8090235 0.7812593 0.8463893
library(tidyr)
library(ggplot2)

# Reshape data for plotting
comparison_plot_data <- classification_comparison %>%
  pivot_longer(cols = c(Precision, Recall, F1_Score, ROC_AUC), names_to = "Metric", values_to = "Value")

# Plot
ggplot(comparison_plot_data, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Comparison of Classification Metrics") +
  ylab("Metric Value") +
  xlab("Model") +
  theme_minimal()

plot(adaboost_roc_curve, col = "red", lwd = 2, main = "ROC Curves for Models")
plot(nb_roc_curve, col = "blue", lwd = 2, add = TRUE)
plot(knn_roc_curve, col = "green", lwd = 2, add = TRUE)
plot(roc_curve, col = "purple", lwd = 2, add = TRUE)
legend("bottomright", legend = c("AdaBoost", "Naive Bayes", "KNN", "Logistic Regression"),
       col = c("red", "blue", "green", "purple"), lwd = 2)

7.2 Objective 2: Customer Spending Patterns Prediction

7.2.1 Regresssion Algorithm 1: Random Forest (RF)

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)

# Train Random Forest model
rf_model <- randomForest(TotalCharges ~ ., data = train_data, ntree = 100)

# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model using regression metrics
rf_results <- postResample(rf_predictions, test_data$TotalCharges)

# Compute MAE
rf_mae <- mean(abs(rf_predictions - test_data$TotalCharges))

# Extract RMSE and R-squared (interpreted as 'accuracy' here)
rf_rmse <- rf_results[["RMSE"]]
rf_rsquared <- rf_results[["Rsquared"]]

# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
rf_accuracy_percent <- round(rf_rsquared * 100, 2)

# Print the results
cat("The root mean squared error (RMSE) is", round(rf_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0479 .
cat("The mean absolute error (MAE) is", round(rf_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.028 .
cat("The R-squared value is", round(rf_rsquared, 4), ".\n")
## The R-squared value is 0.9978 .
cat("The accuracy is", rf_accuracy_percent, "%.\n")
## The accuracy is 99.78 %.

7.2.2 Regresssion Algorithm 2: Support vector regression (SVR)

library(e1071)
library(caret)

# Train SVR model
svr_model <- svm(TotalCharges ~ ., data = train_data, type = "eps-regression")

# Make predictions on the test data
svr_predictions <- predict(svr_model, newdata = test_data)

# Evaluate the model using regression metrics
svr_results <- postResample(svr_predictions, test_data$TotalCharges)

# Compute MAE
svr_mae <- mean(abs(svr_predictions - test_data$TotalCharges))

# Extract RMSE and R-squared (interpreted as 'accuracy' here)
svr_rmse <- svr_results[["RMSE"]]
svr_rsquared <- svr_results[["Rsquared"]]

# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
svr_accuracy_percent <- round(svr_rsquared * 100, 2)

# Print the results
cat("The root mean squared error (RMSE) is", round(svr_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0449 .
cat("The mean absolute error (MAE) is", round(svr_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.0357 .
cat("The R-squared value is", round(svr_rsquared, 4), ".\n")
## The R-squared value is 0.9979 .
cat("The accuracy is", svr_accuracy_percent, "%.\n")
## The accuracy is 99.79 %.

7.2.3 Regresssion Algorithm 3: Gradient Boosting (GBM)

library(gbm)
## Warning: package 'gbm' was built under R version 4.4.2
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(caret)

# Train GBM model
set.seed(123)
gbm_model <- gbm(TotalCharges ~ .,
                 data = train_data,
                 distribution = "gaussian",
                 n.trees = 500,
                 interaction.depth = 5,
                 shrinkage = 0.01,
                 cv.folds = 5)

# Make predictions on the test data
gbm_predictions <- predict(gbm_model, newdata = test_data, n.trees = 500)

# Evaluate the model using regression metrics
gbm_results <- postResample(gbm_predictions, test_data$TotalCharges)

# Compute MAE
gbm_mae <- mean(abs(gbm_predictions - test_data$TotalCharges))

# Extract RMSE and R-squared (interpreted as 'accuracy' here)
gbm_rmse <- gbm_results[["RMSE"]]
gbm_rsquared <- gbm_results[["Rsquared"]]

# Calculate 'accuracy' as the percentage of R-squared (for illustration purposes)
gbm_accuracy_percent <- round(gbm_rsquared * 100, 2)

# Print the results
cat("The root mean squared error (RMSE) is", round(gbm_rmse, 4), ".\n")
## The root mean squared error (RMSE) is 0.0434 .
cat("The mean absolute error (MAE) is", round(gbm_mae, 4), ".\n")
## The mean absolute error (MAE) is 0.0311 .
cat("The R-squared value is", round(gbm_rsquared, 4), ".\n")
## The R-squared value is 0.9982 .
cat("The accuracy is", gbm_accuracy_percent, "%.\n")
## The accuracy is 99.82 %.

7.2.4 Summarize of regression results

# Create a data frame to summarize regression results
regression_comparison <- data.frame(
  Model = c("Random Forest", "SVR", "GBM"),
  RMSE = c(rf_rmse, svr_rmse, gbm_rmse),
  MAE = c(rf_mae, svr_mae, gbm_mae),
  R_Squared = c(rf_rsquared, svr_rsquared, gbm_rsquared)
)

# Print the comparison table
print(regression_comparison)
##           Model       RMSE        MAE R_Squared
## 1 Random Forest 0.04787860 0.02804931 0.9978233
## 2           SVR 0.04486929 0.03571839 0.9978528
## 3           GBM 0.04343913 0.03109167 0.9982029
library(ggplot2)
library(tidyr)

# Reshape data for plotting
regression_plot_data <- regression_comparison %>%
  pivot_longer(cols = c(RMSE, MAE, R_Squared), names_to = "Metric", values_to = "Value")

# Plot R-Squared separately
rsquared_plot <- ggplot(
  subset(regression_plot_data, Metric == "R_Squared"),
  aes(x = Model, y = Value, fill = Metric)
) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("R-Squared Comparison of Regression Models") +
  ylab("R-Squared Value") +
  xlab("Model") +
  theme_minimal()

# Plot RMSE and MAE together
rmse_mae_plot <- ggplot(
  subset(regression_plot_data, Metric %in% c("RMSE", "MAE")),
  aes(x = Model, y = Value, fill = Metric)
) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("RMSE and MAE Comparison of Regression Models") +
  ylab("Metric Value") +
  xlab("Model") +
  theme_minimal()

# Print plots
print(rsquared_plot)

print(rmse_mae_plot)

8 Conclusion

8.1 Churn Prediction (Classification Model)

  • The best model was built with the Logistic Regression (LR) algorithm.
  • It has an accuracy score of 78.67% for churn prediction.
  • With its ROC-AUC score of 0.8561, it ensures accurate predictions, making it ideal for identifying customers likely to churn.
  • Logistic regression is well-suited for classification tasks because it balances simplicity and interpretability while providing robust performance.

8.2 Customer Spending Patterns Prediction (Regression Model)

  • The best model was built with the Random Forest (RF) algorithm.
  • It has an accuracy score of 99.75% for customer spending pattern prediction.
  • Random Forest has the lowest Mean Absolute Error (MAE) of 0.0287, indicating that it is highly reliable to provide consistent and interpretable predictions.
  • Random Forest is well-suited for regression tasks because it effectively handles nonlinear relationships, prevents overfitting using ensemble techniques, and provides reliable predictions even in the presence of noise.