Member detail

Wei Xiaoting 24232389, Lee Sijie S2018064, Liu Yiqian 24229675, Mohammad Irfanul Haque S2012063, Shi Hui 25070378

Introduction

Telecommunication is a competitive and subscription-based industry. Customers pay monthly, thus losing a subscriber directly reduces recurring revenue. Customer churn refers to subscribers who discontinue their mobile, internet or TV services and switch to another telecommunication service provider. High customer churn indicates dissatisfaction due to poor network coverage, cheaper offers or bundled deals from competitors, bad customer service such as long waiting time and unresolved issues, limited data or slow internet speed, billing disputes as well as lack of loyalty programs and incentives. In this project, we analyze the Telco Customer Churn dataset obtained from Kaggle to detect the customer churn and predict the total billing charges of a customer who signed up for services with a telecommunication company.

Objectives

Our goal is to clean the raw data, handle missing values and perform exploratory data analysis (EDA) to understand the customer behaviour. After data cleaning is completed, the cleaned and processed dataset is used to solve two machine learning problems. The objectives of this project are: 1. To develop machine learning-based classification models to detect the customer churn of a telecommunication company.
2. To develop machine learning-based regression models to predict the total billing charges of a customer.
3. To evaluate the performance of the classification models with accuracy, precision, recall and F1 score.
4. To evaluate the performance of the regression models with mean absolute error (MAE), root mean squared error (RMSE) and R-squared (R2).

Data Cleaning

Load R packages

First, we load the necessary R packages. We use tidyverse for data manipulation and data visualization, plus skimr for data summary.

library(tidyverse) 
library(skimr)

Read dataset

Next, we import the raw dataset containing missing values by using the read.csv() function. All columns whose data type is character are converted into factors.

raw_data <- read.csv("telco_churn_dirty.csv", stringsAsFactors = FALSE)
raw_data[raw_data == ""] <- NA
raw_data <- raw_data %>% mutate(across(where(is.character), as.factor))

View dimensions and structures

In order to fully understand the dataset, we employed four different methods to study its dimensions and structure and this comparison help us identify the data types and potential issues.

dim(): Returns the dimensions (rows and columns) to verify the shape and size of the dataset. The result shows that the dataset consists of 7043 rows and 21 columns. Each row represents a customer tied with a telecommunication company and each column contains the customer’s attributes such as the demographic, account information and services.

dim(raw_data) 
## [1] 7043   21

str(): A Base R function that shows the internal structure of the dataset, detailing data types such as integer, numeric and factor.

str(raw_data)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

glimpse(): A tidyverse function similar to str(), but gives an output of a more compact and readable summary of the dataset.

glimpse(raw_data) 
## Rows: 7,043
## Columns: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-…
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

summary(): Provides a statistical overview of the dataset by identifying the data distribution, central tendency, spread and potential anomalies. For numerical attributes, it returns the minimum, 1st quartile, median, mean, 3rd quartile and maximum. For categorical attributes, it returns the frequency of each levels.

summary(raw_data)
##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.39                                                          
##  3rd Qu.:55.50                                                          
##  Max.   :72.00                                                          
##  NA's   :140                                                            
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1467   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1439   1st Qu.: 35.65  
##                   Electronic check         :2248   Median : 70.35  
##                   Mailed check             :1537   Mean   : 64.78  
##                   NA's                     : 352   3rd Qu.: 89.90  
##                                                    Max.   :118.75  
##                                                    NA's   :352     
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

Check for missing values

Identify missing values: Check for NA and empty strings.

missing_values <- colSums(is.na(raw_data)) 
missing_values[missing_values > 0]
##         tenure  PaymentMethod MonthlyCharges   TotalCharges 
##            140            352            352             11

Based on the data exploration above, we have formulated the following data cleaning strategies:

Remove unimportant missing values

Clean column tenure: Remove rows with missing values as it is a critical time-series feature.

clean_data <- raw_data 
clean_data <- clean_data %>% filter(!is.na(tenure))

Convert SeniorCitizen to factor (0/1 categorical)

clean_data$SeniorCitizen <- factor(clean_data$SeniorCitizen,
                                  levels = c(0, 1),
                                  labels = c("No", "Yes"))
labels = c("No", "Yes")

Impute missing values with median

Clean column MonthlyCharges: Impute missing numerical values using the median to ensure robustness to outliers, meaning extreme values like unusually high or low monthly charges will not skew the imputation in order to preserve the central tendency.

median1 <- median(clean_data$MonthlyCharges, na.rm = TRUE)
clean_data$MonthlyCharges[is.na(clean_data$MonthlyCharges)] <- median1

Impute missing values with mode

Clean column PaymentMethod: Impute missing categorical values using the mode (most frequent value) to maintain the existing distribution of categories.

get_mode <- function(value){ 
  uniqv <- unique(value) 
  uniqv[which.max(tabulate(match(value,uniqv)))] 
  }

mode_payment <- get_mode(clean_data$PaymentMethod[!is.na(clean_data$PaymentMethod)]) 
clean_data$PaymentMethod[is.na(clean_data$PaymentMethod)] <- mode_payment

Impute missing values with mean

Clean column TotalCharges: Impute missing numerical values using the mean to ensure that the total revenue estimates are not biased by missing values and preserves the overall average.

mean_total <- mean(clean_data$TotalCharges, na.rm = TRUE)
clean_data$TotalCharges[is.na(clean_data$TotalCharges)] <- mean_total

Verify the missing values

Make sure all missing values are handled.

sum(is.na(clean_data))
## [1] 0

Save cleaned dataset

The cleaned dataset is write into a new CSV file by using the write.csv() function and saved as telco_churn_cleaned.csv

write.csv(clean_data, "telco_churn_cleaned.csv", row.names = FALSE) 
print("Cleaned dataset saved as 'telco_churn_cleaned.csv'.")
## [1] "Cleaned dataset saved as 'telco_churn_cleaned.csv'."

Now that the data is clean, we can visualize the distribution of customer churn and other key features.

Exploratory Data Analysis (EDA)

Exploratory data analysis allows us to uncover patterns and anomalies in the data. We focus on the relationship between customer attributes and the target variable Churn.

Number of customer churn (lost users)

Before building any model, we must check the distribution of the target variable, whether it is balance or imbalance. If one class dominates the other, then accuracy becomes a misleading metric. Hence, we need to establish a baseline.

ggplot(clean_data, aes(x = Churn, fill = Churn)) + 
  geom_bar() + 
  geom_text(stat='count', aes(label=..count..), vjust=-0.5) + 
  theme_minimal() + 
  labs(title = "Number of customer churn (lost users)", y = "Count")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Observation:

Customer churn is divided into two categories: “Yes” and “No.” From the result, we can deduce that the telecommunication company managed to retain a majority of their customers with a count 5072 users. However, a portion of customers that sums up 1831 users which is approximately 26.5% of the total are lost. The number of customers who did not churn is significantly higher than those who did churn, suggesting a relatively lower churn rate.

Distribution of monthly charge

We hypothesize that customers with higher monthly charge billing are more likely to leave. By visualizing the distribution of the monthly charges for both retained and churned customers, we can see if the price sensitivity drives the rate of customer churn in the telecommunication company.

ggplot(clean_data, aes(x = MonthlyCharges, fill = Churn)) + 
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") + 
  theme_minimal() + 
  labs(title = "Distribution of Monthly Charges by Churn Status")

Observation:

The chart shows that when the churn status is “Yes”, the distribution curve peaks at higher values compared to when the churn status is “No”. This confirms that high-paying customers are at greater risk of leaving.

Relationship between payment method and churn

Payment friction can cause dissatisfaction among customers. Thus, we analyze the churn rate within each payment method to identify if specific methods like electronic check correlate with higher churn.

ggplot(clean_data, aes(x = PaymentMethod, fill = Churn)) + 
  geom_bar(position = "fill") + 
  theme_minimal() + 
  labs(title = "Churn Proportion by Payment Method", y = "Proportion") +
  coord_flip()

Observation:

“Electronic check” has a significantly larger churn, as showed in the blue bar compared to automatic payment methods like “Credit card” and “Bank transfer”. The “Credit card” and “Bank transfer” show the lowest churn proportions, indicating that customers using these automatic payment methods tend to stay more.

Data Modelling

Task 1: Customer Churn Detection Using Classification

In order to detect the customer churn of the telecommunications company, customers are classified into two categories, whereby “Yes” indicates that the user has stopped using the service and “No” indicates that the user is still utilizing the service. Based on this classification, we have developed three machine learning models: Decision Tree, Random Forest and Logistic Regression.

Load required R packages

Before beginning the classification task, some necessary R packages are loaded. We use caret to streamline the model training process for classification and regression problems. Next, rpart implements recursive partitioning for Decision Tree and rpart.plot visualizes the plotting of rpart in Decision Tree. For Random Forest, we use randomForest to build the model by combining multiple decision trees based on the Breiman’s random forest algorithm. Lastly, ggplot2 is for data visualization.

library(caret) 
library(rpart)        
library(rpart.plot)   
library(randomForest) 
library(ggplot2)

Set seed for reproducibility

The line of code set.seed(123) ensures random processes are reproducible during model training. It guarantees the same data splits, model initialization and results across different executions for as long as the code and environment remain the same.

set.seed(123) 

Perform train-test split

Stratified sampling is used to split the cleaned data into 80% for training and 20% for testing based on the target variable Churn so that both sets have similar class proportions. The indices returned by the createDataPartition() function are used to subset rows for train_data and the remaining rows for test_data. From the result, we can see that 5523 instances are used for training the models and 1380 instances are used for testing the models.

index <- createDataPartition(clean_data$Churn, p = 0.8, list = FALSE)
train_data <- clean_data[index, ]
nrow(train_data)
## [1] 5523
test_data  <- clean_data[-index, ]
nrow(test_data)
## [1] 1380

Decision Tree

Train Decision Tree

We trained the Decision Tree model by using the rpart() function to predict the target variable Churn from all features except customerID in the training dataset. It uses the Gini index for splits and controls tree complexity with a minimum split size of 10 and a complexity parameter of 0.0034.

dt_model <- rpart(Churn ~ . -customerID,
                  data = train_data,
                  method = "class",
                  parms = list(split = "gini"),
                  control = rpart.control(minsplit = 10, cp = 0.0034))
dt_model
## n= 5523 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 5523 1465 No (0.73474561 0.26525439)  
##     2) Contract=One year,Two year 2511  169 No (0.93269614 0.06730386) *
##     3) Contract=Month-to-month 3012 1296 No (0.56972112 0.43027888)  
##       6) InternetService=DSL,No 1353  388 No (0.71322986 0.28677014)  
##        12) tenure>=5.5 775  138 No (0.82193548 0.17806452) *
##        13) tenure< 5.5 578  250 No (0.56747405 0.43252595)  
##          26) TechSupport=No internet service,Yes 279   81 No (0.70967742 0.29032258) *
##          27) TechSupport=No 299  130 Yes (0.43478261 0.56521739)  
##            54) SeniorCitizen=No 257  122 Yes (0.47470817 0.52529183)  
##             108) OnlineSecurity=Yes 41   14 No (0.65853659 0.34146341) *
##             109) OnlineSecurity=No 216   95 Yes (0.43981481 0.56018519) *
##            55) SeniorCitizen=Yes 42    8 Yes (0.19047619 0.80952381) *
##       7) InternetService=Fiber optic 1659  751 Yes (0.45268234 0.54731766)  
##        14) tenure>=13.5 910  380 No (0.58241758 0.41758242)  
##          28) PaymentMethod=Bank transfer (automatic),Credit card (automatic),Mailed check 389  121 No (0.68894602 0.31105398) *
##          29) PaymentMethod=Electronic check 521  259 No (0.50287908 0.49712092)  
##            58) tenure>=55.5 75   20 No (0.73333333 0.26666667) *
##            59) tenure< 55.5 446  207 Yes (0.46412556 0.53587444)  
##             118) TotalCharges>=3083.6 171   80 No (0.53216374 0.46783626)  
##               236) TotalCharges< 4127.475 92   32 No (0.65217391 0.34782609) *
##               237) TotalCharges>=4127.475 79   31 Yes (0.39240506 0.60759494) *
##             119) TotalCharges< 3083.6 275  116 Yes (0.42181818 0.57818182) *
##        15) tenure< 13.5 749  221 Yes (0.29506008 0.70493992) *

Plot Decision Tree

The Decision Tree is plotted, showing the split conditions in a top-down layout. Additional information such as the predicted class, class probability and percentage of observations reaching each node are also displayed. Based on the Decision Tree:

  • Contract type is the most important factor. 1 year or 2 year contracts have very low churn, approximately 93.3% do not churn. Month-to-month contracts have a higher risk of churn, around 57.0% and this group is explored further. This shows that longer contracts dramatically reduce churn.
  • For customers without long-term contracts, churn depends on internet service type. Those who use DSL or no internet have lower churn compared to those who use fiber optic.
  • Tenure is a strong predictor for churn. Early-stage customers, especially those with tenure less than 6 months or 1 year are the most likely to churn. Churn risk increases by 2.4 times from 17.8% to 43.3% when tenure is below 6 months and customers below 1 year tenure are more than twice as likely to churn with a probability of 70.5%. However, tenure beyond 4.5 years cuts churn risk in half from 53.6% to 26.7%.
  • Among short-tenure or month-to-month customers, support and security services significantly reduce churn. Customers without these services are much more likely to leave.
  • Payment method effects churn, whereby customers who pay by electronic check have higher churn rates at 49.7% whereas automatic payments such as bank transfer and credit card have lower churn rates at 31.1%.
  • Senior citizens are at a greater risk of churn, especially when combined with month-to-month contracts, no technical support and short tenure.
rpart.plot(dt_model,
           type = 2, 
           extra = 104, 
           fallen.leaves = TRUE,
           box.palette = "auto",
           main = "Decision Tree for Customer Churn Detection")

Generate predictions with Decision Tree

After training the Decision Tree model, we make predictions of the customer churn based on the testing dataset by using the predict() function.

dt_pred <- predict(dt_model, test_data, type = "class")

Evaluate performance of Decision Tree

We evaluate the performance of the Decision Tree by computing a confusion matrix dt_conf to compare the predicted churn outcomes to the actual churn values.

dt_conf <- confusionMatrix(dt_pred, test_data$Churn, positive = "Yes")
dt_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  876 157
##        Yes 138 209
##                                           
##                Accuracy : 0.7862          
##                  95% CI : (0.7636, 0.8076)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 5.604e-06       
##                                           
##                   Kappa : 0.4423          
##                                           
##  Mcnemar's Test P-Value : 0.2946          
##                                           
##             Sensitivity : 0.5710          
##             Specificity : 0.8639          
##          Pos Pred Value : 0.6023          
##          Neg Pred Value : 0.8480          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1514          
##    Detection Prevalence : 0.2514          
##       Balanced Accuracy : 0.7175          
##                                           
##        'Positive' Class : Yes             
## 

Visualize confusion matrix of Decision Tree

Based on the confusion matrix, the Decision Tree model achieves an accuracy of 78.62%, meaning it correctly predicts outcomes for most of the cases. It is particularly strong at identifying customer’s churn when it is labelled “No”, with a specificity of 86.39% but shows moderate performance in recognizing customer’s churn when it is labelled “Yes”, with a recall of only 57.10%. The precision of the Decision Tree model for positive predictions is 60.23% while its negative predictive value is higher at 84.80%. Given that only about 26.52% of the dataset consists of actual “Yes”, the model appears to perform reasonably well on the majority class but struggles somewhat with the minority class. While overall accuracy is acceptable, the lower recall in an imbalanced classification may affect the detection of customer leaving the telecommunication company.

dt_cm_table <- as.table(dt_conf$table)

dt_cm_df <- as.data.frame(dt_cm_table)
colnames(dt_cm_df) <- c("Prediction", "Reference", "Freq")

dt_cm_df$Type <- with(dt_cm_df,  ifelse((Prediction == "Yes" & Reference == "Yes") | (Prediction == "No" & Reference == "No"), "Correct (TP/TN)", "Incorrect (FP/FN)"))

ggplot(dt_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
  geom_tile(color = "black", alpha = 0.7) + geom_text(size = 6) +
  scale_fill_manual(values = c("Correct (TP/TN)" = "skyblue",
                               "Incorrect (FP/FN)" = "salmon")) + 
  labs(title = "Confusion Matrix of Decision  Tree", x = "Actual", 
       y = "Predicted") + theme_minimal(base_size = 15)

Random Forest

Train Random Forest

We trained the Random Forest model by using the randomForest() function to predict the target variable Churn from all features except customerID in the training dataset. It specifies 500 trees and 4 variables considered at each split while also calculating variable importance to identify which predictors contribute most to the model.

rf_model <- randomForest(Churn ~ . -customerID,
                         data = train_data,
                         ntree = 500,
                         mtry = 4,
                         importance = TRUE)
rf_model
## 
## Call:
##  randomForest(formula = Churn ~ . - customerID, data = train_data,      ntree = 500, mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 20.44%
## Confusion matrix:
##       No Yes class.error
## No  3650 408   0.1005421
## Yes  721 744   0.4921502

Plot variable importance of Random Forest

The importance of features in the Random Forest is evaluated and visualized. The importance() function retrieves the numerical importance scores while the varImpPlot() creates a plot to display which features contribute most to the model.

MeanDecreaseAccuracy measures how much the accuracy of the Random Forest decreases when the values of a feature are randomly shuffled. A higher value means the feature is more important for making correct predictions. The top 8 features based on MeanDecreaseAccuracy include:

  • tenure
  • TotalCharges
  • Contract
  • MonthlyCharges
  • InternetService
  • TechSupport
  • OnlineSecurity
  • PaperlessBilling

MeanDecreaseGini measures the total decrease in node impurity averaged over all trees when a feature is used to split the data. A higher value means the feature is used more often or earlier in splits, contributes more to splitting nodes and reduces impurity effectively. The top 8 features based on MeanDecreaseGini include:

  • TotalCharges
  • tenure
  • MonthlyCharges
  • Contract
  • PaymentMethod
  • OnlineSecurity
  • TechSupport
  • InternetService

TotalCharges, MonthlyCharges, Contract and tenure are consistently top features in both metrics, indicating they are strong predictors of the Random Forest’s decisions.

importance(rf_model)
##                          No        Yes MeanDecreaseAccuracy MeanDecreaseGini
## gender           -0.5174706 -3.7061547            -2.869033        46.029488
## SeniorCitizen     5.0703889  3.7106840             6.430713        36.688691
## Partner           2.1048771 -5.1536578            -1.554991        39.748978
## Dependents       -2.0605871  8.1775086             4.222721        34.517107
## tenure           25.7108724 26.7821187            47.073919       316.341769
## PhoneService     -1.3763118 10.2866812             6.146633         8.863325
## MultipleLines     6.1700692 11.9754297            14.032826        46.511554
## InternetService  15.2300075 22.1328815            25.412103        75.814673
## OnlineSecurity   10.3836819 25.8632971            19.633618        94.951937
## OnlineBackup      8.7320811  8.9530164            11.817432        50.311576
## DeviceProtection 10.5450509 -6.8127494             7.537554        42.587469
## TechSupport       8.9405146 31.5264404            21.709602        81.017142
## StreamingTV       9.9677211  1.1161928             9.346541        39.196849
## StreamingMovies  13.5287261 -0.8537165            12.060363        40.110479
## Contract          2.2170094 34.7673948            36.805084       164.699772
## PaperlessBilling -1.7604984 21.6614835            15.155278        46.989941
## PaymentMethod     1.7768121 10.3882595             8.605394       110.170671
## MonthlyCharges   17.8428263 14.9600394            27.032589       299.800488
## TotalCharges     29.4515793 15.6623291            42.322819       341.716945
varImpPlot(rf_model, main = "Random Forest Feature Importance")

Generate predictions with Random Forest

After training the Random Forest model, we make predictions of the customer churn based on the testing dataset by using the predict() function.

rf_pred <- predict(rf_model, test_data)

Evaluate performance of Random Forest

We evaluate the performance of the Random Forest by computing a confusion matrix rf_conf to compare the predicted churn outcomes to the actual churn values.

rf_conf <- confusionMatrix(rf_pred, test_data$Churn, positive = "Yes")
rf_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  911 164
##        Yes 103 202
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.7847, 0.8271)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 2.771e-10       
##                                           
##                   Kappa : 0.4757          
##                                           
##  Mcnemar's Test P-Value : 0.0002407       
##                                           
##             Sensitivity : 0.5519          
##             Specificity : 0.8984          
##          Pos Pred Value : 0.6623          
##          Neg Pred Value : 0.8474          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1464          
##    Detection Prevalence : 0.2210          
##       Balanced Accuracy : 0.7252          
##                                           
##        'Positive' Class : Yes             
## 

Visualize confusion matrix of Random Forest

Based on the confusion matrix, the Random Forest model achieves an accuracy of 80.58%, correctly predicting both classes, “Yes” and “No” in 1112 out of 1380 cases. Similarly with the Decision Tree model, it excels at identifying customer’s churn when it is labelled “No”, with a specificity of 89.74% but struggles in recognizing customer’s churn when it is labelled “Yes”, showing a sensitivity of only 55.19%. While the Random Forest model is reliable when it predicts “No” due to a high negative predictive value of 84.73%, its precision for the “Yes” is lower, with a positive predictive value of 66.01%. This suggests it is conservative in detecting customers lost by the telecommunication company and is better suited for scenarios where correctly ruling out customers retained by the telecommunication company is the priority.

rf_cm_table <- as.table(rf_conf$table)

rf_cm_df <- as.data.frame(rf_cm_table)
colnames(rf_cm_df) <- c("Prediction", "Reference", "Freq")

rf_cm_df$Type <- with(rf_cm_df,  ifelse((Prediction == "Yes" & Reference == "Yes") | (Prediction == "No" & Reference == "No"), "Correct (TP/TN)", "Incorrect (FP/FN)"))

ggplot(rf_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
  geom_tile(color = "black", alpha = 0.7) + geom_text(size = 6) +
  scale_fill_manual(values = c("Correct (TP/TN)" = "skyblue",
                               "Incorrect (FP/FN)" = "salmon")) +
  labs(title = "Confusion Matrix of Random Forest", x = "Actual",
       y = "Predicted") + theme_minimal(base_size = 15)

Logistic Regression

Logistic Regression is applied as a baseline classification technique to predict customer churn. This model estimates the probability of churn by modelling the relationship between the predictor variables and the binary response variable Churn using a logistic (sigmoid) function. All predictor variables, excluding customerID, are used to train the model on the training dataset.

Create datasets specifically for Logistic Regression (remove customerID)

train_glm <- train_data %>% select(-customerID)
test_glm  <- test_data  %>% select(-customerID)

Train Logistic Regression

log_model <- glm(
  Churn ~ .,
  data = train_glm,
  family = binomial(link = "logit")
)

summary(log_model)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = train_glm)
## 
## Coefficients: (7 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           9.633e-02  2.581e-01   0.373 0.709035    
## genderMale                           -7.738e-02  7.342e-02  -1.054 0.291869    
## SeniorCitizenYes                      1.796e-01  9.559e-02   1.879 0.060271 .  
## PartnerYes                            3.155e-02  8.805e-02   0.358 0.720144    
## DependentsYes                        -1.796e-01  1.023e-01  -1.757 0.078945 .  
## tenure                               -6.205e-02  6.967e-03  -8.907  < 2e-16 ***
## PhoneServiceYes                      -6.724e-01  1.927e-01  -3.489 0.000485 ***
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      2.112e-01  9.534e-02   2.215 0.026741 *  
## InternetServiceFiber optic            7.183e-01  1.884e-01   3.813 0.000137 ***
## InternetServiceNo                    -7.881e-01  2.163e-01  -3.644 0.000269 ***
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -3.624e-01  1.017e-01  -3.565 0.000364 ***
## OnlineBackupNo internet service              NA         NA      NA       NA    
## OnlineBackupYes                      -1.891e-01  9.217e-02  -2.052 0.040174 *  
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                   1.981e-02  9.437e-02   0.210 0.833700    
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                       -4.254e-01  1.022e-01  -4.162 3.15e-05 ***
## StreamingTVNo internet service               NA         NA      NA       NA    
## StreamingTVYes                        2.060e-01  1.093e-01   1.885 0.059451 .  
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                    1.797e-01  1.114e-01   1.613 0.106806    
## ContractOne year                     -6.925e-01  1.206e-01  -5.744 9.27e-09 ***
## ContractTwo year                     -1.565e+00  2.061e-01  -7.591 3.18e-14 ***
## PaperlessBillingYes                   3.755e-01  8.412e-02   4.465 8.02e-06 ***
## PaymentMethodCredit card (automatic) -1.244e-02  1.332e-01  -0.093 0.925587    
## PaymentMethodElectronic check         3.198e-01  1.089e-01   2.936 0.003329 ** 
## PaymentMethodMailed check             3.759e-02  1.339e-01   0.281 0.778972    
## MonthlyCharges                        1.214e-03  6.371e-03   0.191 0.848888    
## TotalCharges                          3.456e-04  7.922e-05   4.362 1.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6389.9  on 5522  degrees of freedom
## Residual deviance: 4535.0  on 5499  degrees of freedom
## AIC: 4583
## 
## Number of Fisher Scoring iterations: 6

Generate predictions with Logistic Regression

log_prob <- predict(log_model, newdata = test_data, type = "response")

log_pred <- ifelse(log_prob > 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = c("No", "Yes"))

Evaluate peformance of Logistic Regression

The performance of the Logistic Regression model is evaluated using a confusion matrix. The confusion matrix summarizes the model’s prediction results by comparing predicted churn outcomes with actual churn labels. Key evaluation metrics such as accuracy, precision, recall, and F1-score are derived to assess the model’s effectiveness.

log_conf <- confusionMatrix(log_pred, test_data$Churn, positive = "Yes")
log_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  900 155
##        Yes 114 211
##                                           
##                Accuracy : 0.8051          
##                  95% CI : (0.7832, 0.8257)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 6.308e-10       
##                                           
##                   Kappa : 0.4813          
##                                           
##  Mcnemar's Test P-Value : 0.01473         
##                                           
##             Sensitivity : 0.5765          
##             Specificity : 0.8876          
##          Pos Pred Value : 0.6492          
##          Neg Pred Value : 0.8531          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1529          
##    Detection Prevalence : 0.2355          
##       Balanced Accuracy : 0.7320          
##                                           
##        'Positive' Class : Yes             
## 

Visualize confusion matrix of Logistic Regression

Based on the confusion matrix, the Logistic Regression model achieves an accuracy of 80.51%, correctly predicting both classes, “Yes” and “No”, in 1111 out of 1380 cases. The model performs well in identifying customers who do not churn when the actual label is “No”, as indicated by a high specificity of 88.76%. However, the model shows relatively lower performance in identifying customers who churn when the actual label is “Yes”, with a sensitivity of 57.65%. This indicates that while the model is able to correctly identify 155 churned customers, a number of churn cases are still misclassified as non-churn. The positive predictive value of 64.92% suggests that when the model predicts churn, it is correct in a majority of cases, while the negative predictive value of 85.31% indicates strong performance in predicting customers who remain with the service.

Overall, the confusion matrix provides a clear understanding of how the Logistic Regression model distributes its predictions across churned and non-churned customers, highlighting its strengths in identifying retained customers and its limitations in detecting churned customers.

log_cm_table <- as.table(log_conf$table)

log_cm_df <- as.data.frame(log_cm_table)
colnames(log_cm_df) <- c("Prediction", "Reference", "Freq")

log_cm_df$Type <- with(
  log_cm_df,
  ifelse(
    (Prediction == "Yes" & Reference == "Yes") |
    (Prediction == "No" & Reference == "No"),
    "Correct (TP/TN)",
    "Incorrect (FP/FN)"
  )
)

ggplot(log_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
  geom_tile(color = "black", alpha = 0.7) +
  geom_text(size = 6) +
  scale_fill_manual(
    values = c(
      "Correct (TP/TN)" = "skyblue",
      "Incorrect (FP/FN)" = "salmon"
    )
  ) +
  labs(
    title = "Confusion Matrix of Logistic Regression",
    x = "Actual",
    y = "Predicted"
  ) +
  theme_minimal(base_size = 15)

Extract performance metrics for Decision Tree

We create a data frame named dt_metrics to summarize the performance of the Decision Tree by storing the accuracy, precision, recall and F1 score calculated from the confusion matrix dt_conf.

dt_metrics <- data.frame(Model = "Decision Tree",
                         Accuracy = dt_conf$overall['Accuracy'],
                         Precision = dt_conf$byClass['Precision'],
                         Recall = dt_conf$byClass['Recall'],
                         F1_Score = dt_conf$byClass['F1'])

Extract performance metrics for Random Forest

The same is done for Random Forest, in which a data frame called dt_metrics is generated to capture the model performance by recording its accuracy, precision, recall and F1 score derived from the confusion matrix rf_conf.

rf_metrics <- data.frame(Model = "Random Forest",
                         Accuracy = rf_conf$overall['Accuracy'],
                         Precision = rf_conf$byClass['Precision'],
                         Recall = rf_conf$byClass['Recall'],
                         F1_Score = rf_conf$byClass['F1'])

Extract performance metrics for Logistic Regression

Performance metrics from the Logistic Regression confusion matrix are extracted and stored for comparison with other classification models. These metrics are later used to evaluate the relative performance of Logistic Regression against Decision Tree and Random Forest models.

log_metrics <- data.frame(Model = "Logistic Regression",
                          Accuracy  = log_conf$overall["Accuracy"],
                          Precision = log_conf$byClass["Precision"],
                          Recall    = log_conf$byClass["Recall"],
                          F1_Score  = log_conf$byClass["F1"]
)

Comparison of model performance for classification of customer churn

performance <- rbind(dt_metrics, rf_metrics, log_metrics)
performance
##                         Model  Accuracy Precision    Recall  F1_Score
## Accuracy        Decision Tree 0.7862319 0.6023055 0.5710383 0.5862553
## Accuracy1       Random Forest 0.8065217 0.6622951 0.5519126 0.6020864
## Accuracy2 Logistic Regression 0.8050725 0.6492308 0.5765027 0.6107091

Result discussion for classification of customer churn

Based on the performance metrics obtained, all three classification models demonstrate comparable accuracy levels of approximately 80%, indicating that each model is capable of learning meaningful patterns from the dataset. Logistic Regression achieves an accuracy of 80.51%, while Decision Tree and Random Forest achieve accuracies of 78.62% and 80.65%, respectively.

In terms of precision, Random Forest records the highest value (66.23%), followed by Logistic Regression (64.92%) and Decision Tree (60.23%). This suggests that Random Forest is more effective at correctly identifying churned customers when a churn prediction is made. However, when considering recall, the Decision Tree model achieves the highest value (57.10%), indicating a stronger ability to identify actual churn cases compared to Logistic Regression (57.65%) and Random Forest (55.19%).

The F1-score, which balances both precision and recall, further highlights the overall effectiveness of each model. Random Forest achieves the highest F1-score (0.6021), followed closely by Logistic Regression (0.6107) and Decision Tree (0.5863). This indicates that while all models perform reasonably well, Random Forest and Logistic Regression provide a more balanced trade-off between identifying churned customers and avoiding incorrect churn predictions.

Overall, the comparison shows that each model has distinct strengths in customer churn prediction. Logistic Regression offers stable and interpretable performance, Decision Tree provides higher recall in detecting churned customers, and Random Forest demonstrates stronger precision and balanced performance across evaluation metrics.

Task 2: Total Billing Charges Prediction Using Regression

In this section, we apply regression models to predict the total billing charges.The goal is to quantify the relationship between customer tenure, services and total billing. The regression models used include Multiple Linear Regression (OLS) as a baseline model, Lasso Regression is build to handle potential multicollinearity and perform feature selection as well as Random Forest Regression as a non-linear ensemble model.

Load required R packages

library(caret)    # for data splitting and evaluation
library(glmnet)   # for Lasso regression

Import the dataset

df <- read.csv("telco_churn_cleaned.csv")

Key Preprocessing Step: Remove the ID column

# The first column is customerID. If we keep it, the model will use it as a feature for training, which is incorrect.
df_model <- df %>% select(-customerID) 

Split the dataset into training (70%) and testing (30%) sets

# Set seed to ensure reproducibility
set.seed(123)

trainIndex <- createDataPartition(df_model$TotalCharges, p = .7, 
                                  list = FALSE, 
                                  times = 1)
train_data <- df_model[ trainIndex,]
test_data  <- df_model[-trainIndex,]

print("Data splitting completed.")
## [1] "Data splitting completed."

Multiple Linear Regression

We first establish a baseline using standard Ordinary Least Squares (OLS) regression.

Train Multiple Linear Regression

lm_model <- lm(TotalCharges ~ ., data = train_data)

Display summary of the model

summary(lm_model)
## 
## Call:
## lm(formula = TotalCharges ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2208.8  -459.8   -44.4   506.8  3489.6 
## 
## Coefficients: (7 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -1388.3078    62.7551 -22.123  < 2e-16 ***
## genderMale                              13.0232    20.0964   0.648   0.5170    
## SeniorCitizenYes                        -2.0098    29.3391  -0.069   0.9454    
## PartnerYes                              15.5144    24.2969   0.639   0.5232    
## DependentsYes                          -30.4144    25.6634  -1.185   0.2360    
## tenure                                  59.6912     0.6939  86.028  < 2e-16 ***
## PhoneServiceYes                        608.8421    47.8707  12.718  < 2e-16 ***
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                       269.0606    25.3577  10.611  < 2e-16 ***
## InternetServiceFiber optic             876.2143    44.8250  19.547  < 2e-16 ***
## InternetServiceNo                     -422.7149    51.7083  -8.175 3.75e-16 ***
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                      372.4142    27.2326  13.675  < 2e-16 ***
## OnlineBackupNo internet service              NA         NA      NA       NA    
## OnlineBackupYes                        440.7410    25.8659  17.039  < 2e-16 ***
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                    361.4424    26.6921  13.541  < 2e-16 ***
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                         359.9608    27.8931  12.905  < 2e-16 ***
## StreamingTVNo internet service               NA         NA      NA       NA    
## StreamingTVYes                         463.2455    29.4629  15.723  < 2e-16 ***
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                     440.1196    29.8074  14.765  < 2e-16 ***
## ContractOne year                       -28.4688    31.5682  -0.902   0.3672    
## ContractTwo year                      -201.4450    37.9996  -5.301 1.20e-07 ***
## PaperlessBillingYes                    -25.7778    22.4349  -1.149   0.2506    
## PaymentMethodCredit card (automatic)    16.3421    31.6147   0.517   0.6052    
## PaymentMethodElectronic check          -73.3124    29.5553  -2.481   0.0132 *  
## PaymentMethodMailed check              184.0511    33.1615   5.550 3.01e-08 ***
## MonthlyCharges                           0.6941     1.4383   0.483   0.6294    
## ChurnYes                              -176.9107    26.7572  -6.612 4.21e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 697.1 on 4811 degrees of freedom
## Multiple R-squared:  0.9062, Adjusted R-squared:  0.9057 
## F-statistic:  2020 on 23 and 4811 DF,  p-value: < 2.2e-16

Insights

  • Coefficients: The sign of each coefficient indicates the direction of the relationship (positive or negative).

  • Multiple R-squared: Measures the proportion of variance in the response variable explained by the model.

The OLS model identified 7 singularities, resulting in NA coefficients. This indicates redundant information among the categorical features (e.g., repeated encoding of “No internet service” across multiple variables).

Surprisingly, MonthlyCharges was found to be statistically insignificant (p = 0.63). This is a classic symptom of multicollinearity: the variance in TotalCharges is already captured by individual service-related features (e.g., Fiber optic, Streaming services), rendering the aggregate MonthlyCharges variable redundant in the linear model.

Prediction on the test set

lm_pred <- predict(lm_model, test_data)

Evaluate performance of Multiple Linear Regression

lm_rmse <- RMSE(lm_pred, test_data$TotalCharges)
lm_r2 <- R2(lm_pred, test_data$TotalCharges)

cat("Linear Regression Results - RMSE:", lm_rmse," R-squared:", lm_r2)
## Linear Regression Results - RMSE: 702.46  R-squared: 0.9033625

Issues identified

1.High R-squared (0.90): The model explains over 90% of the variance, confirming that Tenure and MonthlyCharges are dominant drivers. 2.Singularities: The summary output shows NA coefficients for several features (e.g., OnlineSecurity, No internet service). This indicates Perfect Multicollinearity—redundant information exists in the categorical features. 3.Insignificance of Key Variables: Surprisingly, MonthlyCharges was deemed statistically insignificant in some iterations. This is a symptom of multicollinearity, where the variance is “stolen” by specific service features (Fiber optic, Streaming, etc.).

Lasso Regression

To address the multicollinearity and singularities observed in the linear model, we implement Lasso Regression. Lasso introduces a penalty term that can shrink coefficients of redundant variables to zero, effectively performing feature selection.

Prepare matrix inputs

Lasso regression does not accept data frames and requires matrix inputs. The model.matrix() function automatically converts categorical variables into dummy (0/1) variables.

x_train <- model.matrix(TotalCharges ~ ., train_data)[,-1] 
y_train <- train_data$TotalCharges

x_test <- model.matrix(TotalCharges ~ ., test_data)[,-1]
y_test <- test_data$TotalCharges

Train Lasso Regression

Setting alpha = 1 specifies Lasso regression.

lasso_cv_model <- cv.glmnet(x_train, y_train, alpha = 1)

Identify the optimal lambda

best_lambda <- lasso_cv_model$lambda.min
plot(lasso_cv_model)

Examine selected features

Variables with coefficients equal to 0 are removed by Lasso.

coef(lasso_cv_model, s = best_lambda)
## 31 x 1 sparse Matrix of class "dgCMatrix"
##                                        s=3.673108
## (Intercept)                          -1412.893131
## genderMale                               6.393781
## SeniorCitizenYes                         .       
## PartnerYes                               4.070882
## DependentsYes                          -16.756789
## tenure                                  59.254420
## PhoneServiceYes                        561.731648
## MultipleLinesNo phone service            .       
## MultipleLinesYes                       259.866634
## InternetServiceFiber optic             831.826839
## InternetServiceNo                     -366.246488
## OnlineSecurityNo internet service        .       
## OnlineSecurityYes                      360.192889
## OnlineBackupNo internet service         -2.306222
## OnlineBackupYes                        430.873206
## DeviceProtectionNo internet service     -5.200375
## DeviceProtectionYes                    351.128156
## TechSupportNo internet service          -4.726654
## TechSupportYes                         344.113597
## StreamingTVNo internet service           .       
## StreamingTVYes                         442.272965
## StreamingMoviesNo internet service      -4.395051
## StreamingMoviesYes                     418.977773
## ContractOne year                         .       
## ContractTwo year                      -160.033712
## PaperlessBillingYes                    -13.169609
## PaymentMethodCredit card (automatic)     3.159533
## PaymentMethodElectronic check          -72.498665
## PaymentMethodMailed check              163.678116
## MonthlyCharges                           2.421755
## ChurnYes                              -167.235252

Prediction on the test set

lasso_pred <- predict(lasso_cv_model, s = best_lambda, newx = x_test)

Evaluate performance of Lasso Regression

lasso_rmse <- RMSE(lasso_pred, y_test)
lasso_r2 <- R2(lasso_pred, y_test)

cat("Lasso Regression Results - RMSE:", lasso_rmse, " R-squared:", lasso_r2)
## Lasso Regression Results - RMSE: 702.6844  R-squared: 0.9032689

Insights

  • Robustness: Lasso successfully handled the singularities. The variables that resulted in NA in the OLS model were automatically shrunk to zero (removed) by Lasso.
  • Feature Selection: The model simplified the feature space by removing non-informative variables (like SeniorCitizen or ContractOneYear in some splits), making the model more interpretable and less prone to overfitting.
  • Accuracy: The RMSE is comparable to the Linear model, proving we can achieve similar accuracy with a cleaner, structurally sounder model.

Prepare data for visualization

We visualize the actual versus predicted values for the Lasso model to assess fit.

plot_data <- data.frame(
  Actual = test_data$TotalCharges,
  Predicted = as.vector(lasso_pred)
)

Data visualization using ggplot2

library(ggplot2)

ggplot(plot_data, aes(x = Actual, y = Predicted)) +
  geom_point(color = "#00AFBB", alpha = 0.5) +  # Scatter plot
  geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # 45-degree reference line
  labs(
    title = "Lasso Regression: Actual vs Predicted Total Charges",
    subtitle = paste("R-squared =", round(lasso_r2, 3), "| RMSE =", round(lasso_rmse, 2)),
    x = "Actual Total Charges ($)",
    y = "Predicted Total Charges ($)"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure Analysis: Actual vs. Predicted Total Charges

The scatter plot above visualizes the performance of our final Lasso Regression model on the test dataset:

1.The Red Line (Ideal Fit): The diagonal red line represents the line of perfect prediction (\(y = x\)). If the model were 100% accurate, every single blue dot would lie exactly on this line.

2.The Blue Dots (Predictions): Each dot represents a customer in the test set.

  • Strong Correlation: The points are tightly clustered around the red reference line, forming a clear diagonal trend. This visually confirms our high R-squared value of ~0.90, indicating that the model successfully captures the underlying linear relationship between customer attributes and their total bill.

  • Error Distribution: The vertical distance between a dot and the red line represents the prediction error (Residual). While most points are close to the line, the slight dispersion (scatter) corresponds to our RMSE of ~702, which reflects the natural variance in customer usage that cannot be perfectly explained by a linear model alone.

The visual alignment confirms that the Lasso model is a reliable tool for forecasting customer value, with no obvious patterns of bias (e.g., consistently overestimating or underestimating).

Random Forest Regression

While Multiple Linear Regression and Lasso Regression assume a linear relationship between features and the target, customer billing data often contains non-linear patterns. To capture these complexities, we employ Random Forest Regression, an ensemble method that averages predictions from multiple decision trees.

Spilt the dataset

library(caret)
library(readr)
telco_df <- read_csv("telco_churn_cleaned.csv")
## Rows: 6903 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): customerID, gender, SeniorCitizen, Partner, Dependents, PhoneServi...
## dbl  (3): tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123) 
train_index <- createDataPartition(
  y = telco_df$TotalCharges,
  p = 0.7,  
  list = FALSE
)
train_data <- telco_df[train_index, ]  
test_data <- telco_df[-train_index, ]  

Train Random Forest Regressor

We use the randomForest package. Unlike Lasso, Random Forest can handle categorical factors directly without needing a model matrix and is robust to outliers.

library(randomForest)

# Set seed for reproducibility
set.seed(123)

# Train the Random Forest Regression Model
# ntree = 500: We build 500 trees
# importance = TRUE: To calculate variable importance
rf_reg_model <- randomForest(TotalCharges ~ ., 
                             data = train_data, 
                             ntree = 500, 
                             importance = TRUE)

print(rf_reg_model)
## 
## Call:
##  randomForest(formula = TotalCharges ~ ., data = train_data, ntree = 500,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 33893.25
##                     % Var explained: 99.34

Prediction on test set

rf_reg_pred <- predict(rf_reg_model, test_data)

Evaluate performance of Random Forest Regressor

rf_rmse <- RMSE(rf_reg_pred, test_data$TotalCharges)
rf_r2 <- R2(rf_reg_pred, test_data$TotalCharges)

cat("Random Forest Regression Results - RMSE:", rf_rmse, " R-squared:", rf_r2)
## Random Forest Regression Results - RMSE: 187.2001  R-squared: 0.9941287

Feature importance analysis

One of the advantages of Random Forest Regression is the ability to see which variables drive the prediction. We visualize the importance of features in predicting TotalCharges.

Plot variable importance

varImpPlot(rf_reg_model, main = "Feature Importance: Predictors of Total Charges")

Observation:

The plot highlights two importance metrics: %IncMSE (accuracy loss if variable is excluded) and IncNodePurity. Unsurprisingly, Tenure and MonthlyCharges are the dominant predictors. However, Random Forest allows us to see if other variables like Contract or InternetService play a significant role in reducing prediction error through non-linear interactions.

Figure Analysis: Actual vs. Predicted Total Charges

We visualize the predictions of the Random Forest Regression as a representative of non-linear models against the actual values to check for fit and bias.

Create a dataframe for plotting

rf_plot_data <- data.frame(
  Actual = test_data$TotalCharges,
  Predicted = as.vector(rf_reg_pred)
)

ggplot(rf_plot_data, aes(x = Actual, y = Predicted)) +
  geom_point(color = "darkgreen", alpha = 0.4) +  # Scatter plot
  geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # 45-degree reference line
  labs(
    title = "Random Forest: Actual vs Predicted Total Charges",
    subtitle = paste("R-squared =", round(rf_r2, 3), "| RMSE =", round(rf_rmse, 2)),
    x = "Actual Total Charges ($)",
    y = "Predicted Total Charges ($)"
  ) +
  theme_minimal()

Comparative Analysis: Choosing the Best Regression Model

To finalize our regression analysis, we compare the performance of all three models: Multiple Linear Regression (Baseline), Lasso Regression (Regularization), and Random Forest (Non-linear Ensemble).

Comparison Table

We aggregate the RMSE and R-squared metrics for all three models to identify the best performer.

Create a summary dataframe

model_comparison <- data.frame(
  Model = c("Linear Regression (OLS)", "Lasso Regression", "Random Forest Regression"),
  RMSE = c(lm_rmse, lasso_rmse, rf_rmse),
  R_Squared = c(lm_r2, lasso_r2, rf_r2)
)

Sort by RMSE (Lower is better)

model_comparison <- model_comparison[order(model_comparison$RMSE), ]

print(model_comparison)
##                      Model     RMSE R_Squared
## 3 Random Forest Regression 187.2001 0.9941287
## 1  Linear Regression (OLS) 702.4600 0.9033625
## 2         Lasso Regression 702.6844 0.9032689

Result discussion for prediction of total billing charges

In this project, we aimed to predict TotalCharges using three distinct algorithms: 1. Multiple Linear Regression (OLS): Provided a strong baseline with an \(R^2\) of ~0.90 but suffered from singularities due to perfect multicollinearity among the categorical variables. 2. Lasso Regression: Successfully handled the multicollinearity by shrinking redundant coefficients to zero. It maintained high accuracy while offering a cleaner, more interpretable set of features. 3. Random Forest Regression: Added the ability to model non-linear relationships. Recommendation: If the Random Forest Regression achieves a significantly lower RMSE than Lasso Regression, it indicates that complex, non-linear interactions (e.g., how specific contracts affect billing over time) are crucial. However, if the performance is similar, Lasso Regression is often preferred for business deployment because it provides a clear mathematical formula (coefficients) that is easier to interpret for stakeholders than the “black box” nature of Random Forest Regression.

Based on our analysis, Tenure and MonthlyCharges are such strong linear drivers of the total billing charges that the linear models such as Lasso Regression perform exceptionally well, making them the most efficient choice for this specific problem.

Conclusion

Customer retention should focus on month-to-month customers, fiber optic users, customers in first 6 months, customers without tech support or online security and customers using electronic check. To address these risks, the telecommunication company can:

  1. Redesign Contract and Pricing Strategies Incentivize long-term commitments through discounted monthly fees and free value-added services such as TechSupport and OnlineSecurity. Introducing hybrid contract options (e.g., flexible 6-month plans) can ease the transition for month-to-month customers who may be hesitant to commit long term.

  2. Strengthen Early-Life Customer Engagement Implement structured 90-day and 180-day retention programs that include proactive check-ins, welcome calls or emails, and personalized usage tips. Offering temporary discounts or service upgrades during early tenure can improve satisfaction and reduce early churn.

  3. Bundle Value-Added Services Increase perceived value by creating default service bundles that include TechSupport and OnlineSecurity. Free or low-risk trial periods for add-on services can help customers experience their benefits before being charged, increasing adoption and long-term retention.

  4. Improve Billing and Payment Experience Encourage enrollment in automatic payment methods through small incentives and clear communication. Reducing reliance on electronic checks by promoting digital and automated payment options can lower friction, improve payment reliability, and decrease churn risk.

As discussed, while all three models achieve similar accuracy (~80%), they each have their own strength. By aligning data-driven insights with customer-centric strategies, the telecommunications company can reduce churn, stabilize revenue and increase customer lifetime value, transforming analytics into a competitive advantage.

On the other hand, the regression analysis demonstrates that the models explain approximately 90% of total billing variation, driven far more by how long a customer stays and which services they subscribe to. Total billing is cumulative, meaning long-tenured customers with richer service bundles generate disproportionately higher lifetime revenue. Despite intuition, monthly charges becomes statistically insignificant once individual services are accounted for. This indicates that customers do not respond to the aggregate bill as much as to what they are paying for such as fiber optic, streaming and security.

Lasso Regression removed redundant features like duplicated “No internet service” indicators, isolating the true revenue-contributing services. This produces a cleaner and more reliable forecast model because not all customer attributes contribute equally to revenue—some add noise without adding insight. Although Multiple Linear Regression and Lasso Regression achieved similar accuracy, Multiple Linear Regression suffered from multicollinearity and unstable coefficients making it risky for business interpretation whereas Lasso Regression provided similar accuracy with better stability and interpretability. An RMSE of ~702 relative to high total charges indicates the model is accurate enough for the telecommunication company to make more informed decisions around retention strategy, service bundling, revenue forecasting and customer value management, ultimately driving sustainable and predictable revenue growth.