Wei Xiaoting 24232389, Lee Sijie S2018064, Liu Yiqian 24229675, Mohammad Irfanul Haque S2012063, Shi Hui 25070378
Telecommunication is a competitive and subscription-based industry. Customers pay monthly, thus losing a subscriber directly reduces recurring revenue. Customer churn refers to subscribers who discontinue their mobile, internet or TV services and switch to another telecommunication service provider. High customer churn indicates dissatisfaction due to poor network coverage, cheaper offers or bundled deals from competitors, bad customer service such as long waiting time and unresolved issues, limited data or slow internet speed, billing disputes as well as lack of loyalty programs and incentives. In this project, we analyze the Telco Customer Churn dataset obtained from Kaggle to detect the customer churn and predict the total billing charges of a customer who signed up for services with a telecommunication company.
Our goal is to clean the raw data, handle missing values and perform
exploratory data analysis (EDA) to understand the customer behaviour.
After data cleaning is completed, the cleaned and processed dataset is
used to solve two machine learning problems. The objectives of this
project are: 1. To develop machine learning-based classification models
to detect the customer churn of a telecommunication company.
2. To develop machine learning-based regression models to predict the
total billing charges of a customer.
3. To evaluate the performance of the classification models with
accuracy, precision, recall and F1 score.
4. To evaluate the performance of the regression models with mean
absolute error (MAE), root mean squared error (RMSE) and R-squared
(R2).
First, we load the necessary R packages. We use tidyverse for data manipulation and data visualization, plus skimr for data summary.
library(tidyverse)
library(skimr)
Next, we import the raw dataset containing missing values by using the read.csv() function. All columns whose data type is character are converted into factors.
raw_data <- read.csv("telco_churn_dirty.csv", stringsAsFactors = FALSE)
raw_data[raw_data == ""] <- NA
raw_data <- raw_data %>% mutate(across(where(is.character), as.factor))
In order to fully understand the dataset, we employed four different methods to study its dimensions and structure and this comparison help us identify the data types and potential issues.
dim(): Returns the dimensions (rows and columns) to verify the shape and size of the dataset. The result shows that the dataset consists of 7043 rows and 21 columns. Each row represents a customer tied with a telecommunication company and each column contains the customer’s attributes such as the demographic, account information and services.
dim(raw_data)
## [1] 7043 21
str(): A Base R function that shows the internal structure of the dataset, detailing data types such as integer, numeric and factor.
str(raw_data)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
glimpse(): A tidyverse function similar to str(), but gives an output of a more compact and readable summary of the dataset.
glimpse(raw_data)
## Rows: 7,043
## Columns: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-…
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
summary(): Provides a statistical overview of the dataset by identifying the data distribution, central tendency, spread and potential anomalies. For numerical attributes, it returns the minimum, 1st quartile, median, mean, 3rd quartile and maximum. For categorical attributes, it returns the frequency of each levels.
summary(raw_data)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.39
## 3rd Qu.:55.50
## Max. :72.00
## NA's :140
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1467 Min. : 18.25
## Yes:4171 Credit card (automatic) :1439 1st Qu.: 35.65
## Electronic check :2248 Median : 70.35
## Mailed check :1537 Mean : 64.78
## NA's : 352 3rd Qu.: 89.90
## Max. :118.75
## NA's :352
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
Identify missing values: Check for NA and empty strings.
missing_values <- colSums(is.na(raw_data))
missing_values[missing_values > 0]
## tenure PaymentMethod MonthlyCharges TotalCharges
## 140 352 352 11
Based on the data exploration above, we have formulated the following data cleaning strategies:
Clean column tenure: Remove rows with missing values as
it is a critical time-series feature.
clean_data <- raw_data
clean_data <- clean_data %>% filter(!is.na(tenure))
clean_data$SeniorCitizen <- factor(clean_data$SeniorCitizen,
levels = c(0, 1),
labels = c("No", "Yes"))
labels = c("No", "Yes")
Clean column MonthlyCharges: Impute missing numerical
values using the median to ensure robustness to outliers, meaning
extreme values like unusually high or low monthly charges will not skew
the imputation in order to preserve the central tendency.
median1 <- median(clean_data$MonthlyCharges, na.rm = TRUE)
clean_data$MonthlyCharges[is.na(clean_data$MonthlyCharges)] <- median1
Clean column PaymentMethod: Impute missing categorical
values using the mode (most frequent value) to maintain the existing
distribution of categories.
get_mode <- function(value){
uniqv <- unique(value)
uniqv[which.max(tabulate(match(value,uniqv)))]
}
mode_payment <- get_mode(clean_data$PaymentMethod[!is.na(clean_data$PaymentMethod)])
clean_data$PaymentMethod[is.na(clean_data$PaymentMethod)] <- mode_payment
Clean column TotalCharges: Impute missing numerical
values using the mean to ensure that the total revenue estimates are not
biased by missing values and preserves the overall average.
mean_total <- mean(clean_data$TotalCharges, na.rm = TRUE)
clean_data$TotalCharges[is.na(clean_data$TotalCharges)] <- mean_total
Make sure all missing values are handled.
sum(is.na(clean_data))
## [1] 0
The cleaned dataset is write into a new CSV file by using the write.csv() function and saved as telco_churn_cleaned.csv
write.csv(clean_data, "telco_churn_cleaned.csv", row.names = FALSE)
print("Cleaned dataset saved as 'telco_churn_cleaned.csv'.")
## [1] "Cleaned dataset saved as 'telco_churn_cleaned.csv'."
Now that the data is clean, we can visualize the distribution of customer churn and other key features.
Exploratory data analysis allows us to uncover patterns and anomalies
in the data. We focus on the relationship between customer attributes
and the target variable Churn.
Before building any model, we must check the distribution of the target variable, whether it is balance or imbalance. If one class dominates the other, then accuracy becomes a misleading metric. Hence, we need to establish a baseline.
ggplot(clean_data, aes(x = Churn, fill = Churn)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
theme_minimal() +
labs(title = "Number of customer churn (lost users)", y = "Count")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Customer churn is divided into two categories: “Yes” and “No.” From the result, we can deduce that the telecommunication company managed to retain a majority of their customers with a count 5072 users. However, a portion of customers that sums up 1831 users which is approximately 26.5% of the total are lost. The number of customers who did not churn is significantly higher than those who did churn, suggesting a relatively lower churn rate.
We hypothesize that customers with higher monthly charge billing are more likely to leave. By visualizing the distribution of the monthly charges for both retained and churned customers, we can see if the price sensitivity drives the rate of customer churn in the telecommunication company.
ggplot(clean_data, aes(x = MonthlyCharges, fill = Churn)) +
geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
theme_minimal() +
labs(title = "Distribution of Monthly Charges by Churn Status")
The chart shows that when the churn status is “Yes”, the distribution curve peaks at higher values compared to when the churn status is “No”. This confirms that high-paying customers are at greater risk of leaving.
Payment friction can cause dissatisfaction among customers. Thus, we analyze the churn rate within each payment method to identify if specific methods like electronic check correlate with higher churn.
ggplot(clean_data, aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = "Churn Proportion by Payment Method", y = "Proportion") +
coord_flip()
“Electronic check” has a significantly larger churn, as showed in the blue bar compared to automatic payment methods like “Credit card” and “Bank transfer”. The “Credit card” and “Bank transfer” show the lowest churn proportions, indicating that customers using these automatic payment methods tend to stay more.
In order to detect the customer churn of the telecommunications company, customers are classified into two categories, whereby “Yes” indicates that the user has stopped using the service and “No” indicates that the user is still utilizing the service. Based on this classification, we have developed three machine learning models: Decision Tree, Random Forest and Logistic Regression.
Before beginning the classification task, some necessary R packages are loaded. We use caret to streamline the model training process for classification and regression problems. Next, rpart implements recursive partitioning for Decision Tree and rpart.plot visualizes the plotting of rpart in Decision Tree. For Random Forest, we use randomForest to build the model by combining multiple decision trees based on the Breiman’s random forest algorithm. Lastly, ggplot2 is for data visualization.
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(ggplot2)
The line of code set.seed(123) ensures random processes
are reproducible during model training. It guarantees the same data
splits, model initialization and results across different executions for
as long as the code and environment remain the same.
set.seed(123)
Stratified sampling is used to split the cleaned data into 80% for
training and 20% for testing based on the target variable
Churn so that both sets have similar class proportions. The
indices returned by the createDataPartition() function are used to
subset rows for train_data and the remaining rows for
test_data. From the result, we can see that 5523 instances
are used for training the models and 1380 instances are used for testing
the models.
index <- createDataPartition(clean_data$Churn, p = 0.8, list = FALSE)
train_data <- clean_data[index, ]
nrow(train_data)
## [1] 5523
test_data <- clean_data[-index, ]
nrow(test_data)
## [1] 1380
We trained the Decision Tree model by using the rpart() function to
predict the target variable Churn from all features except
customerID in the training dataset. It uses the Gini index
for splits and controls tree complexity with a minimum split size of 10
and a complexity parameter of 0.0034.
dt_model <- rpart(Churn ~ . -customerID,
data = train_data,
method = "class",
parms = list(split = "gini"),
control = rpart.control(minsplit = 10, cp = 0.0034))
dt_model
## n= 5523
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 5523 1465 No (0.73474561 0.26525439)
## 2) Contract=One year,Two year 2511 169 No (0.93269614 0.06730386) *
## 3) Contract=Month-to-month 3012 1296 No (0.56972112 0.43027888)
## 6) InternetService=DSL,No 1353 388 No (0.71322986 0.28677014)
## 12) tenure>=5.5 775 138 No (0.82193548 0.17806452) *
## 13) tenure< 5.5 578 250 No (0.56747405 0.43252595)
## 26) TechSupport=No internet service,Yes 279 81 No (0.70967742 0.29032258) *
## 27) TechSupport=No 299 130 Yes (0.43478261 0.56521739)
## 54) SeniorCitizen=No 257 122 Yes (0.47470817 0.52529183)
## 108) OnlineSecurity=Yes 41 14 No (0.65853659 0.34146341) *
## 109) OnlineSecurity=No 216 95 Yes (0.43981481 0.56018519) *
## 55) SeniorCitizen=Yes 42 8 Yes (0.19047619 0.80952381) *
## 7) InternetService=Fiber optic 1659 751 Yes (0.45268234 0.54731766)
## 14) tenure>=13.5 910 380 No (0.58241758 0.41758242)
## 28) PaymentMethod=Bank transfer (automatic),Credit card (automatic),Mailed check 389 121 No (0.68894602 0.31105398) *
## 29) PaymentMethod=Electronic check 521 259 No (0.50287908 0.49712092)
## 58) tenure>=55.5 75 20 No (0.73333333 0.26666667) *
## 59) tenure< 55.5 446 207 Yes (0.46412556 0.53587444)
## 118) TotalCharges>=3083.6 171 80 No (0.53216374 0.46783626)
## 236) TotalCharges< 4127.475 92 32 No (0.65217391 0.34782609) *
## 237) TotalCharges>=4127.475 79 31 Yes (0.39240506 0.60759494) *
## 119) TotalCharges< 3083.6 275 116 Yes (0.42181818 0.57818182) *
## 15) tenure< 13.5 749 221 Yes (0.29506008 0.70493992) *
The Decision Tree is plotted, showing the split conditions in a top-down layout. Additional information such as the predicted class, class probability and percentage of observations reaching each node are also displayed. Based on the Decision Tree:
rpart.plot(dt_model,
type = 2,
extra = 104,
fallen.leaves = TRUE,
box.palette = "auto",
main = "Decision Tree for Customer Churn Detection")
After training the Decision Tree model, we make predictions of the customer churn based on the testing dataset by using the predict() function.
dt_pred <- predict(dt_model, test_data, type = "class")
We evaluate the performance of the Decision Tree by computing a
confusion matrix dt_conf to compare the predicted churn
outcomes to the actual churn values.
dt_conf <- confusionMatrix(dt_pred, test_data$Churn, positive = "Yes")
dt_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 876 157
## Yes 138 209
##
## Accuracy : 0.7862
## 95% CI : (0.7636, 0.8076)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 5.604e-06
##
## Kappa : 0.4423
##
## Mcnemar's Test P-Value : 0.2946
##
## Sensitivity : 0.5710
## Specificity : 0.8639
## Pos Pred Value : 0.6023
## Neg Pred Value : 0.8480
## Prevalence : 0.2652
## Detection Rate : 0.1514
## Detection Prevalence : 0.2514
## Balanced Accuracy : 0.7175
##
## 'Positive' Class : Yes
##
Based on the confusion matrix, the Decision Tree model achieves an accuracy of 78.62%, meaning it correctly predicts outcomes for most of the cases. It is particularly strong at identifying customer’s churn when it is labelled “No”, with a specificity of 86.39% but shows moderate performance in recognizing customer’s churn when it is labelled “Yes”, with a recall of only 57.10%. The precision of the Decision Tree model for positive predictions is 60.23% while its negative predictive value is higher at 84.80%. Given that only about 26.52% of the dataset consists of actual “Yes”, the model appears to perform reasonably well on the majority class but struggles somewhat with the minority class. While overall accuracy is acceptable, the lower recall in an imbalanced classification may affect the detection of customer leaving the telecommunication company.
dt_cm_table <- as.table(dt_conf$table)
dt_cm_df <- as.data.frame(dt_cm_table)
colnames(dt_cm_df) <- c("Prediction", "Reference", "Freq")
dt_cm_df$Type <- with(dt_cm_df, ifelse((Prediction == "Yes" & Reference == "Yes") | (Prediction == "No" & Reference == "No"), "Correct (TP/TN)", "Incorrect (FP/FN)"))
ggplot(dt_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
geom_tile(color = "black", alpha = 0.7) + geom_text(size = 6) +
scale_fill_manual(values = c("Correct (TP/TN)" = "skyblue",
"Incorrect (FP/FN)" = "salmon")) +
labs(title = "Confusion Matrix of Decision Tree", x = "Actual",
y = "Predicted") + theme_minimal(base_size = 15)
We trained the Random Forest model by using the randomForest()
function to predict the target variable Churn from all
features except customerID in the training dataset. It
specifies 500 trees and 4 variables considered at each split while also
calculating variable importance to identify which predictors contribute
most to the model.
rf_model <- randomForest(Churn ~ . -customerID,
data = train_data,
ntree = 500,
mtry = 4,
importance = TRUE)
rf_model
##
## Call:
## randomForest(formula = Churn ~ . - customerID, data = train_data, ntree = 500, mtry = 4, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 20.44%
## Confusion matrix:
## No Yes class.error
## No 3650 408 0.1005421
## Yes 721 744 0.4921502
The importance of features in the Random Forest is evaluated and visualized. The importance() function retrieves the numerical importance scores while the varImpPlot() creates a plot to display which features contribute most to the model.
MeanDecreaseAccuracy measures how much the accuracy of
the Random Forest decreases when the values of a feature are randomly
shuffled. A higher value means the feature is more important for making
correct predictions. The top 8 features based on
MeanDecreaseAccuracy include:
MeanDecreaseGini measures the total decrease in node
impurity averaged over all trees when a feature is used to split the
data. A higher value means the feature is used more often or earlier in
splits, contributes more to splitting nodes and reduces impurity
effectively. The top 8 features based on MeanDecreaseGini
include:
TotalCharges, MonthlyCharges,
Contract and tenure are consistently top
features in both metrics, indicating they are strong predictors of the
Random Forest’s decisions.
importance(rf_model)
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## gender -0.5174706 -3.7061547 -2.869033 46.029488
## SeniorCitizen 5.0703889 3.7106840 6.430713 36.688691
## Partner 2.1048771 -5.1536578 -1.554991 39.748978
## Dependents -2.0605871 8.1775086 4.222721 34.517107
## tenure 25.7108724 26.7821187 47.073919 316.341769
## PhoneService -1.3763118 10.2866812 6.146633 8.863325
## MultipleLines 6.1700692 11.9754297 14.032826 46.511554
## InternetService 15.2300075 22.1328815 25.412103 75.814673
## OnlineSecurity 10.3836819 25.8632971 19.633618 94.951937
## OnlineBackup 8.7320811 8.9530164 11.817432 50.311576
## DeviceProtection 10.5450509 -6.8127494 7.537554 42.587469
## TechSupport 8.9405146 31.5264404 21.709602 81.017142
## StreamingTV 9.9677211 1.1161928 9.346541 39.196849
## StreamingMovies 13.5287261 -0.8537165 12.060363 40.110479
## Contract 2.2170094 34.7673948 36.805084 164.699772
## PaperlessBilling -1.7604984 21.6614835 15.155278 46.989941
## PaymentMethod 1.7768121 10.3882595 8.605394 110.170671
## MonthlyCharges 17.8428263 14.9600394 27.032589 299.800488
## TotalCharges 29.4515793 15.6623291 42.322819 341.716945
varImpPlot(rf_model, main = "Random Forest Feature Importance")
After training the Random Forest model, we make predictions of the customer churn based on the testing dataset by using the predict() function.
rf_pred <- predict(rf_model, test_data)
We evaluate the performance of the Random Forest by computing a
confusion matrix rf_conf to compare the predicted churn
outcomes to the actual churn values.
rf_conf <- confusionMatrix(rf_pred, test_data$Churn, positive = "Yes")
rf_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 911 164
## Yes 103 202
##
## Accuracy : 0.8065
## 95% CI : (0.7847, 0.8271)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 2.771e-10
##
## Kappa : 0.4757
##
## Mcnemar's Test P-Value : 0.0002407
##
## Sensitivity : 0.5519
## Specificity : 0.8984
## Pos Pred Value : 0.6623
## Neg Pred Value : 0.8474
## Prevalence : 0.2652
## Detection Rate : 0.1464
## Detection Prevalence : 0.2210
## Balanced Accuracy : 0.7252
##
## 'Positive' Class : Yes
##
Based on the confusion matrix, the Random Forest model achieves an accuracy of 80.58%, correctly predicting both classes, “Yes” and “No” in 1112 out of 1380 cases. Similarly with the Decision Tree model, it excels at identifying customer’s churn when it is labelled “No”, with a specificity of 89.74% but struggles in recognizing customer’s churn when it is labelled “Yes”, showing a sensitivity of only 55.19%. While the Random Forest model is reliable when it predicts “No” due to a high negative predictive value of 84.73%, its precision for the “Yes” is lower, with a positive predictive value of 66.01%. This suggests it is conservative in detecting customers lost by the telecommunication company and is better suited for scenarios where correctly ruling out customers retained by the telecommunication company is the priority.
rf_cm_table <- as.table(rf_conf$table)
rf_cm_df <- as.data.frame(rf_cm_table)
colnames(rf_cm_df) <- c("Prediction", "Reference", "Freq")
rf_cm_df$Type <- with(rf_cm_df, ifelse((Prediction == "Yes" & Reference == "Yes") | (Prediction == "No" & Reference == "No"), "Correct (TP/TN)", "Incorrect (FP/FN)"))
ggplot(rf_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
geom_tile(color = "black", alpha = 0.7) + geom_text(size = 6) +
scale_fill_manual(values = c("Correct (TP/TN)" = "skyblue",
"Incorrect (FP/FN)" = "salmon")) +
labs(title = "Confusion Matrix of Random Forest", x = "Actual",
y = "Predicted") + theme_minimal(base_size = 15)
Logistic Regression is applied as a baseline classification technique
to predict customer churn. This model estimates the probability of churn
by modelling the relationship between the predictor variables and the
binary response variable Churn using a logistic (sigmoid)
function. All predictor variables, excluding customerID,
are used to train the model on the training dataset.
train_glm <- train_data %>% select(-customerID)
test_glm <- test_data %>% select(-customerID)
log_model <- glm(
Churn ~ .,
data = train_glm,
family = binomial(link = "logit")
)
summary(log_model)
##
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = train_glm)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.633e-02 2.581e-01 0.373 0.709035
## genderMale -7.738e-02 7.342e-02 -1.054 0.291869
## SeniorCitizenYes 1.796e-01 9.559e-02 1.879 0.060271 .
## PartnerYes 3.155e-02 8.805e-02 0.358 0.720144
## DependentsYes -1.796e-01 1.023e-01 -1.757 0.078945 .
## tenure -6.205e-02 6.967e-03 -8.907 < 2e-16 ***
## PhoneServiceYes -6.724e-01 1.927e-01 -3.489 0.000485 ***
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.112e-01 9.534e-02 2.215 0.026741 *
## InternetServiceFiber optic 7.183e-01 1.884e-01 3.813 0.000137 ***
## InternetServiceNo -7.881e-01 2.163e-01 -3.644 0.000269 ***
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -3.624e-01 1.017e-01 -3.565 0.000364 ***
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes -1.891e-01 9.217e-02 -2.052 0.040174 *
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes 1.981e-02 9.437e-02 0.210 0.833700
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -4.254e-01 1.022e-01 -4.162 3.15e-05 ***
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 2.060e-01 1.093e-01 1.885 0.059451 .
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 1.797e-01 1.114e-01 1.613 0.106806
## ContractOne year -6.925e-01 1.206e-01 -5.744 9.27e-09 ***
## ContractTwo year -1.565e+00 2.061e-01 -7.591 3.18e-14 ***
## PaperlessBillingYes 3.755e-01 8.412e-02 4.465 8.02e-06 ***
## PaymentMethodCredit card (automatic) -1.244e-02 1.332e-01 -0.093 0.925587
## PaymentMethodElectronic check 3.198e-01 1.089e-01 2.936 0.003329 **
## PaymentMethodMailed check 3.759e-02 1.339e-01 0.281 0.778972
## MonthlyCharges 1.214e-03 6.371e-03 0.191 0.848888
## TotalCharges 3.456e-04 7.922e-05 4.362 1.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6389.9 on 5522 degrees of freedom
## Residual deviance: 4535.0 on 5499 degrees of freedom
## AIC: 4583
##
## Number of Fisher Scoring iterations: 6
log_prob <- predict(log_model, newdata = test_data, type = "response")
log_pred <- ifelse(log_prob > 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = c("No", "Yes"))
The performance of the Logistic Regression model is evaluated using a confusion matrix. The confusion matrix summarizes the model’s prediction results by comparing predicted churn outcomes with actual churn labels. Key evaluation metrics such as accuracy, precision, recall, and F1-score are derived to assess the model’s effectiveness.
log_conf <- confusionMatrix(log_pred, test_data$Churn, positive = "Yes")
log_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 900 155
## Yes 114 211
##
## Accuracy : 0.8051
## 95% CI : (0.7832, 0.8257)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 6.308e-10
##
## Kappa : 0.4813
##
## Mcnemar's Test P-Value : 0.01473
##
## Sensitivity : 0.5765
## Specificity : 0.8876
## Pos Pred Value : 0.6492
## Neg Pred Value : 0.8531
## Prevalence : 0.2652
## Detection Rate : 0.1529
## Detection Prevalence : 0.2355
## Balanced Accuracy : 0.7320
##
## 'Positive' Class : Yes
##
Based on the confusion matrix, the Logistic Regression model achieves an accuracy of 80.51%, correctly predicting both classes, “Yes” and “No”, in 1111 out of 1380 cases. The model performs well in identifying customers who do not churn when the actual label is “No”, as indicated by a high specificity of 88.76%. However, the model shows relatively lower performance in identifying customers who churn when the actual label is “Yes”, with a sensitivity of 57.65%. This indicates that while the model is able to correctly identify 155 churned customers, a number of churn cases are still misclassified as non-churn. The positive predictive value of 64.92% suggests that when the model predicts churn, it is correct in a majority of cases, while the negative predictive value of 85.31% indicates strong performance in predicting customers who remain with the service.
Overall, the confusion matrix provides a clear understanding of how the Logistic Regression model distributes its predictions across churned and non-churned customers, highlighting its strengths in identifying retained customers and its limitations in detecting churned customers.
log_cm_table <- as.table(log_conf$table)
log_cm_df <- as.data.frame(log_cm_table)
colnames(log_cm_df) <- c("Prediction", "Reference", "Freq")
log_cm_df$Type <- with(
log_cm_df,
ifelse(
(Prediction == "Yes" & Reference == "Yes") |
(Prediction == "No" & Reference == "No"),
"Correct (TP/TN)",
"Incorrect (FP/FN)"
)
)
ggplot(log_cm_df, aes(x = Reference, y = Prediction, fill = Type, label = Freq)) +
geom_tile(color = "black", alpha = 0.7) +
geom_text(size = 6) +
scale_fill_manual(
values = c(
"Correct (TP/TN)" = "skyblue",
"Incorrect (FP/FN)" = "salmon"
)
) +
labs(
title = "Confusion Matrix of Logistic Regression",
x = "Actual",
y = "Predicted"
) +
theme_minimal(base_size = 15)
We create a data frame named dt_metrics to summarize the
performance of the Decision Tree by storing the accuracy, precision,
recall and F1 score calculated from the confusion matrix
dt_conf.
dt_metrics <- data.frame(Model = "Decision Tree",
Accuracy = dt_conf$overall['Accuracy'],
Precision = dt_conf$byClass['Precision'],
Recall = dt_conf$byClass['Recall'],
F1_Score = dt_conf$byClass['F1'])
The same is done for Random Forest, in which a data frame called
dt_metrics is generated to capture the model performance by
recording its accuracy, precision, recall and F1 score derived from the
confusion matrix rf_conf.
rf_metrics <- data.frame(Model = "Random Forest",
Accuracy = rf_conf$overall['Accuracy'],
Precision = rf_conf$byClass['Precision'],
Recall = rf_conf$byClass['Recall'],
F1_Score = rf_conf$byClass['F1'])
Performance metrics from the Logistic Regression confusion matrix are extracted and stored for comparison with other classification models. These metrics are later used to evaluate the relative performance of Logistic Regression against Decision Tree and Random Forest models.
log_metrics <- data.frame(Model = "Logistic Regression",
Accuracy = log_conf$overall["Accuracy"],
Precision = log_conf$byClass["Precision"],
Recall = log_conf$byClass["Recall"],
F1_Score = log_conf$byClass["F1"]
)
performance <- rbind(dt_metrics, rf_metrics, log_metrics)
performance
## Model Accuracy Precision Recall F1_Score
## Accuracy Decision Tree 0.7862319 0.6023055 0.5710383 0.5862553
## Accuracy1 Random Forest 0.8065217 0.6622951 0.5519126 0.6020864
## Accuracy2 Logistic Regression 0.8050725 0.6492308 0.5765027 0.6107091
Based on the performance metrics obtained, all three classification models demonstrate comparable accuracy levels of approximately 80%, indicating that each model is capable of learning meaningful patterns from the dataset. Logistic Regression achieves an accuracy of 80.51%, while Decision Tree and Random Forest achieve accuracies of 78.62% and 80.65%, respectively.
In terms of precision, Random Forest records the highest value (66.23%), followed by Logistic Regression (64.92%) and Decision Tree (60.23%). This suggests that Random Forest is more effective at correctly identifying churned customers when a churn prediction is made. However, when considering recall, the Decision Tree model achieves the highest value (57.10%), indicating a stronger ability to identify actual churn cases compared to Logistic Regression (57.65%) and Random Forest (55.19%).
The F1-score, which balances both precision and recall, further highlights the overall effectiveness of each model. Random Forest achieves the highest F1-score (0.6021), followed closely by Logistic Regression (0.6107) and Decision Tree (0.5863). This indicates that while all models perform reasonably well, Random Forest and Logistic Regression provide a more balanced trade-off between identifying churned customers and avoiding incorrect churn predictions.
Overall, the comparison shows that each model has distinct strengths in customer churn prediction. Logistic Regression offers stable and interpretable performance, Decision Tree provides higher recall in detecting churned customers, and Random Forest demonstrates stronger precision and balanced performance across evaluation metrics.
In this section, we apply regression models to predict the total billing charges.The goal is to quantify the relationship between customer tenure, services and total billing. The regression models used include Multiple Linear Regression (OLS) as a baseline model, Lasso Regression is build to handle potential multicollinearity and perform feature selection as well as Random Forest Regression as a non-linear ensemble model.
library(caret) # for data splitting and evaluation
library(glmnet) # for Lasso regression
df <- read.csv("telco_churn_cleaned.csv")
# The first column is customerID. If we keep it, the model will use it as a feature for training, which is incorrect.
df_model <- df %>% select(-customerID)
# Set seed to ensure reproducibility
set.seed(123)
trainIndex <- createDataPartition(df_model$TotalCharges, p = .7,
list = FALSE,
times = 1)
train_data <- df_model[ trainIndex,]
test_data <- df_model[-trainIndex,]
print("Data splitting completed.")
## [1] "Data splitting completed."
We first establish a baseline using standard Ordinary Least Squares (OLS) regression.
lm_model <- lm(TotalCharges ~ ., data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = TotalCharges ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2208.8 -459.8 -44.4 506.8 3489.6
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1388.3078 62.7551 -22.123 < 2e-16 ***
## genderMale 13.0232 20.0964 0.648 0.5170
## SeniorCitizenYes -2.0098 29.3391 -0.069 0.9454
## PartnerYes 15.5144 24.2969 0.639 0.5232
## DependentsYes -30.4144 25.6634 -1.185 0.2360
## tenure 59.6912 0.6939 86.028 < 2e-16 ***
## PhoneServiceYes 608.8421 47.8707 12.718 < 2e-16 ***
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 269.0606 25.3577 10.611 < 2e-16 ***
## InternetServiceFiber optic 876.2143 44.8250 19.547 < 2e-16 ***
## InternetServiceNo -422.7149 51.7083 -8.175 3.75e-16 ***
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes 372.4142 27.2326 13.675 < 2e-16 ***
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes 440.7410 25.8659 17.039 < 2e-16 ***
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes 361.4424 26.6921 13.541 < 2e-16 ***
## TechSupportNo internet service NA NA NA NA
## TechSupportYes 359.9608 27.8931 12.905 < 2e-16 ***
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 463.2455 29.4629 15.723 < 2e-16 ***
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 440.1196 29.8074 14.765 < 2e-16 ***
## ContractOne year -28.4688 31.5682 -0.902 0.3672
## ContractTwo year -201.4450 37.9996 -5.301 1.20e-07 ***
## PaperlessBillingYes -25.7778 22.4349 -1.149 0.2506
## PaymentMethodCredit card (automatic) 16.3421 31.6147 0.517 0.6052
## PaymentMethodElectronic check -73.3124 29.5553 -2.481 0.0132 *
## PaymentMethodMailed check 184.0511 33.1615 5.550 3.01e-08 ***
## MonthlyCharges 0.6941 1.4383 0.483 0.6294
## ChurnYes -176.9107 26.7572 -6.612 4.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 697.1 on 4811 degrees of freedom
## Multiple R-squared: 0.9062, Adjusted R-squared: 0.9057
## F-statistic: 2020 on 23 and 4811 DF, p-value: < 2.2e-16
Coefficients: The sign of each coefficient indicates the direction of the relationship (positive or negative).
Multiple R-squared: Measures the proportion of variance in the response variable explained by the model.
The OLS model identified 7 singularities, resulting in NA coefficients. This indicates redundant information among the categorical features (e.g., repeated encoding of “No internet service” across multiple variables).
Surprisingly, MonthlyCharges was found to be statistically
insignificant (p = 0.63). This is a classic symptom of
multicollinearity: the variance in TotalCharges is already
captured by individual service-related features (e.g., Fiber optic,
Streaming services), rendering the aggregate MonthlyCharges
variable redundant in the linear model.
lm_pred <- predict(lm_model, test_data)
lm_rmse <- RMSE(lm_pred, test_data$TotalCharges)
lm_r2 <- R2(lm_pred, test_data$TotalCharges)
cat("Linear Regression Results - RMSE:", lm_rmse," R-squared:", lm_r2)
## Linear Regression Results - RMSE: 702.46 R-squared: 0.9033625
1.High R-squared (0.90): The model explains over 90% of the variance,
confirming that Tenure and MonthlyCharges are
dominant drivers. 2.Singularities: The summary output shows NA
coefficients for several features (e.g., OnlineSecurity, No internet
service). This indicates Perfect Multicollinearity—redundant information
exists in the categorical features. 3.Insignificance of Key Variables:
Surprisingly, MonthlyCharges was deemed statistically
insignificant in some iterations. This is a symptom of
multicollinearity, where the variance is “stolen” by specific service
features (Fiber optic, Streaming, etc.).
To address the multicollinearity and singularities observed in the linear model, we implement Lasso Regression. Lasso introduces a penalty term that can shrink coefficients of redundant variables to zero, effectively performing feature selection.
Lasso regression does not accept data frames and requires matrix inputs. The model.matrix() function automatically converts categorical variables into dummy (0/1) variables.
x_train <- model.matrix(TotalCharges ~ ., train_data)[,-1]
y_train <- train_data$TotalCharges
x_test <- model.matrix(TotalCharges ~ ., test_data)[,-1]
y_test <- test_data$TotalCharges
Setting alpha = 1 specifies Lasso regression.
lasso_cv_model <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda <- lasso_cv_model$lambda.min
plot(lasso_cv_model)
Variables with coefficients equal to 0 are removed by Lasso.
coef(lasso_cv_model, s = best_lambda)
## 31 x 1 sparse Matrix of class "dgCMatrix"
## s=3.673108
## (Intercept) -1412.893131
## genderMale 6.393781
## SeniorCitizenYes .
## PartnerYes 4.070882
## DependentsYes -16.756789
## tenure 59.254420
## PhoneServiceYes 561.731648
## MultipleLinesNo phone service .
## MultipleLinesYes 259.866634
## InternetServiceFiber optic 831.826839
## InternetServiceNo -366.246488
## OnlineSecurityNo internet service .
## OnlineSecurityYes 360.192889
## OnlineBackupNo internet service -2.306222
## OnlineBackupYes 430.873206
## DeviceProtectionNo internet service -5.200375
## DeviceProtectionYes 351.128156
## TechSupportNo internet service -4.726654
## TechSupportYes 344.113597
## StreamingTVNo internet service .
## StreamingTVYes 442.272965
## StreamingMoviesNo internet service -4.395051
## StreamingMoviesYes 418.977773
## ContractOne year .
## ContractTwo year -160.033712
## PaperlessBillingYes -13.169609
## PaymentMethodCredit card (automatic) 3.159533
## PaymentMethodElectronic check -72.498665
## PaymentMethodMailed check 163.678116
## MonthlyCharges 2.421755
## ChurnYes -167.235252
lasso_pred <- predict(lasso_cv_model, s = best_lambda, newx = x_test)
lasso_rmse <- RMSE(lasso_pred, y_test)
lasso_r2 <- R2(lasso_pred, y_test)
cat("Lasso Regression Results - RMSE:", lasso_rmse, " R-squared:", lasso_r2)
## Lasso Regression Results - RMSE: 702.6844 R-squared: 0.9032689
SeniorCitizen or
ContractOneYear in some splits), making the model more
interpretable and less prone to overfitting.We visualize the actual versus predicted values for the Lasso model to assess fit.
plot_data <- data.frame(
Actual = test_data$TotalCharges,
Predicted = as.vector(lasso_pred)
)
library(ggplot2)
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_point(color = "#00AFBB", alpha = 0.5) + # Scatter plot
geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # 45-degree reference line
labs(
title = "Lasso Regression: Actual vs Predicted Total Charges",
subtitle = paste("R-squared =", round(lasso_r2, 3), "| RMSE =", round(lasso_rmse, 2)),
x = "Actual Total Charges ($)",
y = "Predicted Total Charges ($)"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The scatter plot above visualizes the performance of our final Lasso Regression model on the test dataset:
1.The Red Line (Ideal Fit): The diagonal red line represents the line of perfect prediction (\(y = x\)). If the model were 100% accurate, every single blue dot would lie exactly on this line.
2.The Blue Dots (Predictions): Each dot represents a customer in the test set.
Strong Correlation: The points are tightly clustered around the red reference line, forming a clear diagonal trend. This visually confirms our high R-squared value of ~0.90, indicating that the model successfully captures the underlying linear relationship between customer attributes and their total bill.
Error Distribution: The vertical distance between a dot and the red line represents the prediction error (Residual). While most points are close to the line, the slight dispersion (scatter) corresponds to our RMSE of ~702, which reflects the natural variance in customer usage that cannot be perfectly explained by a linear model alone.
The visual alignment confirms that the Lasso model is a reliable tool for forecasting customer value, with no obvious patterns of bias (e.g., consistently overestimating or underestimating).
While Multiple Linear Regression and Lasso Regression assume a linear relationship between features and the target, customer billing data often contains non-linear patterns. To capture these complexities, we employ Random Forest Regression, an ensemble method that averages predictions from multiple decision trees.
library(caret)
library(readr)
telco_df <- read_csv("telco_churn_cleaned.csv")
## Rows: 6903 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): customerID, gender, SeniorCitizen, Partner, Dependents, PhoneServi...
## dbl (3): tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
train_index <- createDataPartition(
y = telco_df$TotalCharges,
p = 0.7,
list = FALSE
)
train_data <- telco_df[train_index, ]
test_data <- telco_df[-train_index, ]
We use the randomForest package. Unlike Lasso, Random
Forest can handle categorical factors directly without needing a model
matrix and is robust to outliers.
library(randomForest)
# Set seed for reproducibility
set.seed(123)
# Train the Random Forest Regression Model
# ntree = 500: We build 500 trees
# importance = TRUE: To calculate variable importance
rf_reg_model <- randomForest(TotalCharges ~ .,
data = train_data,
ntree = 500,
importance = TRUE)
print(rf_reg_model)
##
## Call:
## randomForest(formula = TotalCharges ~ ., data = train_data, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 33893.25
## % Var explained: 99.34
rf_reg_pred <- predict(rf_reg_model, test_data)
rf_rmse <- RMSE(rf_reg_pred, test_data$TotalCharges)
rf_r2 <- R2(rf_reg_pred, test_data$TotalCharges)
cat("Random Forest Regression Results - RMSE:", rf_rmse, " R-squared:", rf_r2)
## Random Forest Regression Results - RMSE: 187.2001 R-squared: 0.9941287
One of the advantages of Random Forest Regression is the ability to
see which variables drive the prediction. We visualize the importance of
features in predicting TotalCharges.
varImpPlot(rf_reg_model, main = "Feature Importance: Predictors of Total Charges")
The plot highlights two importance metrics: %IncMSE (accuracy loss if variable is excluded) and IncNodePurity. Unsurprisingly, Tenure and MonthlyCharges are the dominant predictors. However, Random Forest allows us to see if other variables like Contract or InternetService play a significant role in reducing prediction error through non-linear interactions.
We visualize the predictions of the Random Forest Regression as a representative of non-linear models against the actual values to check for fit and bias.
rf_plot_data <- data.frame(
Actual = test_data$TotalCharges,
Predicted = as.vector(rf_reg_pred)
)
ggplot(rf_plot_data, aes(x = Actual, y = Predicted)) +
geom_point(color = "darkgreen", alpha = 0.4) + # Scatter plot
geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # 45-degree reference line
labs(
title = "Random Forest: Actual vs Predicted Total Charges",
subtitle = paste("R-squared =", round(rf_r2, 3), "| RMSE =", round(rf_rmse, 2)),
x = "Actual Total Charges ($)",
y = "Predicted Total Charges ($)"
) +
theme_minimal()
To finalize our regression analysis, we compare the performance of all three models: Multiple Linear Regression (Baseline), Lasso Regression (Regularization), and Random Forest (Non-linear Ensemble).
We aggregate the RMSE and R-squared metrics for all three models to identify the best performer.
model_comparison <- data.frame(
Model = c("Linear Regression (OLS)", "Lasso Regression", "Random Forest Regression"),
RMSE = c(lm_rmse, lasso_rmse, rf_rmse),
R_Squared = c(lm_r2, lasso_r2, rf_r2)
)
model_comparison <- model_comparison[order(model_comparison$RMSE), ]
print(model_comparison)
## Model RMSE R_Squared
## 3 Random Forest Regression 187.2001 0.9941287
## 1 Linear Regression (OLS) 702.4600 0.9033625
## 2 Lasso Regression 702.6844 0.9032689
In this project, we aimed to predict TotalCharges using
three distinct algorithms: 1. Multiple Linear Regression (OLS): Provided
a strong baseline with an \(R^2\) of
~0.90 but suffered from singularities due to perfect multicollinearity
among the categorical variables. 2. Lasso Regression: Successfully
handled the multicollinearity by shrinking redundant coefficients to
zero. It maintained high accuracy while offering a cleaner, more
interpretable set of features. 3. Random Forest Regression: Added the
ability to model non-linear relationships. Recommendation: If the Random
Forest Regression achieves a significantly lower RMSE than Lasso
Regression, it indicates that complex, non-linear interactions (e.g.,
how specific contracts affect billing over time) are crucial. However,
if the performance is similar, Lasso Regression is often preferred for
business deployment because it provides a clear mathematical formula
(coefficients) that is easier to interpret for stakeholders than the
“black box” nature of Random Forest Regression.
Based on our analysis, Tenure and
MonthlyCharges are such strong linear drivers of the total
billing charges that the linear models such as Lasso Regression perform
exceptionally well, making them the most efficient choice for this
specific problem.
Customer retention should focus on month-to-month customers, fiber optic users, customers in first 6 months, customers without tech support or online security and customers using electronic check. To address these risks, the telecommunication company can:
Redesign Contract and Pricing Strategies Incentivize long-term commitments through discounted monthly fees and free value-added services such as TechSupport and OnlineSecurity. Introducing hybrid contract options (e.g., flexible 6-month plans) can ease the transition for month-to-month customers who may be hesitant to commit long term.
Strengthen Early-Life Customer Engagement Implement structured 90-day and 180-day retention programs that include proactive check-ins, welcome calls or emails, and personalized usage tips. Offering temporary discounts or service upgrades during early tenure can improve satisfaction and reduce early churn.
Bundle Value-Added Services Increase perceived value by creating default service bundles that include TechSupport and OnlineSecurity. Free or low-risk trial periods for add-on services can help customers experience their benefits before being charged, increasing adoption and long-term retention.
Improve Billing and Payment Experience Encourage enrollment in automatic payment methods through small incentives and clear communication. Reducing reliance on electronic checks by promoting digital and automated payment options can lower friction, improve payment reliability, and decrease churn risk.
As discussed, while all three models achieve similar accuracy (~80%), they each have their own strength. By aligning data-driven insights with customer-centric strategies, the telecommunications company can reduce churn, stabilize revenue and increase customer lifetime value, transforming analytics into a competitive advantage.
On the other hand, the regression analysis demonstrates that the models explain approximately 90% of total billing variation, driven far more by how long a customer stays and which services they subscribe to. Total billing is cumulative, meaning long-tenured customers with richer service bundles generate disproportionately higher lifetime revenue. Despite intuition, monthly charges becomes statistically insignificant once individual services are accounted for. This indicates that customers do not respond to the aggregate bill as much as to what they are paying for such as fiber optic, streaming and security.
Lasso Regression removed redundant features like duplicated “No internet service” indicators, isolating the true revenue-contributing services. This produces a cleaner and more reliable forecast model because not all customer attributes contribute equally to revenue—some add noise without adding insight. Although Multiple Linear Regression and Lasso Regression achieved similar accuracy, Multiple Linear Regression suffered from multicollinearity and unstable coefficients making it risky for business interpretation whereas Lasso Regression provided similar accuracy with better stability and interpretability. An RMSE of ~702 relative to high total charges indicates the model is accurate enough for the telecommunication company to make more informed decisions around retention strategy, service bundling, revenue forecasting and customer value management, ultimately driving sustainable and predictable revenue growth.