Executive Summary This project presents a dual-layered analytical approach to the telecommunications industry, bridging the gap between Customer Retention (Classification) and Pricing Integrity (Regression). By analyzing 7,000+ customer records, the project developed a predictive framework that identifies high-risk churners (AUC 0.84) and an auditing model that explains 99.88% of pricing variance.
Team Contributions & Division of Labor The project operated under a structured Data Science Lifecycle, ensuring every phase—from raw data to strategic insight—was handled with technical precision:
TIANBOYI (25069211-Team Leader): Integrate project.Defined the research framework and core business questions. Responsible for global quality control, code optimization, and the final synthesis of the R Markdown report. Authored the strategic introduction and executive conclusions.
YE MINGMIN (24070972 - Data Engineer): Data Preprocessing. Executed end-to-end data cleaning, handling missing values and feature engineering to ensure a high-quality “Single Source of Truth” for modeling.
MA LING (24206583 - Data Analyst): Exploratory Data Analysis (EDA). Conducted multi-dimensional visualization and statistical profiling to uncover hidden correlations between customer demographics and churn behavior.
YU MINGJUN (24222016 - Modeling Specialist): Regression Analysis. Built and validated the Linear Regression models to deconstruct the deterministic pricing structure of monthly charges.
YANG YICHEN (24211050 - ML Specialist): Classification Modeling. Implemented and tuned Logistic Regression and Random Forest algorithms to predict customer churn, focused on maximizing predictive power and model interpretability.
Key Outcomes 1.Churn Prediction: Achieved a robust AUC of 0.84, identifying Fiber Optic users and month-to-month contract holders as high-risk segments.
2.Pricing Audit: Confirmed a highly consistent billing logic with an Adjusted R² of 0.9988, proving that charges are strictly driven by service components rather than demographic bias.
3.Strategic Roadmap: Provided data-driven recommendations for service-based bundling and targeted retention for high-value customers.
4.Technical Stack: R (tidyverse, caret, randomForest, ggplot2, broom)
This project aims to address one of the most critical challenges in the telecommunications industry: Customer Churn. This problem has been approached from two distinct analytical angles:
-Classification: Predicting which customers are likely to leave based on their behavior and service profiles.
-Regression: Conducting a pricing audit to understand exactly how the service components drive the MonthlyCharges.
By combining these two methods, the project provides the business with actionable strategies for both retention and revenue optimization.
Before any analysis could begin, the project performed a rigorous data cleaning process to ensure the integrity of the results.
As the first stage of the group project, the primary objective was to transform the raw telecommunications dataset into a high-quality, analysis-ready format. The project was focus on ensuring data integrity, handling missing values, and correctly encoding variables to support both the classification and regression tasks that follow.
To maintain consistency across all team members’ environments, the project utilized the tidyverse suite for data manipulation. The raw dataset, consisting of over 7,000 customer records, was imported for initial inspection.
# 1. Load required packages
library(tidyverse)
# 2. Read the raw data
# Ensuring the raw file is in the working directory
df <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
# Preview the first few rows
head(df)## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
A comprehensive scan of the data structure was performed. The project identified that the Total Charges column was improperly read as a character type due to empty strings, and several categorical variables needed to be converted into factors for proper modeling.
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
## Dataset Dimensions: 7043 rows and 21 columns.
To prepare the data, the project executed a rigorous cleaning pipeline:
-Feature Selection: The project removed the customer ID column as it provides no predictive value.
-Type Conversion: Total Charges was converted to numeric.
-Handling Missing Values: The project identified 11 missing values in Total Charges (occurring where tenure was 0). It decided to remove these records to ensure model accuracy.
-Categorical Encoding: All categorical character columns and the Senior Citizen indicator were converted into factors.
# 1. Remove unnecessary ID column
df <- df %>% select(-customerID)
# 2. Fix TotalCharges and handle NAs
df <- df %>%
mutate(TotalCharges = as.numeric(as.character(TotalCharges))) %>%
drop_na()
# 3. Factorize categorical variables
# The project grouped variables to ensure consistent encoding
df <- df %>%
mutate(across(
c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
Contract, PaperlessBilling, PaymentMethod, Churn),
as.factor
))
# 4. Final validation of numeric scales
df <- df %>%
mutate(
tenure = as.numeric(tenure),
MonthlyCharges = as.numeric(MonthlyCharges)
)Before handing off the data to the EDA and Modeling teams, the project performed a final quality check to confirm there are no remaining missing values and that all factor levels are correctly assigned.
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Finally, the project exported this version as data_cleaned.csv to serve as the “Single Source of Truth” for all remaining project parts.
# --- Export Data ---
# 1. Define output path
export_path <- "data_cleaned.csv"
# 2. Create folder if it doesn't exist
if (!dir.exists(dirname(export_path))) {
dir.create(dirname(export_path), recursive = TRUE)
}
# 3. Try to save data; if it fails (e.g., file open in Excel), just show a warning
# This prevents the entire HTML knitting process from stopping.
tryCatch({
write.csv(df, export_path, row.names = FALSE)
message("Success: Data saved to ", export_path)
}, error = function(e) {
warning("Could not save CSV. Please close 'data_cleaned.csv' in Excel and try again.")
})After the initial data cleaning, the project focused on a deep dive into the dataset to understand the “signals” behind customer churn. The objective of this EDA is to investigate the data structure, assess quality issues, and identify which demographic and service attributes have the strongest correlation with Churn. These insights serve as the empirical foundation for the subsequent classification and regression models.
The project selected a specialized toolset—ggplot2 for visualization and skimr for detailed statistical summaries—to ensure the analysis is both visually clear and statistically sound.
```{rload-libraries} library(dplyr) library(ggplot2) library(tidyr) library(skimr)
telco <- read.csv(“WA_Fn-UseC_-Telco-Customer-Churn.csv”, stringsAsFactors = FALSE)
## 3.3 Comprehensive Data Diagnostics
The project began by validating the dimensions and structure of the dataset to confirm the scope of the analysis.
``` r
# --- Comprehensive Data Diagnostics ---
# Check the internal structure of the dataframe
# The project used 'df' as defined in Section 1
str(df)
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3483 0:5890 No :3639 No :4933 Min. : 1.00 No : 680
## Male :3549 1:1142 Yes:3393 Yes:2099 1st Qu.: 9.00 Yes:6352
## Median :29.00
## Mean :32.42
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity
## No :3385 DSL :2416 No :3497
## No phone service: 680 Fiber optic:3096 No internet service:1520
## Yes :2967 No :1520 Yes :2015
##
##
##
## OnlineBackup DeviceProtection
## No :3087 No :3094
## No internet service:1520 No internet service:1520
## Yes :2425 Yes :2418
##
##
##
## TechSupport StreamingTV
## No :3472 No :2809
## No internet service:1520 No internet service:1520
## Yes :2040 Yes :2703
##
##
##
## StreamingMovies Contract PaperlessBilling
## No :2781 Month-to-month:3875 No :2864
## No internet service:1520 One year :1472 Yes:4168
## Yes :2731 Two year :1685
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Churn
## Bank transfer (automatic):1542 Min. : 18.25 Min. : 18.8 No :5163
## Credit card (automatic) :1521 1st Qu.: 35.59 1st Qu.: 401.4 Yes:1869
## Electronic check :2365 Median : 70.35 Median :1397.5
## Mailed check :1604 Mean : 64.80 Mean :2283.3
## 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
A critical part of the workflow was identifying potential data gaps. The project discovered that Total Charges contained several missing values that required attention, while categorical variables remained highly consistent.
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
# Identifying unique entries in categorical fields
sapply(df %>% select_if(is.character), n_distinct)## named list()
The project explored the distribution of key variables to establish a baseline for the customer profiles.
The project observed that the dataset is imbalanced, with a significant majority of customers currently staying with the provider.
ggplot(df, aes(x = Churn, fill = Churn)) +
geom_bar() +
labs(title = "Baseline Distribution of Customer Churn", x = "Churn Status", y = "Total Count") +
theme_minimal()The project analyzed the density of charges to see the common billing tiers and investigated how long customers typically stay with the company.
# Density of Monthly Charges
ggplot(df, aes(x = MonthlyCharges)) +
geom_density(fill = "steelblue", alpha = 0.6) +
labs(title = "Density Distribution of Monthly Charges", x = "Monthly Charges ($)")# Histogram of Tenure
ggplot(df, aes(x = tenure)) +
geom_histogram(binwidth = 5, fill = "orange", color = "white") +
labs(title = "Distribution of Customer Tenure", x = "Months of Tenure", y = "Frequency")The project shifted the focus to how specific variables interact with the churn event.
The following visualization confirms the project’s hypothesis: customers who churn tend to have higher monthly charges compared to those who remain.
ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
geom_boxplot() +
labs(title = "Monthly Charges Comparison by Churn Status", y = "Monthly Charges ($)")The boxplot reveals that long-term tenure is a strong “moat” against churn. Customers who leave generally do so early in their lifecycle.
ggplot(df, aes(x = Churn, y = tenure, fill = Churn)) +
geom_boxplot() +
labs(title = "Tenure Duration Comparison by Churn Status", y = "Tenure (Months)")The project examined how service types and contract terms influence a customer’s decision to stay or leave.
The project identified Contract Type as perhaps the strongest predictor in the dataset. Month-to-month users are at a much higher risk of leaving.
ggplot(df, aes(x = Contract, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Churn Proportion by Contract Type", y = "Proportion") +
theme_minimal()The project also looked at how specific technologies and billing methods correlate with churn.
# Internet Service Type
ggplot(df, aes(x = InternetService, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Churn Proportion by Internet Service", y = "Proportion")# Payment Method
ggplot(df, aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Churn Proportion by Payment Method", y = "Proportion")Based on the exploratory results, several high-priority features for the next phase has been identified:
-High Risk Factors: Month-to-month contracts, Fiber Optic internet, and Electronic check payments are strongly associated with higher churn rates.
-Protective Factors: One- and Two-year contracts, and longer tenure (loyalty), significantly reduce the likelihood of churn.
-Predictive Potential: Monthly Charges and Tenure are the most significant continuous variables to include in the predictive classification model.
This EDA phase has successfully provided a clear understanding of the customer landscape. By identifying these critical drivers, the project has established a solid analytical foundation for building the Classification and Regression models in the following sections.
Research Question: “Can we build a reliable model to identify high-risk customers based on their service usage and demographic profiles?”
Building on the insights from the Exploratory Data Analysis (EDA), the project moved to the core predictive task of this project. The primary question the chapter aims to answer is: “Can we accurately predict customer churn using demographic characteristics and consumption behavior?”
To ensure the robustness of the predictions, wo distinct supervised learning models were decided to train and compare:
Logistic Regression: Serving as the baseline model due to its high interpretability and efficiency.
Random Forest: Employed as an advanced ensemble method to capture non-linear relationships and complex interactions between variables that a linear model might miss.
To maintain consistency across the workflow, the data_cleaned.csv generated in Part 1 was utilized . The project applied a 70/30 split to create independent training and testing sets, ensuring that the model evaluations reflect real-world performance.
library(caret)
library(randomForest)
library(pROC)
library(dplyr)
# Load the cleaned dataset
model_data <- read.csv("data_cleaned.csv", stringsAsFactors = TRUE)
# Ensure the target variable is a factor
model_data$Churn <- factor(model_data$Churn, levels = c("Yes", "No"))
# Data Partitioning (70% Train, 30% Test)
set.seed(2025)
trainIndex <- createDataPartition(model_data$Churn, p = 0.7, list = FALSE)
trainData <- model_data[trainIndex, ]
testData <- model_data[-trainIndex, ]
cat("Training samples:", nrow(trainData), "\n")## Training samples: 4924
## Testing samples: 2108
The project implemented 5-fold cross-validation to ensure the results are generalizable and to prevent overfitting.
fitControl <- trainControl (method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# model1-Logistic Regression
set.seed (2025)
logit_model <- train (Churn ~ .,
data = trainData,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "ROC")
# model2-Random Forest
set.seed (2025)
rf_model <- train (Churn ~ .,
data = trainData,
method = "rf",
trControl = fitControl,
metric = "ROC",
tuneLength = 3) The project evaluated both models using AUC (Area Under the Curve) and confusion matrices. A higher AUC indicates stronger discrimination ability.
# ROC
prob_logit <- predict (logit_model, testData, type = "prob")
prob_rf <- predict (rf_model, testData, type = "prob")
# predict classfication
pred_logit <- predict (logit_model, testData)
pred_rf <- predict (rf_model, testData)
cm_logit <- confusionMatrix (pred_logit, testData$Churn, mode = "everything")
cm_rf <- confusionMatrix (pred_rf, testData$Churn, mode = "everything")
print (cm_logit)## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 304 147
## No 256 1401
##
## Accuracy : 0.8088
## 95% CI : (0.7914, 0.8254)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 7.089e-16
##
## Kappa : 0.4776
##
## Mcnemar's Test P-Value : 7.454e-08
##
## Sensitivity : 0.5429
## Specificity : 0.9050
## Pos Pred Value : 0.6741
## Neg Pred Value : 0.8455
## Precision : 0.6741
## Recall : 0.5429
## F1 : 0.6014
## Prevalence : 0.2657
## Detection Rate : 0.1442
## Detection Prevalence : 0.2139
## Balanced Accuracy : 0.7239
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 252 106
## No 308 1442
##
## Accuracy : 0.8036
## 95% CI : (0.786, 0.8204)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 6.717e-14
##
## Kappa : 0.4312
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4500
## Specificity : 0.9315
## Pos Pred Value : 0.7039
## Neg Pred Value : 0.8240
## Precision : 0.7039
## Recall : 0.4500
## F1 : 0.5490
## Prevalence : 0.2657
## Detection Rate : 0.1195
## Detection Prevalence : 0.1698
## Balanced Accuracy : 0.6908
##
## 'Positive' Class : Yes
##
The ROC curve shows model performance under different thresholds. Both curves on the same graph for direct comparison were plotted.
roc_logit <- roc (testData$Churn, prob_logit$Yes, levels = c ("No", "Yes"), direction = "<")
roc_rf <- roc (testData$Churn, prob_rf$Yes, levels = c ("No", "Yes"), direction = "<")
# draw
plot (roc_logit, col = "#E69F00", lwd = 2, main = "ROC Curve Comparison: Logit vs RF",
print.auc = TRUE, print.auc.y = 0.4, legacy.axes = TRUE)
plot (roc_rf, col = "#56B4E9", lwd = 2, add = TRUE,
print.auc = TRUE, print.auc.y = 0.3)
legend ("bottomright", legend = c ("Logistic Regression", "Random Forest"),
col = c ("#E69F00", "#56B4E9"), lwd = 2)Interpretation—Both models show strong predictive performance, with AUC values above 0.80. Logistic Regression achieved an AUC of approximately 0.84, slightly higher than Random Forest. This suggests that, for this dataset, a simple linear model is already sufficient.
To understand the “why” behind customer churn, the variable importance derived from the Random Forest model is examined. This allows it to move beyond simple prediction into actionable business intelligence.
# Visualizing top predictors
varImpPlot(rf_model$finalModel, main = "Top Predictors of Churn (Random Forest)")The analysis strongly confirms several critical drivers identified during EDA:
-Tenure is Paramount: New customers represent the highest risk; churn probability drops significantly as customer loyalty (tenure) increases.
-Price Sensitivity: Financial variables rank high in importance, indicating that cost-related pressure is a primary reason for leaving.
-Contractual Commitment: Customers on month-to-month plans are significantly more volatile than those on one- or two-year contracts.
-The Fiber Optic Paradox: Despite higher speeds, Fiber Optic users show higher churn rates, suggesting a potential gap between service cost and perceived value.
Based on the modeling results, the following conclusions were drew:
-Model Selection: Logistic Regression was recommended for immediate deployment. It delivered strong, competitive performance (AUC ≈ 0.84) with the added benefit of low computational cost and clear interpret ability for stakeholders.
-Retention Strategy: Management should focus retention efforts on the first 0–12 months of the customer lifecycle. Specifically, implementing incentive programs to migrate users from month-to-month plans to long-term contracts could substantially reduce overall churn risk.
Research Question: “How do individual service subscriptions quantitatively influence Monthly Charges after controlling for account profiles?”
While the previous section focused on predicting the behavior of customers (Churn), this chapter now shifts focus to auditing the financial structure of the business. The aim of this section is to quantify how various service subscriptions and customer attributes influence the monthly fee charged. By constructing and interpreting a Linear Regression model, this chapter provides the business with transparent pricing insights and identifies the primary levers for revenue optimization.
The analysis uses the pre-cleaned Telco customer churn dataset (data_cleaned.csv), which contains records of approximately 7,000 customers. Each observation describes demographics, contract details, service subscriptions, and billing information. Since the data have already been cleaned by previous parts, the preparation focuses on correctly encoding categorical variables.
library(tidyverse)
library(car)
library(Metrics)
# Load the cleaned data
telco <- read_csv('data_cleaned.csv')
# Encoding categorical variables as factors
telco <- telco %>%
mutate(across(c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
Contract, PaperlessBilling, PaymentMethod, Churn), as.factor))
# Ensuring numerical columns are correctly typed
telco <- telco %>% mutate(across(c(tenure, MonthlyCharges, TotalCharges), as.numeric))
glimpse(telco)## Rows: 7,032
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
Key variables include tenure, Contract, and a series of binary service indicators such as PhoneService, MultipleLines, and InternetService. The MonthlyCharges variable is continuous and serves as the dependent variable in the regression models.
Before fitting any models, the project explore the distribution of monthly charges and their relationship with tenure. The average customer pays around RM 65 per month, with a range between RM 18 and RM 119.
summary_stats <- summary(telco$MonthlyCharges)
summary_table <- tibble(
Statistic = c("Min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"),
Value = as.numeric(summary_stats)
)
knitr::kable(summary_table, digits = 2, caption = "Summary of Monthly Charges")| Statistic | Value |
|---|---|
| Min | 18.25 |
| 1st Qu. | 35.59 |
| Median | 70.35 |
| Mean | 64.80 |
| 3rd Qu. | 89.86 |
| Max | 118.75 |
The project also examined the correlation between a customer’s tenure and their monthly fee.
ggplot(telco, aes(x = tenure, y = MonthlyCharges)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, colour = "blue") +
labs(title = "Relationship Between Tenure and Monthly Charges",
x = "Customer Tenure (months)", y = "Monthly Charges") +
theme_minimal()##
## Pearson's product-moment correlation
##
## data: telco$tenure and telco$MonthlyCharges
## t = 21.359, df = 7030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2247854 0.2686849
## sample estimates:
## cor
## 0.2468618
The correlation coefficient is relatively low (~0.25) but statistically significant. This indicates a modest positive association: customers with longer tenure tend to pay slightly higher fees, likely due to additional service subscriptions over time.
The project fits two multiple linear regression models to understand the pricing structure:
1.Full model: Includes demographics, contract details, billing setup, and all service indicators.
2,Pricing model: Focuses on service subscription variables while controlling for tenure and Contract.
full_model <- lm(MonthlyCharges ~ gender + SeniorCitizen + Partner + Dependents +
tenure + PhoneService + MultipleLines + InternetService +
OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
StreamingTV + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod,
data = telco)
summary(full_model)##
## Call:
## lm(formula = MonthlyCharges ~ gender + SeniorCitizen + Partner +
## Dependents + tenure + PhoneService + MultipleLines + InternetService +
## OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod, data = telco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2296 -0.6166 -0.0055 0.6079 4.8413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496e+01 5.825e-02 428.509 <2e-16
## genderMale 2.492e-02 2.447e-02 1.018 0.309
## SeniorCitizenYes 1.434e-02 3.560e-02 0.403 0.687
## PartnerYes -4.216e-02 2.959e-02 -1.425 0.154
## DependentsYes 1.416e-02 3.140e-02 0.451 0.652
## tenure 1.455e-04 8.376e-04 0.174 0.862
## PhoneServiceYes 2.006e+01 4.815e-02 416.574 <2e-16
## MultipleLinesYes 5.019e+00 2.954e-02 169.908 <2e-16
## InternetServiceFiber optic 2.496e+01 3.488e-02 715.626 <2e-16
## InternetServiceNo -2.505e+01 4.874e-02 -513.867 <2e-16
## OnlineSecurityYes 5.013e+00 3.217e-02 155.850 <2e-16
## OnlineBackupYes 4.996e+00 3.023e-02 165.278 <2e-16
## DeviceProtectionYes 5.020e+00 3.131e-02 160.305 <2e-16
## TechSupportYes 5.033e+00 3.279e-02 153.491 <2e-16
## StreamingTVYes 9.970e+00 3.205e-02 311.106 <2e-16
## StreamingMoviesYes 9.966e+00 3.207e-02 310.727 <2e-16
## ContractOne year 1.019e-02 3.832e-02 0.266 0.790
## ContractTwo year -2.564e-02 4.648e-02 -0.552 0.581
## PaperlessBillingYes -2.071e-02 2.735e-02 -0.757 0.449
## PaymentMethodCredit card (automatic) 5.085e-05 3.710e-02 0.001 0.999
## PaymentMethodElectronic check -1.940e-02 3.636e-02 -0.534 0.594
## PaymentMethodMailed check -1.399e-02 3.950e-02 -0.354 0.723
##
## (Intercept) ***
## genderMale
## SeniorCitizenYes
## PartnerYes
## DependentsYes
## tenure
## PhoneServiceYes ***
## MultipleLinesYes ***
## InternetServiceFiber optic ***
## InternetServiceNo ***
## OnlineSecurityYes ***
## OnlineBackupYes ***
## DeviceProtectionYes ***
## TechSupportYes ***
## StreamingTVYes ***
## StreamingMoviesYes ***
## ContractOne year
## ContractTwo year
## PaperlessBillingYes
## PaymentMethodCredit card (automatic)
## PaymentMethodElectronic check
## PaymentMethodMailed check
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 7010 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.88e+05 on 21 and 7010 DF, p-value: < 2.2e-16
Many demographic variables have p-values above 0.05, while specific service indicators are highly significant. This supports the hypothesis that monthly charges are essentially the sum of base fees plus subscribed services.
The project verified that multi-collinearity is not inflating standard errors using Variance Inflation Factors (VIFs).
## GVIF Df GVIF^(1/(2*Df))
## gender 1.001769 1 1.000884
## SeniorCitizen 1.153168 1 1.073857
## Partner 1.462369 1 1.209284
## Dependents 1.380811 1 1.175079
## tenure 2.827609 1 1.681549
## PhoneService 1.354768 1 1.163945
## MultipleLines 1.423682 1 1.193182
## InternetService 3.783812 2 1.394705
## OnlineSecurity 1.415235 1 1.189637
## OnlineBackup 1.380823 1 1.175084
## DeviceProtection 1.480118 1 1.216601
## TechSupport 1.481353 1 1.217108
## StreamingTV 1.625973 1 1.275137
## StreamingMovies 1.634755 1 1.278575
## Contract 2.567229 2 1.265803
## PaperlessBilling 1.208329 1 1.099240
## PaymentMethod 1.580562 3 1.079283
Values are under commonly cited thresholds, suggesting stable estimated coefficients.
The project specifies a refined pricing model including only the variables most directly related to the monthly fee.
pricing_model <- lm(MonthlyCharges ~ tenure + Contract + PhoneService + MultipleLines +
InternetService + OnlineSecurity + OnlineBackup +
DeviceProtection + TechSupport + StreamingTV +
StreamingMovies,
data = telco)
summary(pricing_model)##
## Call:
## lm(formula = MonthlyCharges ~ tenure + Contract + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies,
## data = telco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2425 -0.6160 -0.0047 0.6063 4.8075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.495e+01 4.550e-02 548.265 <2e-16 ***
## tenure -4.821e-07 7.952e-04 -0.001 1.000
## ContractOne year 1.262e-02 3.792e-02 0.333 0.739
## ContractTwo year -2.299e-02 4.571e-02 -0.503 0.615
## PhoneServiceYes 2.006e+01 4.804e-02 417.534 <2e-16 ***
## MultipleLinesYes 5.017e+00 2.943e-02 170.454 <2e-16 ***
## InternetServiceFiber optic 2.495e+01 3.401e-02 733.750 <2e-16 ***
## InternetServiceNo -2.504e+01 4.760e-02 -526.106 <2e-16 ***
## OnlineSecurityYes 5.014e+00 3.194e-02 156.970 <2e-16 ***
## OnlineBackupYes 4.995e+00 3.019e-02 165.446 <2e-16 ***
## DeviceProtectionYes 5.020e+00 3.126e-02 160.597 <2e-16 ***
## TechSupportYes 5.034e+00 3.256e-02 154.609 <2e-16 ***
## StreamingTVYes 9.967e+00 3.190e-02 312.458 <2e-16 ***
## StreamingMoviesYes 9.964e+00 3.197e-02 311.698 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 7018 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 4.655e+05 on 13 and 7018 DF, p-value: < 2.2e-16
The Adjusted R² is very high (~0.9988), explaining over 99.8% of the variation in charges, confirming a deterministic pricing structure driven by selected services.
The project tabulates the coefficients for significant predictors (p < 0.05).
coef_df <- broom::tidy(pricing_model, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(Effect = ifelse(estimate > 0, "Positive", "Negative")) %>%
select(term, estimate, conf.low, conf.high, p.value, Effect)
knitr::kable(coef_df, digits = 3,
col.names = c("Predictor", "Estimate", "CI Lower", "CI Upper", "p-value", "Effect"),
caption = "Coefficients for the pricing model (significant predictors).")| Predictor | Estimate | CI Lower | CI Upper | p-value | Effect |
|---|---|---|---|---|---|
| tenure | 0.000 | -0.002 | 0.002 | 1.000 | Negative |
| ContractOne year | 0.013 | -0.062 | 0.087 | 0.739 | Positive |
| ContractTwo year | -0.023 | -0.113 | 0.067 | 0.615 | Negative |
| PhoneServiceYes | 20.058 | 19.964 | 20.152 | 0.000 | Positive |
| MultipleLinesYes | 5.017 | 4.959 | 5.075 | 0.000 | Positive |
| InternetServiceFiber optic | 24.954 | 24.887 | 25.020 | 0.000 | Positive |
| InternetServiceNo | -25.043 | -25.136 | -24.950 | 0.000 | Negative |
| OnlineSecurityYes | 5.014 | 4.951 | 5.077 | 0.000 | Positive |
| OnlineBackupYes | 4.995 | 4.936 | 5.054 | 0.000 | Positive |
| DeviceProtectionYes | 5.020 | 4.959 | 5.082 | 0.000 | Positive |
| TechSupportYes | 5.034 | 4.971 | 5.098 | 0.000 | Positive |
| StreamingTVYes | 9.967 | 9.905 | 10.030 | 0.000 | Positive |
| StreamingMoviesYes | 9.964 | 9.901 | 10.026 | 0.000 | Positive |
The project quantified prediction error using RMSE and MAE.
pred <- predict(pricing_model, newdata = telco)
rmse_val <- sqrt(mean((telco$MonthlyCharges - pred)^2))The RMSE (~1.02) and MAE (~0.78) indicate that the model’s predictions are on average within about RM 1 of the actual monthly charges. Given that charges range from RM 18 to RM 118, such a small error underscores the deterministic nature of the pricing structure.
The diagnostic plot compares actual versus predicted charges.
ggplot(data.frame(Actual = telco$MonthlyCharges, Predicted = pred), aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3) +
geom_abline(intercept = 0, slope = 1, colour = 'red', linetype = 'dashed') +
labs(title = 'Actual vs Predicted Monthly Charges',
x = 'Actual Monthly Charges',
y = 'Predicted Monthly Charges') +
theme_minimal()The regression analysis reveals a transparent and additive pricing structure. The following strategic takeaways has identified:
-Service-Driven Pricing: Subscriptions to phone services, multiple lines, fiber optic internet, and optional add-ons (like Streaming TV) have fixed and statistically significant effects on the monthly fee.
-Demographic Neutrality: After controlling for service mix and contract length, demographics such as gender or having a partner do not influence the price. This confirms that the company maintains billing equity across all customer segments.
-The Role of Tenure: A positive but very small tenure coefficient is founded, suggesting only minor price adjustments or upgrades occur over time. The specific service mix is a much stronger determinant of the bill than tenure alone.
-Actionable Revenue Levers: Because the pricing is so deterministic (R-Squared ≈ 0.99), the company can confidently offer “service-based” bundled discounts (e.g., discounting streaming services for high-risk churn groups identified in Part 4) to increase perceived value without disrupting the core pricing logic.
By integrating the classification and regression findings, a dual-action strategy was provided: who is likely to leave can now be identified and precisely how their monthly bill through targeted service bundles to retain them can be adjusted.
The project successfully moved beyond simple data observation to create a cohesive framework for customer retention. By merging classification techniques with a detailed pricing audit through regression, a comprehensive picture of the customer lifecycle has be built. Who is likely to leave was not just identified; The underlying financial mechanics that influence that decision were uncovered.
The classification analysis provided a reliable foundation for identifying at-risk customers. Both the Logistic Regression and Random Forest models demonstrated strong predictive power, achieving AUC values around 0.84. While the Random Forest model offered a slight edge in capturing complex, non-linear interactions, both models were consistent in identifying the primary drivers of churn.
According to the Variable Importance Plot, the top three predictors are:
-Tenure: Long-term loyalty significantly reduces the likelihood of leaving.
-Contract Type: Month-to-month plans are high-risk zones.
-Monthly Charges: Financial pressure is a direct trigger for churn.
This finding led to a vital Strategic Question: If Monthly Charges are so critical to churn, how exactly does the company calculate these fees? This question served as the logical bridge to the regression analysis.
The regression model provided remarkable clarity, achieving an Adjusted R-squared of 0.9988. This confirms that the company’s pricing structure is highly deterministic and based almost entirely on the specific services a customer selects.
The findings from this audit include:
-Service Mix vs. Tenure: While the scatter plot shows a slight positive correlation between tenure and monthly charges (indicated by the blue trend line), the high variance at every level suggests that the specific service mix (e.g., Fiber Optic, Streaming) is a much stronger determinant of price than tenure alone.
-Primary Cost Drivers: Fiber Optic service and Streaming add-ons are the most significant contributors to the monthly fee.
Billing Equity: Demographic factors such as Gender and Partner status showed negligible coefficients, ensuring that the company maintains pricing consistency and equity across different customer segments.
Because the pricing logic is so consistent and transparent, the company can confidently implement targeted retention strategies without disrupting its core business model. Based on the integrated analysis, the following is recommended:
-Implement Service-Based Discounts: Instead of general price cuts, the company should offer targeted bundles (e.g., discounting streaming services for long-tenure customers) to lower the Monthly Charges for at-risk users.
-Focus on High-Value Churners: Priority should be given to Fiber Optic users in their first 12 months, as they face the highest costs and show the highest risk of churn.
-Contract Incentivization: Use the deterministic nature of the pricing model to create transparent “upgrade paths” that move customers from month-to-month plans to long-term contracts in exchange for localized service discounts.
By linking the “who” (classification) with the “how much” (regression), the project have provided a roadmap that allows the company to balance revenue goals with a fairer, more personalized customer experience.