Executive Summary This project presents a dual-layered analytical approach to the telecommunications industry, bridging the gap between Customer Retention (Classification) and Pricing Integrity (Regression). By analyzing 7,000+ customer records, our team developed a predictive framework that identifies high-risk churners (AUC 0.84) and an auditing model that explains 99.88% of pricing variance.
Team Contributions & Division of Labor Our team operated under a structured Data Science Lifecycle, ensuring every phase—from raw data to strategic insight—was handled with technical precision:
TIANBOYI (25069211-Team Leader): Integrate project.Defined the research framework and core business questions. Responsible for global quality control, code optimization, and the final synthesis of the R Markdown report. Authored the strategic introduction and executive conclusions.
YE MINGMIN (24070972 - Data Engineer): Data Preprocessing. Executed end-to-end data cleaning, handling missing values and feature engineering to ensure a high-quality “Single Source of Truth” for modeling.
MA LING (24206583 - Data Analyst): Exploratory Data Analysis (EDA). Conducted multi-dimensional visualization and statistical profiling to uncover hidden correlations between customer demographics and churn behavior.
YU MINGJUN (24222016 - Modeling Specialist): Regression Analysis. Built and validated the Linear Regression models to deconstruct the deterministic pricing structure of monthly charges.
YANG YICHEN (24211050 - ML Specialist): Classification Modeling. Implemented and tuned Logistic Regression and Random Forest algorithms to predict customer churn, focused on maximizing predictive power and model interpretability.
Key Outcomes 1.Churn Prediction: Achieved a robust AUC of 0.84, identifying Fiber Optic users and month-to-month contract holders as high-risk segments.
2.Pricing Audit: Confirmed a highly consistent billing logic with an Adjusted R² of 0.9988, proving that charges are strictly driven by service components rather than demographic bias.
3.Strategic Roadmap: Provided data-driven recommendations for service-based bundling and targeted retention for high-value customers.
4.Technical Stack: R (tidyverse, caret, randomForest, ggplot2, broom)
This project aims to address one of the most critical challenges in the telecommunications industry: Customer Churn. Our team has approached this problem from two distinct analytical angles:
-Classification: Predicting which customers are likely to leave based on their behavior and service profiles.
-Regression: Conducting a pricing audit to understand exactly how our service components drive the MonthlyCharges.
By combining these two methods, we provide the business with actionable strategies for both retention and revenue optimization.
Before any analysis could begin, our team performed a rigorous data cleaning process to ensure the integrity of the results.
As the first stage of our group project, the primary objective was to transform the raw telecommunications dataset into a high-quality, analysis-ready format. Our team focus was on ensuring data integrity, handling missing values, and correctly encoding variables to support both the classification and regression tasks that follow.
To maintain consistency across all team members’ environments, we utilized the tidyverse suite for data manipulation. The raw dataset, consisting of over 7,000 customer records, was imported for initial inspection.
# 1. Load required packages
library(tidyverse)
# 2. Read the raw data
# Ensuring the raw file is in the working directory
df <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
# Preview the first few rows
head(df)## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
Our team performed a comprehensive scan of the data structure. We identified that the TotalCharges column was improperly read as a character type due to empty strings, and several categorical variables needed to be converted into factors for proper modeling.
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
## Dataset Dimensions: 7043 rows and 21 columns.
To prepare the data, our team executed a rigorous cleaning pipeline:
-Feature Selection: We removed the customerID column as it provides no predictive value.
-Type Conversion: TotalCharges was converted to numeric.
-Handling Missing Values: We identified 11 missing values in TotalCharges (occurring where tenure was 0). We decided to remove these records to ensure model accuracy.
-Categorical Encoding: We converted all categorical character columns and the SeniorCitizen indicator into factors.
# 1. Remove unnecessary ID column
df <- df %>% select(-customerID)
# 2. Fix TotalCharges and handle NAs
df <- df %>%
mutate(TotalCharges = as.numeric(as.character(TotalCharges))) %>%
drop_na()
# 3. Factorize categorical variables
# We grouped variables to ensure consistent encoding
df <- df %>%
mutate(across(
c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
Contract, PaperlessBilling, PaymentMethod, Churn),
as.factor
))
# 4. Final validation of numeric scales
df <- df %>%
mutate(
tenure = as.numeric(tenure),
MonthlyCharges = as.numeric(MonthlyCharges)
)Before handing off the data to the EDA and Modeling teams, we performed a final quality check to confirm there are no remaining missing values and that all factor levels are correctly assigned.
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Finally, we exported this version as data_cleaned.csv to serve as the “Single Source of Truth” for all remaining project parts.
# --- Export Data ---
# 1. Define output path
export_path <- "data_cleaned.csv"
# 2. Create folder if it doesn't exist
if (!dir.exists(dirname(export_path))) {
dir.create(dirname(export_path), recursive = TRUE)
}
# 3. Try to save data; if it fails (e.g., file open in Excel), just show a warning
# This prevents the entire HTML knitting process from stopping.
tryCatch({
write.csv(df, export_path, row.names = FALSE)
message("Success: Data saved to ", export_path)
}, error = function(e) {
warning("Could not save CSV. Please close 'data_cleaned.csv' in Excel and try again.")
})After the initial data cleaning, our team focused on a deep dive into the dataset to understand the “signals” behind customer churn. The objective of this EDA is to investigate the data structure, assess quality issues, and identify which demographic and service attributes have the strongest correlation with Churn. These insights serve as the empirical foundation for our subsequent classification and regression models.
Our team selected a specialized toolset—ggplot2 for visualization and skimr for detailed statistical summaries—to ensure our analysis is both visually clear and statistically sound.
```{rload-libraries} library(dplyr) library(ggplot2) library(tidyr) library(skimr)
telco <- read.csv(“WA_Fn-UseC_-Telco-Customer-Churn.csv”, stringsAsFactors = FALSE)
## 3.3 Comprehensive Data Diagnostics
We began by validating the dimensions and structure of the dataset to confirm the scope of our analysis.
``` r
# --- Comprehensive Data Diagnostics ---
# Check the internal structure of the dataframe
# We use 'df' as defined in Section 1
str(df)
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3483 0:5890 No :3639 No :4933 Min. : 1.00 No : 680
## Male :3549 1:1142 Yes:3393 Yes:2099 1st Qu.: 9.00 Yes:6352
## Median :29.00
## Mean :32.42
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity
## No :3385 DSL :2416 No :3497
## No phone service: 680 Fiber optic:3096 No internet service:1520
## Yes :2967 No :1520 Yes :2015
##
##
##
## OnlineBackup DeviceProtection
## No :3087 No :3094
## No internet service:1520 No internet service:1520
## Yes :2425 Yes :2418
##
##
##
## TechSupport StreamingTV
## No :3472 No :2809
## No internet service:1520 No internet service:1520
## Yes :2040 Yes :2703
##
##
##
## StreamingMovies Contract PaperlessBilling
## No :2781 Month-to-month:3875 No :2864
## No internet service:1520 One year :1472 Yes:4168
## Yes :2731 Two year :1685
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Churn
## Bank transfer (automatic):1542 Min. : 18.25 Min. : 18.8 No :5163
## Credit card (automatic) :1521 1st Qu.: 35.59 1st Qu.: 401.4 Yes:1869
## Electronic check :2365 Median : 70.35 Median :1397.5
## Mailed check :1604 Mean : 64.80 Mean :2283.3
## 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
A critical part of our team’s workflow was identifying potential data gaps. We discovered that TotalCharges contained several missing values that required attention, while categorical variables remained highly consistent.
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
# Identifying unique entries in categorical fields
sapply(df %>% select_if(is.character), n_distinct)## named list()
We explored the distribution of key variables to establish a baseline for our customer profiles.
Our team observed that the dataset is imbalanced, with a significant majority of customers currently staying with the provider.
ggplot(df, aes(x = Churn, fill = Churn)) +
geom_bar() +
labs(title = "Baseline Distribution of Customer Churn", x = "Churn Status", y = "Total Count") +
theme_minimal()We analyzed the density of charges to see the common billing tiers and investigated how long customers typically stay with the company.
# Density of Monthly Charges
ggplot(df, aes(x = MonthlyCharges)) +
geom_density(fill = "steelblue", alpha = 0.6) +
labs(title = "Density Distribution of Monthly Charges", x = "Monthly Charges ($)")# Histogram of Tenure
ggplot(df, aes(x = tenure)) +
geom_histogram(binwidth = 5, fill = "orange", color = "white") +
labs(title = "Distribution of Customer Tenure", x = "Months of Tenure", y = "Frequency")Our team shifted the focus to how specific variables interact with the churn event.
The following visualization confirms our team’s hypothesis: customers who churn tend to have higher monthly charges compared to those who remain.
ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
geom_boxplot() +
labs(title = "Monthly Charges Comparison by Churn Status", y = "Monthly Charges ($)")The boxplot reveals that long-term tenure is a strong “moat” against churn. Customers who leave generally do so early in their lifecycle.
ggplot(df, aes(x = Churn, y = tenure, fill = Churn)) +
geom_boxplot() +
labs(title = "Tenure Duration Comparison by Churn Status", y = "Tenure (Months)")We examined how service types and contract terms influence a customer’s decision to stay or leave.
The team identified Contract Type as perhaps the strongest predictor in the dataset. Month-to-month users are at a much higher risk of leaving.
ggplot(df, aes(x = Contract, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Churn Proportion by Contract Type", y = "Proportion") +
theme_minimal()We also looked at how specific technologies and billing methods correlate with churn.
# Internet Service Type
ggplot(df, aes(x = InternetService, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Churn Proportion by Internet Service", y = "Proportion")# Payment Method
ggplot(df, aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Churn Proportion by Payment Method", y = "Proportion")Based on our exploratory results, our team has identified several high-priority features for the next phase:
-High Risk Factors: Month-to-month contracts, Fiber Optic internet, and Electronic check payments are strongly associated with higher churn rates.
-Protective Factors: One- and Two-year contracts, and longer tenure (loyalty), significantly reduce the likelihood of churn.
-Predictive Potential: Monthly Charges and Tenure are the most significant continuous variables to include in our predictive classification model.
This EDA phase has successfully provided our team with a clear understanding of the customer landscape. By identifying these critical drivers, we have established a solid analytical foundation for building our Classification and Regression models in the following sections.
Research Question: “Can we build a reliable model to identify high-risk customers based on their service usage and demographic profiles?”
Building on the insights from our Exploratory Data Analysis (EDA), our team moved to the core predictive task of this project. The primary question we aim to answer is: “Can we accurately predict customer churn using demographic characteristics and consumption behavior?”
To ensure the robustness of our predictions, our team decided to train and compare two distinct supervised learning models:
Logistic Regression: Serving as our baseline model due to its high interpretability and efficiency.
Random Forest: Employed as an advanced ensemble method to capture non-linear relationships and complex interactions between variables that a linear model might miss.
To maintain consistency across our project workflow, we utilized the data_cleaned.csv generated in Part 1. Our team applied a 70/30 split to create independent training and testing sets, ensuring that our model evaluations reflect real-world performance.
library(caret)
library(randomForest)
library(pROC)
library(dplyr)
# Load the cleaned dataset
model_data <- read.csv("data_cleaned.csv", stringsAsFactors = TRUE)
# Ensure the target variable is a factor
model_data$Churn <- factor(model_data$Churn, levels = c("Yes", "No"))
# Data Partitioning (70% Train, 30% Test)
set.seed(2025)
trainIndex <- createDataPartition(model_data$Churn, p = 0.7, list = FALSE)
trainData <- model_data[trainIndex, ]
testData <- model_data[-trainIndex, ]
cat("Training samples:", nrow(trainData), "\n")## Training samples: 4924
## Testing samples: 2108
We implement 5-fold cross-validation to ensure our results are generalizable and to prevent overfitting.
fitControl <- trainControl (method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# model1-Logistic Regression
set.seed (2025)
logit_model <- train (Churn ~ .,
data = trainData,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "ROC")
# model2-Random Forest
set.seed (2025)
rf_model <- train (Churn ~ .,
data = trainData,
method = "rf",
trControl = fitControl,
metric = "ROC",
tuneLength = 3) We evaluated both models using AUC (Area Under the Curve) and confusion matrices. A higher AUC indicates stronger discrimination ability.
# ROC
prob_logit <- predict (logit_model, testData, type = "prob")
prob_rf <- predict (rf_model, testData, type = "prob")
# predict classfication
pred_logit <- predict (logit_model, testData)
pred_rf <- predict (rf_model, testData)
cm_logit <- confusionMatrix (pred_logit, testData$Churn, mode = "everything")
cm_rf <- confusionMatrix (pred_rf, testData$Churn, mode = "everything")
print (cm_logit)## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 304 147
## No 256 1401
##
## Accuracy : 0.8088
## 95% CI : (0.7914, 0.8254)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 7.089e-16
##
## Kappa : 0.4776
##
## Mcnemar's Test P-Value : 7.454e-08
##
## Sensitivity : 0.5429
## Specificity : 0.9050
## Pos Pred Value : 0.6741
## Neg Pred Value : 0.8455
## Precision : 0.6741
## Recall : 0.5429
## F1 : 0.6014
## Prevalence : 0.2657
## Detection Rate : 0.1442
## Detection Prevalence : 0.2139
## Balanced Accuracy : 0.7239
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 252 106
## No 308 1442
##
## Accuracy : 0.8036
## 95% CI : (0.786, 0.8204)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 6.717e-14
##
## Kappa : 0.4312
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4500
## Specificity : 0.9315
## Pos Pred Value : 0.7039
## Neg Pred Value : 0.8240
## Precision : 0.7039
## Recall : 0.4500
## F1 : 0.5490
## Prevalence : 0.2657
## Detection Rate : 0.1195
## Detection Prevalence : 0.1698
## Balanced Accuracy : 0.6908
##
## 'Positive' Class : Yes
##
The ROC curve shows model performance under different thresholds. We plotted both curves on the same graph for direct comparison.
roc_logit <- roc (testData$Churn, prob_logit$Yes, levels = c ("No", "Yes"), direction = "<")
roc_rf <- roc (testData$Churn, prob_rf$Yes, levels = c ("No", "Yes"), direction = "<")
# draw
plot (roc_logit, col = "#E69F00", lwd = 2, main = "ROC Curve Comparison: Logit vs RF",
print.auc = TRUE, print.auc.y = 0.4, legacy.axes = TRUE)
plot (roc_rf, col = "#56B4E9", lwd = 2, add = TRUE,
print.auc = TRUE, print.auc.y = 0.3)
legend ("bottomright", legend = c ("Logistic Regression", "Random Forest"),
col = c ("#E69F00", "#56B4E9"), lwd = 2)
Interpretation—Both models show strong predictive
performance, with AUC values above 0.80. Logistic Regression achieved an
AUC of approximately 0.84, slightly higher than Random
Forest. This suggests that, for this dataset, a simple linear model is
already sufficient.
To understand the “why” behind customer churn, our team examined the variable importance derived from the Random Forest model. This allows us to move beyond simple prediction into actionable business intelligence.
# Visualizing top predictors
varImpPlot(rf_model$finalModel, main = "Top Predictors of Churn (Random Forest)")Our analysis strongly confirms several critical drivers identified during EDA:
-Tenure is Paramount: New customers represent the highest risk; churn probability drops significantly as customer loyalty (tenure) increases.
-Price Sensitivity: Financial variables rank high in importance, indicating that cost-related pressure is a primary reason for leaving.
-Contractual Commitment: Customers on month-to-month plans are significantly more volatile than those on one- or two-year contracts.
-The Fiber Optic Paradox: Despite higher speeds, Fiber Optic users show higher churn rates, suggesting a potential gap between service cost and perceived value.
Based on the modeling results, our team draws the following conclusions:
-Model Selection: We recommend Logistic Regression for immediate deployment. It delivers strong, competitive performance (AUC ≈ 0.84) with the added benefit of low computational cost and clear interpretability for stakeholders.
-Retention Strategy: Management should focus retention efforts on the first 0–12 months of the customer lifecycle. Specifically, implementing incentive programs to migrate users from month-to-month plans to long-term contracts could substantially reduce overall churn risk.
Research Question: “How do individual service subscriptions quantitatively influence Monthly Charges after controlling for account profiles?”
While the previous section focused on predicting the behavior of customers (Churn), our team now shifts focus to auditing the financial structure of the business. The aim of this section is to quantify how various service subscriptions and customer attributes influence the monthly fee charged. By constructing and interpreting a Linear Regression model, our team provides the business with transparent pricing insights and identifies the primary levers for revenue optimization.
The analysis uses the pre-cleaned Telco customer churn dataset (data_cleaned.csv), which contains records of approximately 7,000 customers. Each observation describes demographics, contract details, service subscriptions, and billing information. Since the data have already been cleaned by the team in previous parts, our preparation focuses on correctly encoding categorical variables.
library(tidyverse)
library(car)
library(Metrics)
# Load the cleaned data
telco <- read_csv('data_cleaned.csv')
# Encoding categorical variables as factors
telco <- telco %>%
mutate(across(c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
Contract, PaperlessBilling, PaymentMethod, Churn), as.factor))
# Ensuring numerical columns are correctly typed
telco <- telco %>% mutate(across(c(tenure, MonthlyCharges, TotalCharges), as.numeric))
glimpse(telco)## Rows: 7,032
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
Key variables include tenure, Contract, and a series of binary service indicators such as PhoneService, MultipleLines, and InternetService. The MonthlyCharges variable is continuous and serves as the dependent variable in our regression models.
Before fitting any models, we explore the distribution of monthly charges and their relationship with tenure. The average customer pays around RM 65 per month, with a range between RM 18 and RM 119.
summary_stats <- summary(telco$MonthlyCharges)
summary_table <- tibble(
Statistic = c("Min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"),
Value = as.numeric(summary_stats)
)
knitr::kable(summary_table, digits = 2, caption = "Summary of Monthly Charges")| Statistic | Value |
|---|---|
| Min | 18.25 |
| 1st Qu. | 35.59 |
| Median | 70.35 |
| Mean | 64.80 |
| 3rd Qu. | 89.86 |
| Max | 118.75 |
We also examined the correlation between a customer’s tenure and their monthly fee.
ggplot(telco, aes(x = tenure, y = MonthlyCharges)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, colour = "blue") +
labs(title = "Relationship Between Tenure and Monthly Charges",
x = "Customer Tenure (months)", y = "Monthly Charges") +
theme_minimal()##
## Pearson's product-moment correlation
##
## data: telco$tenure and telco$MonthlyCharges
## t = 21.359, df = 7030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2247854 0.2686849
## sample estimates:
## cor
## 0.2468618
The correlation coefficient is relatively low (~0.25) but statistically significant. This indicates a modest positive association: customers with longer tenure tend to pay slightly higher fees, likely due to additional service subscriptions over time.
We fit two multiple linear regression models to understand the pricing structure:
1.Full model: Includes demographics, contract details, billing setup, and all service indicators.
2,Pricing model: Focuses on service subscription variables while controlling for tenure and Contract.
full_model <- lm(MonthlyCharges ~ gender + SeniorCitizen + Partner + Dependents +
tenure + PhoneService + MultipleLines + InternetService +
OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
StreamingTV + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod,
data = telco)
summary(full_model)##
## Call:
## lm(formula = MonthlyCharges ~ gender + SeniorCitizen + Partner +
## Dependents + tenure + PhoneService + MultipleLines + InternetService +
## OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod, data = telco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2296 -0.6166 -0.0055 0.6079 4.8413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496e+01 5.825e-02 428.509 <2e-16
## genderMale 2.492e-02 2.447e-02 1.018 0.309
## SeniorCitizenYes 1.434e-02 3.560e-02 0.403 0.687
## PartnerYes -4.216e-02 2.959e-02 -1.425 0.154
## DependentsYes 1.416e-02 3.140e-02 0.451 0.652
## tenure 1.455e-04 8.376e-04 0.174 0.862
## PhoneServiceYes 2.006e+01 4.815e-02 416.574 <2e-16
## MultipleLinesYes 5.019e+00 2.954e-02 169.908 <2e-16
## InternetServiceFiber optic 2.496e+01 3.488e-02 715.626 <2e-16
## InternetServiceNo -2.505e+01 4.874e-02 -513.867 <2e-16
## OnlineSecurityYes 5.013e+00 3.217e-02 155.850 <2e-16
## OnlineBackupYes 4.996e+00 3.023e-02 165.278 <2e-16
## DeviceProtectionYes 5.020e+00 3.131e-02 160.305 <2e-16
## TechSupportYes 5.033e+00 3.279e-02 153.491 <2e-16
## StreamingTVYes 9.970e+00 3.205e-02 311.106 <2e-16
## StreamingMoviesYes 9.966e+00 3.207e-02 310.727 <2e-16
## ContractOne year 1.019e-02 3.832e-02 0.266 0.790
## ContractTwo year -2.564e-02 4.648e-02 -0.552 0.581
## PaperlessBillingYes -2.071e-02 2.735e-02 -0.757 0.449
## PaymentMethodCredit card (automatic) 5.085e-05 3.710e-02 0.001 0.999
## PaymentMethodElectronic check -1.940e-02 3.636e-02 -0.534 0.594
## PaymentMethodMailed check -1.399e-02 3.950e-02 -0.354 0.723
##
## (Intercept) ***
## genderMale
## SeniorCitizenYes
## PartnerYes
## DependentsYes
## tenure
## PhoneServiceYes ***
## MultipleLinesYes ***
## InternetServiceFiber optic ***
## InternetServiceNo ***
## OnlineSecurityYes ***
## OnlineBackupYes ***
## DeviceProtectionYes ***
## TechSupportYes ***
## StreamingTVYes ***
## StreamingMoviesYes ***
## ContractOne year
## ContractTwo year
## PaperlessBillingYes
## PaymentMethodCredit card (automatic)
## PaymentMethodElectronic check
## PaymentMethodMailed check
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 7010 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.88e+05 on 21 and 7010 DF, p-value: < 2.2e-16
Many demographic variables have p-values above 0.05, while specific service indicators are highly significant. This supports the hypothesis that monthly charges are essentially the sum of base fees plus subscribed services.
We verified that multicollinearity is not inflating standard errors using Variance Inflation Factors (VIFs).
## GVIF Df GVIF^(1/(2*Df))
## gender 1.001769 1 1.000884
## SeniorCitizen 1.153168 1 1.073857
## Partner 1.462369 1 1.209284
## Dependents 1.380811 1 1.175079
## tenure 2.827609 1 1.681549
## PhoneService 1.354768 1 1.163945
## MultipleLines 1.423682 1 1.193182
## InternetService 3.783812 2 1.394705
## OnlineSecurity 1.415235 1 1.189637
## OnlineBackup 1.380823 1 1.175084
## DeviceProtection 1.480118 1 1.216601
## TechSupport 1.481353 1 1.217108
## StreamingTV 1.625973 1 1.275137
## StreamingMovies 1.634755 1 1.278575
## Contract 2.567229 2 1.265803
## PaperlessBilling 1.208329 1 1.099240
## PaymentMethod 1.580562 3 1.079283
Values are under commonly cited thresholds, suggesting stable estimated coefficients.
We specify a refined pricing model including only the variables most directly related to the monthly fee.
pricing_model <- lm(MonthlyCharges ~ tenure + Contract + PhoneService + MultipleLines +
InternetService + OnlineSecurity + OnlineBackup +
DeviceProtection + TechSupport + StreamingTV +
StreamingMovies,
data = telco)
summary(pricing_model)##
## Call:
## lm(formula = MonthlyCharges ~ tenure + Contract + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies,
## data = telco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2425 -0.6160 -0.0047 0.6063 4.8075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.495e+01 4.550e-02 548.265 <2e-16 ***
## tenure -4.821e-07 7.952e-04 -0.001 1.000
## ContractOne year 1.262e-02 3.792e-02 0.333 0.739
## ContractTwo year -2.299e-02 4.571e-02 -0.503 0.615
## PhoneServiceYes 2.006e+01 4.804e-02 417.534 <2e-16 ***
## MultipleLinesYes 5.017e+00 2.943e-02 170.454 <2e-16 ***
## InternetServiceFiber optic 2.495e+01 3.401e-02 733.750 <2e-16 ***
## InternetServiceNo -2.504e+01 4.760e-02 -526.106 <2e-16 ***
## OnlineSecurityYes 5.014e+00 3.194e-02 156.970 <2e-16 ***
## OnlineBackupYes 4.995e+00 3.019e-02 165.446 <2e-16 ***
## DeviceProtectionYes 5.020e+00 3.126e-02 160.597 <2e-16 ***
## TechSupportYes 5.034e+00 3.256e-02 154.609 <2e-16 ***
## StreamingTVYes 9.967e+00 3.190e-02 312.458 <2e-16 ***
## StreamingMoviesYes 9.964e+00 3.197e-02 311.698 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 7018 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 4.655e+05 on 13 and 7018 DF, p-value: < 2.2e-16
The Adjusted R² is very high (~0.9988), explaining over 99.8% of the variation in charges, confirming a deterministic pricing structure driven by selected services.
We tabulate the coefficients for significant predictors (p < 0.05).
coef_df <- broom::tidy(pricing_model, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(Effect = ifelse(estimate > 0, "Positive", "Negative")) %>%
select(term, estimate, conf.low, conf.high, p.value, Effect)
knitr::kable(coef_df, digits = 3,
col.names = c("Predictor", "Estimate", "CI Lower", "CI Upper", "p-value", "Effect"),
caption = "Coefficients for the pricing model (significant predictors).")| Predictor | Estimate | CI Lower | CI Upper | p-value | Effect |
|---|---|---|---|---|---|
| tenure | 0.000 | -0.002 | 0.002 | 1.000 | Negative |
| ContractOne year | 0.013 | -0.062 | 0.087 | 0.739 | Positive |
| ContractTwo year | -0.023 | -0.113 | 0.067 | 0.615 | Negative |
| PhoneServiceYes | 20.058 | 19.964 | 20.152 | 0.000 | Positive |
| MultipleLinesYes | 5.017 | 4.959 | 5.075 | 0.000 | Positive |
| InternetServiceFiber optic | 24.954 | 24.887 | 25.020 | 0.000 | Positive |
| InternetServiceNo | -25.043 | -25.136 | -24.950 | 0.000 | Negative |
| OnlineSecurityYes | 5.014 | 4.951 | 5.077 | 0.000 | Positive |
| OnlineBackupYes | 4.995 | 4.936 | 5.054 | 0.000 | Positive |
| DeviceProtectionYes | 5.020 | 4.959 | 5.082 | 0.000 | Positive |
| TechSupportYes | 5.034 | 4.971 | 5.098 | 0.000 | Positive |
| StreamingTVYes | 9.967 | 9.905 | 10.030 | 0.000 | Positive |
| StreamingMoviesYes | 9.964 | 9.901 | 10.026 | 0.000 | Positive |
We quantified prediction error using RMSE and MAE.
pred <- predict(pricing_model, newdata = telco)
rmse_val <- sqrt(mean((telco$MonthlyCharges - pred)^2))The RMSE (~1.02) and MAE (~0.78) indicate that the model’s predictions are on average within about RM 1 of the actual monthly charges. Given that charges range from RM 18 to RM 118, such a small error underscores the deterministic nature of the pricing structure.
The diagnostic plot compares actual versus predicted charges.
ggplot(data.frame(Actual = telco$MonthlyCharges, Predicted = pred), aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3) +
geom_abline(intercept = 0, slope = 1, colour = 'red', linetype = 'dashed') +
labs(title = 'Actual vs Predicted Monthly Charges',
x = 'Actual Monthly Charges',
y = 'Predicted Monthly Charges') +
theme_minimal()Our regression analysis reveals a transparent and additive pricing structure. The team has identified the following strategic takeaways:
-Service-Driven Pricing: Subscriptions to phone services, multiple lines, fiber optic internet, and optional add-ons (like Streaming TV) have fixed and statistically significant effects on the monthly fee.
-Demographic Neutrality: After controlling for service mix and contract length, demographics such as gender or having a partner do not influence the price. This confirms that the company maintains billing equity across all customer segments.
-The Role of Tenure: We found a positive but very small tenure coefficient, suggesting only minor price adjustments or upgrades occur over time. The specific service mix is a much stronger determinant of the bill than tenure alone.
-Actionable Revenue Levers: Because our pricing is so deterministic (R-Squared ≈ 0.99), the company can confidently offer “service-based” bundled discounts (e.g., discounting streaming services for high-risk churn groups identified in Part 4) to increase perceived value without disrupting the core pricing logic.
By integrating our classification and regression findings, our team provides a dual-action strategy: we can now identify who is likely to leave and precisely how we can adjust their monthly bill through targeted service bundles to retain them.
Our project successfully moved beyond simple data observation to create a cohesive framework for customer retention. By merging classification techniques with a detailed pricing audit through regression, our team has built a comprehensive picture of the customer lifecycle. We did not just identify who is likely to leave; we uncovered the underlying financial mechanics that influence that decision.
Our classification analysis provided a reliable foundation for identifying at-risk customers. Both the Logistic Regression and Random Forest models demonstrated strong predictive power, achieving AUC values around 0.84. While the Random Forest model offered a slight edge in capturing complex, non-linear interactions, both models were consistent in identifying the primary drivers of churn.
According to the Variable Importance Plot, the top three predictors are:
-Tenure: Long-term loyalty significantly reduces the likelihood of leaving.
-Contract Type: Month-to-month plans are high-risk zones.
-Monthly Charges: Financial pressure is a direct trigger for churn.
This finding led our team to a vital Strategic Question: If Monthly Charges are so critical to churn, how exactly does the company calculate these fees? This question served as the logical bridge to our regression analysis.
The regression model provided remarkable clarity, achieving an Adjusted R-squared of 0.9988. This confirms that the company’s pricing structure is highly deterministic and based almost entirely on the specific services a customer selects.
Our findings from this audit include:
-Service Mix vs. Tenure: While the scatter plot shows a slight positive correlation between tenure and monthly charges (indicated by the blue trend line), the high variance at every level suggests that the specific service mix (e.g., Fiber Optic, Streaming) is a much stronger determinant of price than tenure alone.
-Primary Cost Drivers: Fiber Optic service and Streaming add-ons are the most significant contributors to the monthly fee.
Billing Equity: Demographic factors such as Gender and Partner status showed negligible coefficients, ensuring that the company maintains pricing consistency and equity across different customer segments.
Because the pricing logic is so consistent and transparent, the company can confidently implement targeted retention strategies without disrupting its core business model. Based on our integrated analysis, we recommend the following:
-Implement Service-Based Discounts: Instead of general price cuts, the company should offer targeted bundles (e.g., discounting streaming services for long-tenure customers) to lower the Monthly Charges for at-risk users.
-Focus on High-Value Churners: Priority should be given to Fiber Optic users in their first 12 months, as they face the highest costs and show the highest risk of churn.
-Contract Incentivization: Use the deterministic nature of the pricing model to create transparent “upgrade paths” that move customers from month-to-month plans to long-term contracts in exchange for localized service discounts.
By linking the “who” (classification) with the “how much” (regression), we have provided a roadmap that allows the company to balance revenue goals with a fairer, more personalized customer experience.