==================================================

Project Overview & Team Contributions

==================================================

Executive Summary This project presents a dual-layered analytical approach to the telecommunications industry, bridging the gap between Customer Retention (Classification) and Pricing Integrity (Regression). By analyzing 7,000+ customer records, the project developed a predictive framework that identifies high-risk churners (AUC 0.84) and an auditing model that explains 99.88% of pricing variance.

Team Contributions & Division of Labor The project operated under a structured Data Science Lifecycle, ensuring every phase—from raw data to strategic insight—was handled with technical precision:

TIANBOYI (25069211-Team Leader): Integrate project.Defined the research framework and core business questions. Responsible for global quality control, code optimization, and the final synthesis of the R Markdown report. Authored the strategic introduction and executive conclusions.

YE MINGMIN (24070972 - Data Engineer): Data Preprocessing. Executed end-to-end data cleaning, handling missing values and feature engineering to ensure a high-quality “Single Source of Truth” for modeling.

MA LING (24206583 - Data Analyst): Exploratory Data Analysis (EDA). Conducted multi-dimensional visualization and statistical profiling to uncover hidden correlations between customer demographics and churn behavior.

YU MINGJUN (24222016 - Modeling Specialist): Regression Analysis. Built and validated the Linear Regression models to deconstruct the deterministic pricing structure of monthly charges.

YANG YICHEN (24211050 - ML Specialist): Classification Modeling. Implemented and tuned Logistic Regression and Random Forest algorithms to predict customer churn, focused on maximizing predictive power and model interpretability.

Key Outcomes 1.Churn Prediction: Achieved a robust AUC of 0.84, identifying Fiber Optic users and month-to-month contract holders as high-risk segments.

2.Pricing Audit: Confirmed a highly consistent billing logic with an Adjusted R² of 0.9988, proving that charges are strictly driven by service components rather than demographic bias.

3.Strategic Roadmap: Provided data-driven recommendations for service-based bundling and targeted retention for high-value customers.

4.Technical Stack: R (tidyverse, caret, randomForest, ggplot2, broom)

==================================================

1. Introduction and Project Objective

==================================================

This project aims to address one of the most critical challenges in the telecommunications industry: Customer Churn. This problem has been approached from two distinct analytical angles:

-Classification: Predicting which customers are likely to leave based on their behavior and service profiles.

-Regression: Conducting a pricing audit to understand exactly how the service components drive the MonthlyCharges.

By combining these two methods, the project provides the business with actionable strategies for both retention and revenue optimization.

==================================================

2. Data cleaning and Preprocessing

==================================================

Before any analysis could begin, the project performed a rigorous data cleaning process to ensure the integrity of the results.

2.1 Objective and Strategy

As the first stage of the group project, the primary objective was to transform the raw telecommunications dataset into a high-quality, analysis-ready format. The project was focus on ensuring data integrity, handling missing values, and correctly encoding variables to support both the classification and regression tasks that follow.

2.2 Environment Setup and Data Import

To maintain consistency across all team members’ environments, the project utilized the tidyverse suite for data manipulation. The raw dataset, consisting of over 7,000 customer records, was imported for initial inspection.

# 1. Load required packages
library(tidyverse)

# 2. Read the raw data
# Ensuring the raw file is in the working directory
df <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Preview the first few rows
head(df)

##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes

2.3 Initial Diagnostics

A comprehensive scan of the data structure was performed. The project identified that the Total Charges column was improperly read as a character type due to empty strings, and several categorical variables needed to be converted into factors for proper modeling.

# Verify data types and dimensions
glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

cat("Dataset Dimensions:", dim(df)[1], "rows and", dim(df)[2], "columns.")

## Dataset Dimensions: 7043 rows and 21 columns.

2.4 Execution of Cleaning Steps

To prepare the data, the project executed a rigorous cleaning pipeline:

-Feature Selection: The project removed the customer ID column as it provides no predictive value.

-Type Conversion: Total Charges was converted to numeric.

-Handling Missing Values: The project identified 11 missing values in Total Charges (occurring where tenure was 0). It decided to remove these records to ensure model accuracy.

-Categorical Encoding: All categorical character columns and the Senior Citizen indicator were converted into factors.

# 1. Remove unnecessary ID column
df <- df %>% select(-customerID)

# 2. Fix TotalCharges and handle NAs
df <- df %>%
  mutate(TotalCharges = as.numeric(as.character(TotalCharges))) %>%
  drop_na()

# 3. Factorize categorical variables
# The project grouped variables to ensure consistent encoding
df <- df %>%
  mutate(across(
    c(gender, SeniorCitizen, Partner, Dependents, PhoneService, 
      MultipleLines, InternetService, OnlineSecurity, OnlineBackup, 
      DeviceProtection, TechSupport, StreamingTV, StreamingMovies, 
      Contract, PaperlessBilling, PaymentMethod, Churn),
    as.factor
  ))

# 4. Final validation of numeric scales
df <- df %>%
  mutate(
    tenure = as.numeric(tenure),
    MonthlyCharges = as.numeric(MonthlyCharges)
  )

2.5 Final Validation and Export

Before handing off the data to the EDA and Modeling teams, the project performed a final quality check to confirm there are no remaining missing values and that all factor levels are correctly assigned.

# Check for any remaining NAs
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

# Confirm clean dataset structure
str(df)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Finally, the project exported this version as data_cleaned.csv to serve as the “Single Source of Truth” for all remaining project parts.

# --- Export Data ---

# 1. Define output path
export_path <- "data_cleaned.csv" 

# 2. Create folder if it doesn't exist
if (!dir.exists(dirname(export_path))) {
  dir.create(dirname(export_path), recursive = TRUE)
}

# 3. Try to save data; if it fails (e.g., file open in Excel), just show a warning
# This prevents the entire HTML knitting process from stopping.
tryCatch({
  write.csv(df, export_path, row.names = FALSE)
  message("Success: Data saved to ", export_path)
}, error = function(e) {
  warning("Could not save CSV. Please close 'data_cleaned.csv' in Excel and try again.")
})

==================================================

3. Exploratory Data Analysis (EDA)

==================================================

3.1 Introduction and Objective

After the initial data cleaning, the project focused on a deep dive into the dataset to understand the “signals” behind customer churn. The objective of this EDA is to investigate the data structure, assess quality issues, and identify which demographic and service attributes have the strongest correlation with Churn. These insights serve as the empirical foundation for the subsequent classification and regression models.

3.2 Environment and Data Acquisition

The project selected a specialized toolset—ggplot2 for visualization and skimr for detailed statistical summaries—to ensure the analysis is both visually clear and statistically sound.

```{rload-libraries} library(dplyr) library(ggplot2) library(tidyr) library(skimr)

Loading the raw data for exploration

telco <- read.csv(“WA_Fn-UseC_-Telco-Customer-Churn.csv”, stringsAsFactors = FALSE)


## 3.3 Comprehensive Data Diagnostics

The project began by validating the dimensions and structure of the dataset to confirm the scope of the analysis.


``` r
# --- Comprehensive Data Diagnostics ---

# Check the internal structure of the dataframe
# The project used 'df' as defined in Section 1
str(df)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

# Summarize the distribution of variables
summary(df)

##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3483   0:5890        No :3639   No :4933   Min.   : 1.00   No : 680    
##  Male  :3549   1:1142        Yes:3393   Yes:2099   1st Qu.: 9.00   Yes:6352    
##                                                    Median :29.00               
##                                                    Mean   :32.42               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##           MultipleLines     InternetService             OnlineSecurity
##  No              :3385   DSL        :2416   No                 :3497  
##  No phone service: 680   Fiber optic:3096   No internet service:1520  
##  Yes             :2967   No         :1520   Yes                :2015  
##                                                                       
##                                                                       
##                                                                       
##               OnlineBackup             DeviceProtection
##  No                 :3087   No                 :3094   
##  No internet service:1520   No internet service:1520   
##  Yes                :2425   Yes                :2418   
##                                                        
##                                                        
##                                                        
##               TechSupport                StreamingTV  
##  No                 :3472   No                 :2809  
##  No internet service:1520   No internet service:1520  
##  Yes                :2040   Yes                :2703  
##                                                       
##                                                       
##                                                       
##             StreamingMovies           Contract    PaperlessBilling
##  No                 :2781   Month-to-month:3875   No :2864        
##  No internet service:1520   One year      :1472   Yes:4168        
##  Yes                :2731   Two year      :1685                   
##                                                                   
##                                                                   
##                                                                   
##                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
##  Bank transfer (automatic):1542   Min.   : 18.25   Min.   :  18.8   No :5163  
##  Credit card (automatic)  :1521   1st Qu.: 35.59   1st Qu.: 401.4   Yes:1869  
##  Electronic check         :2365   Median : 70.35   Median :1397.5             
##  Mailed check             :1604   Mean   : 64.80   Mean   :2283.3             
##                                   3rd Qu.: 89.86   3rd Qu.:3794.7             
##                                   Max.   :118.75   Max.   :8684.8

# Check for missing values across all columns
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

3.3.1 Quality Check: Missing and Unique Values

A critical part of the workflow was identifying potential data gaps. The project discovered that Total Charges contained several missing values that required attention, while categorical variables remained highly consistent.

# Missing values across all columns
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

# Identifying unique entries in categorical fields
sapply(df %>% select_if(is.character), n_distinct)

## named list()

3.4 Univariate Analysis: Understanding the Baseline

The project explored the distribution of key variables to establish a baseline for the customer profiles.

3.4.1 Target Variable: Churn

The project observed that the dataset is imbalanced, with a significant majority of customers currently staying with the provider.

ggplot(df, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  labs(title = "Baseline Distribution of Customer Churn", x = "Churn Status", y = "Total Count") +
  theme_minimal()

3.4.2 Distribution of Monthly Charges and Tenure

The project analyzed the density of charges to see the common billing tiers and investigated how long customers typically stay with the company.

# Density of Monthly Charges
ggplot(df, aes(x = MonthlyCharges)) +
  geom_density(fill = "steelblue", alpha = 0.6) +
  labs(title = "Density Distribution of Monthly Charges", x = "Monthly Charges ($)")

# Histogram of Tenure
ggplot(df, aes(x = tenure)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "white") +
  labs(title = "Distribution of Customer Tenure", x = "Months of Tenure", y = "Frequency")

3.5 Bivariate Analysis: Correlating Features with Churn

The project shifted the focus to how specific variables interact with the churn event.

3.5.1 Financial Drivers: Charges vs. Churn

The following visualization confirms the project’s hypothesis: customers who churn tend to have higher monthly charges compared to those who remain.

ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
  geom_boxplot() +
  labs(title = "Monthly Charges Comparison by Churn Status", y = "Monthly Charges ($)")

3.5.2 Loyalty Drivers: Tenure vs. Churn

The boxplot reveals that long-term tenure is a strong “moat” against churn. Customers who leave generally do so early in their lifecycle.

ggplot(df, aes(x = Churn, y = tenure, fill = Churn)) +
  geom_boxplot() +
  labs(title = "Tenure Duration Comparison by Churn Status", y = "Tenure (Months)")

3.6 Categorical Insights: Contract and Services

The project examined how service types and contract terms influence a customer’s decision to stay or leave.

3.6.1 Contract Influence

The project identified Contract Type as perhaps the strongest predictor in the dataset. Month-to-month users are at a much higher risk of leaving.

ggplot(df, aes(x = Contract, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Proportion by Contract Type", y = "Proportion") +
  theme_minimal()

3.6.2 Internet Service and Payment Methods

The project also looked at how specific technologies and billing methods correlate with churn.

# Internet Service Type
ggplot(df, aes(x = InternetService, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Proportion by Internet Service", y = "Proportion")

# Payment Method
ggplot(df, aes(x = PaymentMethod, fill = Churn)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Churn Proportion by Payment Method", y = "Proportion")

3.7 Key EDA Insights for Modeling

Based on the exploratory results, several high-priority features for the next phase has been identified:

-High Risk Factors: Month-to-month contracts, Fiber Optic internet, and Electronic check payments are strongly associated with higher churn rates.

-Protective Factors: One- and Two-year contracts, and longer tenure (loyalty), significantly reduce the likelihood of churn.

-Predictive Potential: Monthly Charges and Tenure are the most significant continuous variables to include in the predictive classification model.

3.8 EDA Conclusion

This EDA phase has successfully provided a clear understanding of the customer landscape. By identifying these critical drivers, the project has established a solid analytical foundation for building the Classification and Regression models in the following sections.

==================================================

4. Classification Analysis: Predictive Modeling for Churn

==================================================

Research Question: “Can we build a reliable model to identify high-risk customers based on their service usage and demographic profiles?”

4.1 Objective and Modeling Strategy

Building on the insights from the Exploratory Data Analysis (EDA), the project moved to the core predictive task of this project. The primary question the chapter aims to answer is: “Can we accurately predict customer churn using demographic characteristics and consumption behavior?”

To ensure the robustness of the predictions, wo distinct supervised learning models were decided to train and compare:

Logistic Regression: Serving as the baseline model due to its high interpretability and efficiency.

Random Forest: Employed as an advanced ensemble method to capture non-linear relationships and complex interactions between variables that a linear model might miss.

4.2 Data Preparation for Modeling

To maintain consistency across the workflow, the data_cleaned.csv generated in Part 1 was utilized . The project applied a 70/30 split to create independent training and testing sets, ensuring that the model evaluations reflect real-world performance.

library(caret)
library(randomForest)
library(pROC)
library(dplyr)

# Load the cleaned dataset
model_data <- read.csv("data_cleaned.csv", stringsAsFactors = TRUE)  

# Ensure the target variable is a factor
model_data$Churn <- factor(model_data$Churn, levels = c("Yes", "No"))

# Data Partitioning (70% Train, 30% Test)
set.seed(2025) 
trainIndex <- createDataPartition(model_data$Churn, p = 0.7, list = FALSE)
trainData <- model_data[trainIndex, ]
testData <- model_data[-trainIndex, ]

cat("Training samples:", nrow(trainData), "\n")

## Training samples: 4924

cat("Testing samples:", nrow(testData), "\n")

## Testing samples: 2108

4.3 Model Training

The project implemented 5-fold cross-validation to ensure the results are generalizable and to prevent overfitting.

fitControl <- trainControl (method = "cv",
                           number = 5,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

# model1-Logistic Regression

set.seed (2025)
logit_model <- train (Churn ~ ., 
                     data = trainData, 
                     method = "glm", 
                     family = "binomial", 
                     trControl = fitControl, 
                     metric = "ROC")

#  model2-Random Forest
set.seed (2025)
rf_model <- train (Churn ~ ., 
                  data = trainData, 
                  method = "rf", 
                  trControl = fitControl, 
                  metric = "ROC",
                  tuneLength = 3)

4.4. Model Evaluation

The project evaluated both models using AUC (Area Under the Curve) and confusion matrices. A higher AUC indicates stronger discrimination ability.

#  ROC 
prob_logit <- predict (logit_model, testData, type = "prob")
prob_rf    <- predict (rf_model, testData, type = "prob")

# predict classfication
pred_logit <- predict (logit_model, testData)
pred_rf    <- predict (rf_model, testData)


cm_logit <- confusionMatrix (pred_logit, testData$Churn, mode = "everything")
cm_rf    <- confusionMatrix (pred_rf, testData$Churn, mode = "everything")


print (cm_logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  304  147
##        No   256 1401
##                                           
##                Accuracy : 0.8088          
##                  95% CI : (0.7914, 0.8254)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 7.089e-16       
##                                           
##                   Kappa : 0.4776          
##                                           
##  Mcnemar's Test P-Value : 7.454e-08       
##                                           
##             Sensitivity : 0.5429          
##             Specificity : 0.9050          
##          Pos Pred Value : 0.6741          
##          Neg Pred Value : 0.8455          
##               Precision : 0.6741          
##                  Recall : 0.5429          
##                      F1 : 0.6014          
##              Prevalence : 0.2657          
##          Detection Rate : 0.1442          
##    Detection Prevalence : 0.2139          
##       Balanced Accuracy : 0.7239          
##                                           
##        'Positive' Class : Yes             
##

print (cm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  252  106
##        No   308 1442
##                                          
##                Accuracy : 0.8036         
##                  95% CI : (0.786, 0.8204)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 6.717e-14      
##                                          
##                   Kappa : 0.4312         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4500         
##             Specificity : 0.9315         
##          Pos Pred Value : 0.7039         
##          Neg Pred Value : 0.8240         
##               Precision : 0.7039         
##                  Recall : 0.4500         
##                      F1 : 0.5490         
##              Prevalence : 0.2657         
##          Detection Rate : 0.1195         
##    Detection Prevalence : 0.1698         
##       Balanced Accuracy : 0.6908         
##                                          
##        'Positive' Class : Yes            
##

4.4.1 ROC Curve Comparison

The ROC curve shows model performance under different thresholds. Both curves on the same graph for direct comparison were plotted.

roc_logit <- roc (testData$Churn, prob_logit$Yes, levels = c ("No", "Yes"), direction = "<")
roc_rf    <- roc (testData$Churn, prob_rf$Yes, levels = c ("No", "Yes"), direction = "<")

# draw
plot (roc_logit, col = "#E69F00", lwd = 2, main = "ROC Curve Comparison: Logit vs RF",
     print.auc = TRUE, print.auc.y = 0.4, legacy.axes = TRUE)
plot (roc_rf, col = "#56B4E9", lwd = 2, add = TRUE, 
     print.auc = TRUE, print.auc.y = 0.3)
legend ("bottomright", legend = c ("Logistic Regression", "Random Forest"), 
       col = c ("#E69F00", "#56B4E9"), lwd = 2)

Interpretation—Both models show strong predictive performance, with AUC values above 0.80. Logistic Regression achieved an AUC of approximately 0.84, slightly higher than Random Forest. This suggests that, for this dataset, a simple linear model is already sufficient.

4.5 Key Findings and Business Insights

To understand the “why” behind customer churn, the variable importance derived from the Random Forest model is examined. This allows it to move beyond simple prediction into actionable business intelligence.

# Visualizing top predictors
varImpPlot(rf_model$finalModel, main = "Top Predictors of Churn (Random Forest)")

The analysis strongly confirms several critical drivers identified during EDA:

-Tenure is Paramount: New customers represent the highest risk; churn probability drops significantly as customer loyalty (tenure) increases.

-Price Sensitivity: Financial variables rank high in importance, indicating that cost-related pressure is a primary reason for leaving.

-Contractual Commitment: Customers on month-to-month plans are significantly more volatile than those on one- or two-year contracts.

-The Fiber Optic Paradox: Despite higher speeds, Fiber Optic users show higher churn rates, suggesting a potential gap between service cost and perceived value.

4.6 Classification Conclusion and Recommendations

Based on the modeling results, the following conclusions were drew:

-Model Selection: Logistic Regression was recommended for immediate deployment. It delivered strong, competitive performance (AUC ≈ 0.84) with the added benefit of low computational cost and clear interpret ability for stakeholders.

-Retention Strategy: Management should focus retention efforts on the first 0–12 months of the customer lifecycle. Specifically, implementing incentive programs to migrate users from month-to-month plans to long-term contracts could substantially reduce overall churn risk.

==================================================

5. Regression Analysis: Deconstructing Pricing Drivers

==================================================

Research Question: “How do individual service subscriptions quantitatively influence Monthly Charges after controlling for account profiles?”

5.1 Objective and Rational

While the previous section focused on predicting the behavior of customers (Churn), this chapter now shifts focus to auditing the financial structure of the business. The aim of this section is to quantify how various service subscriptions and customer attributes influence the monthly fee charged. By constructing and interpreting a Linear Regression model, this chapter provides the business with transparent pricing insights and identifies the primary levers for revenue optimization.

5.2 Data and Preparation

The analysis uses the pre-cleaned Telco customer churn dataset (data_cleaned.csv), which contains records of approximately 7,000 customers. Each observation describes demographics, contract details, service subscriptions, and billing information. Since the data have already been cleaned by previous parts, the preparation focuses on correctly encoding categorical variables.

library(tidyverse)  
library(car)       
library(Metrics)   

# Load the cleaned data
telco <- read_csv('data_cleaned.csv')

# Encoding categorical variables as factors
telco <- telco %>%
  mutate(across(c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
                  MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
                  DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
                  Contract, PaperlessBilling, PaymentMethod, Churn), as.factor))

# Ensuring numerical columns are correctly typed
telco <- telco %>% mutate(across(c(tenure, MonthlyCharges, TotalCharges), as.numeric))

glimpse(telco)

## Rows: 7,032
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

Key variables include tenure, Contract, and a series of binary service indicators such as PhoneService, MultipleLines, and InternetService. The MonthlyCharges variable is continuous and serves as the dependent variable in the regression models.

5.3 Exploratory Analysis of Monthly Charges

Before fitting any models, the project explore the distribution of monthly charges and their relationship with tenure. The average customer pays around RM 65 per month, with a range between RM 18 and RM 119.

summary_stats <- summary(telco$MonthlyCharges)
summary_table <- tibble(
  Statistic = c("Min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"),
  Value     = as.numeric(summary_stats)
)
knitr::kable(summary_table, digits = 2, caption = "Summary of Monthly Charges")

Summary of Monthly Charges
Statistic	Value
Min	18.25
1st Qu.	35.59
Median	70.35
Mean	64.80
3rd Qu.	89.86
Max	118.75

The project also examined the correlation between a customer’s tenure and their monthly fee.

ggplot(telco, aes(x = tenure, y = MonthlyCharges)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue") +
  labs(title = "Relationship Between Tenure and Monthly Charges",
       x = "Customer Tenure (months)", y = "Monthly Charges") +
  theme_minimal()

cor_test <- cor.test(telco$tenure, telco$MonthlyCharges)
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  telco$tenure and telco$MonthlyCharges
## t = 21.359, df = 7030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2247854 0.2686849
## sample estimates:
##       cor 
## 0.2468618

The correlation coefficient is relatively low (~0.25) but statistically significant. This indicates a modest positive association: customers with longer tenure tend to pay slightly higher fees, likely due to additional service subscriptions over time.

5.4 Building Linear Regression Models

The project fits two multiple linear regression models to understand the pricing structure:

1.Full model: Includes demographics, contract details, billing setup, and all service indicators.

2，Pricing model: Focuses on service subscription variables while controlling for tenure and Contract.

full_model <- lm(MonthlyCharges ~ gender + SeniorCitizen + Partner + Dependents +
                   tenure + PhoneService + MultipleLines + InternetService +
                   OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
                   StreamingTV + StreamingMovies + Contract +
                   PaperlessBilling + PaymentMethod,
                 data = telco)
summary(full_model)

## 
## Call:
## lm(formula = MonthlyCharges ~ gender + SeniorCitizen + Partner + 
##     Dependents + tenure + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod, data = telco)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2296 -0.6166 -0.0055  0.6079  4.8413 
## 
## Coefficients:
##                                        Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                           2.496e+01  5.825e-02  428.509   <2e-16
## genderMale                            2.492e-02  2.447e-02    1.018    0.309
## SeniorCitizenYes                      1.434e-02  3.560e-02    0.403    0.687
## PartnerYes                           -4.216e-02  2.959e-02   -1.425    0.154
## DependentsYes                         1.416e-02  3.140e-02    0.451    0.652
## tenure                                1.455e-04  8.376e-04    0.174    0.862
## PhoneServiceYes                       2.006e+01  4.815e-02  416.574   <2e-16
## MultipleLinesYes                      5.019e+00  2.954e-02  169.908   <2e-16
## InternetServiceFiber optic            2.496e+01  3.488e-02  715.626   <2e-16
## InternetServiceNo                    -2.505e+01  4.874e-02 -513.867   <2e-16
## OnlineSecurityYes                     5.013e+00  3.217e-02  155.850   <2e-16
## OnlineBackupYes                       4.996e+00  3.023e-02  165.278   <2e-16
## DeviceProtectionYes                   5.020e+00  3.131e-02  160.305   <2e-16
## TechSupportYes                        5.033e+00  3.279e-02  153.491   <2e-16
## StreamingTVYes                        9.970e+00  3.205e-02  311.106   <2e-16
## StreamingMoviesYes                    9.966e+00  3.207e-02  310.727   <2e-16
## ContractOne year                      1.019e-02  3.832e-02    0.266    0.790
## ContractTwo year                     -2.564e-02  4.648e-02   -0.552    0.581
## PaperlessBillingYes                  -2.071e-02  2.735e-02   -0.757    0.449
## PaymentMethodCredit card (automatic)  5.085e-05  3.710e-02    0.001    0.999
## PaymentMethodElectronic check        -1.940e-02  3.636e-02   -0.534    0.594
## PaymentMethodMailed check            -1.399e-02  3.950e-02   -0.354    0.723
##                                         
## (Intercept)                          ***
## genderMale                              
## SeniorCitizenYes                        
## PartnerYes                              
## DependentsYes                           
## tenure                                  
## PhoneServiceYes                      ***
## MultipleLinesYes                     ***
## InternetServiceFiber optic           ***
## InternetServiceNo                    ***
## OnlineSecurityYes                    ***
## OnlineBackupYes                      ***
## DeviceProtectionYes                  ***
## TechSupportYes                       ***
## StreamingTVYes                       ***
## StreamingMoviesYes                   ***
## ContractOne year                        
## ContractTwo year                        
## PaperlessBillingYes                     
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check           
## PaymentMethodMailed check               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 7010 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 2.88e+05 on 21 and 7010 DF,  p-value: < 2.2e-16

Many demographic variables have p-values above 0.05, while specific service indicators are highly significant. This supports the hypothesis that monthly charges are essentially the sum of base fees plus subscribed services.

5.5 Multi-collinearity Check

The project verified that multi-collinearity is not inflating standard errors using Variance Inflation Factors (VIFs).

vif(full_model)

##                      GVIF Df GVIF^(1/(2*Df))
## gender           1.001769  1        1.000884
## SeniorCitizen    1.153168  1        1.073857
## Partner          1.462369  1        1.209284
## Dependents       1.380811  1        1.175079
## tenure           2.827609  1        1.681549
## PhoneService     1.354768  1        1.163945
## MultipleLines    1.423682  1        1.193182
## InternetService  3.783812  2        1.394705
## OnlineSecurity   1.415235  1        1.189637
## OnlineBackup     1.380823  1        1.175084
## DeviceProtection 1.480118  1        1.216601
## TechSupport      1.481353  1        1.217108
## StreamingTV      1.625973  1        1.275137
## StreamingMovies  1.634755  1        1.278575
## Contract         2.567229  2        1.265803
## PaperlessBilling 1.208329  1        1.099240
## PaymentMethod    1.580562  3        1.079283

Values are under commonly cited thresholds, suggesting stable estimated coefficients.

5.6 Pricing Model

The project specifies a refined pricing model including only the variables most directly related to the monthly fee.

pricing_model <- lm(MonthlyCharges ~ tenure + Contract + PhoneService + MultipleLines +
                      InternetService + OnlineSecurity + OnlineBackup +
                      DeviceProtection + TechSupport + StreamingTV +
                      StreamingMovies,
                    data = telco)
summary(pricing_model)

## 
## Call:
## lm(formula = MonthlyCharges ~ tenure + Contract + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies, 
##     data = telco)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2425 -0.6160 -0.0047  0.6063  4.8075 
## 
## Coefficients:
##                              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 2.495e+01  4.550e-02  548.265   <2e-16 ***
## tenure                     -4.821e-07  7.952e-04   -0.001    1.000    
## ContractOne year            1.262e-02  3.792e-02    0.333    0.739    
## ContractTwo year           -2.299e-02  4.571e-02   -0.503    0.615    
## PhoneServiceYes             2.006e+01  4.804e-02  417.534   <2e-16 ***
## MultipleLinesYes            5.017e+00  2.943e-02  170.454   <2e-16 ***
## InternetServiceFiber optic  2.495e+01  3.401e-02  733.750   <2e-16 ***
## InternetServiceNo          -2.504e+01  4.760e-02 -526.106   <2e-16 ***
## OnlineSecurityYes           5.014e+00  3.194e-02  156.970   <2e-16 ***
## OnlineBackupYes             4.995e+00  3.019e-02  165.446   <2e-16 ***
## DeviceProtectionYes         5.020e+00  3.126e-02  160.597   <2e-16 ***
## TechSupportYes              5.034e+00  3.256e-02  154.609   <2e-16 ***
## StreamingTVYes              9.967e+00  3.190e-02  312.458   <2e-16 ***
## StreamingMoviesYes          9.964e+00  3.197e-02  311.698   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 7018 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 4.655e+05 on 13 and 7018 DF,  p-value: < 2.2e-16

The Adjusted R² is very high (~0.9988), explaining over 99.8% of the variation in charges, confirming a deterministic pricing structure driven by selected services.

5.7 Interpreting the Coefficients

The project tabulates the coefficients for significant predictors (p < 0.05).

coef_df <- broom::tidy(pricing_model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(Effect = ifelse(estimate > 0, "Positive", "Negative")) %>%
  select(term, estimate, conf.low, conf.high, p.value, Effect)

knitr::kable(coef_df, digits = 3,
             col.names = c("Predictor", "Estimate", "CI Lower", "CI Upper", "p-value", "Effect"),
             caption = "Coefficients for the pricing model (significant predictors).")

Coefficients for the pricing model (significant predictors).
Predictor	Estimate	CI Lower	CI Upper	p-value	Effect
tenure	0.000	-0.002	0.002	1.000	Negative
ContractOne year	0.013	-0.062	0.087	0.739	Positive
ContractTwo year	-0.023	-0.113	0.067	0.615	Negative
PhoneServiceYes	20.058	19.964	20.152	0.000	Positive
MultipleLinesYes	5.017	4.959	5.075	0.000	Positive
InternetServiceFiber optic	24.954	24.887	25.020	0.000	Positive
InternetServiceNo	-25.043	-25.136	-24.950	0.000	Negative
OnlineSecurityYes	5.014	4.951	5.077	0.000	Positive
OnlineBackupYes	4.995	4.936	5.054	0.000	Positive
DeviceProtectionYes	5.020	4.959	5.082	0.000	Positive
TechSupportYes	5.034	4.971	5.098	0.000	Positive
StreamingTVYes	9.967	9.905	10.030	0.000	Positive
StreamingMoviesYes	9.964	9.901	10.026	0.000	Positive

5.8 Model Performance

The project quantified prediction error using RMSE and MAE.

pred <- predict(pricing_model, newdata = telco)
rmse_val <- sqrt(mean((telco$MonthlyCharges - pred)^2))

The RMSE (~1.02) and MAE (~0.78) indicate that the model’s predictions are on average within about RM 1 of the actual monthly charges. Given that charges range from RM 18 to RM 118, such a small error underscores the deterministic nature of the pricing structure.

5.9 Diagnostics

The diagnostic plot compares actual versus predicted charges.

ggplot(data.frame(Actual = telco$MonthlyCharges, Predicted = pred), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, colour = 'red', linetype = 'dashed') +
  labs(title = 'Actual vs Predicted Monthly Charges',
       x = 'Actual Monthly Charges',
       y = 'Predicted Monthly Charges') +
  theme_minimal()

5.10 Key Findings and Business Implications

The regression analysis reveals a transparent and additive pricing structure. The following strategic takeaways has identified:

-Service-Driven Pricing: Subscriptions to phone services, multiple lines, fiber optic internet, and optional add-ons (like Streaming TV) have fixed and statistically significant effects on the monthly fee.

-Demographic Neutrality: After controlling for service mix and contract length, demographics such as gender or having a partner do not influence the price. This confirms that the company maintains billing equity across all customer segments.

-The Role of Tenure: A positive but very small tenure coefficient is founded, suggesting only minor price adjustments or upgrades occur over time. The specific service mix is a much stronger determinant of the bill than tenure alone.

-Actionable Revenue Levers: Because the pricing is so deterministic (R-Squared ≈ 0.99), the company can confidently offer “service-based” bundled discounts (e.g., discounting streaming services for high-risk churn groups identified in Part 4) to increase perceived value without disrupting the core pricing logic.

5.11 Regression Conclusion

By integrating the classification and regression findings, a dual-action strategy was provided: who is likely to leave can now be identified and precisely how their monthly bill through targeted service bundles to retain them can be adjusted.

==================================================

6.Final Conclusion

==================================================

6.1 Project Synthesis

The project successfully moved beyond simple data observation to create a cohesive framework for customer retention. By merging classification techniques with a detailed pricing audit through regression, a comprehensive picture of the customer lifecycle has be built. Who is likely to leave was not just identified; The underlying financial mechanics that influence that decision were uncovered.

6.2 Classification Insights and the Strategic Pivot

The classification analysis provided a reliable foundation for identifying at-risk customers. Both the Logistic Regression and Random Forest models demonstrated strong predictive power, achieving AUC values around 0.84. While the Random Forest model offered a slight edge in capturing complex, non-linear interactions, both models were consistent in identifying the primary drivers of churn.

According to the Variable Importance Plot, the top three predictors are:

-Tenure: Long-term loyalty significantly reduces the likelihood of leaving.

-Contract Type: Month-to-month plans are high-risk zones.

-Monthly Charges: Financial pressure is a direct trigger for churn.

This finding led to a vital Strategic Question: If Monthly Charges are so critical to churn, how exactly does the company calculate these fees? This question served as the logical bridge to the regression analysis.

6.3 Deconstructing the Pricing Structure

The regression model provided remarkable clarity, achieving an Adjusted R-squared of 0.9988. This confirms that the company’s pricing structure is highly deterministic and based almost entirely on the specific services a customer selects.

The findings from this audit include:

-Service Mix vs. Tenure: While the scatter plot shows a slight positive correlation between tenure and monthly charges (indicated by the blue trend line), the high variance at every level suggests that the specific service mix (e.g., Fiber Optic, Streaming) is a much stronger determinant of price than tenure alone.

-Primary Cost Drivers: Fiber Optic service and Streaming add-ons are the most significant contributors to the monthly fee.

Billing Equity: Demographic factors such as Gender and Partner status showed negligible coefficients, ensuring that the company maintains pricing consistency and equity across different customer segments.

6.4 Data-Driven Business Recommendations

Because the pricing logic is so consistent and transparent, the company can confidently implement targeted retention strategies without disrupting its core business model. Based on the integrated analysis, the following is recommended:

-Implement Service-Based Discounts: Instead of general price cuts, the company should offer targeted bundles (e.g., discounting streaming services for long-tenure customers) to lower the Monthly Charges for at-risk users.

-Focus on High-Value Churners: Priority should be given to Fiber Optic users in their first 12 months, as they face the highest costs and show the highest risk of churn.

-Contract Incentivization: Use the deterministic nature of the pricing model to create transparent “upgrade paths” that move customers from month-to-month plans to long-term contracts in exchange for localized service discounts.

By linking the “who” (classification) with the “how much” (regression), the project have provided a roadmap that allows the company to balance revenue goals with a fairer, more personalized customer experience.

Predicting Customer Churn and Analyzing Pricing Factors: A Deep Dive into Telecommunications User Behavior

2026-01-07

==================================================

Project Overview & Team Contributions

==================================================

==================================================

1. Introduction and Project Objective

==================================================

==================================================

2. Data cleaning and Preprocessing

==================================================

2.1 Objective and Strategy

2.2 Environment Setup and Data Import

2.3 Initial Diagnostics

2.4 Execution of Cleaning Steps

2.5 Final Validation and Export

==================================================

3. Exploratory Data Analysis (EDA)

==================================================

3.1 Introduction and Objective

3.2 Environment and Data Acquisition

Loading the raw data for exploration

3.3.1 Quality Check: Missing and Unique Values

3.4 Univariate Analysis: Understanding the Baseline

3.4.1 Target Variable: Churn

3.4.2 Distribution of Monthly Charges and Tenure

3.5 Bivariate Analysis: Correlating Features with Churn

3.5.1 Financial Drivers: Charges vs. Churn

3.5.2 Loyalty Drivers: Tenure vs. Churn

3.6 Categorical Insights: Contract and Services

3.6.1 Contract Influence

3.6.2 Internet Service and Payment Methods

3.7 Key EDA Insights for Modeling

3.8 EDA Conclusion

==================================================

4. Classification Analysis: Predictive Modeling for Churn

==================================================

4.1 Objective and Modeling Strategy

4.2 Data Preparation for Modeling

4.3 Model Training

4.4. Model Evaluation

4.4.1 ROC Curve Comparison

4.5 Key Findings and Business Insights

4.6 Classification Conclusion and Recommendations

==================================================

5. Regression Analysis: Deconstructing Pricing Drivers

==================================================

5.1 Objective and Rational

5.2 Data and Preparation

5.3 Exploratory Analysis of Monthly Charges

5.4 Building Linear Regression Models

5.5 Multi-collinearity Check

5.6 Pricing Model

5.7 Interpreting the Coefficients

5.8 Model Performance

5.9 Diagnostics

5.10 Key Findings and Business Implications

5.11 Regression Conclusion

==================================================

6.Final Conclusion

==================================================

6.1 Project Synthesis

6.2 Classification Insights and the Strategic Pivot

6.3 Deconstructing the Pricing Structure

6.4 Data-Driven Business Recommendations