==================================================

# Project Overview & Team Contributions

==================================================

Executive Summary This project presents a dual-layered analytical approach to the telecommunications industry, bridging the gap between Customer Retention (Classification) and Pricing Integrity (Regression). By analyzing 7,000+ customer records, our team developed a predictive framework that identifies high-risk churners (AUC 0.84) and an auditing model that explains 99.88% of pricing variance.

Team Contributions & Division of Labor Our team operated under a structured Data Science Lifecycle, ensuring every phase—from raw data to strategic insight—was handled with technical precision:

TIANBOYI (25069211-Team Leader): Integrate project.Defined the research framework and core business questions. Responsible for global quality control, code optimization, and the final synthesis of the R Markdown report. Authored the strategic introduction and executive conclusions.

YE MINGMIN (24070972 - Data Engineer): Data Preprocessing. Executed end-to-end data cleaning, handling missing values and feature engineering to ensure a high-quality “Single Source of Truth” for modeling.

MA LING (24206583 - Data Analyst): Exploratory Data Analysis (EDA). Conducted multi-dimensional visualization and statistical profiling to uncover hidden correlations between customer demographics and churn behavior.

YU MINGJUN (24222016 - Modeling Specialist): Regression Analysis. Built and validated the Linear Regression models to deconstruct the deterministic pricing structure of monthly charges.

YANG YICHEN (24211050 - ML Specialist): Classification Modeling. Implemented and tuned Logistic Regression and Random Forest algorithms to predict customer churn, focused on maximizing predictive power and model interpretability.

Key Outcomes 1.Churn Prediction: Achieved a robust AUC of 0.84, identifying Fiber Optic users and month-to-month contract holders as high-risk segments.

2.Pricing Audit: Confirmed a highly consistent billing logic with an Adjusted R² of 0.9988, proving that charges are strictly driven by service components rather than demographic bias.

3.Strategic Roadmap: Provided data-driven recommendations for service-based bundling and targeted retention for high-value customers.

4.Technical Stack: R (tidyverse, caret, randomForest, ggplot2, broom)

==================================================

1. Introduction and Project Objective

==================================================

This project aims to address one of the most critical challenges in the telecommunications industry: Customer Churn. Our team has approached this problem from two distinct analytical angles:

-Classification: Predicting which customers are likely to leave based on their behavior and service profiles.

-Regression: Conducting a pricing audit to understand exactly how our service components drive the MonthlyCharges.

By combining these two methods, we provide the business with actionable strategies for both retention and revenue optimization.

==================================================

2. Data cleaning and Preprocessing

==================================================

Before any analysis could begin, our team performed a rigorous data cleaning process to ensure the integrity of the results.

2.1 Objective and Strategy

As the first stage of our group project, the primary objective was to transform the raw telecommunications dataset into a high-quality, analysis-ready format. Our team focus was on ensuring data integrity, handling missing values, and correctly encoding variables to support both the classification and regression tasks that follow.

2.2 Environment Setup and Data Import

To maintain consistency across all team members’ environments, we utilized the tidyverse suite for data manipulation. The raw dataset, consisting of over 7,000 customer records, was imported for initial inspection.

# 1. Load required packages
library(tidyverse)

# 2. Read the raw data
# Ensuring the raw file is in the working directory
df <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Preview the first few rows
head(df)

##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes

2.3 Initial Diagnostics

Our team performed a comprehensive scan of the data structure. We identified that the TotalCharges column was improperly read as a character type due to empty strings, and several categorical variables needed to be converted into factors for proper modeling.

# Verify data types and dimensions
glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

cat("Dataset Dimensions:", dim(df)[1], "rows and", dim(df)[2], "columns.")

## Dataset Dimensions: 7043 rows and 21 columns.

2.4 Execution of Cleaning Steps

To prepare the data, our team executed a rigorous cleaning pipeline:

-Feature Selection: We removed the customerID column as it provides no predictive value.

-Type Conversion: TotalCharges was converted to numeric.

-Handling Missing Values: We identified 11 missing values in TotalCharges (occurring where tenure was 0). We decided to remove these records to ensure model accuracy.

-Categorical Encoding: We converted all categorical character columns and the SeniorCitizen indicator into factors.

# 1. Remove unnecessary ID column
df <- df %>% select(-customerID)

# 2. Fix TotalCharges and handle NAs
df <- df %>%
  mutate(TotalCharges = as.numeric(as.character(TotalCharges))) %>%
  drop_na()

# 3. Factorize categorical variables
# We grouped variables to ensure consistent encoding
df <- df %>%
  mutate(across(
    c(gender, SeniorCitizen, Partner, Dependents, PhoneService, 
      MultipleLines, InternetService, OnlineSecurity, OnlineBackup, 
      DeviceProtection, TechSupport, StreamingTV, StreamingMovies, 
      Contract, PaperlessBilling, PaymentMethod, Churn),
    as.factor
  ))

# 4. Final validation of numeric scales
df <- df %>%
  mutate(
    tenure = as.numeric(tenure),
    MonthlyCharges = as.numeric(MonthlyCharges)
  )

2.5 Final Validation and Export

Before handing off the data to the EDA and Modeling teams, we performed a final quality check to confirm there are no remaining missing values and that all factor levels are correctly assigned.

# Check for any remaining NAs
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

# Confirm clean dataset structure
str(df)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Finally, we exported this version as data_cleaned.csv to serve as the “Single Source of Truth” for all remaining project parts.

# --- Export Data ---

# 1. Define output path
export_path <- "data_cleaned.csv" 

# 2. Create folder if it doesn't exist
if (!dir.exists(dirname(export_path))) {
  dir.create(dirname(export_path), recursive = TRUE)
}

# 3. Try to save data; if it fails (e.g., file open in Excel), just show a warning
# This prevents the entire HTML knitting process from stopping.
tryCatch({
  write.csv(df, export_path, row.names = FALSE)
  message("Success: Data saved to ", export_path)
}, error = function(e) {
  warning("Could not save CSV. Please close 'data_cleaned.csv' in Excel and try again.")
})

==================================================

3. Exploratory Data Analysis (EDA)

==================================================

3.1 Introduction and Objective

After the initial data cleaning, our team focused on a deep dive into the dataset to understand the “signals” behind customer churn. The objective of this EDA is to investigate the data structure, assess quality issues, and identify which demographic and service attributes have the strongest correlation with Churn. These insights serve as the empirical foundation for our subsequent classification and regression models.

3.2 Environment and Data Acquisition

Our team selected a specialized toolset—ggplot2 for visualization and skimr for detailed statistical summaries—to ensure our analysis is both visually clear and statistically sound.

```{rload-libraries} library(dplyr) library(ggplot2) library(tidyr) library(skimr)

Loading the raw data for exploration

telco <- read.csv(“WA_Fn-UseC_-Telco-Customer-Churn.csv”, stringsAsFactors = FALSE)



## 3.3 Comprehensive Data Diagnostics
We began by validating the dimensions and structure of the dataset to confirm the scope of our analysis.

``` r
# --- Comprehensive Data Diagnostics ---

# Check the internal structure of the dataframe
# We use 'df' as defined in Section 1
str(df)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

# Summarize the distribution of variables
summary(df)

##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3483   0:5890        No :3639   No :4933   Min.   : 1.00   No : 680    
##  Male  :3549   1:1142        Yes:3393   Yes:2099   1st Qu.: 9.00   Yes:6352    
##                                                    Median :29.00               
##                                                    Mean   :32.42               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##           MultipleLines     InternetService             OnlineSecurity
##  No              :3385   DSL        :2416   No                 :3497  
##  No phone service: 680   Fiber optic:3096   No internet service:1520  
##  Yes             :2967   No         :1520   Yes                :2015  
##                                                                       
##                                                                       
##                                                                       
##               OnlineBackup             DeviceProtection
##  No                 :3087   No                 :3094   
##  No internet service:1520   No internet service:1520   
##  Yes                :2425   Yes                :2418   
##                                                        
##                                                        
##                                                        
##               TechSupport                StreamingTV  
##  No                 :3472   No                 :2809  
##  No internet service:1520   No internet service:1520  
##  Yes                :2040   Yes                :2703  
##                                                       
##                                                       
##                                                       
##             StreamingMovies           Contract    PaperlessBilling
##  No                 :2781   Month-to-month:3875   No :2864        
##  No internet service:1520   One year      :1472   Yes:4168        
##  Yes                :2731   Two year      :1685                   
##                                                                   
##                                                                   
##                                                                   
##                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
##  Bank transfer (automatic):1542   Min.   : 18.25   Min.   :  18.8   No :5163  
##  Credit card (automatic)  :1521   1st Qu.: 35.59   1st Qu.: 401.4   Yes:1869  
##  Electronic check         :2365   Median : 70.35   Median :1397.5             
##  Mailed check             :1604   Mean   : 64.80   Mean   :2283.3             
##                                   3rd Qu.: 89.86   3rd Qu.:3794.7             
##                                   Max.   :118.75   Max.   :8684.8

# Check for missing values across all columns
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

3.3.1 Quality Check: Missing and Unique Values

A critical part of our team’s workflow was identifying potential data gaps. We discovered that TotalCharges contained several missing values that required attention, while categorical variables remained highly consistent.

# Missing values across all columns
colSums(is.na(df))

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

# Identifying unique entries in categorical fields
sapply(df %>% select_if(is.character), n_distinct)

## named list()

3.4 Univariate Analysis: Understanding the Baseline

We explored the distribution of key variables to establish a baseline for our customer profiles.

3.4.1 Target Variable: Churn

Our team observed that the dataset is imbalanced, with a significant majority of customers currently staying with the provider.

ggplot(df, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  labs(title = "Baseline Distribution of Customer Churn", x = "Churn Status", y = "Total Count") +
  theme_minimal()

3.4.2 Distribution of Monthly Charges and Tenure

We analyzed the density of charges to see the common billing tiers and investigated how long customers typically stay with the company.

# Density of Monthly Charges
ggplot(df, aes(x = MonthlyCharges)) +
  geom_density(fill = "steelblue", alpha = 0.6) +
  labs(title = "Density Distribution of Monthly Charges", x = "Monthly Charges ($)")

# Histogram of Tenure
ggplot(df, aes(x = tenure)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "white") +
  labs(title = "Distribution of Customer Tenure", x = "Months of Tenure", y = "Frequency")

3.5 Bivariate Analysis: Correlating Features with Churn

Our team shifted the focus to how specific variables interact with the churn event.

3.5.1 Financial Drivers: Charges vs. Churn

The following visualization confirms our team’s hypothesis: customers who churn tend to have higher monthly charges compared to those who remain.

ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
  geom_boxplot() +
  labs(title = "Monthly Charges Comparison by Churn Status", y = "Monthly Charges ($)")

3.5.2 Loyalty Drivers: Tenure vs. Churn

The boxplot reveals that long-term tenure is a strong “moat” against churn. Customers who leave generally do so early in their lifecycle.

ggplot(df, aes(x = Churn, y = tenure, fill = Churn)) +
  geom_boxplot() +
  labs(title = "Tenure Duration Comparison by Churn Status", y = "Tenure (Months)")

3.6 Categorical Insights: Contract and Services

We examined how service types and contract terms influence a customer’s decision to stay or leave.

3.6.1 Contract Influence

The team identified Contract Type as perhaps the strongest predictor in the dataset. Month-to-month users are at a much higher risk of leaving.

ggplot(df, aes(x = Contract, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Proportion by Contract Type", y = "Proportion") +
  theme_minimal()

3.6.2 Internet Service and Payment Methods

We also looked at how specific technologies and billing methods correlate with churn.

# Internet Service Type
ggplot(df, aes(x = InternetService, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Proportion by Internet Service", y = "Proportion")

# Payment Method
ggplot(df, aes(x = PaymentMethod, fill = Churn)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Churn Proportion by Payment Method", y = "Proportion")

3.7 Key EDA Insights for Modeling

Based on our exploratory results, our team has identified several high-priority features for the next phase:

-High Risk Factors: Month-to-month contracts, Fiber Optic internet, and Electronic check payments are strongly associated with higher churn rates.

-Protective Factors: One- and Two-year contracts, and longer tenure (loyalty), significantly reduce the likelihood of churn.

-Predictive Potential: Monthly Charges and Tenure are the most significant continuous variables to include in our predictive classification model.

3.8 EDA Conclusion

This EDA phase has successfully provided our team with a clear understanding of the customer landscape. By identifying these critical drivers, we have established a solid analytical foundation for building our Classification and Regression models in the following sections.

==================================================

4. Classification Analysis: Predictive Modeling for Churn

==================================================

Research Question: “Can we build a reliable model to identify high-risk customers based on their service usage and demographic profiles?”

4.1 Objective and Modeling Strategy

Building on the insights from our Exploratory Data Analysis (EDA), our team moved to the core predictive task of this project. The primary question we aim to answer is: “Can we accurately predict customer churn using demographic characteristics and consumption behavior?”

To ensure the robustness of our predictions, our team decided to train and compare two distinct supervised learning models:

Logistic Regression: Serving as our baseline model due to its high interpretability and efficiency.

Random Forest: Employed as an advanced ensemble method to capture non-linear relationships and complex interactions between variables that a linear model might miss.

4.2 Data Preparation for Modeling

To maintain consistency across our project workflow, we utilized the data_cleaned.csv generated in Part 1. Our team applied a 70/30 split to create independent training and testing sets, ensuring that our model evaluations reflect real-world performance.

library(caret)
library(randomForest)
library(pROC)
library(dplyr)

# Load the cleaned dataset
model_data <- read.csv("data_cleaned.csv", stringsAsFactors = TRUE)  

# Ensure the target variable is a factor
model_data$Churn <- factor(model_data$Churn, levels = c("Yes", "No"))

# Data Partitioning (70% Train, 30% Test)
set.seed(2025) 
trainIndex <- createDataPartition(model_data$Churn, p = 0.7, list = FALSE)
trainData <- model_data[trainIndex, ]
testData <- model_data[-trainIndex, ]

cat("Training samples:", nrow(trainData), "\n")

## Training samples: 4924

cat("Testing samples:", nrow(testData), "\n")

## Testing samples: 2108

4.3 Model Training

We implement 5-fold cross-validation to ensure our results are generalizable and to prevent overfitting.

fitControl <- trainControl (method = "cv",
                           number = 5,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

# model1-Logistic Regression

set.seed (2025)
logit_model <- train (Churn ~ ., 
                     data = trainData, 
                     method = "glm", 
                     family = "binomial", 
                     trControl = fitControl, 
                     metric = "ROC")

#  model2-Random Forest
set.seed (2025)
rf_model <- train (Churn ~ ., 
                  data = trainData, 
                  method = "rf", 
                  trControl = fitControl, 
                  metric = "ROC",
                  tuneLength = 3)

4.4. Model Evaluation

We evaluated both models using AUC (Area Under the Curve) and confusion matrices. A higher AUC indicates stronger discrimination ability.

#  ROC 
prob_logit <- predict (logit_model, testData, type = "prob")
prob_rf    <- predict (rf_model, testData, type = "prob")

# predict classfication
pred_logit <- predict (logit_model, testData)
pred_rf    <- predict (rf_model, testData)


cm_logit <- confusionMatrix (pred_logit, testData$Churn, mode = "everything")
cm_rf    <- confusionMatrix (pred_rf, testData$Churn, mode = "everything")


print (cm_logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  304  147
##        No   256 1401
##                                           
##                Accuracy : 0.8088          
##                  95% CI : (0.7914, 0.8254)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 7.089e-16       
##                                           
##                   Kappa : 0.4776          
##                                           
##  Mcnemar's Test P-Value : 7.454e-08       
##                                           
##             Sensitivity : 0.5429          
##             Specificity : 0.9050          
##          Pos Pred Value : 0.6741          
##          Neg Pred Value : 0.8455          
##               Precision : 0.6741          
##                  Recall : 0.5429          
##                      F1 : 0.6014          
##              Prevalence : 0.2657          
##          Detection Rate : 0.1442          
##    Detection Prevalence : 0.2139          
##       Balanced Accuracy : 0.7239          
##                                           
##        'Positive' Class : Yes             
##

print (cm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  252  106
##        No   308 1442
##                                          
##                Accuracy : 0.8036         
##                  95% CI : (0.786, 0.8204)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 6.717e-14      
##                                          
##                   Kappa : 0.4312         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4500         
##             Specificity : 0.9315         
##          Pos Pred Value : 0.7039         
##          Neg Pred Value : 0.8240         
##               Precision : 0.7039         
##                  Recall : 0.4500         
##                      F1 : 0.5490         
##              Prevalence : 0.2657         
##          Detection Rate : 0.1195         
##    Detection Prevalence : 0.1698         
##       Balanced Accuracy : 0.6908         
##                                          
##        'Positive' Class : Yes            
##

4.4.1 ROC Curve Comparison

The ROC curve shows model performance under different thresholds. We plotted both curves on the same graph for direct comparison.

roc_logit <- roc (testData$Churn, prob_logit$Yes, levels = c ("No", "Yes"), direction = "<")
roc_rf    <- roc (testData$Churn, prob_rf$Yes, levels = c ("No", "Yes"), direction = "<")

# draw
plot (roc_logit, col = "#E69F00", lwd = 2, main = "ROC Curve Comparison: Logit vs RF",
     print.auc = TRUE, print.auc.y = 0.4, legacy.axes = TRUE)
plot (roc_rf, col = "#56B4E9", lwd = 2, add = TRUE, 
     print.auc = TRUE, print.auc.y = 0.3)
legend ("bottomright", legend = c ("Logistic Regression", "Random Forest"), 
       col = c ("#E69F00", "#56B4E9"), lwd = 2)

Interpretation—Both models show strong predictive performance, with AUC values above 0.80. Logistic Regression achieved an AUC of approximately 0.84, slightly higher than Random Forest. This suggests that, for this dataset, a simple linear model is already sufficient.

4.5 Key Findings and Business Insights

To understand the “why” behind customer churn, our team examined the variable importance derived from the Random Forest model. This allows us to move beyond simple prediction into actionable business intelligence.

# Visualizing top predictors
varImpPlot(rf_model$finalModel, main = "Top Predictors of Churn (Random Forest)")

Our analysis strongly confirms several critical drivers identified during EDA:

-Tenure is Paramount: New customers represent the highest risk; churn probability drops significantly as customer loyalty (tenure) increases.

-Price Sensitivity: Financial variables rank high in importance, indicating that cost-related pressure is a primary reason for leaving.

-Contractual Commitment: Customers on month-to-month plans are significantly more volatile than those on one- or two-year contracts.

-The Fiber Optic Paradox: Despite higher speeds, Fiber Optic users show higher churn rates, suggesting a potential gap between service cost and perceived value.

4.6 Classification Conclusion and Recommendations

Based on the modeling results, our team draws the following conclusions:

-Model Selection: We recommend Logistic Regression for immediate deployment. It delivers strong, competitive performance (AUC ≈ 0.84) with the added benefit of low computational cost and clear interpretability for stakeholders.

-Retention Strategy: Management should focus retention efforts on the first 0–12 months of the customer lifecycle. Specifically, implementing incentive programs to migrate users from month-to-month plans to long-term contracts could substantially reduce overall churn risk.

==================================================

5. Regression Analysis: Deconstructing Pricing Drivers

==================================================

Research Question: “How do individual service subscriptions quantitatively influence Monthly Charges after controlling for account profiles?”

5.1 Objective and Rational

While the previous section focused on predicting the behavior of customers (Churn), our team now shifts focus to auditing the financial structure of the business. The aim of this section is to quantify how various service subscriptions and customer attributes influence the monthly fee charged. By constructing and interpreting a Linear Regression model, our team provides the business with transparent pricing insights and identifies the primary levers for revenue optimization.

5.2 Data and Preparation

The analysis uses the pre-cleaned Telco customer churn dataset (data_cleaned.csv), which contains records of approximately 7,000 customers. Each observation describes demographics, contract details, service subscriptions, and billing information. Since the data have already been cleaned by the team in previous parts, our preparation focuses on correctly encoding categorical variables.

library(tidyverse)  
library(car)       
library(Metrics)   

# Load the cleaned data
telco <- read_csv('data_cleaned.csv')

# Encoding categorical variables as factors
telco <- telco %>%
  mutate(across(c(gender, SeniorCitizen, Partner, Dependents, PhoneService,
                  MultipleLines, InternetService, OnlineSecurity, OnlineBackup,
                  DeviceProtection, TechSupport, StreamingTV, StreamingMovies,
                  Contract, PaperlessBilling, PaymentMethod, Churn), as.factor))

# Ensuring numerical columns are correctly typed
telco <- telco %>% mutate(across(c(tenure, MonthlyCharges, TotalCharges), as.numeric))

glimpse(telco)

## Rows: 7,032
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

Key variables include tenure, Contract, and a series of binary service indicators such as PhoneService, MultipleLines, and InternetService. The MonthlyCharges variable is continuous and serves as the dependent variable in our regression models.

5.3 Exploratory Analysis of Monthly Charges

Before fitting any models, we explore the distribution of monthly charges and their relationship with tenure. The average customer pays around RM 65 per month, with a range between RM 18 and RM 119.

summary_stats <- summary(telco$MonthlyCharges)
summary_table <- tibble(
  Statistic = c("Min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"),
  Value     = as.numeric(summary_stats)
)
knitr::kable(summary_table, digits = 2, caption = "Summary of Monthly Charges")

Summary of Monthly Charges
Statistic	Value
Min	18.25
1st Qu.	35.59
Median	70.35
Mean	64.80
3rd Qu.	89.86
Max	118.75

We also examined the correlation between a customer’s tenure and their monthly fee.

ggplot(telco, aes(x = tenure, y = MonthlyCharges)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue") +
  labs(title = "Relationship Between Tenure and Monthly Charges",
       x = "Customer Tenure (months)", y = "Monthly Charges") +
  theme_minimal()

cor_test <- cor.test(telco$tenure, telco$MonthlyCharges)
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  telco$tenure and telco$MonthlyCharges
## t = 21.359, df = 7030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2247854 0.2686849
## sample estimates:
##       cor 
## 0.2468618

The correlation coefficient is relatively low (~0.25) but statistically significant. This indicates a modest positive association: customers with longer tenure tend to pay slightly higher fees, likely due to additional service subscriptions over time.

5.4 Building Linear Regression Models

We fit two multiple linear regression models to understand the pricing structure:

1.Full model: Includes demographics, contract details, billing setup, and all service indicators.

2，Pricing model: Focuses on service subscription variables while controlling for tenure and Contract.

full_model <- lm(MonthlyCharges ~ gender + SeniorCitizen + Partner + Dependents +
                   tenure + PhoneService + MultipleLines + InternetService +
                   OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
                   StreamingTV + StreamingMovies + Contract +
                   PaperlessBilling + PaymentMethod,
                 data = telco)
summary(full_model)

## 
## Call:
## lm(formula = MonthlyCharges ~ gender + SeniorCitizen + Partner + 
##     Dependents + tenure + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod, data = telco)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2296 -0.6166 -0.0055  0.6079  4.8413 
## 
## Coefficients:
##                                        Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                           2.496e+01  5.825e-02  428.509   <2e-16
## genderMale                            2.492e-02  2.447e-02    1.018    0.309
## SeniorCitizenYes                      1.434e-02  3.560e-02    0.403    0.687
## PartnerYes                           -4.216e-02  2.959e-02   -1.425    0.154
## DependentsYes                         1.416e-02  3.140e-02    0.451    0.652
## tenure                                1.455e-04  8.376e-04    0.174    0.862
## PhoneServiceYes                       2.006e+01  4.815e-02  416.574   <2e-16
## MultipleLinesYes                      5.019e+00  2.954e-02  169.908   <2e-16
## InternetServiceFiber optic            2.496e+01  3.488e-02  715.626   <2e-16
## InternetServiceNo                    -2.505e+01  4.874e-02 -513.867   <2e-16
## OnlineSecurityYes                     5.013e+00  3.217e-02  155.850   <2e-16
## OnlineBackupYes                       4.996e+00  3.023e-02  165.278   <2e-16
## DeviceProtectionYes                   5.020e+00  3.131e-02  160.305   <2e-16
## TechSupportYes                        5.033e+00  3.279e-02  153.491   <2e-16
## StreamingTVYes                        9.970e+00  3.205e-02  311.106   <2e-16
## StreamingMoviesYes                    9.966e+00  3.207e-02  310.727   <2e-16
## ContractOne year                      1.019e-02  3.832e-02    0.266    0.790
## ContractTwo year                     -2.564e-02  4.648e-02   -0.552    0.581
## PaperlessBillingYes                  -2.071e-02  2.735e-02   -0.757    0.449
## PaymentMethodCredit card (automatic)  5.085e-05  3.710e-02    0.001    0.999
## PaymentMethodElectronic check        -1.940e-02  3.636e-02   -0.534    0.594
## PaymentMethodMailed check            -1.399e-02  3.950e-02   -0.354    0.723
##                                         
## (Intercept)                          ***
## genderMale                              
## SeniorCitizenYes                        
## PartnerYes                              
## DependentsYes                           
## tenure                                  
## PhoneServiceYes                      ***
## MultipleLinesYes                     ***
## InternetServiceFiber optic           ***
## InternetServiceNo                    ***
## OnlineSecurityYes                    ***
## OnlineBackupYes                      ***
## DeviceProtectionYes                  ***
## TechSupportYes                       ***
## StreamingTVYes                       ***
## StreamingMoviesYes                   ***
## ContractOne year                        
## ContractTwo year                        
## PaperlessBillingYes                     
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check           
## PaymentMethodMailed check               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 7010 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 2.88e+05 on 21 and 7010 DF,  p-value: < 2.2e-16

Many demographic variables have p-values above 0.05, while specific service indicators are highly significant. This supports the hypothesis that monthly charges are essentially the sum of base fees plus subscribed services.

5.5 Multicollinearity Check

We verified that multicollinearity is not inflating standard errors using Variance Inflation Factors (VIFs).

vif(full_model)

##                      GVIF Df GVIF^(1/(2*Df))
## gender           1.001769  1        1.000884
## SeniorCitizen    1.153168  1        1.073857
## Partner          1.462369  1        1.209284
## Dependents       1.380811  1        1.175079
## tenure           2.827609  1        1.681549
## PhoneService     1.354768  1        1.163945
## MultipleLines    1.423682  1        1.193182
## InternetService  3.783812  2        1.394705
## OnlineSecurity   1.415235  1        1.189637
## OnlineBackup     1.380823  1        1.175084
## DeviceProtection 1.480118  1        1.216601
## TechSupport      1.481353  1        1.217108
## StreamingTV      1.625973  1        1.275137
## StreamingMovies  1.634755  1        1.278575
## Contract         2.567229  2        1.265803
## PaperlessBilling 1.208329  1        1.099240
## PaymentMethod    1.580562  3        1.079283

Values are under commonly cited thresholds, suggesting stable estimated coefficients.

5.6 Pricing Model

We specify a refined pricing model including only the variables most directly related to the monthly fee.

pricing_model <- lm(MonthlyCharges ~ tenure + Contract + PhoneService + MultipleLines +
                      InternetService + OnlineSecurity + OnlineBackup +
                      DeviceProtection + TechSupport + StreamingTV +
                      StreamingMovies,
                    data = telco)
summary(pricing_model)

## 
## Call:
## lm(formula = MonthlyCharges ~ tenure + Contract + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies, 
##     data = telco)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2425 -0.6160 -0.0047  0.6063  4.8075 
## 
## Coefficients:
##                              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 2.495e+01  4.550e-02  548.265   <2e-16 ***
## tenure                     -4.821e-07  7.952e-04   -0.001    1.000    
## ContractOne year            1.262e-02  3.792e-02    0.333    0.739    
## ContractTwo year           -2.299e-02  4.571e-02   -0.503    0.615    
## PhoneServiceYes             2.006e+01  4.804e-02  417.534   <2e-16 ***
## MultipleLinesYes            5.017e+00  2.943e-02  170.454   <2e-16 ***
## InternetServiceFiber optic  2.495e+01  3.401e-02  733.750   <2e-16 ***
## InternetServiceNo          -2.504e+01  4.760e-02 -526.106   <2e-16 ***
## OnlineSecurityYes           5.014e+00  3.194e-02  156.970   <2e-16 ***
## OnlineBackupYes             4.995e+00  3.019e-02  165.446   <2e-16 ***
## DeviceProtectionYes         5.020e+00  3.126e-02  160.597   <2e-16 ***
## TechSupportYes              5.034e+00  3.256e-02  154.609   <2e-16 ***
## StreamingTVYes              9.967e+00  3.190e-02  312.458   <2e-16 ***
## StreamingMoviesYes          9.964e+00  3.197e-02  311.698   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 7018 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 4.655e+05 on 13 and 7018 DF,  p-value: < 2.2e-16

The Adjusted R² is very high (~0.9988), explaining over 99.8% of the variation in charges, confirming a deterministic pricing structure driven by selected services.

5.7 Interpreting the Coefficients

We tabulate the coefficients for significant predictors (p < 0.05).

coef_df <- broom::tidy(pricing_model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(Effect = ifelse(estimate > 0, "Positive", "Negative")) %>%
  select(term, estimate, conf.low, conf.high, p.value, Effect)

knitr::kable(coef_df, digits = 3,
             col.names = c("Predictor", "Estimate", "CI Lower", "CI Upper", "p-value", "Effect"),
             caption = "Coefficients for the pricing model (significant predictors).")

Coefficients for the pricing model (significant predictors).
Predictor	Estimate	CI Lower	CI Upper	p-value	Effect
tenure	0.000	-0.002	0.002	1.000	Negative
ContractOne year	0.013	-0.062	0.087	0.739	Positive
ContractTwo year	-0.023	-0.113	0.067	0.615	Negative
PhoneServiceYes	20.058	19.964	20.152	0.000	Positive
MultipleLinesYes	5.017	4.959	5.075	0.000	Positive
InternetServiceFiber optic	24.954	24.887	25.020	0.000	Positive
InternetServiceNo	-25.043	-25.136	-24.950	0.000	Negative
OnlineSecurityYes	5.014	4.951	5.077	0.000	Positive
OnlineBackupYes	4.995	4.936	5.054	0.000	Positive
DeviceProtectionYes	5.020	4.959	5.082	0.000	Positive
TechSupportYes	5.034	4.971	5.098	0.000	Positive
StreamingTVYes	9.967	9.905	10.030	0.000	Positive
StreamingMoviesYes	9.964	9.901	10.026	0.000	Positive

5.8 Model Performance

We quantified prediction error using RMSE and MAE.

pred <- predict(pricing_model, newdata = telco)
rmse_val <- sqrt(mean((telco$MonthlyCharges - pred)^2))

The RMSE (~1.02) and MAE (~0.78) indicate that the model’s predictions are on average within about RM 1 of the actual monthly charges. Given that charges range from RM 18 to RM 118, such a small error underscores the deterministic nature of the pricing structure.

5.9 Diagnostics

The diagnostic plot compares actual versus predicted charges.

ggplot(data.frame(Actual = telco$MonthlyCharges, Predicted = pred), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, colour = 'red', linetype = 'dashed') +
  labs(title = 'Actual vs Predicted Monthly Charges',
       x = 'Actual Monthly Charges',
       y = 'Predicted Monthly Charges') +
  theme_minimal()

5.10 Key Findings and Business Implications

Our regression analysis reveals a transparent and additive pricing structure. The team has identified the following strategic takeaways:

-Service-Driven Pricing: Subscriptions to phone services, multiple lines, fiber optic internet, and optional add-ons (like Streaming TV) have fixed and statistically significant effects on the monthly fee.

-Demographic Neutrality: After controlling for service mix and contract length, demographics such as gender or having a partner do not influence the price. This confirms that the company maintains billing equity across all customer segments.

-The Role of Tenure: We found a positive but very small tenure coefficient, suggesting only minor price adjustments or upgrades occur over time. The specific service mix is a much stronger determinant of the bill than tenure alone.

-Actionable Revenue Levers: Because our pricing is so deterministic (R-Squared ≈ 0.99), the company can confidently offer “service-based” bundled discounts (e.g., discounting streaming services for high-risk churn groups identified in Part 4) to increase perceived value without disrupting the core pricing logic.

5.11 Regression Conclusion

By integrating our classification and regression findings, our team provides a dual-action strategy: we can now identify who is likely to leave and precisely how we can adjust their monthly bill through targeted service bundles to retain them.

==================================================

6.Final Conclusion

==================================================

6.1 Project Synthesis

Our project successfully moved beyond simple data observation to create a cohesive framework for customer retention. By merging classification techniques with a detailed pricing audit through regression, our team has built a comprehensive picture of the customer lifecycle. We did not just identify who is likely to leave; we uncovered the underlying financial mechanics that influence that decision.

6.2 Classification Insights and the Strategic Pivot

Our classification analysis provided a reliable foundation for identifying at-risk customers. Both the Logistic Regression and Random Forest models demonstrated strong predictive power, achieving AUC values around 0.84. While the Random Forest model offered a slight edge in capturing complex, non-linear interactions, both models were consistent in identifying the primary drivers of churn.

According to the Variable Importance Plot, the top three predictors are:

-Tenure: Long-term loyalty significantly reduces the likelihood of leaving.

-Contract Type: Month-to-month plans are high-risk zones.

-Monthly Charges: Financial pressure is a direct trigger for churn.

This finding led our team to a vital Strategic Question: If Monthly Charges are so critical to churn, how exactly does the company calculate these fees? This question served as the logical bridge to our regression analysis.

6.3 Deconstructing the Pricing Structure

The regression model provided remarkable clarity, achieving an Adjusted R-squared of 0.9988. This confirms that the company’s pricing structure is highly deterministic and based almost entirely on the specific services a customer selects.

Our findings from this audit include:

-Service Mix vs. Tenure: While the scatter plot shows a slight positive correlation between tenure and monthly charges (indicated by the blue trend line), the high variance at every level suggests that the specific service mix (e.g., Fiber Optic, Streaming) is a much stronger determinant of price than tenure alone.

-Primary Cost Drivers: Fiber Optic service and Streaming add-ons are the most significant contributors to the monthly fee.

Billing Equity: Demographic factors such as Gender and Partner status showed negligible coefficients, ensuring that the company maintains pricing consistency and equity across different customer segments.

6.4 Data-Driven Business Recommendations

Because the pricing logic is so consistent and transparent, the company can confidently implement targeted retention strategies without disrupting its core business model. Based on our integrated analysis, we recommend the following:

-Implement Service-Based Discounts: Instead of general price cuts, the company should offer targeted bundles (e.g., discounting streaming services for long-tenure customers) to lower the Monthly Charges for at-risk users.

-Focus on High-Value Churners: Priority should be given to Fiber Optic users in their first 12 months, as they face the highest costs and show the highest risk of churn.

-Contract Incentivization: Use the deterministic nature of the pricing model to create transparent “upgrade paths” that move customers from month-to-month plans to long-term contracts in exchange for localized service discounts.

By linking the “who” (classification) with the “how much” (regression), we have provided a roadmap that allows the company to balance revenue goals with a fairer, more personalized customer experience.

Predicting Customer Churn and Analyzing Pricing Factors: A Deep Dive into Telecommunications User Behavior

2026-01-07

==================================================

# Project Overview & Team Contributions

==================================================

==================================================

1. Introduction and Project Objective

==================================================

==================================================

2. Data cleaning and Preprocessing

==================================================

2.1 Objective and Strategy

2.2 Environment Setup and Data Import

2.3 Initial Diagnostics

2.4 Execution of Cleaning Steps

2.5 Final Validation and Export

==================================================

3. Exploratory Data Analysis (EDA)

==================================================

3.1 Introduction and Objective

3.2 Environment and Data Acquisition

Loading the raw data for exploration

3.3.1 Quality Check: Missing and Unique Values

3.4 Univariate Analysis: Understanding the Baseline

3.4.1 Target Variable: Churn

3.4.2 Distribution of Monthly Charges and Tenure

3.5 Bivariate Analysis: Correlating Features with Churn

3.5.1 Financial Drivers: Charges vs. Churn

3.5.2 Loyalty Drivers: Tenure vs. Churn

3.6 Categorical Insights: Contract and Services

3.6.1 Contract Influence

3.6.2 Internet Service and Payment Methods

3.7 Key EDA Insights for Modeling

3.8 EDA Conclusion

==================================================

4. Classification Analysis: Predictive Modeling for Churn

==================================================

4.1 Objective and Modeling Strategy

4.2 Data Preparation for Modeling

4.3 Model Training

4.4. Model Evaluation

4.4.1 ROC Curve Comparison

4.5 Key Findings and Business Insights

4.6 Classification Conclusion and Recommendations

==================================================

5. Regression Analysis: Deconstructing Pricing Drivers

==================================================

5.1 Objective and Rational

5.2 Data and Preparation

5.3 Exploratory Analysis of Monthly Charges

5.4 Building Linear Regression Models

5.5 Multicollinearity Check

5.6 Pricing Model

5.7 Interpreting the Coefficients

5.8 Model Performance

5.9 Diagnostics

5.10 Key Findings and Business Implications

5.11 Regression Conclusion

==================================================

6.Final Conclusion

==================================================

6.1 Project Synthesis

6.2 Classification Insights and the Strategic Pivot

6.3 Deconstructing the Pricing Structure

6.4 Data-Driven Business Recommendations