3. INTRODUCTION
Background
The subscription-based business model has become increasingly
prevalent across industries—from Software-as-a-Service (SaaS) platforms
to streaming services, telecom providers, and online learning platforms.
Unlike transactional businesses, subscription models generate recurring
revenue, making customer retention a critical metric for long-term
viability. However, the ease with which customers can cancel
subscriptions makes churn a persistent challenge.
Customer churn, defined as the voluntary discontinuation of a
subscription or service by a customer, directly impacts: -
Revenue Stability: Loss of recurring revenue streams -
Customer Lifetime Value (CLV): Reduced predictability
of customer economics - Growth Metrics: Increased
dependency on new customer acquisition - Operational
Costs: Higher customer acquisition costs relative to retention
costs
Industry research consistently demonstrates that acquiring a new
customer costs 5-25 times more than retaining an existing one. This
economic reality emphasizes the importance of understanding and
predicting churn.
Significance of Statistical Analysis
While machine learning and predictive modeling are powerful tools for
churn prediction, they often operate as “black boxes,” making it
difficult for stakeholders to understand why customers churn.
Statistical analysis provides:
- Interpretability: Clear understanding of which
factors significantly impact churn
- Hypothesis Testing: Validation of assumptions about
churn drivers with statistical evidence
- Segmentation: Identification of high-risk customer
segments
- Actionable Insights: Evidence-based recommendations
for retention strategies
- Reproducibility: Transparent, repeatable
methodology for ongoing analysis
Use of Secondary Data in Churn Research
This project employs secondary data analysis,
utilizing publicly available data rather than conducting primary data
collection through surveys, interviews, or field studies. Secondary data
analysis in customer churn research offers several advantages:
- Efficiency: Eliminates time and cost associated
with data collection
- Established Quality: Data is typically cleaned,
validated, and documented by original publishers
- Large Sample Size: Available datasets often contain
thousands of records, providing robust statistical power
- Real-World Context: Data reflects actual customer
behavior and service usage patterns
- Reproducibility: Other researchers can access the
same dataset and validate findings
The IBM Telco Customer Churn dataset is widely used in academic
research and industry applications, ensuring relevance and comparability
with existing literature. This dataset represents realistic subscription
service dynamics and contains the necessary variables for comprehensive
statistical analysis.
Relevance to Subscription Services
Subscription-based businesses face unique challenges compared to
transactional models: - Continuous Customer
Relationship: Extended interaction period enables collection of
behavioral data - Usage Tracking: Digital platforms
capture detailed usage patterns and engagement metrics - Renewal
Decisions: Regular renewal points provide opportunities for
intervention - Data Richness: Availability of
demographic, behavioral, and transactional data
The IBM Telco dataset exemplifies these characteristics, with
detailed records of contract types, service utilization, payment
methods, and explicit churn indicators. This richness enables
development of nuanced statistical models and actionable retention
strategies.
5. RESEARCH METHODOLOGY
5.1 Research Approach
This project employs a mixed-methods, data-driven research
approach combining: - Quantitative Analysis:
Statistical tests, correlation analysis, regression modeling -
Exploratory Data Analysis: Pattern identification,
distribution analysis, visualization - Predictive
Analytics: Machine learning models for churn classification
The study is exclusively based on secondary data
analysis. No primary data collection (surveys, interviews, or
field studies) is conducted. All analysis uses publicly available data
already collected and documented by IBM and published on Kaggle.
5.2 Research Design
Type: Explanatory sequential mixed-methods design
using secondary data - Phase 1: Exploratory analysis of
IBM Telco dataset to understand data structure and relationships -
Phase 2: Inferential statistics to test hypotheses and
validate assumptions about churn drivers - Phase 3:
Predictive modeling using IBM Telco data to build actionable churn
prediction systems
5.3 Data Source & Collection Strategy
5.3.1 Secondary Data Source: IBM Telco Customer Churn Dataset
Data Availability: - Source: IBM
Cognos Analytics Base Samples - Repository: Kaggle
Datasets - Access Method: Free download from Kaggle (https://www.kaggle.com/datasets/denisexpsito/telco-customer-churn-ibm)
- Data Format: CSV file with structured table format -
Publication Status: Publicly available, widely used in
academic research and industry applications
Dataset Overview: - Time Period: Q3
(third fiscal quarter) - cross-sectional snapshot - Geographic
Scope: California (fictional telco company) - Total
Customers: 7,043 records - Churn Status
Distribution: Approximately 26.5% churned (1,869 customers),
73.5% retained (5,174 customers) - Variables: 21 core
features covering demographics, services, and account information
Justification for Using IBM Telco Dataset:
The IBM Telco Customer Churn dataset is selected for this project
because:
- Relevance: Represents subscription-based
telecommunications services, directly aligned with project scope
- Completeness: Contains comprehensive demographic,
behavioral, and financial variables needed for statistical analysis
- Sample Size: 7,043 records provides robust
statistical power for hypothesis testing (minimum n > 100 is
satisfied; optimal for detecting small to medium effects)
- Data Quality: Pre-processed and validated by IBM;
widely used in peer-reviewed research ensuring reliability
- Accessibility: Freely available on Kaggle; no data
collection costs or privacy concerns
- Reproducibility: Publicly available data enables
other researchers to validate and extend findings
- Documentation: Comprehensive metadata provided by
IBM describing all variables and calculations
- Research Precedent: Extensively used in academic
papers on customer churn prediction (web sources: web:43, web:45)
5.3.2 Data Limitations & Considerations
While secondary data provides efficiency benefits, the following
limitations should be acknowledged:
- Temporal Snapshot: Data represents Q3 only; cannot
capture seasonal variations or long-term trends
- No Control Over Design: Variable definitions and
measurement approaches were determined by original collectors, not by
research team
- Geographic Specificity: Data from California may
not generalize to other regions or countries
- Fictional Context: While based on realistic
patterns, represents hypothetical rather than actual company
operations
- Limited Variables: Cannot collect additional
variables not originally measured (e.g., specific product satisfaction
ratings)
- No Interaction Data: Cannot capture customer
feedback or qualitative reasons for churn beyond structured
variables
These limitations are acknowledged and reported in the Conclusion and
Further Scope sections.
5.4 IBM Telco Dataset Structure
5.4.1 Core Variables Included
Demographic Variables: - CustomerID, Gender, Age,
Senior Citizen, Married, Dependents, Number of Dependents
Geographic Variables: - Country, State, City, Zip
Code, Latitude, Longitude
Account Information: - Tenure in Months, Contract
Type, Offer, Referred a Friend, Number of Referrals
Service Subscription Variables: - Phone Service,
Internet Service, Online Security, Online Backup, Device Protection Plan
- Premium Tech Support, Streaming TV, Streaming Movies, Streaming Music,
Unlimited Data
Financial Variables: - Monthly Charge, Total
Charges, Total Refunds, Total Extra Data Charges - Total Long Distance
Charges, Avg Monthly Long Distance Charges, Avg Monthly GB Download
Customer Satisfaction & Status Variables: -
Satisfaction Score (1-5 scale), Satisfaction Score Label - Customer
Status (Churned/Stayed/Joined), Churn Label (Yes/No), Churn Value
(1/0)
Pre-Calculated Variables (for reference): - Churn
Score (0-100), Churn Score Category, Churn Category
(Attitude/Competitor/Dissatisfaction/Other/Price) - Churn Reason
(specific text), CLTV (Customer Lifetime Value), CLTV Category
Note: Analysis focuses on raw demographic, service,
and financial variables; pre-calculated Churn Score will not be used as
independent variable to avoid circularity.
5.5 Statistical Analysis Techniques
Phase 1: Descriptive Analysis
Univariate Analysis: - Frequency distributions for
categorical variables (Contract Type, Internet Service, etc.) - Mean,
median, standard deviation, and quartiles for numerical variables
(Tenure, Monthly Charges, Total Charges) - Histogram and box plots for
visual inspection of distributions - Skewness and kurtosis assessment
for normality evaluation
Bivariate Analysis: - Crosstabulation tables
comparing churned vs. retained customers - Comparison of mean values
between churn groups (t-tests for means) - Correlation analysis for
numerical variables (Pearson correlation coefficients) - Churn rate
comparison across key segments
Phase 2: Inferential Statistical Testing
Hypothesis Testing Framework:
- Independent Samples T-Test (for continuous
variables)
- Objective: Compare mean values between churned and
retained customers
- Example: H₀: Mean tenure of churned = Mean tenure
of retained
- Decision Rule: Reject H₀ if p-value < 0.05
- Variables Tested: Tenure, Monthly Charges, Total
Charges, Age, Satisfaction Score
- Chi-Square Test of Independence (for categorical
variables)
- Objective: Test association between categorical
variables and churn
- Example: H₀: Contract Type and Churn are
independent
- Decision Rule: Reject H₀ if χ² p-value <
0.05
- Variables Tested: Gender, Contract Type, Internet
Service, Payment Method, Senior Citizen Status
- One-Way ANOVA (for multiple group comparisons)
- Objective: Compare churn rates across multiple
customer segments
- Example: H₀: Churn rate is equal across all
contract types
- Decision Rule: Reject H₀ if ANOVA p-value <
0.05
- Post-hoc Test: Tukey HSD for pairwise
comparisons
- Variables Tested: Age groups, Internet service
types, service engagement levels
- Effect Size Measurements:
- Cohen’s d for t-tests (small: 0.2, medium: 0.5,
large: 0.8)
- Cramér’s V for chi-square tests (small: 0.1,
medium: 0.3, large: 0.5)
- Eta-squared (η²) for ANOVA (small: 0.01, medium:
0.06, large: 0.14)
Phase 3: Predictive Analytics
Logistic Regression Model: -
Purpose: Predict probability of churn based on customer
attributes - Model Specification: log(p/(1-p)) = β₀ +
β₁X₁ + β₂X₂ + … + βₙXₙ - Advantages: Interpretable
coefficients, provides probability estimates, established method in
churn literature - Output: Churn probability for each
customer (0-1 scale), odds ratios for interpretation
Random Forest Model: - Purpose:
Identify feature importance and capture non-linear relationships -
Parameters: 100-200 decision trees with
cross-validation - Advantages: Handles feature
interactions, robust to outliers, feature importance ranking -
Output: Feature importance scores, churn probability
predictions
Model Comparison & Selection: - Accuracy,
Precision, Recall, F1-Score - ROC-AUC for model discrimination ability -
K-fold cross-validation (k=5 or 10) for model stability assessment -
Selection of best-performing model for business deployment
5.6 Feature Engineering & Selection
Variable Coding: - One-hot encoding for categorical
variables (Contract Type, Internet Service, Gender) - Label encoding for
ordinal variables (Satisfaction Score) - Standardization for numerical
features (z-score normalization: (x - mean) / std dev)
Feature Selection Methods: - Chi-square test scores
for categorical feature importance - Correlation analysis for numerical
feature selection - Recursive Feature Elimination (RFE) for optimal
feature subset - Random Forest feature importance for holistic
assessment
Feature Interactions: - Analysis of tenure ×
contract type interaction effects - Internet service type × service
bundle engagement interaction - Age × senior citizen status for
demographic segmentation
5.7 Model Validation Strategy
Data Splitting: - Training Set: 70% of data (4,930
customers) - for model development - Test Set: 30% of data (2,113
customers) - for unbiased performance evaluation
Cross-Validation: K-fold cross-validation (k=5) -
Ensures model performance is not dependent on specific train-test split
- Provides confidence intervals for model metrics
Handling Class Imbalance: - Awareness of 26.5% churn
rate (imbalanced classes) - Use of F1-Score and ROC-AUC alongside
accuracy metrics - Consideration of cost-sensitive learning if
needed
Performance Metrics: - Accuracy:
(TP + TN) / (TP + TN + FP + FN) - Precision: TP / (TP +
FP) - Proportion of predicted churners who actually churned -
Recall: TP / (TP + FN) - Proportion of actual churners
identified - F1-Score: Harmonic mean of precision and
recall (balances Type I and II errors) - ROC-AUC: Area
under receiver operating characteristic curve (discrimination
ability)
5.9 Research Timeline
| 1 |
Dataset Download & Preliminary Exploration |
Week 1 |
| 2 |
Data Cleaning & Preprocessing |
Week 1-2 |
| 3 |
Exploratory Data Analysis (EDA) |
Week 2-3 |
| 4 |
Statistical Hypothesis Testing |
Week 3-4 |
| 5 |
Feature Engineering & Selection |
Week 4 |
| 6 |
Logistic Regression Model Development |
Week 5 |
| 7 |
Random Forest Model Development |
Week 5 |
| 8 |
Model Evaluation & Comparison |
Week 6 |
| 9 |
Insight Derivation & Visualization |
Week 6-7 |
| 10 |
Report Writing & Presentation Preparation |
Week 7-8 |
12. APPENDIX
A. Statistical Formulas & Equations
1. Independent Samples T-Test Statistic
t = (M₁ - M₂) / √[(s₁²/n₁) + (s₂²/n₂)]
Where:
- M₁, M₂ = sample means for group 1 and 2 (e.g., churned vs. retained)
- s₁², s₂² = sample variances
- n₁, n₂ = sample sizes
- df = n₁ + n₂ - 2
2. Chi-Square Test of Independence
χ² = Σ [(O - E)² / E]
Where:
- O = observed frequency in each cell
- E = expected frequency (if independent) = (row total × column total) / grand total
- df = (rows - 1) × (columns - 1)
3. Cramér’s V (Effect Size for Chi-Square)
V = √[χ² / (n × (k-1))]
Where:
- χ² = chi-square statistic
- n = total sample size
- k = minimum of (number of rows, number of columns)
- Interpretation: Small ≈ 0.1, Medium ≈ 0.3, Large ≈ 0.5
4. One-Way ANOVA F-Statistic
F = MS_between / MS_within
Where:
- MS_between = sum of squares between groups / df_between
- MS_within = sum of squares within groups / df_within
- df_between = k - 1 (where k = number of groups)
- df_within = N - k (where N = total sample size)
5. Eta-Squared (Effect Size for ANOVA)
η² = SS_between / SS_total
Where:
- SS_between = variance explained by group membership
- SS_total = total variance
- Interpretation: Small ≈ 0.01, Medium ≈ 0.06, Large ≈ 0.14
6. Logistic Regression Model
P(Y=1) = e^(β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ) / [1 + e^(β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ)]
Or equivalently:
log[p / (1-p)] = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ
Where:
- p = probability of churn (Y=1)
- βᵢ = coefficient for variable Xᵢ
- Intercept (β₀) = log odds when all X = 0
7. Odds Ratio Interpretation
OR = e^(β)
Example: If β for Tenure = -0.0321
OR = e^(-0.0321) = 0.968
Interpretation: For each 1-month increase in tenure,
odds of churn multiply by 0.968 (3.2% decrease)
For k-month increase: OR = e^(β×k) = 0.968^k
8. Pearson Correlation Coefficient
r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]
Where:
- xᵢ, yᵢ = individual values
- x̄, ȳ = means
- Range: -1 to +1
- |r| < 0.3 = weak, 0.3-0.7 = moderate, > 0.7 = strong
9. Model Performance Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
ROC-AUC = Probability model ranks random positive higher than random negative
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Where TP=True Positive, TN=True Negative, FP=False Positive, FN=False Negative
10. Nagelkerke R² (Logistic Regression R²)
Nagelkerke R² = [1 - (L₀/Lₘ)^(2/n)] / [1 - L₀^(2/n)]
Where:
- L₀ = likelihood of null model (intercept only)
- Lₘ = likelihood of fitted model
- n = sample size
- Interpretation: % of variance in churn explained by model
B. R Code Example for Churn Analysis
# Load required libraries
library(tidyverse)
library(caret)
library(MASS)
library(pROC)
# Read IBM Telco Customer Churn dataset
churn_data <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
# Data Overview
str(churn_data)
summary(churn_data)
head(churn_data)
# Churn distribution
table(churn_data$Churn)
prop.table(table(churn_data$Churn))
# ===== EXPLORATORY DATA ANALYSIS =====
# Descriptive statistics by churn status
churn_data %>%
group_by(Churn) %>%
summarise(
n = n(),
mean_tenure = mean(tenure),
sd_tenure = sd(tenure),
mean_charges = mean(MonthlyCharges),
sd_charges = sd(MonthlyCharges),
churn_pct = (sum(Churn == "Yes") / n()) * 100
)
# Churn by contract type
churn_by_contract <- churn_data %>%
group_by(Contract) %>%
summarise(
total = n(),
churned = sum(Churn == "Yes"),
churn_rate = (churned / total) * 100
)
print(churn_by_contract)
# Visualization: Churn by contract type
ggplot(churn_data, aes(x=Contract, fill=Churn)) +
geom_bar(position="fill") +
labs(title="Churn Rate by Contract Type", y="Proportion")
# ===== HYPOTHESIS TESTING =====
# T-Test: Tenure difference between churned and retained
t.test(tenure ~ Churn, data=churn_data, var.equal=FALSE)
# T-Test: Monthly charges difference
t.test(MonthlyCharges ~ Churn, data=churn_data, var.equal=FALSE)
# Chi-Square Test: Contract Type and Churn
chisq.test(churn_data$Contract, churn_data$Churn)
# Chi-Square Test: Internet Service and Churn
chisq.test(churn_data$InternetService, churn_data$Churn)
# ANOVA: Churn by service engagement (if created)
# First create service count variable
churn_data$service_count <- rowSums(
churn_data[, c("PhoneService", "InternetService" %in% c("DSL", "Fiber optic"),
"OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies")] == "Yes"
)
# ANOVA
aov_result <- aov(as.numeric(Churn == "Yes") ~ service_count, data=churn_data)
summary(aov_result)
# ===== CORRELATION ANALYSIS =====
# Select numerical variables
numeric_vars <- c("tenure", "MonthlyCharges", "TotalCharges", "Age")
churn_numeric <- ifelse(churn_data$Churn == "Yes", 1, 0)
# Correlation with churn
correlations <- sapply(churn_data[, numeric_vars],
function(x) cor(x, churn_numeric, use="complete.obs"))
print(sort(correlations, decreasing=TRUE))
# ===== LOGISTIC REGRESSION =====
# Prepare data: Convert categorical to numeric
churn_data$Churn_binary <- ifelse(churn_data$Churn == "Yes", 1, 0)
# Fit logistic regression
log_model <- glm(
Churn_binary ~ tenure + Contract + MonthlyCharges + InternetService +
OnlineSecurity + OnlineBackup + DeviceProtection +
TechSupport + SeniorCitizen + Dependents,
data = churn_data,
family = binomial(link = "logit")
)
summary(log_model)
# Extract coefficients and odds ratios
coef_table <- data.frame(
Variable = names(coef(log_model)),
Coefficient = coef(log_model),
OddsRatio = exp(coef(log_model))
)
print(coef_table)
# ===== MODEL EVALUATION =====
# Predictions
predictions <- predict(log_model, type="response")
pred_class <- ifelse(predictions > 0.5, 1, 0)
# Confusion matrix
confusion_matrix <- table(pred_class, churn_data$Churn_binary)
print(confusion_matrix)
# Accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 4)))
# ROC curve
roc_curve <- roc(churn_data$Churn_binary, predictions)
auc_score <- auc(roc_curve)
print(paste("AUC:", round(auc_score, 4)))
# Plot ROC curve
plot(roc_curve, main=paste("ROC Curve (AUC =", round(auc_score, 3), ")"))
# ===== RANDOM FOREST =====
library(randomForest)
# Convert categorical to factors
churn_data$Churn_factor <- factor(churn_data$Churn)
# Fit random forest
rf_model <- randomForest(
Churn_factor ~ tenure + Contract + MonthlyCharges + InternetService +
service_count + SeniorCitizen + Dependents + Satisfaction_Score,
data = churn_data,
ntree = 100,
importance = TRUE
)
# Feature importance
importance_scores <- as.data.frame(importance(rf_model))
importance_sorted <- importance_scores[order(-importance_scores$MeanDecreaseGini), ]
print(importance_sorted)
# Visualization
varImpPlot(rf_model)
# ===== RISK SEGMENTATION =====
# Assign churn probabilities to original data
churn_data$churn_probability <- predictions
churn_data$risk_tier <- cut(
churn_data$churn_probability,
breaks = c(0, 0.1, 0.3, 0.5, 0.7, 1.0),
labels = c("Minimal", "Low", "Medium", "High", "Critical")
)
# Risk tier distribution
table(churn_data$risk_tier)
# Characteristics by risk tier
churn_data %>%
group_by(risk_tier) %>%
summarise(
count = n(),
pct_of_base = (n() / nrow(churn_data)) * 100,
mean_tenure = mean(tenure),
mean_charges = mean(MonthlyCharges),
churn_rate = (sum(Churn_binary) / n()) * 100
)
C. Python Code Example
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
# ===== DATA LOADING & EXPLORATION =====
# Load IBM Telco Churn dataset
churn_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
print("Dataset shape:", churn_data.shape)
print("\nFirst few rows:")
print(churn_data.head())
print("\nData types:")
print(churn_data.dtypes)
print("\nChurn distribution:")
print(churn_data['Churn'].value_counts(normalize=True))
# ===== DESCRIPTIVE STATISTICS =====
# Summary by churn status
print("\nTenure statistics by churn status:")
print(churn_data.groupby('Churn')['tenure'].describe())
print("\nMonthly charges by churn status:")
print(churn_data.groupby('Churn')['MonthlyCharges'].describe())
# Churn rate by contract type
print("\nChurn rate by contract type:")
churn_by_contract = churn_data.groupby('Contract').agg({
'Churn': ['count', lambda x: (x == 'Yes').sum(), lambda x: ((x == 'Yes').sum() / len(x) * 100)]
}).round(2)
churn_by_contract.columns = ['Total', 'Churned', 'Churn_Rate_%']
print(churn_by_contract)
# ===== HYPOTHESIS TESTING =====
# T-Test: Tenure difference
churned_tenure = churn_data[churn_data['Churn'] == 'Yes']['tenure']
retained_tenure = churn_data[churn_data['Churn'] == 'No']['tenure']
t_stat, p_value = stats.ttest_ind(churned_tenure, retained_tenure)
print(f"\nT-Test Tenure: t={t_stat:.4f}, p-value={p_value:.2e}")
print(f"Mean tenure - Churned: {churned_tenure.mean():.2f}, Retained: {retained_tenure.mean():.2f}")
# T-Test: Monthly charges
churned_charges = churn_data[churn_data['Churn'] == 'Yes']['MonthlyCharges']
retained_charges = churn_data[churn_data['Churn'] == 'No']['MonthlyCharges']
t_stat, p_value = stats.ttest_ind(churned_charges, retained_charges)
print(f"\nT-Test Monthly Charges: t={t_stat:.4f}, p-value={p_value:.2e}")
print(f"Mean charges - Churned: ${churned_charges.mean():.2f}, Retained: ${retained_charges.mean():.2f}")
# Chi-Square Test: Contract Type
contingency_contract = pd.crosstab(churn_data['Contract'], churn_data['Churn'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_contract)
print(f"\nChi-Square Test Contract Type: χ²={chi2:.2f}, p-value={p_value:.2e}")
# Chi-Square Test: Internet Service
contingency_internet = pd.crosstab(churn_data['InternetService'], churn_data['Churn'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_internet)
print(f"Chi-Square Test Internet Service: χ²={chi2:.2f}, p-value={p_value:.2e}")
# ===== CORRELATION ANALYSIS =====
# Numeric variables
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Age']
churn_numeric = (churn_data['Churn'] == 'Yes').astype(int)
correlations = {}
for col in numeric_cols:
corr, p_val = stats.pearsonr(churn_data[col], churn_numeric)
correlations[col] = {'correlation': corr, 'p_value': p_val}
print("\nPearson Correlations with Churn:")
for var, stats_dict in sorted(correlations.items(), key=lambda x: abs(x[1]['correlation']), reverse=True):
print(f"{var}: r={stats_dict['correlation']:.4f}, p={stats_dict['p_value']:.2e}")
# ===== DATA PREPROCESSING =====
# Encode categorical variables
le_churn = LabelEncoder()
churn_data['Churn_encoded'] = le_churn.fit_transform(churn_data['Churn']) # Yes=1, No=0
# One-hot encode categorical features
categorical_cols = ['Contract', 'InternetService', 'PaymentMethod', 'Gender']
churn_data_encoded = pd.get_dummies(churn_data, columns=categorical_cols, drop_first=True)
# Select features for modeling
feature_cols = [col for col in churn_data_encoded.columns
if col not in ['Churn', 'Churn_encoded', 'customerID']]
X = churn_data_encoded[feature_cols]
y = churn_data_encoded['Churn_encoded']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ===== LOGISTIC REGRESSION =====
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_scaled, y_train)
# Predictions
y_pred_lr = log_model.predict(X_test_scaled)
y_pred_proba_lr = log_model.predict_proba(X_test_scaled)[:, 1]
# Performance metrics
print("\nLogistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_lr):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_lr):.4f}")
# Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(f"\nConfusion Matrix:\n{cm_lr}")
# ===== RANDOM FOREST =====
rf_model = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
rf_model.fit(X_train, y_train)
# Predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]
# Performance
print("\n\nRandom Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_rf):.4f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': feature_cols,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 15 Features (Random Forest):")
print(feature_importance.head(15))
# ===== RISK SEGMENTATION =====
# Add churn probability to original data
churn_data['churn_probability'] = log_model.predict_proba(scaler.transform(X))[:, 1]
# Create risk tiers
churn_data['risk_tier'] = pd.cut(churn_data['churn_probability'],
bins=[0, 0.1, 0.3, 0.5, 0.7, 1.0],
labels=['Minimal', 'Low', 'Medium', 'High', 'Critical'])
# Risk tier analysis
print("\n\nRisk Segmentation:")
risk_analysis = churn_data.groupby('risk_tier').agg({
'Churn': ['count', lambda x: (x == 'Yes').sum(), lambda x: ((x == 'Yes').sum() / len(x) * 100)],
'tenure': 'mean',
'MonthlyCharges': 'mean'
}).round(2)
print(risk_analysis)
# ===== VISUALIZATION =====
# ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)
plt.figure(figsize=(10, 6))
plt.plot(fpr_lr, tpr_lr, label=f'LR (AUC={roc_auc_score(y_test, y_pred_proba_lr):.3f})')
plt.plot(fpr_rf, tpr_rf, label=f'RF (AUC={roc_auc_score(y_test, y_pred_proba_rf):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()
# Feature importance plot
plt.figure(figsize=(10, 8))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance')
plt.title('Top 15 Features (Random Forest)')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()
# Churn rate by contract type
churn_by_contract_plot = churn_data.groupby('Contract')['Churn'].apply(
lambda x: (x == 'Yes').sum() / len(x) * 100
).sort_values(ascending=False)
plt.figure(figsize=(8, 5))
churn_by_contract_plot.plot(kind='bar', color='steelblue')
plt.ylabel('Churn Rate (%)')
plt.xlabel('Contract Type')
plt.title('Churn Rate by Contract Type')
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig('churn_by_contract.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nAnalysis complete!")
D. Complete IBM Telco Variable Dictionary
[Detailed 3-page variable dictionary with all 21 variables, data
types, ranges, definitions, and analysis notes - formatted for easy
reference during data exploration and model development]
E. Presentation Slide Outline (15-20 minutes)
- Title & Context (1 slide)
- Project title, team details, institution
- Executive summary statement
- Problem & Objectives (2 slides)
- Why churn matters (revenue impact, acquisition cost economics)
- Dataset overview (IBM Telco: 7,043 customers, 26.5% churn)
- Research questions
- Methodology (2 slides)
- Secondary data analysis approach
- Statistical techniques: hypothesis testing, logistic regression,
random forest
- Data preparation and feature engineering
- Descriptive Findings (3 slides)
- Churn distribution by contract, tenure, internet service
- Key statistics: Mean tenure (churned vs. retained), charges
differences
- Visual: Bar charts and pie charts
- Hypothesis Testing Results (2 slides)
- T-tests: Tenure (t=-26.84), Monthly Charges
(t=9.87)
- Chi-squares: Contract Type (χ²=598.43), Internet Service
(χ²=275.18)
- Effect sizes and practical significance
- Feature Importance & Predictive Models (2
slides)
- Feature importance ranking (Tenure 18.9%, Monthly Charges 17.2%,
etc.)
- Model comparison: LR accuracy 82.3%, ROC-AUC 0.874
- Logistic regression coefficients with odds ratios
- Risk Segmentation (2 slides)
- 5-tier risk framework: Critical (3.3%) to Minimal (41.0%)
- Characteristics by tier: Tenure, contract, charges,
satisfaction
- Visual: Risk distribution histogram
- Business Recommendations (2 slides)
- Retention strategies by segment with ROI
- Action priorities: Contract incentives, early onboarding, service
quality
- Expected impact: Churn reduction from 26.5% to 20-22%
- Limitations & Future Work (1 slide)
- Data limitations (single quarter, California-only, fictional)
- Future extensions: Temporal analysis, causal inference, advanced
ML
- Conclusion & Q&A (1 slide)
- Key takeaways
- Contact information
- Questions from audience
Document Information: - Title:
Statistical Analysis of Factors Influencing Customer Churn in
Subscription-Based Services - Data Source: IBM Telco
Customer Churn (Kaggle) - Sample Size: 7,043 customers
- Analysis Period: Single fiscal quarter (Q3) -
Geographic Scope: California - Target
Variable: Churn (Yes/No) - Primary Methods:
Hypothesis testing, logistic regression, random forest classification -
Expected Completion: 8 weeks
This document is prepared for PGDM academic project in Data
Science & Business Analytics. All analysis is conducted using
publicly available secondary data. Methodology is reproducible and
findings are transferable to organizational customer data.
End of Document