Part 1: Data Preparation and Exploratory Analysis

Introduction to the Dataset

For this assignment, I am using the Telco Customer Churn dataset, which contains detailed information about telecom customers and whether they have churned (left the service). The dataset includes demographic information, contract details, payment method types, and monthly service charges. Since customer retention is a critical metric in the telecom industry, analyzing churn trends helps companies develop better retention strategies.

Research Questions

  1. What factors contribute to customer churn? Understanding which features most influence churn can help in crafting targeted retention campaigns.
  2. How does contract type impact churn? Customers with different contract types (month-to-month, one-year, two-year) might exhibit varying churn rates.
  3. Does the interaction between contract type and monthly charges affect churn probability? If an interaction exists, it suggests that the impact of monthly charges on churn differs based on contract type.
# Load the Customer Churn dataset
telco_data <- read_csv("telco-data.csv.csv") %>%
  mutate(Churn = if_else(Churn == "Yes", 1, 0),  # Convert Churn to binary (1 = churn, 0 = no churn)
         Contract = as.numeric(factor(Contract)),  # Convert contract type to numeric
         PaymentMethod = as.numeric(factor(PaymentMethod)))  # Convert payment method to numeric
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the structure of the dataset
glimpse(telco_data)
## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 2, 1, 1, 3, 2, 3,…
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <dbl> 3, 4, 4, 1, 3, 3, 2, 4, 3, 1, 4, 2, 2, 1, 3, 2, 4, 1,…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <dbl> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,…

Exploratory Data Analysis

Before diving into modeling, it is essential to explore the data to identify patterns and relationships.

# Summary statistics
summary(telco_data)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies       Contract   
##  Length:7043        Length:7043        Length:7043        Min.   :1.00  
##  Class :character   Class :character   Class :character   1st Qu.:1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.00  
##                                                           Mean   :1.69  
##                                                           3rd Qu.:2.00  
##                                                           Max.   :3.00  
##                                                                         
##  PaperlessBilling   PaymentMethod   MonthlyCharges    TotalCharges   
##  Length:7043        Min.   :1.000   Min.   : 18.25   Min.   :  18.8  
##  Class :character   1st Qu.:2.000   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Median :3.000   Median : 70.35   Median :1397.5  
##                     Mean   :2.574   Mean   : 64.76   Mean   :2283.3  
##                     3rd Qu.:3.000   3rd Qu.: 89.85   3rd Qu.:3794.7  
##                     Max.   :4.000   Max.   :118.75   Max.   :8684.8  
##                                                      NA's   :11      
##      Churn       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2654  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
## 
# Visualizing Churn by Contract Type
ggplot(telco_data, aes(x = as.factor(Contract), fill = as.factor(Churn))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Contract Type", x = "Contract Type", y = "Proportion")

# Visualizing Churn by Monthly Charges
ggplot(telco_data, aes(x = MonthlyCharges, fill = as.factor(Churn))) +
  geom_density(alpha = 0.5) +
  labs(title = "Distribution of Monthly Charges by Churn Status", x = "Monthly Charges", y = "Density")

Part 2: Logistic Regression Models

Overview of Logistic Regression

Logistic regression is commonly used when the dependent variable is binary. Since our target variable (Churn) is either 1 (churned) or 0 (not churned), logistic regression is appropriate for this analysis. The logit model estimates the probability of churn as a function of multiple predictors, transforming the linear combination of features into a probability via the logistic function.

Model 1: Baseline Logistic Regression (Churn ~ Tenure + MonthlyCharges)

This baseline model evaluates the impact of tenure (how long a customer has been with the service) and monthly charges on churn probability.

baseline_model <- glm(Churn ~ tenure + MonthlyCharges, data = telco_data, family = binomial)
summary(baseline_model)
## 
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges, family = binomial, 
##     data = telco_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.802436   0.086557  -20.82   <2e-16 ***
## tenure         -0.054850   0.001689  -32.47   <2e-16 ***
## MonthlyCharges  0.032954   0.001299   25.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8150.1  on 7042  degrees of freedom
## Residual deviance: 6394.4  on 7040  degrees of freedom
## AIC: 6400.4
## 
## Number of Fisher Scoring iterations: 5

Model 2: Adding Contract Type as a Predictor

Contract type is an important variable, as longer contracts (one-year, two-year) might reduce churn probability. This model extends the baseline model by incorporating Contract.

model2 <- glm(Churn ~ tenure + MonthlyCharges + Contract, data = telco_data, family = binomial)
summary(model2)
## 
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges + Contract, family = binomial, 
##     data = telco_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.573358   0.118770  -4.827 1.38e-06 ***
## tenure         -0.035787   0.002037 -17.569  < 2e-16 ***
## MonthlyCharges  0.028621   0.001349  21.212  < 2e-16 ***
## Contract       -1.030966   0.071445 -14.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8150.1  on 7042  degrees of freedom
## Residual deviance: 6146.1  on 7039  degrees of freedom
## AIC: 6154.1
## 
## Number of Fisher Scoring iterations: 6

Model 3: Adding Interaction Term (MonthlyCharges * Contract)

To explore whether the effect of MonthlyCharges on churn depends on Contract type, we introduce an interaction term.

model3 <- glm(Churn ~ tenure + MonthlyCharges * Contract, data = telco_data, family = binomial)
summary(model3)
## 
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges * Contract, family = binomial, 
##     data = telco_data)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.196469   0.256383   0.766 0.443493    
## tenure                  -0.036163   0.002056 -17.592  < 2e-16 ***
## MonthlyCharges           0.018817   0.003118   6.036 1.58e-09 ***
## Contract                -1.660611   0.205560  -8.078 6.56e-16 ***
## MonthlyCharges:Contract  0.007896   0.002312   3.415 0.000638 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8150.1  on 7042  degrees of freedom
## Residual deviance: 6133.5  on 7038  degrees of freedom
## AIC: 6143.5
## 
## Number of Fisher Scoring iterations: 7

Part 3: Model Comparison using Likelihood Ratio Test, AIC, and BIC

Why Compare Models?

Model comparison helps determine the best-fit model while balancing complexity and performance. We use: - Likelihood Ratio Test (LRT): Compares nested models to test if adding predictors significantly improves model fit. - AIC (Akaike Information Criterion): Evaluates model fit by penalizing complexity. - BIC (Bayesian Information Criterion): Similar to AIC but more conservative in penalizing complexity.

# Compare models using Likelihood Ratio Test
anova(baseline_model, model2, model3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Churn ~ tenure + MonthlyCharges
## Model 2: Churn ~ tenure + MonthlyCharges + Contract
## Model 3: Churn ~ tenure + MonthlyCharges * Contract
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7040     6394.4                          
## 2      7039     6146.1  1  248.273 < 2.2e-16 ***
## 3      7038     6133.5  1   12.552 0.0003957 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare models using AIC and BIC
model_comparison <- tibble(
  Model = c("Baseline", "Model 2", "Model 3"),
  AIC = c(AIC(baseline_model), AIC(model2), AIC(model3)),
  BIC = c(BIC(baseline_model), BIC(model2), BIC(model3))
)

print(model_comparison)
## # A tibble: 3 × 3
##   Model      AIC   BIC
##   <chr>    <dbl> <dbl>
## 1 Baseline 6400. 6421.
## 2 Model 2  6154. 6182.
## 3 Model 3  6144. 6178.

Interpretation of Results

  • If Model 3 (with interaction) has the lowest AIC/BIC, it suggests that the interaction between MonthlyCharges and Contract is important for predicting churn.
  • If Model 2 has a lower AIC/BIC than Model 1, but Model 3 does not significantly improve further, the interaction term may not be necessary.
  • The Likelihood Ratio Test indicates whether adding Contract and the interaction term significantly improve the model.

Conclusion and Business Implications

This analysis provides actionable insights for customer retention strategies: - Tenure matters: Customers with shorter tenure are more likely to churn, suggesting that early retention strategies (e.g., discounts or loyalty rewards) could be beneficial. - Pricing strategy impact: Higher monthly charges correlate with higher churn rates. The company may explore flexible pricing options or value-added services to retain high-paying customers. - Contract importance: Customers with longer contracts have lower churn probability, indicating that incentivizing annual or two-year contracts may reduce churn rates.

Final Thoughts

Using logistic regression with MLE and model selection techniques, we identified key churn predictors and determined the best-fit model. This framework can be extended to include additional variables, refine segmentation, and optimize retention strategies.