Introduction

Stepwise logistic regression is a method used to automatically select a subset of predictors that best explain the variation in a binary outcome.

It works by adding or removing predictors one at a time based on a criterion like AIC (Akaike Information Criterion). The goal is to balance model fit and complexity.

Understanding the AIC (Akaike Information Criterion)

Formula:

AIC = –2 * log-likelihood + 2 * k

Where: - log-likelihood measures how well the model fits the data. - k is the number of estimated parameters in the model (including the intercept).

Explanation:

AIC is a metric used to compare models by balancing model fit and model complexity.

  • The first term (–2 * log-likelihood) rewards models that fit the data well.
  • The second term (2 * k) penalizes models that are too complex (i.e., those with more predictors).

Lower AIC values indicate better models. During stepwise regression, AIC helps determine whether adding or removing a variable improves the model.

Unlike accuracy metrics, AIC can be used even when models are not nested and is especially helpful for variable selection.

Forward selection starts with no predictors and adds the most helpful ones.

Backward elimination starts with all predictors and removes the least helpful.

Stepwise does both—adds and removes as needed.

This helps us identify the most informative predictors without overfitting the model with unnecessary variables.

Task 1: Prepare the Data

Read in the data and display the first 6 rows

set.seed(123)
head(df)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes

Create a summary of the data to ensure the variables are stored correctly.

summary(df)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

We need to convert character variables to factors:

# install library(dplyr)
df <- df |> 
  mutate(across(where(is.character), as.factor))

summary(df)
##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

Churn is the target variable and it also a factor. We will set class 0 as the baseline:

df$Churn <- relevel(df$Churn, ref = 'No')

head(df)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes
summary(df)
##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

Now, we will drop the Customer ID column and omit any missing values

df <- df |> select(-customerID)
df <- na.omit(df)

Task 2: The Full Model

To begin the stepwise logistic regression process, we define two models:

  • A null model that includes only the intercept (no predictors), representing the baseline churn rate.

  • A full model that includes all available predictors.

These models set the boundaries for the stepwise algorithm to search for the best subset of variables by comparing improvements in model fit.

# Null model (intercept only)
null_model <- glm(Churn ~ 1, data = df, family = binomial)

# Full model (the . tells R to include all predictors)
full_model <- glm(Churn ~ ., data = df, family = binomial)

We can view the summaries of each model before we start the step-wise regression:

summary(null_model)
## 
## Call:
## glm(formula = Churn ~ 1, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.016      0.027  -37.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8143.4  on 7031  degrees of freedom
## Residual deviance: 8143.4  on 7031  degrees of freedom
## AIC: 8145.4
## 
## Number of Fisher Scoring iterations: 4
summary(full_model)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = df)
## 
## Coefficients: (7 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           1.165e+00  8.151e-01   1.430  0.15284    
## genderMale                           -2.183e-02  6.480e-02  -0.337  0.73619    
## SeniorCitizen                         2.168e-01  8.453e-02   2.564  0.01033 *  
## PartnerYes                           -3.840e-04  7.783e-02  -0.005  0.99606    
## DependentsYes                        -1.485e-01  8.973e-02  -1.655  0.09796 .  
## tenure                               -6.059e-02  6.236e-03  -9.716  < 2e-16 ***
## PhoneServiceYes                       1.715e-01  6.487e-01   0.264  0.79153    
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      4.484e-01  1.773e-01   2.530  0.01142 *  
## InternetServiceFiber optic            1.747e+00  7.981e-01   2.190  0.02855 *  
## InternetServiceNo                    -1.786e+00  8.073e-01  -2.213  0.02691 *  
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -2.054e-01  1.787e-01  -1.150  0.25031    
## OnlineBackupNo internet service              NA         NA      NA       NA    
## OnlineBackupYes                       2.604e-02  1.754e-01   0.148  0.88197    
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                   1.474e-01  1.764e-01   0.836  0.40339    
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                       -1.805e-01  1.806e-01  -0.999  0.31759    
## StreamingTVNo internet service               NA         NA      NA       NA    
## StreamingTVYes                        5.905e-01  3.263e-01   1.810  0.07035 .  
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                    5.993e-01  3.267e-01   1.834  0.06658 .  
## ContractOne year                     -6.608e-01  1.076e-01  -6.142 8.15e-10 ***
## ContractTwo year                     -1.357e+00  1.764e-01  -7.691 1.46e-14 ***
## PaperlessBillingYes                   3.424e-01  7.450e-02   4.596 4.31e-06 ***
## PaymentMethodCredit card (automatic) -8.779e-02  1.141e-01  -0.770  0.44156    
## PaymentMethodElectronic check         3.045e-01  9.450e-02   3.222  0.00127 ** 
## PaymentMethodMailed check            -5.759e-02  1.149e-01  -0.501  0.61627    
## MonthlyCharges                       -4.034e-02  3.176e-02  -1.270  0.20392    
## TotalCharges                          3.289e-04  7.063e-05   4.657 3.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8143.4  on 7031  degrees of freedom
## Residual deviance: 5826.3  on 7008  degrees of freedom
## AIC: 5874.3
## 
## Number of Fisher Scoring iterations: 6

Full Model Summary (Interpretation and Notes)

The full logistic regression model includes all available predictors to estimate the probability of customer churn. Many of these predictors are categorical, and R automatically creates dummy variables to represent their levels (thanks R). For example:

  • genderMale represents the comparison between male customers and the reference group (female).

  • InternetServiceFiber optic compares customers with fiber optic service to those in the reference level (“DSL” or “No”).

  • Binary variables such as SeniorCitizen are included directly as numeric predictors.

An example interpretation: the coefficient for SeniorCitizen is approximately 0.217. This indicates that, holding all other variables constant, being a senior citizen is associated with an increase in the log-odds of churn. The associated p-value is 0.0103, suggesting the relationship is statistically significant at the 5% level.

After exponentiating, \(e^{0.217} \approx 1.24\), meaning the odds of churn for senior citizens are 1.24 times the odds for non-senior customers.This translates to a 24% increase in the odds of churn, not a 124% increase, because:\(\text{Percent increase in odds} = (1.24 - 1) \times 100 = 24\%\) So, senior citizens are estimated to be 24% more likely to churn than non-seniors, in terms of odds.

Some coefficients are listed as NA, which indicates that R identified perfect multicollinearity or redundant levels in the model. These variables were automatically removed during fitting due to linear dependence.

What if One Level of a Categorical Variable Is Significant but Another Is Not?

This is a common and important situation in regression modeling. Suppose you have a categorical variable like Contract with three levels:

  • Month-to-month (reference level)
  • One year
  • Two year

Your regression output might show:

Predictor Coefficient p-value
ContractOne year –0.50 0.08
ContractTwo year –1.20 0.001

How to Interpret This:

  • Significance is relative to the reference level. Here, both “One year” and “Two year” contracts are being compared to “Month-to-month”.
  • The model suggests that “Two year” contracts significantly reduce churn, while the effect of “One year” contracts is not statistically significant at the \(5\%\) level.
  • This does not mean that “One year” contracts have no effect — only that their effect is not strong enough to rule out chance based on the current data.

What to do?

Option A: Keep All Levels
- This is usually the best choice unless you have a specific reason to simplify. - Retains the full structure of the categorical variable.

Option B: Collapse Categories
- If two levels behave similarly, you can recode them into one (e.g., group “One year” and “Two year” into “Long-term contract”). - Simplifies the model and may make effects more detectable.

Option C: Use Regularization (like LASSO)
- With many categories or sparse data, regularization will shrink less useful coefficients (possibly to zero). - Allows the model to decide which levels are worth keeping.


Summary

Even if only one level of a categorical variable is statistically significant, the variable as a whole may still be useful. Do not drop the entire variable just because some levels are not significant — this can remove meaningful structure and information from your model.

Finally, note the AIC value reported for the full model. In the next step, the stepwise procedure will attempt to reduce this AIC by selectively adding or removing predictors to improve model parsimony and performance.

Task 3: StepWise Regression - Backward Elimination

This function starts with the full model and at each step:

  • Evaluates the effect of removing each predictor,

  • Chooses the removal that most reduces AIC (or increases it the least),

  • Stops when removing any further predictor would increase AIC.

#we will use the option trace = 1 so that we can see each step
# if you do not want to see it, set trace = 0
backward_model <- step(full_model, direction = "backward", trace = 1)
## Start:  AIC=5874.27
## Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure + 
##     PhoneService + MultipleLines + InternetService + OnlineSecurity + 
##     OnlineBackup + DeviceProtection + TechSupport + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges
## 
## 
## Step:  AIC=5874.27
## Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     TotalCharges
## 
##                    Df Deviance    AIC
## - Partner           1   5826.3 5872.3
## - OnlineBackup      1   5826.3 5872.3
## - gender            1   5826.4 5872.4
## - DeviceProtection  1   5827.0 5873.0
## - TechSupport       1   5827.3 5873.3
## - OnlineSecurity    1   5827.6 5873.6
## - MonthlyCharges    1   5827.9 5873.9
## <none>                  5826.3 5874.3
## - Dependents        1   5829.0 5875.0
## - StreamingTV       1   5829.6 5875.6
## - StreamingMovies   1   5829.6 5875.6
## - InternetService   1   5831.1 5877.1
## - SeniorCitizen     1   5832.8 5878.8
## - MultipleLines     2   5846.8 5890.8
## - PaperlessBilling  1   5847.5 5893.5
## - PaymentMethod     3   5852.5 5894.5
## - TotalCharges      1   5849.1 5895.1
## - Contract          2   5908.5 5952.5
## - tenure            1   5937.9 5983.9
## 
## Step:  AIC=5872.27
## Churn ~ gender + SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + OnlineBackup + DeviceProtection + 
##     TechSupport + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - OnlineBackup      1   5826.3 5870.3
## - gender            1   5826.4 5870.4
## - DeviceProtection  1   5827.0 5871.0
## - TechSupport       1   5827.3 5871.3
## - OnlineSecurity    1   5827.6 5871.6
## - MonthlyCharges    1   5827.9 5871.9
## <none>                  5826.3 5872.3
## - StreamingTV       1   5829.6 5873.6
## - Dependents        1   5829.6 5873.6
## - StreamingMovies   1   5829.6 5873.6
## - InternetService   1   5831.1 5875.1
## - SeniorCitizen     1   5832.9 5876.9
## - MultipleLines     2   5846.8 5888.8
## - PaperlessBilling  1   5847.5 5891.5
## - PaymentMethod     3   5852.5 5892.5
## - TotalCharges      1   5849.1 5893.1
## - Contract          2   5908.5 5950.5
## - tenure            1   5938.7 5982.7
## 
## Step:  AIC=5870.29
## Churn ~ gender + SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - gender            1   5826.4 5868.4
## - DeviceProtection  1   5827.7 5869.7
## <none>                  5826.3 5870.3
## - TechSupport       1   5829.6 5871.6
## - Dependents        1   5829.6 5871.6
## - OnlineSecurity    1   5830.6 5872.6
## - MonthlyCharges    1   5832.9 5874.9
## - SeniorCitizen     1   5832.9 5874.9
## - StreamingTV       1   5838.0 5880.0
## - StreamingMovies   1   5838.6 5880.6
## - MultipleLines     2   5847.8 5887.8
## - InternetService   1   5847.3 5889.3
## - PaperlessBilling  1   5847.6 5889.6
## - PaymentMethod     3   5852.5 5890.5
## - TotalCharges      1   5849.2 5891.2
## - Contract          2   5908.6 5948.6
## - tenure            1   5938.7 5980.7
## 
## Step:  AIC=5868.41
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - DeviceProtection  1   5827.8 5867.8
## <none>                  5826.4 5868.4
## - TechSupport       1   5829.7 5869.7
## - Dependents        1   5829.8 5869.8
## - OnlineSecurity    1   5830.7 5870.7
## - MonthlyCharges    1   5833.0 5873.0
## - SeniorCitizen     1   5833.0 5873.0
## - StreamingTV       1   5838.1 5878.1
## - StreamingMovies   1   5838.7 5878.7
## - MultipleLines     2   5847.9 5885.9
## - InternetService   1   5847.4 5887.4
## - PaperlessBilling  1   5847.7 5887.7
## - PaymentMethod     3   5852.6 5888.6
## - TotalCharges      1   5849.3 5889.3
## - Contract          2   5908.6 5946.6
## - tenure            1   5938.9 5978.9
## 
## Step:  AIC=5867.84
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + TechSupport + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## <none>                  5827.8 5867.8
## - Dependents        1   5831.2 5869.2
## - MonthlyCharges    1   5833.4 5871.4
## - TechSupport       1   5834.1 5872.1
## - SeniorCitizen     1   5834.5 5872.5
## - OnlineSecurity    1   5836.0 5874.0
## - StreamingTV       1   5838.7 5876.7
## - StreamingMovies   1   5839.3 5877.3
## - MultipleLines     2   5848.2 5884.2
## - PaperlessBilling  1   5848.9 5886.9
## - PaymentMethod     3   5853.8 5887.8
## - TotalCharges      1   5850.6 5888.6
## - InternetService   1   5852.5 5890.5
## - Contract          2   5909.0 5945.0
## - tenure            1   5940.6 5978.6

The final model, stored in the backward_model object, is the result of applying backward elimination to the full logistic regression model. Starting with all available predictors, the algorithm iteratively removed variables that did not contribute meaningfully to model performance, as judged by the Akaike Information Criterion (AIC).

The predictors retained in this model are those that collectively provide the best balance between model fit and complexity. Each retained variable significantly improves the model’s ability to predict customer churn relative to a simpler model without it.

This model is more parsimonious than the full model, avoids overfitting, and includes only those predictors that help explain variation in the outcome variable.

You can now interpret the coefficients, assess model performance, or use it for prediction and model evaluation.

summary(backward_model)
## 
## Call:
## glm(formula = Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + TechSupport + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges, family = binomial, data = df)
## 
## Coefficients: (4 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           6.360e-01  5.152e-01   1.234 0.217094    
## SeniorCitizen                         2.167e-01  8.399e-02   2.580 0.009868 ** 
## DependentsYes                        -1.496e-01  8.141e-02  -1.838 0.066054 .  
## tenure                               -6.060e-02  6.210e-03  -9.758  < 2e-16 ***
## MultipleLinesNo phone service         1.367e-01  2.424e-01   0.564 0.572692    
## MultipleLinesYes                      3.709e-01  9.472e-02   3.916 9.01e-05 ***
## InternetServiceFiber optic            1.367e+00  2.761e-01   4.953 7.32e-07 ***
## InternetServiceNo                    -1.403e+00  3.148e-01  -4.456 8.37e-06 ***
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -2.827e-01  9.892e-02  -2.858 0.004266 ** 
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                       -2.541e-01  1.021e-01  -2.490 0.012791 *  
## StreamingTVNo internet service               NA         NA      NA       NA    
## StreamingTVYes                        4.452e-01  1.351e-01   3.294 0.000987 ***
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                    4.545e-01  1.344e-01   3.382 0.000719 ***
## ContractOne year                     -6.543e-01  1.074e-01  -6.092 1.11e-09 ***
## ContractTwo year                     -1.346e+00  1.762e-01  -7.640 2.17e-14 ***
## PaperlessBillingYes                   3.409e-01  7.443e-02   4.580 4.64e-06 ***
## PaymentMethodCredit card (automatic) -8.769e-02  1.140e-01  -0.769 0.441714    
## PaymentMethodElectronic check         3.024e-01  9.444e-02   3.202 0.001363 ** 
## PaymentMethodMailed check            -5.878e-02  1.147e-01  -0.512 0.608461    
## MonthlyCharges                       -2.501e-02  1.057e-02  -2.366 0.017962 *  
## TotalCharges                          3.281e-04  7.055e-05   4.652 3.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8143.4  on 7031  degrees of freedom
## Residual deviance: 5827.8  on 7012  degrees of freedom
## AIC: 5867.8
## 
## Number of Fisher Scoring iterations: 6
formula(backward_model)
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + TechSupport + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges

Final Logistic Regression Model (Interpretation)

The model predicts the probability that a customer churns based on a refined set of predictors selected through backward elimination. The model has removed redundant or non-informative variables, as evidenced by the note that 4 coefficients were not defined due to singularities (indicating multicollinearity or perfect separation among categories).

Key Observations

1. Model Fit

  • Null deviance: 8143.4 (model with intercept only)
  • Residual deviance: 5827.8 (after fitting predictors)
  • AIC: 5867.8

This significant drop in deviance and AIC suggests that the final model provides a much better fit than the null model.


2. Significant Predictors (p < 0.05)

Here are some key variables that are statistically significant and likely important for churn prediction:

  • SeniorCitizen: Being a senior citizen increases the odds of churn (p = 0.0099)
  • tenure: Longer tenure is associated with lower odds of churn (p < 2e-16)
  • InternetServiceFiber optic: Customers with fiber optic service are more likely to churn (p = 7.32e-07)
  • InternetServiceNo: Customers with no internet service are less likely to churn (p = 3.7e-06)
  • OnlineSecurityYes: Having online security decreases churn odds (p = 0.0043)
  • TechSupportYes: Having tech support decreases churn odds (p = 0.0128)
  • StreamingTVYes, StreamingMoviesYes: Usage of streaming services is associated with higher churn
  • ContractTwo year: Having a two-year contract is strongly associated with lower churn (p < 2e-16)
  • PaperlessBillingYes: Associated with increased churn (p = 4.64e-06)
  • MonthlyCharges, TotalCharges: Both significant, with total charges having a strong effect (p = 3.29e-06)

3. Non-significant Predictors (p > 0.05)

  • MultipleLinesNo phone service
  • PhoneServiceYes
  • PaymentMethodCredit card (automatic)
  • PaymentMethodMailed check

These were retained likely because their removal did not improve the AIC, but their individual effects are not statistically significant.


Summary

The final model:

  • Includes relevant service usage and account features (e.g., streaming services, contract type, internet type)
  • Captures behavioral and financial patterns (e.g., billing type, total charges)
  • Reflects known business patterns (e.g., customers on longer contracts churn less)

This model is now ready for evaluation using prediction metrics or for comparison against models in Python with L1/L2 regularization.

Task 4:

Comparing Logistic Regression Models Across R and Python

Your task is to compare the logistic regression model created in R using backward elimination with logistic regression models built in Python using the scikit-learn library. You will:

  1. Recreate a logistic regression model in Python using the same predictors selected in your final R model or by choosing predictors using Python and methods discussed in class.
  2. Train two regularized models in Python:
    • One using Lasso (L1 penalty)
    • One using Ridge (L2 penalty)
  3. Use cross-validation to determine the optimal value of the regularization parameter (C in LogisticRegression, or alpha in LogisticRegressionCV).
  4. Evaluate and compare the models based on classification performance (e.g., accuracy, precision, recall, ROC AUC).
  5. Interpret and report which model performs best and why. Discuss how regularization affects model complexity and variable selection.

You may use tools such as: - LogisticRegressionCV for automatic hyperparameter tuning, - Pipeline and StandardScaler to scale your features (especially important for regularized models), - classification_report and roc_auc_score for performance metrics.

This task assesses your ability to apply statistical modeling techniques, evaluate models across platforms, and interpret regularization in practice.

Task 5:

Addressing Class Imbalance

In this dataset, the Churn variable is imbalanced, meaning that the majority of customers do not churn. This can bias the logistic regression model and lead to misleading performance metrics.

Your task is to:

  1. Balance the training data using random oversampling of the minority class.
  2. Re-run the modeling process:
    • Fit the full model
    • Perform backward elimination
    • Evaluate the final model
  3. Compare the results:
    • Do you observe the same set of predictors in the final model?
    • Are the coefficient signs or significance levels different?
    • Does the AIC improve or worsen?
    • How does the model’s performance change?

You can use the following R code to oversample the minority class:

# Install if necessary
install.packages("ROSE")  # Or use "DMwR" for SMOTE

library(ROSE)

# Create a balanced dataset using random oversampling
balanced_df <- ovun.sample(Churn ~ ., data = df, method = "over", seed = 123)$data

# Proceed with the same modeling steps on balanced_df:
# - full_model <- glm(..., data = balanced_df, ...)
# - backward_model <- step(...)