Stepwise logistic regression is a method used to automatically select a subset of predictors that best explain the variation in a binary outcome.
It works by adding or removing predictors one at a time based on a criterion like AIC (Akaike Information Criterion). The goal is to balance model fit and complexity.
Formula:
AIC = –2 * log-likelihood + 2 * k
Where: - log-likelihood measures how well the model fits the data. - k is the number of estimated parameters in the model (including the intercept).
Explanation:
AIC is a metric used to compare models by balancing model fit and model complexity.
Lower AIC values indicate better models. During stepwise regression, AIC helps determine whether adding or removing a variable improves the model.
Unlike accuracy metrics, AIC can be used even when models are not nested and is especially helpful for variable selection.
Forward selection starts with no predictors and adds the most helpful ones.
Backward elimination starts with all predictors and removes the least helpful.
Stepwise does both—adds and removes as needed.
This helps us identify the most informative predictors without overfitting the model with unnecessary variables.
Read in the data and display the first 6 rows
set.seed(123)
head(df)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
Create a summary of the data to ensure the variables are stored correctly.
summary(df)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
We need to convert character variables to factors:
# install library(dplyr)
df <- df |>
mutate(across(where(is.character), as.factor))
summary(df)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
Churn is the target variable and it also a factor. We will set class 0 as the baseline:
df$Churn <- relevel(df$Churn, ref = 'No')
head(df)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
summary(df)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
Now, we will drop the Customer ID column and omit any missing values
df <- df |> select(-customerID)
df <- na.omit(df)
To begin the stepwise logistic regression process, we define two models:
A null model that includes only the intercept (no predictors), representing the baseline churn rate.
A full model that includes all available predictors.
These models set the boundaries for the stepwise algorithm to search for the best subset of variables by comparing improvements in model fit.
# Null model (intercept only)
null_model <- glm(Churn ~ 1, data = df, family = binomial)
# Full model (the . tells R to include all predictors)
full_model <- glm(Churn ~ ., data = df, family = binomial)
We can view the summaries of each model before we start the step-wise regression:
summary(null_model)
##
## Call:
## glm(formula = Churn ~ 1, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.016 0.027 -37.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 8143.4 on 7031 degrees of freedom
## AIC: 8145.4
##
## Number of Fisher Scoring iterations: 4
summary(full_model)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = df)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.165e+00 8.151e-01 1.430 0.15284
## genderMale -2.183e-02 6.480e-02 -0.337 0.73619
## SeniorCitizen 2.168e-01 8.453e-02 2.564 0.01033 *
## PartnerYes -3.840e-04 7.783e-02 -0.005 0.99606
## DependentsYes -1.485e-01 8.973e-02 -1.655 0.09796 .
## tenure -6.059e-02 6.236e-03 -9.716 < 2e-16 ***
## PhoneServiceYes 1.715e-01 6.487e-01 0.264 0.79153
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 4.484e-01 1.773e-01 2.530 0.01142 *
## InternetServiceFiber optic 1.747e+00 7.981e-01 2.190 0.02855 *
## InternetServiceNo -1.786e+00 8.073e-01 -2.213 0.02691 *
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -2.054e-01 1.787e-01 -1.150 0.25031
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes 2.604e-02 1.754e-01 0.148 0.88197
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes 1.474e-01 1.764e-01 0.836 0.40339
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -1.805e-01 1.806e-01 -0.999 0.31759
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 5.905e-01 3.263e-01 1.810 0.07035 .
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 5.993e-01 3.267e-01 1.834 0.06658 .
## ContractOne year -6.608e-01 1.076e-01 -6.142 8.15e-10 ***
## ContractTwo year -1.357e+00 1.764e-01 -7.691 1.46e-14 ***
## PaperlessBillingYes 3.424e-01 7.450e-02 4.596 4.31e-06 ***
## PaymentMethodCredit card (automatic) -8.779e-02 1.141e-01 -0.770 0.44156
## PaymentMethodElectronic check 3.045e-01 9.450e-02 3.222 0.00127 **
## PaymentMethodMailed check -5.759e-02 1.149e-01 -0.501 0.61627
## MonthlyCharges -4.034e-02 3.176e-02 -1.270 0.20392
## TotalCharges 3.289e-04 7.063e-05 4.657 3.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 5826.3 on 7008 degrees of freedom
## AIC: 5874.3
##
## Number of Fisher Scoring iterations: 6
The full logistic regression model includes all available predictors to estimate the probability of customer churn. Many of these predictors are categorical, and R automatically creates dummy variables to represent their levels (thanks R). For example:
genderMale represents the comparison between male customers and the reference group (female).
InternetServiceFiber optic compares customers with fiber optic service to those in the reference level (“DSL” or “No”).
Binary variables such as SeniorCitizen are included directly as numeric predictors.
An example interpretation: the coefficient for
SeniorCitizen is approximately 0.217. This indicates that,
holding all other variables constant, being a senior citizen is
associated with an increase in the log-odds of churn.
The associated p-value is 0.0103, suggesting the relationship is
statistically significant at the 5% level.
After exponentiating, \(e^{0.217} \approx 1.24\), meaning the odds of churn for senior citizens are 1.24 times the odds for non-senior customers.This translates to a 24% increase in the odds of churn, not a 124% increase, because:\(\text{Percent increase in odds} = (1.24 - 1) \times 100 = 24\%\) So, senior citizens are estimated to be 24% more likely to churn than non-seniors, in terms of odds.
Some coefficients are listed as NA, which indicates that R identified perfect multicollinearity or redundant levels in the model. These variables were automatically removed during fitting due to linear dependence.
What if One Level of a Categorical Variable Is Significant but Another Is Not?
This is a common and important situation in regression modeling.
Suppose you have a categorical variable like Contract with
three levels:
Your regression output might show:
| Predictor | Coefficient | p-value |
|---|---|---|
| ContractOne year | –0.50 | 0.08 |
| ContractTwo year | –1.20 | 0.001 |
How to Interpret This:
What to do?
Option A: Keep All Levels
- This is usually the best choice unless you have a specific reason to
simplify. - Retains the full structure of the categorical variable.
Option B: Collapse Categories
- If two levels behave similarly, you can recode them into one (e.g.,
group “One year” and “Two year” into “Long-term contract”). - Simplifies
the model and may make effects more detectable.
Option C: Use Regularization (like LASSO)
- With many categories or sparse data, regularization will shrink less
useful coefficients (possibly to zero). - Allows the model to decide
which levels are worth keeping.
Summary
Even if only one level of a categorical variable is statistically significant, the variable as a whole may still be useful. Do not drop the entire variable just because some levels are not significant — this can remove meaningful structure and information from your model.
Finally, note the AIC value reported for the full model. In the next step, the stepwise procedure will attempt to reduce this AIC by selectively adding or removing predictors to improve model parsimony and performance.
This function starts with the full model and at each step:
Evaluates the effect of removing each predictor,
Chooses the removal that most reduces AIC (or increases it the least),
Stops when removing any further predictor would increase AIC.
#we will use the option trace = 1 so that we can see each step
# if you do not want to see it, set trace = 0
backward_model <- step(full_model, direction = "backward", trace = 1)
## Start: AIC=5874.27
## Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure +
## PhoneService + MultipleLines + InternetService + OnlineSecurity +
## OnlineBackup + DeviceProtection + TechSupport + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + TotalCharges
##
##
## Step: AIC=5874.27
## Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges +
## TotalCharges
##
## Df Deviance AIC
## - Partner 1 5826.3 5872.3
## - OnlineBackup 1 5826.3 5872.3
## - gender 1 5826.4 5872.4
## - DeviceProtection 1 5827.0 5873.0
## - TechSupport 1 5827.3 5873.3
## - OnlineSecurity 1 5827.6 5873.6
## - MonthlyCharges 1 5827.9 5873.9
## <none> 5826.3 5874.3
## - Dependents 1 5829.0 5875.0
## - StreamingTV 1 5829.6 5875.6
## - StreamingMovies 1 5829.6 5875.6
## - InternetService 1 5831.1 5877.1
## - SeniorCitizen 1 5832.8 5878.8
## - MultipleLines 2 5846.8 5890.8
## - PaperlessBilling 1 5847.5 5893.5
## - PaymentMethod 3 5852.5 5894.5
## - TotalCharges 1 5849.1 5895.1
## - Contract 2 5908.5 5952.5
## - tenure 1 5937.9 5983.9
##
## Step: AIC=5872.27
## Churn ~ gender + SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + OnlineBackup + DeviceProtection +
## TechSupport + StreamingTV + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - OnlineBackup 1 5826.3 5870.3
## - gender 1 5826.4 5870.4
## - DeviceProtection 1 5827.0 5871.0
## - TechSupport 1 5827.3 5871.3
## - OnlineSecurity 1 5827.6 5871.6
## - MonthlyCharges 1 5827.9 5871.9
## <none> 5826.3 5872.3
## - StreamingTV 1 5829.6 5873.6
## - Dependents 1 5829.6 5873.6
## - StreamingMovies 1 5829.6 5873.6
## - InternetService 1 5831.1 5875.1
## - SeniorCitizen 1 5832.9 5876.9
## - MultipleLines 2 5846.8 5888.8
## - PaperlessBilling 1 5847.5 5891.5
## - PaymentMethod 3 5852.5 5892.5
## - TotalCharges 1 5849.1 5893.1
## - Contract 2 5908.5 5950.5
## - tenure 1 5938.7 5982.7
##
## Step: AIC=5870.29
## Churn ~ gender + SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - gender 1 5826.4 5868.4
## - DeviceProtection 1 5827.7 5869.7
## <none> 5826.3 5870.3
## - TechSupport 1 5829.6 5871.6
## - Dependents 1 5829.6 5871.6
## - OnlineSecurity 1 5830.6 5872.6
## - MonthlyCharges 1 5832.9 5874.9
## - SeniorCitizen 1 5832.9 5874.9
## - StreamingTV 1 5838.0 5880.0
## - StreamingMovies 1 5838.6 5880.6
## - MultipleLines 2 5847.8 5887.8
## - InternetService 1 5847.3 5889.3
## - PaperlessBilling 1 5847.6 5889.6
## - PaymentMethod 3 5852.5 5890.5
## - TotalCharges 1 5849.2 5891.2
## - Contract 2 5908.6 5948.6
## - tenure 1 5938.7 5980.7
##
## Step: AIC=5868.41
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - DeviceProtection 1 5827.8 5867.8
## <none> 5826.4 5868.4
## - TechSupport 1 5829.7 5869.7
## - Dependents 1 5829.8 5869.8
## - OnlineSecurity 1 5830.7 5870.7
## - MonthlyCharges 1 5833.0 5873.0
## - SeniorCitizen 1 5833.0 5873.0
## - StreamingTV 1 5838.1 5878.1
## - StreamingMovies 1 5838.7 5878.7
## - MultipleLines 2 5847.9 5885.9
## - InternetService 1 5847.4 5887.4
## - PaperlessBilling 1 5847.7 5887.7
## - PaymentMethod 3 5852.6 5888.6
## - TotalCharges 1 5849.3 5889.3
## - Contract 2 5908.6 5946.6
## - tenure 1 5938.9 5978.9
##
## Step: AIC=5867.84
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + TechSupport + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## <none> 5827.8 5867.8
## - Dependents 1 5831.2 5869.2
## - MonthlyCharges 1 5833.4 5871.4
## - TechSupport 1 5834.1 5872.1
## - SeniorCitizen 1 5834.5 5872.5
## - OnlineSecurity 1 5836.0 5874.0
## - StreamingTV 1 5838.7 5876.7
## - StreamingMovies 1 5839.3 5877.3
## - MultipleLines 2 5848.2 5884.2
## - PaperlessBilling 1 5848.9 5886.9
## - PaymentMethod 3 5853.8 5887.8
## - TotalCharges 1 5850.6 5888.6
## - InternetService 1 5852.5 5890.5
## - Contract 2 5909.0 5945.0
## - tenure 1 5940.6 5978.6
The final model, stored in the backward_model object, is the result of applying backward elimination to the full logistic regression model. Starting with all available predictors, the algorithm iteratively removed variables that did not contribute meaningfully to model performance, as judged by the Akaike Information Criterion (AIC).
The predictors retained in this model are those that collectively provide the best balance between model fit and complexity. Each retained variable significantly improves the model’s ability to predict customer churn relative to a simpler model without it.
This model is more parsimonious than the full model, avoids overfitting, and includes only those predictors that help explain variation in the outcome variable.
You can now interpret the coefficients, assess model performance, or use it for prediction and model evaluation.
summary(backward_model)
##
## Call:
## glm(formula = Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + TechSupport + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + TotalCharges, family = binomial, data = df)
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.360e-01 5.152e-01 1.234 0.217094
## SeniorCitizen 2.167e-01 8.399e-02 2.580 0.009868 **
## DependentsYes -1.496e-01 8.141e-02 -1.838 0.066054 .
## tenure -6.060e-02 6.210e-03 -9.758 < 2e-16 ***
## MultipleLinesNo phone service 1.367e-01 2.424e-01 0.564 0.572692
## MultipleLinesYes 3.709e-01 9.472e-02 3.916 9.01e-05 ***
## InternetServiceFiber optic 1.367e+00 2.761e-01 4.953 7.32e-07 ***
## InternetServiceNo -1.403e+00 3.148e-01 -4.456 8.37e-06 ***
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -2.827e-01 9.892e-02 -2.858 0.004266 **
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -2.541e-01 1.021e-01 -2.490 0.012791 *
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 4.452e-01 1.351e-01 3.294 0.000987 ***
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 4.545e-01 1.344e-01 3.382 0.000719 ***
## ContractOne year -6.543e-01 1.074e-01 -6.092 1.11e-09 ***
## ContractTwo year -1.346e+00 1.762e-01 -7.640 2.17e-14 ***
## PaperlessBillingYes 3.409e-01 7.443e-02 4.580 4.64e-06 ***
## PaymentMethodCredit card (automatic) -8.769e-02 1.140e-01 -0.769 0.441714
## PaymentMethodElectronic check 3.024e-01 9.444e-02 3.202 0.001363 **
## PaymentMethodMailed check -5.878e-02 1.147e-01 -0.512 0.608461
## MonthlyCharges -2.501e-02 1.057e-02 -2.366 0.017962 *
## TotalCharges 3.281e-04 7.055e-05 4.652 3.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 5827.8 on 7012 degrees of freedom
## AIC: 5867.8
##
## Number of Fisher Scoring iterations: 6
formula(backward_model)
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## InternetService + OnlineSecurity + TechSupport + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + TotalCharges
The model predicts the probability that a customer churns based on a refined set of predictors selected through backward elimination. The model has removed redundant or non-informative variables, as evidenced by the note that 4 coefficients were not defined due to singularities (indicating multicollinearity or perfect separation among categories).
This significant drop in deviance and AIC suggests that the final model provides a much better fit than the null model.
Here are some key variables that are statistically significant and likely important for churn prediction:
These were retained likely because their removal did not improve the AIC, but their individual effects are not statistically significant.
The final model:
This model is now ready for evaluation using prediction metrics or for comparison against models in Python with L1/L2 regularization.
Comparing Logistic Regression Models Across R and Python
Your task is to compare the logistic regression model created
in R using backward elimination with logistic regression models
built in Python using the scikit-learn
library. You will:
C in
LogisticRegression, or alpha in
LogisticRegressionCV).You may use tools such as: - LogisticRegressionCV for
automatic hyperparameter tuning, - Pipeline and
StandardScaler to scale your features (especially important
for regularized models), - classification_report and
roc_auc_score for performance metrics.
This task assesses your ability to apply statistical modeling techniques, evaluate models across platforms, and interpret regularization in practice.
Addressing Class Imbalance
In this dataset, the Churn variable is imbalanced,
meaning that the majority of customers do not churn. This can bias the
logistic regression model and lead to misleading performance
metrics.
Your task is to:
You can use the following R code to oversample the minority class:
# Install if necessary
install.packages("ROSE") # Or use "DMwR" for SMOTE
library(ROSE)
# Create a balanced dataset using random oversampling
balanced_df <- ovun.sample(Churn ~ ., data = df, method = "over", seed = 123)$data
# Proceed with the same modeling steps on balanced_df:
# - full_model <- glm(..., data = balanced_df, ...)
# - backward_model <- step(...)