Logistic Regression


1. Introduction to Logistic Regression

Logistic regression, while not requiring the strict assumptions of linear regression, still operates under several key assumptions to ensure valid estimation and interpretation of results.


1. The Dependent Variable is Binary

Why It Matters: If the outcome variable is not binary, logistic regression may not be the appropriate method.


2. Independence of Observations

Why It Matters: Violation of this assumption, such as in repeated measures or clustered data, can lead to biased standard errors and incorrect inferences.

Solutions for Violations: - Use techniques like generalized estimating equations (GEE) or mixed-effects logistic regression for dependent data.


3. Linearity of the Logit

Why It Matters: If the relationship between predictors and the logit is non-linear, the model might produce biased estimates.

How to Check: - Test for non-linearity by including polynomial or interaction terms. - Use splines or transformations for non-linear relationships.


4. No Perfect Multicollinearity

Why It Matters: Multicollinearity inflates the standard errors of coefficients, leading to unreliable estimates.

How to Check: - Compute the Variance Inflation Factor (VIF) for each predictor: \[ \text{VIF} = \frac{1}{1 - R^2} \] - A VIF > 5 (or 10) suggests potential multicollinearity.

Solutions for Violations: - Remove or combine collinear predictors.


5. No Omitted Variables (Specification Bias)

How to Address: - Use domain knowledge to guide variable selection. - Compare models using measures like AIC or BIC.


6. No Outliers or High-Leverage Points

Why It Matters: - Outliers can distort coefficients, leading to misleading results.

How to Check: - Examine deviance residuals or Cook’s distance to identify potential outliers. - Use influence plots to assess high-leverage points.

Solutions for Violations: - Remove or adjust for outliers.


7. Sufficient Sample Size

Why It Matters: Small sample sizes can lead to overfitting or unstable coefficient estimates.

Solutions for Violations: - Combine categories or predictors to reduce model complexity. - Use penalized regression methods for small datasets.


8. Homoscedasticity (Variance Assumption)

How to Check: - Examine residual plots to ensure that error variance is consistent across levels of the predictors.


9. Independent Errors

Why It Matters: Correlated errors violate the assumptions of the model and can result in inefficient estimates.

Solutions for Violations: - Use robust standard errors or a clustered sandwich estimator to account for correlated errors.


Practical Tips


Step-by-Step Explanation of Logistic Regression

Step 1: Load the Dataset

# Load necessary library
library(ISLR)#da povucemo bazu

# Load the Default dataset
data(Default)

# Inspect the structure of the dataset
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
Default$default <- as.factor(Default$default)
# Check for the balance between the outcome categories
table(Default$default)
## 
##   No  Yes 
## 9667  333

In the Default dataset, the term default refers to whether an individual has failed to repay their credit card debt. It is a binary variable indicating the following:

  • default = "Yes": The individual has defaulted on their credit card payment obligations.
  • default = "No": The individual has not defaulted and is meeting their payment obligations.

This variable serves as the dependent variable in the logistic regression model, where we aim to estimate the factors (e.g., balance, income) that affect the likelihood of defaulting.

  • balance (credit card balance)

Step 2: Fit the Logistic Regression Model

Fit a logistic regression model to estimate the effects of balance (credit card balance) and income on the likelihood of default:

# Fit the logistic regression model
model <- glm(default ~ balance + income, data = Default, family = "binomial")

# View the summary of the model
summary(model)
## 
## Call:
## glm(formula = default ~ balance + income, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Step 3: Interpretation

Model Summary

The model estimates the likelihood of default (default) based on two predictors: balance and income.
Dakle vjerovatnoca da osoba nece izmirivati obaveze

Statistical Significance:

  • P-values for all predictors are extremely small (less than \(0.001\)), indicating that both predictors significantly influence the likelihood of default.

1. Intercept: -11.54

To calculate the probability of default when balance and income are 0:

probability = 1 / (1 + exp(-(-11.54))) 
probability #priblizno 0
## [1] 9.732792e-06

The probability of default when both balance and income are zero is virtually 0.


2. Balance: 0.005647

To interpret this in terms of odds ratio:

odds_ratio_balance = exp(0.005647) 
odds_ratio_balance #≈ 1.0057
## [1] 1.005663

Real-world Interpretation:

  • If someone’s credit card balance increases by $100, their odds of defaulting increase by approximately 57%.
  • This means that, as expected, a higher credit card balance increases the likelihood of default, which aligns with intuition: larger balances may indicate higher financial strain.

Dakle, Veći saldo može ukazivati na veći finansijski pritisak.


3. Income: 0.00002081

To interpret this in terms of odds ratio:

odds_ratio_income = exp(0.00002081)
odds_ratio_income #≈ 1.00002081
## [1] 1.000021

Real-world Interpretation:

  • The effect of income on default is extremely small. Even if income increases by a significant amount, the impact on the probability of default is very minimal.
  • This suggests that while income may play a role in determining the likelihood of default, its effect is not as pronounced as the balance. Higher income, in this case, does not significantly lower the likelihood of default, possibly because other factors like spending patterns or existing debt might outweigh income in influencing default decisions.

Summary of Coefficients:


How the Log-Odds Translate to Probability:

In logistic regression, we estimate the log-odds of the outcome, but we can convert this to the probability using the logistic function.

Formula:

\[ \text{Probability of default} = \frac{1}{1 + \exp(-( \text{Intercept} + \text{balance coefficient} \times \text{balance} + \text{income coefficient} \times \text{income}))} \]

For example, if a person has: - balance = 2000 - income = 50000

We can calculate the probability of default as follows:

log_odds = -11.54 + 0.005647 * 2000 + 0.00002081 * 50000
probability = 1 / (1 + exp(-log_odds))
probability
## [1] 0.6887968

This will give us the probability of that person defaulting based on the model’s estimates.


Final Notes:


Model Fit

Deviance:

  • Null Deviance: \(2920.6\) (model without predictors). Ovo je ono sto ocekujemo da nemamo nikakvih varijabli
  • Residual Deviance: \(1579.0\) (model with predictors).
    • A significant reduction in deviance shows that the predictors (balance and income) greatly improve the model.

AIC (Akaike Information Criterion): \(1585\)

  • A lower AIC value indicates a better-fitting model compared to models with similar data.

Overall Fit: - The large reduction in deviance and low p-values indicate that the model is effective at explaining the likelihood of default based on balance and income.


Step 4: Evaluate Model Fit

Null vs. Residual Deviance:
anova(model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: default
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     9999     2920.7              
## balance  1  1324.20      9998     1596.5 < 2.2e-16 ***
## income   1    17.49      9997     1579.0 2.895e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Analysis of Deviance table provides insights into the statistical significance of each predictor in a logistic regression model by comparing the deviance of the model at each step. Here’s a breakdown of the output:

Understanding the Table Columns:

Interpretation of the Output:

1. NULL Model (No predictors):

  • Residual Deviance: 2920.7
    • This is the deviance of the null model, which includes no predictors (only the intercept).
    • This is the baseline measure of model fit, meaning the deviance when we have no explanatory variables.

2. Adding the balance variable:

  • Deviance: 1596.5
  • Residual Df: 9998
  • Residual Dev: 1596.5
  • p-value: \(< 2.2 \times 10^{-16}\)
    • The addition of balance reduces the deviance significantly from 2920.7 to 1596.5. This shows that balance is a very important predictor in the model and improves the model fit.
    • The p-value for balance is very small (\(< 2.2 \times 10^{-16}\)), which indicates that balance is highly significant in predicting the likelihood of default. We can conclude that balance has a very strong influence on the outcome.

3. Adding the income variable:

  • Deviance: 1579.0
  • Residual Df: 9997
  • Residual Dev: 1579.0
  • p-value: \(2.895 \times 10^{-5}\)
    • Adding income further reduces the deviance from 1596.5 to 1579.0, showing that income also contributes to improving the model fit, though to a lesser extent than balance.
    • The p-value for income is \(2.895 \times 10^{-5}\), which is still quite small and indicates that income is statistically significant in predicting default, but its effect is less strong than balance.

McFadden’s Pseudo R²:
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
pR2(model)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
##  -789.4831351 -1460.3248557  1341.6834411     0.4593784     0.1255572 
##          r2CU 
##     0.4957247
  • Values closer to 1 indicate better model fit.

The Pseudo-R² values are used to assess the fit of logistic regression models, but they are not interpreted in the same way as R² in linear regression. Since logistic regression does not have a direct R², various pseudo-R² measures can be used to compare models or assess model fit. Let’s break down the output and interpret each value:


Interpreting the values:

1. Log-Likelihood (llh and llhNull):

  • llh: \(-789.4831\)
    • This is the log-likelihood of the fitted model, which includes the predictors (balance and income).
    • A higher (less negative) log-likelihood suggests a better model fit.
  • llhNull: \(-1460.3249\)
    • This is the log-likelihood of the null model (model with only the intercept, no predictors).
    • A significant difference between the log-likelihood of the null and the fitted model indicates that the fitted model is much better at explaining the data than the null model.

2. G² (Likelihood Ratio Test Statistic):

  • G²: \(1341.6834\)
    • The G² statistic tests whether the addition of the predictors (balance and income) significantly improves the model fit compared to the null model.
    • This large value suggests a significant improvement of the fitted model over the null model, indicating that both balance and income are important predictors.

Pseudo-R² Measures:

3. McFadden’s R²:

  • McFadden’s R²: \(0.4594\)
    • McFadden’s R² is one of the most commonly used pseudo-R² measures. It is calculated as: \[ R^2 = 1 - \frac{\text{llh}}{\text{llhNull}} \]
    • Interpretation: A McFadden’s R² of 0.4594 suggests that the model explains about 46% of the variation in the response variable (default) compared to the null model. This is considered a good fit in logistic regression (values between 0.2 and 0.4 typically represent good fits).

Summary:

Overall, the pseudo-R² values suggest that the model performs well and provides a good explanation for the likelihood of default, with McFadden’s R² being the most conservative estimate of fit.


Step 6: Proportionate Change in Odds

To make the coefficients more interpretable, calculate the proportional change in odds for significant predictors:

# Compute odds ratios and proportional changes
odds_ratios <- exp(coef(model))
proportional_changes <- (odds_ratios - 1) * 100
proportional_changes
##   (Intercept)       balance        income 
## -99.999027167   0.566307789   0.002080919

Interpretation: - For every 1-unit increase in balance, the odds of default increase by 0.57%. - For every 1-unit increase in income, the odds of default decrease by 0.002%.


Step 7: Discuss Limitations of Estimation

  • Scale of Predictors: The effects of income appear small due to its large scale (e.g., thousands of dollars). Scaling predictors might improve interpretability.
  • Binary Assumptions: Logistic regression assumes the log-odds relationship is linear. Nonlinear effects could require transformations or more advanced models.

Scaling Predictors for Better Estimation

Why Scale Predictors?

When predictors have vastly different scales, logistic regression coefficients may become difficult to interpret, and the fitting process might encounter numerical stability issues. Scaling predictors standardizes their ranges, improving the interpretability of coefficients in terms of relative effect sizes.

For example: - In your model, balance is measured in dollars, while income is also in dollars but represents a larger range. - The difference in scales may lead to challenges in comparing their effects directly or interpreting the coefficients.

How to Scale Predictors in R?

One common approach is z-score scaling, where predictors are transformed to have a mean of 0 and a standard deviation of 1. This can be done using the scale() function:

# Scale the predictors
Default$balance_scaled <- scale(Default$balance)
Default$income_scaled <- scale(Default$income)
summary(Default$balance_scaled)
##        V1          
##  Min.   :-1.72700  
##  1st Qu.:-0.73110  
##  Median :-0.02427  
##  Mean   : 0.00000  
##  3rd Qu.: 0.68415  
##  Max.   : 3.76037
sd (Default$balance_scaled)
## [1] 1
summary(Default$income_scaled)
##        V1          
##  Min.   :-2.45527  
##  1st Qu.:-0.91301  
##  Median : 0.07766  
##  Mean   : 0.00000  
##  3rd Qu.: 0.77161  
##  Max.   : 3.00205
sd(Default$income_scaled)
## [1] 1
# Fit the logistic regression model with scaled predictors
model_scaled <- glm(default ~ balance_scaled + income_scaled, 
                    family = "binomial", data = Default)

summary(model_scaled)
## 
## Call:
## glm(formula = default ~ balance_scaled + income_scaled, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.12557    0.18756 -32.659  < 2e-16 ***
## balance_scaled  2.73159    0.10998  24.836  < 2e-16 ***
## income_scaled   0.27752    0.06649   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Interpreting the Scaled Coefficients

The results of this logistic regression model estimate the relationship between the probability of default and two predictors: scaled balance (balance_scaled) and scaled income (income_scaled). Both predictors were standardized, meaning their values were scaled to have a mean of 0 and a standard deviation of 1, to improve interpretability and comparability.


Interpretation of the Coefficients:

  1. Intercept (-6.12557):
    • The log-odds of defaulting when both balance_scaled and income_scaled are at their mean (i.e., 0 after scaling) is -6.12557.
    • The corresponding probability of defaulting is very low, as: \[ \text{Probability} = \frac{e^{-6.12557}}{1 + e^{-6.12557}} \approx 0.00218 \, (0.22\%). \]

  1. Balance (balance_scaled: 2.73159):
    • For every 1 standard deviation increase in the scaled balance, the log-odds of defaulting increase by 2.73159.
    • Translating this to odds: \[ \text{Odds Ratio} = e^{2.73159} \approx 15.35. \] This means that for every 1 standard deviation increase in balance, the odds of defaulting increase by approximately 15.35 times.

  1. Income (income_scaled: 0.27752):
    • For every 1 standard deviation increase in the scaled income, the log-odds of defaulting increase by 0.27752.
    • Translating this to odds: \[ \text{Odds Ratio} = e^{0.27752} \approx 1.32. \] This indicates that for every 1 standard deviation increase in income, the odds of defaulting increase by about 32%. This is a smaller effect compared to balance_scaled.

Statistical Significance:


Model Fit:


Key Insights:

  1. Scaled Balance is the most influential predictor of default. Larger balances (relative to the mean and standard deviation of the dataset) drastically increase the likelihood of default.
  2. Scaled Income has a smaller but still significant effect, with higher income increasing the odds of default slightly.
  3. The intercept suggests that, on average, default probabilities are very low when predictors are at their mean levels.

Diagnostics for Logistic Regression

To evaluate your model’s performance and ensure the assumptions are met, consider the following diagnostic tools:

1. Goodness-of-Fit Tests

  • Hosmer-Lemeshow Test: Tests whether the predicted probabilities match the observed data. A p-value > 0.05 indicates a good fit.
library(ResourceSelection)
## ResourceSelection 0.3-6   2023-06-27
hoslem.test(Default$default, fitted(model_scaled))
## Warning in Ops.factor(1, y): '-' not meaningful for factors
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Default$default, fitted(model_scaled)
## X-squared = NA, df = 8, p-value = NA
hoslem.test(Default$default, fitted(model))
## Warning in Ops.factor(1, y): '-' not meaningful for factors
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Default$default, fitted(model)
## X-squared = NA, df = 8, p-value = NA
#imamo previse No u odnosu na yes, ali inace se koristi

2. Residual Analysis

  • Deviance residuals can be plotted to check for outliers or poorly predicted cases.
plot(residuals(model_scaled, type = "deviance"))

- Look for residuals that are extremely large, as they may indicate potential outliers.

3. Influence Measures

  • Identify influential data points that may disproportionately affect the model.
library(car)
## Loading required package: carData
influencePlot(model_scaled) # ne funkcionise plot (model,which=4)

##         StudRes          Hat       CookD
## 3249  0.9961308 7.942615e-03 0.001716759
## 4160  3.7308471 4.638519e-05 0.015887189
## 5371 -0.7897206 9.125176e-03 0.001126548
## 9539  3.7207714 5.096523e-05 0.016789495

4. Multicollinearity Check

  • Multicollinearity among predictors can distort the estimation of coefficients.
  • Use Variance Inflation Factor (VIF) to assess multicollinearity.
library(car)
vif(model_scaled)
## balance_scaled  income_scaled 
##       1.045605       1.045605
  • A VIF > 5 (or 10) indicates potential multicollinearity.