Logistic estimate regression

Logistic Regression

1. Introduction to Logistic Regression

What is Logistic Regression?
- Models the probability of a binary outcome.
- Outcome variable: binary (e.g., 0/1).

Logistic regression, while not requiring the strict assumptions of linear regression, still operates under several key assumptions to ensure valid estimation and interpretation of results.

1. The Dependent Variable is Binary

Logistic regression is used for binary outcomes (e.g., default vs. no default).
The dependent variable $Y$ is assumed to follow a Bernoulli distribution, where $Y \in \{0, 1\}$.

Why It Matters: If the outcome variable is not binary, logistic regression may not be the appropriate method.

2. Independence of Observations

Observations are assumed to be independent of each other.
The occurrence of one outcome should not influence another.

Why It Matters: Violation of this assumption, such as in repeated measures or clustered data, can lead to biased standard errors and incorrect inferences.

Solutions for Violations: - Use techniques like generalized estimating equations (GEE) or mixed-effects logistic regression for dependent data.

3. Linearity of the Logit

Logistic regression assumes a linear relationship between the predictors and the logit of the outcome: \[ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots \]
- Here, $p$ is the probability of the outcome, and the predictors $X_i$ influence the log-odds linearly.

Why It Matters: If the relationship between predictors and the logit is non-linear, the model might produce biased estimates.

How to Check: - Test for non-linearity by including polynomial or interaction terms. - Use splines or transformations for non-linear relationships.

4. No Perfect Multicollinearity

Predictors should not be perfectly correlated with each other.
High multicollinearity makes it difficult to separate the effects of individual predictors.

Why It Matters: Multicollinearity inflates the standard errors of coefficients, leading to unreliable estimates.

How to Check: - Compute the Variance Inflation Factor (VIF) for each predictor: \[ \text{VIF} = \frac{1}{1 - R^2} \] - A VIF > 5 (or 10) suggests potential multicollinearity.

Solutions for Violations: - Remove or combine collinear predictors.

5. No Omitted Variables (Specification Bias)

The model should include all relevant predictors that influence the dependent variable.
Omitting important variables leads to omitted variable bias, while including irrelevant variables can increase variance.

How to Address: - Use domain knowledge to guide variable selection. - Compare models using measures like AIC or BIC.

6. No Outliers or High-Leverage Points

Outliers or influential observations can disproportionately affect parameter estimates.

Why It Matters: - Outliers can distort coefficients, leading to misleading results.

How to Check: - Examine deviance residuals or Cook’s distance to identify potential outliers. - Use influence plots to assess high-leverage points.

Solutions for Violations: - Remove or adjust for outliers.

7. Sufficient Sample Size

Logistic regression requires a sufficient number of cases in both outcome categories (e.g., 0 and 1) for stable estimation.
A commonly cited rule of thumb is 10 events per predictor variable (EPV).

Why It Matters: Small sample sizes can lead to overfitting or unstable coefficient estimates.

Solutions for Violations: - Combine categories or predictors to reduce model complexity. - Use penalized regression methods for small datasets.

8. Homoscedasticity (Variance Assumption)

While logistic regression does not require homoscedasticity of residuals (as in linear regression), the variance of errors is assumed to follow the binomial distribution.

How to Check: - Examine residual plots to ensure that error variance is consistent across levels of the predictors.

9. Independent Errors

The errors are assumed to be independent and not correlated with predictors or each other.

Why It Matters: Correlated errors violate the assumptions of the model and can result in inefficient estimates.

Solutions for Violations: - Use robust standard errors or a clustered sandwich estimator to account for correlated errors.

10. Proper Link Function

Logistic regression assumes the logit link function is appropriate for modeling the relationship between predictors and the outcome.

Why It Matters: Using an inappropriate link function (e.g., probit) can lead to incorrect estimates.

Solutions: - Ensure the logit function is a reasonable choice for your data.

Practical Tips

Diagnostics: Always check residuals, leverage, and multicollinearity after fitting the model.
Standardization: Scaling predictors can improve interpretability and comparability of coefficients.
Variable Selection: Use stepwise selection

Step-by-Step Explanation of Logistic Regression

Step 1: Load the Dataset

# Load necessary library
library(ISLR)#da povucemo bazu

# Load the Default dataset
data(Default)

# Inspect the structure of the dataset
str(Default)

## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

Default$default <- as.factor(Default$default)
# Check for the balance between the outcome categories
table(Default$default)

## 
##   No  Yes 
## 9667  333

In the Default dataset, the term default refers to whether an individual has failed to repay their credit card debt. It is a binary variable indicating the following:

default = "Yes": The individual has defaulted on their credit card payment obligations.
default = "No": The individual has not defaulted and is meeting their payment obligations.

This variable serves as the dependent variable in the logistic regression model, where we aim to estimate the factors (e.g., balance, income) that affect the likelihood of defaulting.

balance (credit card balance)

Step 2: Fit the Logistic Regression Model

Fit a logistic regression model to estimate the effects of balance (credit card balance) and income on the likelihood of default:

# Fit the logistic regression model
model <- glm(default ~ balance + income, data = Default, family = "binomial")

# View the summary of the model
summary(model)

## 
## Call:
## glm(formula = default ~ balance + income, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Step 3: Interpretation

Model Summary

The model estimates the likelihood of default (default) based on two predictors: balance and income.
Dakle vjerovatnoca da osoba nece izmirivati obaveze

Statistical Significance:

P-values for all predictors are extremely small (less than $0.001$), indicating that both predictors significantly influence the likelihood of default.

1. Intercept: -11.54

Interpretation: The intercept represents the log-odds of default when both balance and income are equal to 0. This is a hypothetical scenario, as it doesn’t make sense for someone to have a balance of $0 and income of $0 in practice.
Log-odds of default when balance = 0 and income = 0 are $-11.54$, which corresponds to a very low probability of defaulting. This very low probability comes from the negative value of the intercept.

To calculate the probability of default when balance and income are 0:

probability = 1 / (1 + exp(-(-11.54))) 
probability #priblizno 0

## [1] 9.732792e-06

The probability of default when both balance and income are zero is virtually 0.

2. Balance: 0.005647

Interpretation: For each additional dollar of balance (credit card balance), the log-odds of default increase by 0.005647.
This suggests that individuals with higher credit card balances are more likely to default. The effect size of 0.005647 might seem small in isolation, but it’s meaningful because credit card balances are likely to be large amounts in practice.

To interpret this in terms of odds ratio:

odds_ratio_balance = exp(0.005647) 
odds_ratio_balance #≈ 1.0057

## [1] 1.005663

Odds ratio: $1.0057$
Interpretation: For every $1 increase in the credit card balance, the odds of default increase by 0.57%.

Real-world Interpretation:

If someone’s credit card balance increases by $100, their odds of defaulting increase by approximately 57%.
This means that, as expected, a higher credit card balance increases the likelihood of default, which aligns with intuition: larger balances may indicate higher financial strain.

Dakle, Veći saldo može ukazivati na veći finansijski pritisak.

3. Income: 0.00002081

Interpretation: For each additional dollar of income, the log-odds of default increase by 0.00002081.
While this is a positive coefficient (indicating a positive relationship), the effect is extremely small compared to the balance variable.

To interpret this in terms of odds ratio:

odds_ratio_income = exp(0.00002081)
odds_ratio_income #≈ 1.00002081

## [1] 1.000021

Odds ratio: $1.00002$
Interpretation: For every $1 increase in income, the odds of default increase by 0.002%.

Real-world Interpretation:

The effect of income on default is extremely small. Even if income increases by a significant amount, the impact on the probability of default is very minimal.
This suggests that while income may play a role in determining the likelihood of default, its effect is not as pronounced as the balance. Higher income, in this case, does not significantly lower the likelihood of default, possibly because other factors like spending patterns or existing debt might outweigh income in influencing default decisions.

Summary of Coefficients:

Balance has a strong and statistically significant positive effect on the likelihood of default. A higher credit card balance increases the odds of default.
Income also has a positive effect, but it is extremely small, meaning the model suggests that slightly higher income results in a very minor increase in the odds of default.

How the Log-Odds Translate to Probability:

In logistic regression, we estimate the log-odds of the outcome, but we can convert this to the probability using the logistic function.

Formula:

\[ \text{Probability of default} = \frac{1}{1 + \exp(-( \text{Intercept} + \text{balance coefficient} \times \text{balance} + \text{income coefficient} \times \text{income}))} \]

For example, if a person has: - balance = 2000 - income = 50000

We can calculate the probability of default as follows:

log_odds = -11.54 + 0.005647 * 2000 + 0.00002081 * 50000
probability = 1 / (1 + exp(-log_odds))
probability

## [1] 0.6887968

This will give us the probability of that person defaulting based on the model’s estimates.

Final Notes:

The positive effect of income could be due to factors not captured in this model, such as spending patterns, existing debts, or financial management behaviors, which could explain why individuals with higher incomes still have a slightly higher likelihood of default.

Model Fit

Deviance:

Null Deviance: $2920.6$ (model without predictors). Ovo je ono sto ocekujemo da nemamo nikakvih varijabli
Residual Deviance: $1579.0$ (model with predictors).
- A significant reduction in deviance shows that the predictors (balance and income) greatly improve the model.

AIC (Akaike Information Criterion): $1585$

A lower AIC value indicates a better-fitting model compared to models with similar data.

Overall Fit: - The large reduction in deviance and low p-values indicate that the model is effective at explaining the likelihood of default based on balance and income.

Step 4: Evaluate Model Fit

Null vs. Residual Deviance:

anova(model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: default
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     9999     2920.7              
## balance  1  1324.20      9998     1596.5 < 2.2e-16 ***
## income   1    17.49      9997     1579.0 2.895e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Analysis of Deviance table provides insights into the statistical significance of each predictor in a logistic regression model by comparing the deviance of the model at each step. Here’s a breakdown of the output:

Understanding the Table Columns:

Interpretation of the Output:

1. NULL Model (No predictors):

Residual Deviance: 2920.7
- This is the deviance of the null model, which includes no predictors (only the intercept).
- This is the baseline measure of model fit, meaning the deviance when we have no explanatory variables.

2. Adding the `balance` variable:

Deviance: 1596.5
Residual Df: 9998
Residual Dev: 1596.5
p-value: $< 2.2 \times 10^{-16}$
- The addition of balance reduces the deviance significantly from 2920.7 to 1596.5. This shows that balance is a very important predictor in the model and improves the model fit.
- The p-value for balance is very small ($< 2.2 \times 10^{-16}$), which indicates that balance is highly significant in predicting the likelihood of default. We can conclude that balance has a very strong influence on the outcome.

3. Adding the `income` variable:

Deviance: 1579.0
Residual Df: 9997
Residual Dev: 1579.0
p-value: $2.895 \times 10^{-5}$
- Adding income further reduces the deviance from 1596.5 to 1579.0, showing that income also contributes to improving the model fit, though to a lesser extent than balance.
- The p-value for income is $2.895 \times 10^{-5}$, which is still quite small and indicates that income is statistically significant in predicting default, but its effect is less strong than balance.

McFadden’s Pseudo R²:

library(pscl)

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

pR2(model)

## fitting null model for pseudo-r2

##           llh       llhNull            G2      McFadden          r2ML 
##  -789.4831351 -1460.3248557  1341.6834411     0.4593784     0.1255572 
##          r2CU 
##     0.4957247

Values closer to 1 indicate better model fit.

The Pseudo-R² values are used to assess the fit of logistic regression models, but they are not interpreted in the same way as R² in linear regression. Since logistic regression does not have a direct R², various pseudo-R² measures can be used to compare models or assess model fit. Let’s break down the output and interpret each value:

Interpreting the values:

1. Log-Likelihood (llh and llhNull):

llh: $-789.4831$
- This is the log-likelihood of the fitted model, which includes the predictors (balance and income).
- A higher (less negative) log-likelihood suggests a better model fit.
llhNull: $-1460.3249$
- This is the log-likelihood of the null model (model with only the intercept, no predictors).
- A significant difference between the log-likelihood of the null and the fitted model indicates that the fitted model is much better at explaining the data than the null model.

2. G² (Likelihood Ratio Test Statistic):

G²: $1341.6834$
- The G² statistic tests whether the addition of the predictors (balance and income) significantly improves the model fit compared to the null model.
- This large value suggests a significant improvement of the fitted model over the null model, indicating that both balance and income are important predictors.

Pseudo-R² Measures:

3. McFadden’s R²:

McFadden’s R²: $0.4594$
- McFadden’s R² is one of the most commonly used pseudo-R² measures. It is calculated as: \[ R^2 = 1 - \frac{\text{llh}}{\text{llhNull}} \]
- Interpretation: A McFadden’s R² of 0.4594 suggests that the model explains about 46% of the variation in the response variable (default) compared to the null model. This is considered a good fit in logistic regression (values between 0.2 and 0.4 typically represent good fits).

Summary:

McFadden’s R² (0.4594) suggests that the model does a relatively good job at explaining the variation in the outcome variable (default), explaining about 46% of the variation compared to the null model.

Overall, the pseudo-R² values suggest that the model performs well and provides a good explanation for the likelihood of default, with McFadden’s R² being the most conservative estimate of fit.

Step 6: Proportionate Change in Odds

To make the coefficients more interpretable, calculate the proportional change in odds for significant predictors:

# Compute odds ratios and proportional changes
odds_ratios <- exp(coef(model))
proportional_changes <- (odds_ratios - 1) * 100
proportional_changes

##   (Intercept)       balance        income 
## -99.999027167   0.566307789   0.002080919

Interpretation: - For every 1-unit increase in balance, the odds of default increase by 0.57%. - For every 1-unit increase in income, the odds of default decrease by 0.002%.

Step 7: Discuss Limitations of Estimation

Scale of Predictors: The effects of income appear small due to its large scale (e.g., thousands of dollars). Scaling predictors might improve interpretability.
Binary Assumptions: Logistic regression assumes the log-odds relationship is linear. Nonlinear effects could require transformations or more advanced models.

Scaling Predictors for Better Estimation

Why Scale Predictors?

When predictors have vastly different scales, logistic regression coefficients may become difficult to interpret, and the fitting process might encounter numerical stability issues. Scaling predictors standardizes their ranges, improving the interpretability of coefficients in terms of relative effect sizes.

For example: - In your model, balance is measured in dollars, while income is also in dollars but represents a larger range. - The difference in scales may lead to challenges in comparing their effects directly or interpreting the coefficients.

How to Scale Predictors in R?

One common approach is z-score scaling, where predictors are transformed to have a mean of 0 and a standard deviation of 1. This can be done using the scale() function:

# Scale the predictors
Default$balance_scaled <- scale(Default$balance)
Default$income_scaled <- scale(Default$income)
summary(Default$balance_scaled)

##        V1          
##  Min.   :-1.72700  
##  1st Qu.:-0.73110  
##  Median :-0.02427  
##  Mean   : 0.00000  
##  3rd Qu.: 0.68415  
##  Max.   : 3.76037

sd (Default$balance_scaled)

## [1] 1

summary(Default$income_scaled)

##        V1          
##  Min.   :-2.45527  
##  1st Qu.:-0.91301  
##  Median : 0.07766  
##  Mean   : 0.00000  
##  3rd Qu.: 0.77161  
##  Max.   : 3.00205

sd(Default$income_scaled)

## [1] 1

# Fit the logistic regression model with scaled predictors
model_scaled <- glm(default ~ balance_scaled + income_scaled, 
                    family = "binomial", data = Default)

summary(model_scaled)

## 
## Call:
## glm(formula = default ~ balance_scaled + income_scaled, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.12557    0.18756 -32.659  < 2e-16 ***
## balance_scaled  2.73159    0.10998  24.836  < 2e-16 ***
## income_scaled   0.27752    0.06649   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Interpreting the Scaled Coefficients

The results of this logistic regression model estimate the relationship between the probability of default and two predictors: scaled balance (balance_scaled) and scaled income (income_scaled). Both predictors were standardized, meaning their values were scaled to have a mean of 0 and a standard deviation of 1, to improve interpretability and comparability.

Interpretation of the Coefficients:

Intercept (-6.12557):
- The log-odds of defaulting when both balance_scaled and income_scaled are at their mean (i.e., 0 after scaling) is -6.12557.
- The corresponding probability of defaulting is very low, as: \[ \text{Probability} = \frac{e^{-6.12557}}{1 + e^{-6.12557}} \approx 0.00218 \, (0.22\%). \]

Balance (balance_scaled: 2.73159):
- For every 1 standard deviation increase in the scaled balance, the log-odds of defaulting increase by 2.73159.
- Translating this to odds: \[ \text{Odds Ratio} = e^{2.73159} \approx 15.35. \] This means that for every 1 standard deviation increase in balance, the odds of defaulting increase by approximately 15.35 times.

Income (income_scaled: 0.27752):
- For every 1 standard deviation increase in the scaled income, the log-odds of defaulting increase by 0.27752.
- Translating this to odds: \[ \text{Odds Ratio} = e^{0.27752} \approx 1.32. \] This indicates that for every 1 standard deviation increase in income, the odds of defaulting increase by about 32%. This is a smaller effect compared to balance_scaled.

Statistical Significance:

All coefficients are highly statistically significant (p-values < 0.001):
- balance_scaled (z = 24.836, p < 2e-16): Extremely strong evidence that balance impacts the probability of default.
- income_scaled (z = 4.174, p = 2.99e-05): Significant evidence that income impacts the probability of default, though the effect size is smaller compared to balance.

Model Fit:

Null Deviance (2920.6) vs. Residual Deviance (1579.0):
- A significant drop in deviance suggests the model provides a much better fit than the null model (intercept-only model).
AIC (1585):
- The Akaike Information Criterion suggests the model’s overall goodness of fit and penalizes model complexity. Lower AIC values indicate better models.

Key Insights:

Scaled Balance is the most influential predictor of default. Larger balances (relative to the mean and standard deviation of the dataset) drastically increase the likelihood of default.
Scaled Income has a smaller but still significant effect, with higher income increasing the odds of default slightly.
The intercept suggests that, on average, default probabilities are very low when predictors are at their mean levels.

Diagnostics for Logistic Regression

To evaluate your model’s performance and ensure the assumptions are met, consider the following diagnostic tools:

1. Goodness-of-Fit Tests

Hosmer-Lemeshow Test: Tests whether the predicted probabilities match the observed data. A p-value > 0.05 indicates a good fit.

library(ResourceSelection)

## ResourceSelection 0.3-6   2023-06-27

hoslem.test(Default$default, fitted(model_scaled))

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Default$default, fitted(model_scaled)
## X-squared = NA, df = 8, p-value = NA

hoslem.test(Default$default, fitted(model))

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Default$default, fitted(model)
## X-squared = NA, df = 8, p-value = NA

#imamo previse No u odnosu na yes, ali inace se koristi

2. Residual Analysis

Deviance residuals can be plotted to check for outliers or poorly predicted cases.

plot(residuals(model_scaled, type = "deviance"))

- Look for residuals that are extremely large, as they may indicate potential outliers.

3. Influence Measures

Identify influential data points that may disproportionately affect the model.

library(car)

## Loading required package: carData

influencePlot(model_scaled) # ne funkcionise plot (model,which=4)

##         StudRes          Hat       CookD
## 3249  0.9961308 7.942615e-03 0.001716759
## 4160  3.7308471 4.638519e-05 0.015887189
## 5371 -0.7897206 9.125176e-03 0.001126548
## 9539  3.7207714 5.096523e-05 0.016789495

4. Multicollinearity Check

Multicollinearity among predictors can distort the estimation of coefficients.
Use Variance Inflation Factor (VIF) to assess multicollinearity.

library(car)
vif(model_scaled)

## balance_scaled  income_scaled 
##       1.045605       1.045605

A VIF > 5 (or 10) indicates potential multicollinearity.

Logistic estimate regression

Amra Fetahovic

2024-12-10

Logistic Regression

1. Introduction to Logistic Regression

1. The Dependent Variable is Binary

2. Independence of Observations

3. Linearity of the Logit

4. No Perfect Multicollinearity

5. No Omitted Variables (Specification Bias)

6. No Outliers or High-Leverage Points

7. Sufficient Sample Size

8. Homoscedasticity (Variance Assumption)

9. Independent Errors

10. Proper Link Function

Practical Tips

Step-by-Step Explanation of Logistic Regression

Step 1: Load the Dataset

balance (credit card balance)

Step 2: Fit the Logistic Regression Model

Step 3: Interpretation

Model Summary

Statistical Significance:

1. Intercept: -11.54

2. Balance: 0.005647

Real-world Interpretation:

3. Income: 0.00002081

Real-world Interpretation:

Summary of Coefficients:

How the Log-Odds Translate to Probability:

Formula:

Final Notes:

Model Fit

Deviance:

AIC (Akaike Information Criterion): \(1585\)

Step 4: Evaluate Model Fit

Null vs. Residual Deviance:

Understanding the Table Columns:

Interpretation of the Output:

1. NULL Model (No predictors):

2. Adding the balance variable:

3. Adding the income variable:

McFadden’s Pseudo R²:

Interpreting the values:

1. Log-Likelihood (llh and llhNull):

2. G² (Likelihood Ratio Test Statistic):

Pseudo-R² Measures:

3. McFadden’s R²:

Summary:

Step 6: Proportionate Change in Odds

Step 7: Discuss Limitations of Estimation

Scaling Predictors for Better Estimation

Why Scale Predictors?

How to Scale Predictors in R?

Interpreting the Scaled Coefficients

Interpretation of the Coefficients:

Statistical Significance:

Model Fit:

Key Insights:

Diagnostics for Logistic Regression

1. Goodness-of-Fit Tests

2. Residual Analysis

3. Influence Measures

4. Multicollinearity Check

2. Adding the `balance` variable:

3. Adding the `income` variable: