Logistic regression, while not requiring the strict assumptions of linear regression, still operates under several key assumptions to ensure valid estimation and interpretation of results.
default
vs. no default
).Why It Matters: If the outcome variable is not binary, logistic regression may not be the appropriate method.
Why It Matters: Violation of this assumption, such as in repeated measures or clustered data, can lead to biased standard errors and incorrect inferences.
Solutions for Violations: - Use techniques like generalized estimating equations (GEE) or mixed-effects logistic regression for dependent data.
Why It Matters: If the relationship between predictors and the logit is non-linear, the model might produce biased estimates.
How to Check: - Test for non-linearity by including polynomial or interaction terms. - Use splines or transformations for non-linear relationships.
Why It Matters: Multicollinearity inflates the standard errors of coefficients, leading to unreliable estimates.
How to Check: - Compute the Variance Inflation Factor (VIF) for each predictor: \[ \text{VIF} = \frac{1}{1 - R^2} \] - A VIF > 5 (or 10) suggests potential multicollinearity.
Solutions for Violations: - Remove or combine collinear predictors.
How to Address: - Use domain knowledge to guide variable selection. - Compare models using measures like AIC or BIC.
Why It Matters: - Outliers can distort coefficients, leading to misleading results.
How to Check: - Examine deviance residuals or Cook’s distance to identify potential outliers. - Use influence plots to assess high-leverage points.
Solutions for Violations: - Remove or adjust for outliers.
Why It Matters: Small sample sizes can lead to overfitting or unstable coefficient estimates.
Solutions for Violations: - Combine categories or predictors to reduce model complexity. - Use penalized regression methods for small datasets.
How to Check: - Examine residual plots to ensure that error variance is consistent across levels of the predictors.
Why It Matters: Correlated errors violate the assumptions of the model and can result in inefficient estimates.
Solutions for Violations: - Use robust standard errors or a clustered sandwich estimator to account for correlated errors.
Why It Matters: Using an inappropriate link function (e.g., probit) can lead to incorrect estimates.
Solutions: - Ensure the logit function is a reasonable choice for your data.
# Load necessary library
library(ISLR)#da povucemo bazu
# Load the Default dataset
data(Default)
# Inspect the structure of the dataset
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
Default$default <- as.factor(Default$default)
# Check for the balance between the outcome categories
table(Default$default)
##
## No Yes
## 9667 333
In the Default
dataset, the term
default refers to whether an individual has
failed to repay their credit card debt. It is a binary
variable indicating the following:
default = "Yes"
: The individual has
defaulted on their credit card payment obligations.default = "No"
: The individual has not
defaulted and is meeting their payment obligations.This variable serves as the dependent variable in the logistic regression model, where we aim to estimate the factors (e.g., balance, income) that affect the likelihood of defaulting.
Fit a logistic regression model to estimate the effects of
balance
(credit card balance) and income
on
the likelihood of default:
# Fit the logistic regression model
model <- glm(default ~ balance + income, data = Default, family = "binomial")
# View the summary of the model
summary(model)
##
## Call:
## glm(formula = default ~ balance + income, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The model estimates the likelihood of default (default
)
based on two predictors: balance and
income.
Dakle vjerovatnoca da osoba nece izmirivati obaveze
balance
and income
are equal to 0. This is a
hypothetical scenario, as it doesn’t make sense for someone to have a
balance of $0 and income of $0 in practice.balance
= 0
and income
= 0 are \(-11.54\), which corresponds to a
very low probability of defaulting. This very low
probability comes from the negative value of the intercept.To calculate the probability of default when
balance
and income
are 0:
probability = 1 / (1 + exp(-(-11.54)))
probability #priblizno 0
## [1] 9.732792e-06
The probability of default when both balance
and
income
are zero is virtually 0.
balance
(credit card balance), the
log-odds of default increase by
0.005647.To interpret this in terms of odds ratio:
odds_ratio_balance = exp(0.005647)
odds_ratio_balance #≈ 1.0057
## [1] 1.005663
Dakle, Veći saldo može ukazivati na veći finansijski pritisak.
To interpret this in terms of odds ratio:
odds_ratio_income = exp(0.00002081)
odds_ratio_income #≈ 1.00002081
## [1] 1.000021
In logistic regression, we estimate the log-odds of the outcome, but we can convert this to the probability using the logistic function.
\[ \text{Probability of default} = \frac{1}{1 + \exp(-( \text{Intercept} + \text{balance coefficient} \times \text{balance} + \text{income coefficient} \times \text{income}))} \]
For example, if a person has: - balance = 2000
-
income = 50000
We can calculate the probability of default as follows:
log_odds = -11.54 + 0.005647 * 2000 + 0.00002081 * 50000
probability = 1 / (1 + exp(-log_odds))
probability
## [1] 0.6887968
This will give us the probability of that person defaulting based on the model’s estimates.
Overall Fit: - The large reduction in deviance and low p-values indicate that the model is effective at explaining the likelihood of default based on balance and income.
anova(model, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: default
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 9999 2920.7
## balance 1 1324.20 9998 1596.5 < 2.2e-16 ***
## income 1 17.49 9997 1579.0 2.895e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Analysis of Deviance table provides insights into the statistical significance of each predictor in a logistic regression model by comparing the deviance of the model at each step. Here’s a breakdown of the output:
balance
variable:balance
reduces the deviance
significantly from 2920.7 to 1596.5. This shows that
balance
is a very important predictor in the model and
improves the model fit.balance
is very small
(\(< 2.2 \times 10^{-16}\)), which
indicates that balance
is highly significant in predicting
the likelihood of default. We can conclude that balance
has
a very strong influence on the outcome.income
variable:income
further reduces the deviance from 1596.5
to 1579.0, showing that income
also contributes to
improving the model fit, though to a lesser extent than
balance
.income
is \(2.895 \times 10^{-5}\), which is still
quite small and indicates that income
is statistically
significant in predicting default, but its effect is less strong than
balance
.library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -789.4831351 -1460.3248557 1341.6834411 0.4593784 0.1255572
## r2CU
## 0.4957247
The Pseudo-R² values are used to assess the fit of logistic regression models, but they are not interpreted in the same way as R² in linear regression. Since logistic regression does not have a direct R², various pseudo-R² measures can be used to compare models or assess model fit. Let’s break down the output and interpret each value:
balance
and
income
).balance
and income
) significantly
improves the model fit compared to the null model.balance
and
income
are important predictors.default
) compared to the null model.
This is considered a good fit in logistic regression
(values between 0.2 and 0.4 typically represent good fits).default
), explaining about 46%
of the variation compared to the null model.Overall, the pseudo-R² values suggest that the model performs well and provides a good explanation for the likelihood of default, with McFadden’s R² being the most conservative estimate of fit.
To make the coefficients more interpretable, calculate the proportional change in odds for significant predictors:
# Compute odds ratios and proportional changes
odds_ratios <- exp(coef(model))
proportional_changes <- (odds_ratios - 1) * 100
proportional_changes
## (Intercept) balance income
## -99.999027167 0.566307789 0.002080919
Interpretation: - For every 1-unit increase in
balance
, the odds of default increase by
0.57%. - For every 1-unit increase in
income
, the odds of default decrease by
0.002%.
income
appear small due to its large scale (e.g., thousands
of dollars). Scaling predictors might improve interpretability.When predictors have vastly different scales, logistic regression coefficients may become difficult to interpret, and the fitting process might encounter numerical stability issues. Scaling predictors standardizes their ranges, improving the interpretability of coefficients in terms of relative effect sizes.
For example: - In your model, balance
is measured in
dollars, while income
is also in dollars but represents a
larger range. - The difference in scales may lead to challenges in
comparing their effects directly or interpreting the coefficients.
One common approach is z-score scaling, where
predictors are transformed to have a mean of 0 and a standard deviation
of 1. This can be done using the scale()
function:
# Scale the predictors
Default$balance_scaled <- scale(Default$balance)
Default$income_scaled <- scale(Default$income)
summary(Default$balance_scaled)
## V1
## Min. :-1.72700
## 1st Qu.:-0.73110
## Median :-0.02427
## Mean : 0.00000
## 3rd Qu.: 0.68415
## Max. : 3.76037
sd (Default$balance_scaled)
## [1] 1
summary(Default$income_scaled)
## V1
## Min. :-2.45527
## 1st Qu.:-0.91301
## Median : 0.07766
## Mean : 0.00000
## 3rd Qu.: 0.77161
## Max. : 3.00205
sd(Default$income_scaled)
## [1] 1
# Fit the logistic regression model with scaled predictors
model_scaled <- glm(default ~ balance_scaled + income_scaled,
family = "binomial", data = Default)
summary(model_scaled)
##
## Call:
## glm(formula = default ~ balance_scaled + income_scaled, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.12557 0.18756 -32.659 < 2e-16 ***
## balance_scaled 2.73159 0.10998 24.836 < 2e-16 ***
## income_scaled 0.27752 0.06649 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The results of this logistic regression model estimate the
relationship between the probability of default and two predictors:
scaled balance (balance_scaled
) and
scaled income (income_scaled
). Both
predictors were standardized, meaning their values were scaled to have a
mean of 0 and a standard deviation of 1, to improve interpretability and
comparability.
balance_scaled
and
income_scaled
are at their mean (i.e., 0 after scaling) is
-6.12557.balance_scaled
.balance_scaled
(z = 24.836, p < 2e-16): Extremely
strong evidence that balance impacts the probability of default.income_scaled
(z = 4.174, p = 2.99e-05): Significant
evidence that income impacts the probability of default, though the
effect size is smaller compared to balance.To evaluate your model’s performance and ensure the assumptions are met, consider the following diagnostic tools:
library(ResourceSelection)
## ResourceSelection 0.3-6 2023-06-27
hoslem.test(Default$default, fitted(model_scaled))
## Warning in Ops.factor(1, y): '-' not meaningful for factors
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Default$default, fitted(model_scaled)
## X-squared = NA, df = 8, p-value = NA
hoslem.test(Default$default, fitted(model))
## Warning in Ops.factor(1, y): '-' not meaningful for factors
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Default$default, fitted(model)
## X-squared = NA, df = 8, p-value = NA
#imamo previse No u odnosu na yes, ali inace se koristi
plot(residuals(model_scaled, type = "deviance"))
- Look for residuals that are extremely large, as they may indicate
potential outliers.
library(car)
## Loading required package: carData
influencePlot(model_scaled) # ne funkcionise plot (model,which=4)
## StudRes Hat CookD
## 3249 0.9961308 7.942615e-03 0.001716759
## 4160 3.7308471 4.638519e-05 0.015887189
## 5371 -0.7897206 9.125176e-03 0.001126548
## 9539 3.7207714 5.096523e-05 0.016789495
library(car)
vif(model_scaled)
## balance_scaled income_scaled
## 1.045605 1.045605