Many economic decisions are binary in nature:
Labor force participation (Yes/No)
Purchase decisions (Buy/Not Buy)
Default on loans (Default/Not Default)
Migration decisions (Move/Stay)
Consider the following linear regression model:
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ...+ + \beta_k X_{ki} + u_i \]
where \(P(Y=1 \mid X_1,X_2,...,X_k) = \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ...+ + \beta_k X_{k}\).
\[ E(Y)= 1*Pr(Y=1) + 0*Pr(Y=0) = Pr(Y=1) \]
In the multiple regression model with a binary dependent variable, we have \[ E(Y_i | X_{1i},....,X_{ki})=Pr(Y_i=1 | X_{1i},....,X_{ki}) = \beta_0 + \beta_1 X_{1i} + ...+\beta_k X_{ki} \] This means that \(\beta_j\) can be interpreted as the change in the probability that \(Y_i=1\) holding constant the other \(k-1\) regressors. Just as in common multiple regression, the \(\beta_j\) can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.
In other words: The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i},....,X_{ki}) }{\partial X_j } = \beta_j \]
Most individuals who want to buy a house apply for a mortgage at a bank, but not all mortgage applications are approved.
What determines whether or not a mortgage application is approved or denied?
deny=1 if mortgage application is denied
pi_ratio (pirat): anticipated monthly loan payments divided by monthly income
black=1 if applicant is black, =0 if applicant is white
Research Question: Does the payment to income ratio affect whether or not a mortgage application is denied?
## deny pirat hirat lvrat chist mhist phist unemp selfemp insurance condomin
## 1 no 0.221 0.221 0.8000000 5 2 no 3.9 no no no
## 2 no 0.265 0.265 0.9218750 2 2 no 3.2 no no no
## 3 no 0.372 0.248 0.9203980 1 2 no 3.2 no no no
## 4 no 0.320 0.250 0.8604651 1 2 no 4.3 no no no
## 5 no 0.360 0.350 0.6000000 1 1 no 3.2 no no no
## 6 no 0.240 0.170 0.5105263 1 1 no 3.9 no no no
## afam single hschool
## 1 no no yes
## 2 no yes yes
## 3 no no yes
## 4 no no yes
## 5 no no yes
## 6 no no yes
## deny pirat hirat lvrat chist
## no :2095 Min. :0.0000 Min. :0.0000 Min. :0.0200 1:1353
## yes: 285 1st Qu.:0.2800 1st Qu.:0.2140 1st Qu.:0.6527 2: 441
## Median :0.3300 Median :0.2600 Median :0.7795 3: 126
## Mean :0.3308 Mean :0.2553 Mean :0.7378 4: 77
## 3rd Qu.:0.3700 3rd Qu.:0.2988 3rd Qu.:0.8685 5: 182
## Max. :3.0000 Max. :3.0000 Max. :1.9500 6: 201
## mhist phist unemp selfemp insurance condomin
## 1: 747 no :2205 Min. : 1.800 no :2103 no :2332 no :1694
## 2:1571 yes: 175 1st Qu.: 3.100 yes: 277 yes: 48 yes: 686
## 3: 41 Median : 3.200
## 4: 21 Mean : 3.774
## 3rd Qu.: 3.900
## Max. :10.600
## afam single hschool
## no :2041 no :1444 no : 39
## yes: 339 yes: 936 yes:2341
##
##
##
##
We want to model the variable deny, an indicator for
whether an applicant’s mortgage application has been accepted (deny =
no) or denied (deny = yes).
Specifically we want to model the relationship between
deny and pirat (anticipated monthly payments
as a ratio of income). It is straightforward to translate this into the
following simple regression model:
\[ deny = \beta_0 + \beta_1 * PI ratio + u \]
Estimate this model just like any other linear regression using
lm().
# convert 'deny' to numeric
HMDA$deny <- as.numeric(HMDA$deny) - 1
# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1##
## Call:
## lm(formula = deny ~ pirat, data = HMDA)
##
## Coefficients:
## (Intercept) pirat
## -0.07991 0.60353
Now plot the data together with the estimated regression line and see how they compare:
# plot the data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Scatterplot Mortgage Application Denial and
the Payment-to-Income Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.8)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add the estimated regression line
abline(denymod1,
lwd = 1.8,
col = "steelblue")Results show that a payment to income ratio of 1 is associated with an expected probability of mortgage application denial of roughly 60 percent.
There is a positive relationship between the payment-to-income ratio and the probability of a denied mortgage application so individuals with a high ratio of loan payments to income are more likely to be rejected.
Based on the estimation results from the LPM, we find that the true coefficient on the P/I ratio is statistically different from 0 at the 1 percent level.
Recall that the theoretical model is: \[ E(Y_i | X_{1i})=Pr(Y_i=1 | X_{1i}) = \beta_0 + \beta_1 X_{1i} \] where the only regressor is \(X_1\) which is the P/I ratio in the mortgage example.
The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i}) }{\partial X_1 } = \beta_1 \] In the mortgage application example, we got \(\hat{\beta_1}=0.60\).
For P/I ratio larger than 1.75, the estimated model predicts that the probability of a mortgage application getting turned down is larger than!
For some values, it is even negative so the predicted probability of denial is even negative so that the model has no meaningful interpretation here.
In the linear probability model the predicted probability can be below 0 or above 1!
So how should we interpret the coefficient estimate \(\hat{\beta}\) for the P/I ratio?
This is a major flaw of the linear probability model: it assumes the conditional probability function to be linear.
In other words, it does not restrict \(P(Y=1 | X_1,..,X_k)\) to be between 0 and 1.
This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable.
Violation of measurements: If the dependent variable is dichotomous e.g. smoker and non-smoker, employed and unemployed, etc.
Problems with LPM:
Non-normality of the errors \(U_i\)
Heteroskedastic variances of the errors,
Non-fulfillment of \(0 \leq E(Y_i|X_i) \leq 1\),
Constant marginal effects,
Questionable of value of \(R^2\) as a measure of goodness of fit.
\[ Y_i = \beta_0 + \beta_{1i} X_{1i} + ... + \beta_k X_{ki} + u_i \]
The variance of a Bernoulli random variable is: \[ Var(Y)=Pr(Y=1) * [1- Pr(Y=1)] \] We can use this to find the conditional variance of the error term.
Probabilities cannot be less than 0 or greater than 1
To address this problem, we will consider nonlinear probability models
\[ Pr(Y_i=1) = G(Z) \] where \(Z=\beta_0 + \beta_1X_{1i} + .... + \beta_k X_{ki}\) and \(0 \leq G(Z) \leq 1\).
We will consider 2 nonlinear functions
Probit: \(G(Z)=\phi(Z)\).
Logit: \(G(Z)=\frac{1}{1+e^{-Z}}\).
Probit regression models the probability that \(Y=1\).
Using the cumulative standard normal distribution function \(\Phi(Z)\).
evaluated at \(Z=\beta_0 + \beta_1 X_{1i} + .... + \beta_k X_{ki}\).
since \(\phi(z) = Pr(Z \leq z)\) we have that the predicted probabilities of the probit model are between 0 and 1.
Example:
Suppose we have only one regressor and \(Z=-2 + 3 X_1\).
We want to know the probability that \(Y=1\) when \(X_1=0.4\).
\(z=-2 + 3*0.4 = -0.8\).
\(Pr(Y=1)=Pr(Z \leq -0.8) = \Phi(-0.8) = 0.2119\)
The probit model is derived from an underlying latent variable model:
\[ y_i^* = x_i'\beta + \epsilon_i \]
where:
\(y_i^*\) is an unobserved latent variable
\(x_i\) is a vector of covariates
\(\beta\) is a vector of parameters
\(\epsilon_i \sim N(0,1)\) (standard normal)
We observe:
\[ y_i = \begin{cases} 1 & \text{if } y_i^* > 0 \\\\ 0 & \text{if } y_i^* \leq 0 \end{cases} \]
This leads to:
\[ P(y_i = 1|x_i) = P(y_i^* > 0|x_i) = P(x_i'\beta + \epsilon_i > 0) \] \[ = P(\epsilon_i > -x_i'\beta) = \Phi(x_i'\beta) \]
where \(\Phi(\cdot)\) is the standard normal CDF.
The probability that Y = 1 given X is: \[ P(Y = 1|X) = \Phi(X\beta) \] where:
\(\Phi\) is the cumulative distribution function of the standard normal distribution
beta are the coefficients
X are the predictors
In probit regression, the cumulative standard normal distribution function \(\Phi(.)\) is used to model the regression function when the dependent variable is binary.
\[ E(Y | X) = P(Y=1 | X) = \Phi(\beta_0 + \beta_1 X) \]
where \(\beta_0 + \beta_1X\) plays the role of a quantile \(z\). Recall that \[ \Phi(z)=P(Z\leq z), Z \sim N(0,1) \]
The probit coefficient \(\beta_1\) is the change in \(z\) associated with one unit change in \(X\). Although th effect on z of a change in X is linear, the link between x and the dependent variable \(Y\) is nonlinear since \(\Phi\) is a nonlinear function of \(X\).
For a model with multiple regressors, the analysis follows the same way
For a binary variable \(Y\),
\[ Y=\beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u \] with \[ P(Y=1 | X_1,X_2,...,X_k) = \Phi(\beta_0 + \beta_1 X_1 + ...+ \beta_k X_k) \] is the population Probit model with multiple regressors \(X_1, X_2, ..., X_k\) and \(\Phi(.)\) is the cumulative standard normal function. Then we calculate the predicted probability that \(Y=1\) given \(X_1,X_2,...,X_k\) in two steps:
Compute \(z= \beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u\)
Look up \(\Phi(z)\) by calling
pnorm().
In the above model, \(\beta_j\) is the effect of \(z\) of a one unit change in regressor \(X_j\) holding constant all other \(k-1\) regressors.
Estimate the probit model using the mortgage data:
# estimate the simple probit model
denyprobit <- glm(deny ~ pirat,
family = binomial(link = "probit"),
data = HMDA)
coeftest(denyprobit, vcov. = vcovHC, type = "HC1")##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.19415 0.18901 -11.6087 < 2.2e-16 ***
## pirat 2.96787 0.53698 5.5269 3.259e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plot the data with the estimated regression line:
# plot data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit Model of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.85)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")The estimated regression function has a stretched “S”-shape which is typical for the CDF of a continuous random variable with symmetric PDF like that of a normal random variable.
The function is clearly nonlinear and flattens out for large and small values of P/I ratio.
The functional form thus also ensures that the predicted conditional probabilities of a denial lie between 0 and 1.
We can also run the probit model with additional regressors. The
below shows results from regressing deny on
PIratio and race (whether the person is black or not).
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.090514 0.033430 -2.7076 0.006826 **
## pirat 0.559195 0.103671 5.3939 7.575e-08 ***
## blackyes 0.177428 0.025055 7.0815 1.871e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Used when outcome variable is binary (0/1)
Common in economics for analyzing:
Labor force participation
Consumer choice
Market entry decisions
Default probability
\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]
Comparing Binary Choice Models:
Linear Probability Model (LPM): \[ P(Y=1|X) = X\beta \]
Probit Model:
\[ P(Y=1 | X) = \Phi(X \beta) \]
The logit model uses the CDF of a standard logistic distribution, whereas the probit model uses the standard normal distribution.
We can also rewrite the logit model as:
\[ \log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = X\beta \]
Estimate the logit model using our mortgage data:
denylogit <- glm(deny ~ pirat,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit, vcov. = vcovHC, type = "HC1")##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.02843 0.35898 -11.2218 < 2.2e-16 ***
## pirat 5.88450 1.00015 5.8836 4.014e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plot the data with the estimated regression line:
# plot data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit and Logit Models of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.9)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")
lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)
# add a legend
legend("topleft",
horiz = TRUE,
legend = c("Probit", "Logit"),
col = c("steelblue", "black"),
lty = c(1, 2))Now also estimate a Logit regression with multiple regressors:
# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit2, vcov. = vcovHC, type = "HC1")##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.12556 0.34597 -11.9245 < 2.2e-16 ***
## pirat 5.37036 0.96376 5.5723 2.514e-08 ***
## blackyes 1.27278 0.14616 8.7081 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compute predictions for P/I ratio = 0.3 and compute difference in probabilities:
# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2,
newdata = data.frame("black" = c("no", "yes"),
"pirat" = c(0.3, 0.3)),
type = "response")
predictions## 1 2
## 0.07485143 0.22414592
## 2
## 0.1492945
As in the Probit model, all the model coefficients are highly significant and the estimates for the coefficient on P/I is positive.
We also computed the predicted probability of denial for two hypothetical applicants that differ in race and have the same P/I ratio 0.3.
White applicants face a denial probabilities of only 7.5 percent, while the black applicants are rejected with probability 22.4 percent.
The Probit model and the Logit model deliver only approximations to the unknown population regression function \(E(Y | X)\)
It is not obvious how to decide which model to use in practice.
The linear probability model has the clear drawback of not being able to capture the nonlinear nature of the population regression function and it may predict probabilities to lie outside the interval 0-1.
Probit and Logit models are harder to interpret but capture the nonlinearities better than the linear approach: both models produce predictions of probabilities that lie inside the interval 0-1.
Predictions of all three models are often close to each other. There is no one method that is always preferred over the other.
The logistic regression model is specified as:
\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]
where:
\(Y\) is the binary outcome
\(X\) is the vector of covariates
\(\beta\) are the coefficients
\(\Lambda(\cdot)\) is the logistic cumulative distribution function
Generate a sample of size \(n=2000\)
with variable education, experience,
female and age.
# Generate sample data
set.seed(123)
n <- 2000
# Generate explanatory variables
data <- data.frame(
education = rnorm(n, mean = 16, sd = 2),
experience = rnorm(n, mean = 10, sd = 5),
female = rbinom(n, 1, 0.5),
age = rnorm(n, mean = 40, sd = 10)
)Generate a variable for employment status:
# Generate probability of working
z <- with(data,
-2 + 0.3*education + 0.2*experience - 0.5*female - 0.02*age)
prob <- 1/(1 + exp(-z))
data$working <- rbinom(n, 1, prob)Use the simulated sample to estimate a logit model where
working is the dependent variable:
# Estimate logit model
logit_model <- glm(working ~ education + experience + female + age,
family = binomial(link = "logit"),
data = data)Get the log-odds coefficients:
##
## Call:
## glm(formula = working ~ education + experience + female + age,
## family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.48735 1.15126 -2.161 0.0307 *
## education 0.25462 0.06464 3.939 8.19e-05 ***
## experience 0.21925 0.02756 7.954 1.80e-15 ***
## female -0.42654 0.25740 -1.657 0.0975 .
## age 0.01027 0.01276 0.804 0.4212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 600.21 on 1999 degrees of freedom
## Residual deviance: 512.37 on 1995 degrees of freedom
## AIC: 522.37
##
## Number of Fisher Scoring iterations: 7
Interpretation of the raw coefficients:
For a one-unit increase in X, the change in the log-odds of Y=1 is given by the coefficient estimate.
Example: Education coefficient = 0.255
# Calculate odds ratios
odds_ratios <- exp(coef(logit_model))
odds_ratios_ci <- exp(confint(logit_model))
# Combine results
odds_results <- data.frame(
OR = odds_ratios,
CI_lower = odds_ratios_ci[,1],
CI_upper = odds_ratios_ci[,2]
)
kable(odds_results, digits = 3)| OR | CI_lower | CI_upper | |
|---|---|---|---|
| (Intercept) | 0.083 | 0.009 | 0.798 |
| education | 1.290 | 1.138 | 1.466 |
| experience | 1.245 | 1.181 | 1.316 |
| female | 0.653 | 0.391 | 1.077 |
| age | 1.010 | 0.985 | 1.036 |
What are Odds?
What are Odds Ratios?
Odds represent the ratio of the probability of success to the probability of failure:
\[ Odds = \frac{P(Y=1)}{P(Y=0)} = \frac{P(Y=1)}{1-P(Y=1)} \]
Example:
If \(P(Y=1) = 0.75\)
Then \(Odds = 0.75/(1-0.75) = 3\)
Interpretation: Event is 3 times more likely to occur than not occur
What are Odds Ratios?
The odds ratio is the ratio of two odds:
\[ OR = \frac{Odds_1}{Odds_0} = \frac{P(Y=1|X=x+1)/P(Y=0|X=x+1)}{P(Y=1|X=x)/P(Y=0|X=x)} \]
The logistic regression model is:
\[ P(Y=1|X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} \]
Deriving the Odds Ratio:
Write out the odds for X = x+1: \[ Odds_1 = \frac{P(Y=1|X=x+1)}{1-P(Y=1|X=x+1)} = e^{\beta_0 + \beta_1(x+1)} \]
Write out the odds for X = x: \[ Odds_0 = \frac{P(Y=1|X=x)}{1-P(Y=1|X=x)} = e^{\beta_0 + \beta_1x} \]
Take the ratio: \[ OR = \frac{Odds_1}{Odds_0} = \frac{e^{\beta_0 + \beta_1(x+1)}}{e^{\beta_0 + \beta_1x}} = e^{\beta_1} \]
## factor AME SE z p lower upper
## age 0.0003 0.0004 0.8026 0.4222 -0.0005 0.0011
## education 0.0079 0.0021 3.7487 0.0002 0.0038 0.0120
## experience 0.0068 0.0010 6.6597 0.0000 0.0048 0.0088
## female -0.0132 0.0081 -1.6413 0.1007 -0.0291 0.0026
Interpretation:
Shows change in probability for a one-unit change in X
Evaluated at observed values and averaged
Example: Education AME = 0.008
# Create prediction data
new_data <- expand.grid(
education = seq(min(data$education),
max(data$education),
length.out = 100),
experience = mean(data$experience),
female = c(0, 1),
age = mean(data$age)
)
# Generate predictions
new_data$pred_prob <- predict(logit_model,
newdata = new_data,
type = "response")
# Plot
ggplot(new_data, aes(x = education, y = pred_prob, color = factor(female))) +
geom_line() +
labs(title = "Predicted Probability of Working by Education and Gender",
x = "Years of Education",
y = "Probability of Working",
color = "Female") +
theme_minimal()# Generate sample data
set.seed(123)
n <- 1000
# Create education variable
education <- rnorm(n, mean = 16, sd = 2)
# Generate probability of employment
z <- -2 + 0.3 * education
prob <- 1/(1 + exp(-z))
employed <- rbinom(n, 1, prob)
# Create dataset
data <- data.frame(
education = education,
employed = employed
)
# Fit logistic regression
model <- glm(employed ~ education,
family = binomial(link = "logit"),
data = data)# Calculate odds ratio for education
beta <- coef(model)[2] # coefficient for education
odds_ratio <- exp(beta)
# Calculate confidence interval
ci <- exp(confint(model))[2,]
# Create results table
results <- data.frame(
Coefficient = beta,
Odds_Ratio = odds_ratio,
CI_Lower = ci[1],
CI_Upper = ci[2]
)
kable(results, digits = 3)| Coefficient | Odds_Ratio | CI_Lower | CI_Upper | |
|---|---|---|---|---|
| education | 0.362 | 1.437 | 1.255 | 1.653 |
Interpretation:
The odds ratio of 1.437 means:
For each additional year of education
The odds of employment increase by 43.7%
Create a sequence of education values and calculate predicted probabilities using these values:
# Create sequence of education values
edu_seq <- seq(min(education), max(education), length.out = 100)
# Calculate predicted probabilities
pred_prob <- predict(model,
newdata = data.frame(education = edu_seq),
type = "response")Calculate odds:
Create a plot to compare odds and odds ratios:
# Create plot data
plot_data <- data.frame(
Education = edu_seq,
Probability = pred_prob,
Odds = odds
)
# Plot both probability and odds
p1 <- ggplot(plot_data, aes(x = Education, y = Probability)) +
geom_line() +
labs(title = "Predicted Probability vs. Education",
y = "Probability of Employment") +
theme_minimal()
p2 <- ggplot(plot_data, aes(x = Education, y = Odds)) +
geom_line() +
labs(title = "Odds vs. Education",
y = "Odds of Employment") +
theme_minimal()
# Display plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)Odds ratios multiply for unit changes in X
For k-unit change: \(OR^k\)
Example: 2-year increase in education: \(OR = (e^{\beta_1})^2\)
OR > 1: Positive association
OR < 1: Negative association
OR = 1: No association
Reciprocal for opposite comparison: \(OR_{Y|X} = \frac{1}{OR_{X|Y}}\)
Key Differences:
Odds ratio remains constant
Probability changes vary by education level
Maximum probability change occurs at P = 0.5
# Function to calculate probability change
calc_prob_change <- function(x, beta0 = -2, beta1 = 0.3) {
p1 <- 1/(1 + exp(-(beta0 + beta1*(x+1))))
p0 <- 1/(1 + exp(-(beta0 + beta1*x)))
return(p1 - p0)
}
# Calculate probability changes at different education levels
edu_levels <- seq(10, 20, by = 2)
prob_changes <- sapply(edu_levels, calc_prob_change)
# Create comparison table
comparison <- data.frame(
Education_Level = edu_levels,
Probability_Change = prob_changes,
Odds_Ratio = rep(odds_ratio, length(edu_levels))
)
kable(comparison, digits = 3)| Education_Level | Probability_Change | Odds_Ratio |
|---|---|---|
| 10 | 0.055 | 1.437 |
| 12 | 0.038 | 1.437 |
| 14 | 0.024 | 1.437 |
| 16 | 0.014 | 1.437 |
| 18 | 0.008 | 1.437 |
| 20 | 0.005 | 1.437 |
Odds can exceed 1
Probability is bounded [0,1]
Relationship: \(P = \frac{Odds}{1 + Odds}\)
“Times more likely” refers to odds, not probability
Percent changes in odds vs. percent changes in probability
Alternative Measures:
When to Use Ordinal Logistic Regression:
Dependent variable has ordered categories. Used for ordinal dependent variables with more than two categories
Assumes proportional odds across categories
Examples:
Education levels (High School, Bachelors, Masters, PhD)
Survey responses (Strongly Disagree to Strongly Agree)
Credit ratings (AAA, AA, A, BBB, etc.)
Customer satisfaction surveys
Product quality control
Also suitable for ordered outcomes like:
Educational grades
Disease severity
\[ P(Y \leq j|X) = \frac{e^{\alpha_j + X\beta}}{1 + e^{\alpha_j + X\beta}} \]
where: - j is the category
\(\alpha_j\) are the cutpoints
\(\beta\) are the coefficients
X are the predictors
Generate a sample of 1000 observations.
The variables in the sample will be salary,
work_hours, experience,
training.
# Generate sample data
n <- 1000
# Create base variables
salary <- rnorm(n, mean = 50000, sd = 10000)
work_hours <- rnorm(n, mean = 40, sd = 5)
experience <- rpois(n, lambda = 10)
training <- rbinom(n, 1, 0.3)
# Create the data frame first with raw values
mydata <- data.frame(
salary = salary,
work_hours = work_hours,
experience = experience,
training = training
)## 'data.frame': 1000 obs. of 7 variables:
## $ salary : num 44395 47698 65587 50705 51293 ...
## $ work_hours : num 35 34.8 39.9 39.3 27.3 ...
## $ experience : int 8 10 8 13 10 8 14 9 12 9 ...
## $ training : int 1 0 0 1 0 1 0 0 0 0 ...
## $ salary_std : num -0.5814 -0.2484 1.5555 0.0548 0.1141 ...
## $ work_hours_std: num -1.0283 -1.072 -0.0599 -0.173 -2.567 ...
## $ satisfaction : Ord.factor w/ 4 levels "Very Dissatisfied"<..: 1 2 1 3 2 1 2 1 1 1 ...
Now that we generated our sample, we can estimate the model with ordinal logit:
## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std +
## training, data = mydata, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## salary_std 0.42255 0.06232 6.7807
## experience 0.18257 0.02064 8.8462
## work_hours_std -0.01522 0.06130 -0.2483
## training 0.96524 0.13310 7.2519
##
## Intercepts:
## Value Std. Error t value
## Very Dissatisfied|Dissatisfied 1.8396 0.2184 8.4241
## Dissatisfied|Satisfied 3.7021 0.2434 15.2103
## Satisfied|Very Satisfied 5.5191 0.2891 19.0880
##
## Residual Deviance: 2118.656
## AIC: 2132.656
##
##
## | | Value| Std. Error| t value| p value|
## |:-----------------------------------|------:|----------:|-------:|-------:|
## |salary_std | 0.423| 0.062| 6.781| 0.000|
## |experience | 0.183| 0.021| 8.846| 0.000|
## |work_hours_std | -0.015| 0.061| -0.248| 0.804|
## |training | 0.965| 0.133| 7.252| 0.000|
## |Very Dissatisfied|Dissatisfied | 1.840| 0.218| 8.424| 0.000|
## |Dissatisfied|Satisfied | 3.702| 0.243| 15.210| 0.000|
## |Satisfied|Very Satisfied | 5.519| 0.289| 19.088| 0.000|
## [1] "Odds Ratios with 95% Confidence Intervals:"
##
##
## | | OR| CI_lower| CI_upper|
## |:--------------|-----:|--------:|--------:|
## |salary_std | 1.526| 1.351| 1.726|
## |experience | 1.200| 1.153| 1.250|
## |work_hours_std | 0.985| 0.873| 1.111|
## |training | 2.625| 2.024| 3.411|
## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std +
## training, data = mydata, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## salary_std 0.42255 0.06232 6.7807
## experience 0.18257 0.02064 8.8462
## work_hours_std -0.01522 0.06130 -0.2483
## training 0.96524 0.13310 7.2519
##
## Intercepts:
## Value Std. Error t value
## Very Dissatisfied|Dissatisfied 1.8396 0.2184 8.4241
## Dissatisfied|Satisfied 3.7021 0.2434 15.2103
## Satisfied|Very Satisfied 5.5191 0.2891 19.0880
##
## Residual Deviance: 2118.656
## AIC: 2132.656
| Value | Std. Error | t value | p value | |
|---|---|---|---|---|
| salary_std | 0.423 | 0.062 | 6.781 | 0.000 |
| experience | 0.183 | 0.021 | 8.846 | 0.000 |
| work_hours_std | -0.015 | 0.061 | -0.248 | 0.804 |
| training | 0.965 | 0.133 | 7.252 | 0.000 |
| Very Dissatisfied|Dissatisfied | 1.840 | 0.218 | 8.424 | 0.000 |
| Dissatisfied|Satisfied | 3.702 | 0.243 | 15.210 | 0.000 |
| Satisfied|Very Satisfied | 5.519 | 0.289 | 19.088 | 0.000 |
# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(model))
ci <- exp(confint(model))
# Combine results
or_results <- data.frame(
OR = odds_ratios,
CI_lower = ci[,1],
CI_upper = ci[,2]
)
kable(or_results, digits = 3)| OR | CI_lower | CI_upper | |
|---|---|---|---|
| salary_std | 1.526 | 1.351 | 1.726 |
| experience | 1.200 | 1.153 | 1.250 |
| work_hours_std | 0.985 | 0.873 | 1.111 |
| training | 2.625 | 2.024 | 3.411 |
# Create interpretation table
interpret <- data.frame(
Variable = names(odds_ratios),
OR = odds_ratios,
Interpretation = paste("For one unit increase,",
"odds of higher satisfaction",
ifelse(odds_ratios > 1, "increase", "decrease"),
"by",
round(abs(1-odds_ratios)*100, 1),
"%")
)
kable(interpret, digits = 3)| Variable | OR | Interpretation | |
|---|---|---|---|
| salary_std | salary_std | 1.526 | For one unit increase, odds of higher satisfaction increase by 52.6 % |
| experience | experience | 1.200 | For one unit increase, odds of higher satisfaction increase by 20 % |
| work_hours_std | work_hours_std | 0.985 | For one unit increase, odds of higher satisfaction decrease by 1.5 % |
| training | training | 2.625 | For one unit increase, odds of higher satisfaction increase by 162.5 % |
The ordinal logit model can be expressed as:
\[ \log\left(\frac{P(Y_i \leq j)}{P(Y_i > j)}\right) = \alpha_j - X_i\beta \]
Where:
\(Y_i\) is the response for observation i
\(j\) represents the category threshold
\(X_i\) are the predictor variables
\(\beta\) are the coefficients
\(\alpha_j\) are the threshold parameters
temp: Temperature in Celsius (15-25°C)
Represents the fermentation/storage temperature of the wine
Range: 15.0°C to 24.9°C
Median: 19.8°C
contact: Contact time in hours (1-24 hours)
Represents the time the wine spends in contact with grape skins/oak
Range: 1 to 24 hours
Median: 12 hours
rating: Wine quality rating (1-5 scale)
1: Poor
2: Fair
3: Good
4: Very Good
5: Excellent
The data simulates a wine tasting experiment where different combinations of temperature and contact time are tested to see their effect on wine quality ratings.
The ratings are weighted to follow a roughly normal distribution centered around rating 3 (Good).
## temp contact rating
## 1 17.9 6 1
## 2 22.9 23 3
## 3 19.1 15 1
## 4 23.8 13 2
## 5 24.4 10 3
## 6 15.5 21 2
# Fit ordinal logistic regression model
model <- polr(rating ~ temp + contact, data = wine, Hess = TRUE)
# Display summary
summary(model)## Call:
## polr(formula = rating ~ temp + contact, data = wine, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## temp -0.03336 0.04767 -0.6997
## contact -0.01374 0.01880 -0.7306
##
## Intercepts:
## Value Std. Error t value
## 1|2 -3.1019 1.0385 -2.9868
## 2|3 -1.7432 1.0218 -1.7061
## 3|4 0.0281 1.0164 0.0276
## 4|5 1.0191 1.0204 0.9987
##
## Residual Deviance: 587.4065
## AIC: 599.4065
Negative coefficients indicate lower probability of higher ratings
Positive coefficients indicate higher probability of higher ratings
For a one-unit increase in predictor: \[\log\left(\frac{P(Y \leq j)}{P(Y > j)}\right) = \alpha_j - \beta X\]
## temp contact
## -0.03335622 -0.01373835
# Calculate odds ratios and confidence intervals
ci <- confint(model)
odds_ratios <- exp(coef(model))
ci_exp <- exp(ci)
# Combine results
or_table <- cbind(OR = odds_ratios,
LowerCI = ci_exp[,1],
UpperCI = ci_exp[,2])
print(or_table)## OR LowerCI UpperCI
## temp 0.9671940 0.8807487 1.062027
## contact 0.9863556 0.9505159 1.023337
# Create new data for predictions
new_data <- expand.grid(
temp = c(min(wine$temp), mean(wine$temp), max(wine$temp)),
contact = c(min(wine$contact), mean(wine$contact), max(wine$contact))
)
# Get predictions
preds <- predict(model, newdata = new_data, type = "probs")
round(preds, 3)## 1 2 3 4 5
## 1 0.070 0.156 0.406 0.190 0.178
## 2 0.082 0.176 0.413 0.175 0.154
## 3 0.095 0.195 0.416 0.160 0.134
## 4 0.081 0.174 0.413 0.177 0.156
## 5 0.094 0.194 0.416 0.161 0.135
## 6 0.109 0.213 0.414 0.146 0.117
## 7 0.093 0.193 0.416 0.162 0.136
## 8 0.109 0.213 0.414 0.146 0.117
## 9 0.125 0.233 0.408 0.132 0.102
## [1] 599.4065
# Test proportional odds assumption
sf <- summary(model)
pval <- pchisq(sf$deviance, sf$df.residual, lower.tail = FALSE)
cat("Proportional Odds Test p-value:", pval)## Proportional Odds Test p-value: 3.467484e-41
Ordinal logistic regression preserves order information
Interpretation through:
Log-odds coefficients
Odds ratios
Predicted probabilities
Important assumptions:
Proportional odds
No multicollinearity
Linear Interpretation
Logit effects are non-linear
Marginal effects vary across X
Significance vs. Magnitude
Statistical significance ≠ Practical significance
Consider effect sizes
Sample Specific
Effects depend on sample characteristics
Be careful with extrapolation
Discrete Changes
For binary/categorical variables
Different from continuous marginal effects
Standard Errors
Delta method
Bootstrap methods
Model Diagnostics
Goodness of fit
Classification accuracy