Binary Choice Models

Why Binary Choice Models?

Many economic decisions are binary in nature:

Labor force participation (Yes/No)
Purchase decisions (Buy/Not Buy)
Default on loans (Default/Not Default)
Migration decisions (Move/Stay)

Labor Economics
- Employment decisions
- Job search outcomes
Industrial Organization
- Market entry/exit
- Technology adoption
Financial Economics
- Default prediction
- Investment decisions

The Linear Probability Model (LPM)

Consider the following linear regression model:

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ...+ + \beta_k X_{ki} + u_i \]

where \(P(Y=1 \mid X_1,X_2,...,X_k) = \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ...+ + \beta_k X_{k}\).

\[ E(Y)= 1*Pr(Y=1) + 0*Pr(Y=0) = Pr(Y=1) \]

In the multiple regression model with a binary dependent variable, we have \[ E(Y_i | X_{1i},....,X_{ki})=Pr(Y_i=1 | X_{1i},....,X_{ki}) = \beta_0 + \beta_1 X_{1i} + ...+\beta_k X_{ki} \] This means that \(\beta_j\) can be interpreted as the change in the probability that \(Y_i=1\) holding constant the other \(k-1\) regressors. Just as in common multiple regression, the \(\beta_j\) can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.

In other words: The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i},....,X_{ki}) }{\partial X_j } = \beta_j \]

An Application of LPM to Mortage Data

Most individuals who want to buy a house apply for a mortgage at a bank, but not all mortgage applications are approved.
What determines whether or not a mortgage application is approved or denied?
deny=1 if mortgage application is denied
pi_ratio (pirat): anticipated monthly loan payments divided by monthly income
black=1 if applicant is black, =0 if applicant is white
Research Question: Does the payment to income ratio affect whether or not a mortgage application is denied?

# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)

# inspect the data
head(HMDA)

##   deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
## 1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
## 2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
## 3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
## 4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
## 5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
## 6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
##   afam single hschool
## 1   no     no     yes
## 2   no    yes     yes
## 3   no     no     yes
## 4   no     no     yes
## 5   no     no     yes
## 6   no     no     yes

summary(HMDA)

##   deny          pirat            hirat            lvrat        chist   
##  no :2095   Min.   :0.0000   Min.   :0.0000   Min.   :0.0200   1:1353  
##  yes: 285   1st Qu.:0.2800   1st Qu.:0.2140   1st Qu.:0.6527   2: 441  
##             Median :0.3300   Median :0.2600   Median :0.7795   3: 126  
##             Mean   :0.3308   Mean   :0.2553   Mean   :0.7378   4:  77  
##             3rd Qu.:0.3700   3rd Qu.:0.2988   3rd Qu.:0.8685   5: 182  
##             Max.   :3.0000   Max.   :3.0000   Max.   :1.9500   6: 201  
##  mhist    phist          unemp        selfemp    insurance  condomin  
##  1: 747   no :2205   Min.   : 1.800   no :2103   no :2332   no :1694  
##  2:1571   yes: 175   1st Qu.: 3.100   yes: 277   yes:  48   yes: 686  
##  3:  41              Median : 3.200                                   
##  4:  21              Mean   : 3.774                                   
##                      3rd Qu.: 3.900                                   
##                      Max.   :10.600                                   
##   afam      single     hschool   
##  no :2041   no :1444   no :  39  
##  yes: 339   yes: 936   yes:2341  
##                                  
##                                  
##                                  
##

We want to model the variable deny, an indicator for whether an applicant’s mortgage application has been accepted (deny = no) or denied (deny = yes).

Specifically we want to model the relationship between deny and pirat (anticipated monthly payments as a ratio of income). It is straightforward to translate this into the following simple regression model:

\[ deny = \beta_0 + \beta_1 * PI ratio + u \]

Estimate this model just like any other linear regression using lm().

# convert 'deny' to numeric
HMDA$deny <- as.numeric(HMDA$deny) - 1

# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1

## 
## Call:
## lm(formula = deny ~ pirat, data = HMDA)
## 
## Coefficients:
## (Intercept)        pirat  
##    -0.07991      0.60353

Now plot the data together with the estimated regression line and see how they compare:

# plot the data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Scatterplot Mortgage Application Denial and 
                                    the Payment-to-Income Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.8)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add the estimated regression line
abline(denymod1, 
       lwd = 1.8, 
       col = "steelblue")

Results show that a payment to income ratio of 1 is associated with an expected probability of mortgage application denial of roughly 60 percent.

There is a positive relationship between the payment-to-income ratio and the probability of a denied mortgage application so individuals with a high ratio of loan payments to income are more likely to be rejected.

Based on the estimation results from the LPM, we find that the true coefficient on the P/I ratio is statistically different from 0 at the 1 percent level.

Limitations of the Linear Probability Model

Recall that the theoretical model is: \[ E(Y_i | X_{1i})=Pr(Y_i=1 | X_{1i}) = \beta_0 + \beta_1 X_{1i} \] where the only regressor is \(X_1\) which is the P/I ratio in the mortgage example.

The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i}) }{\partial X_1 } = \beta_1 \] In the mortgage application example, we got \(\hat{\beta_1}=0.60\).

For P/I ratio larger than 1.75, the estimated model predicts that the probability of a mortgage application getting turned down is larger than!

For some values, it is even negative so the predicted probability of denial is even negative so that the model has no meaningful interpretation here.

In the linear probability model the predicted probability can be below 0 or above 1!

So how should we interpret the coefficient estimate \(\hat{\beta}\) for the P/I ratio?

Limitations of the Linear Probability Model

This is a major flaw of the linear probability model: it assumes the conditional probability function to be linear.

In other words, it does not restrict \(P(Y=1 | X_1,..,X_k)\) to be between 0 and 1.

This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable.

Violation of measurements: If the dependent variable is dichotomous e.g. smoker and non-smoker, employed and unemployed, etc.

Limitations of the Linear Probability Model

Problems with LPM:

Non-normality of the errors \(U_i\)
Heteroskedastic variances of the errors,
Non-fulfillment of \(0 \leq E(Y_i|X_i) \leq 1\),
Constant marginal effects,
Questionable of value of \(R^2\) as a measure of goodness of fit.

Alternatives are: Logit, Probit or Tobit models.

Limitations of LPM: Heteroskedasticity

\[ Y_i = \beta_0 + \beta_{1i} X_{1i} + ... + \beta_k X_{ki} + u_i \]

The variance of a Bernoulli random variable is: \[ Var(Y)=Pr(Y=1) * [1- Pr(Y=1)] \] We can use this to find the conditional variance of the error term.

Alternatives to LPM: Nonlinear probability models

Probabilities cannot be less than 0 or greater than 1
To address this problem, we will consider nonlinear probability models

\[ Pr(Y_i=1) = G(Z) \] where \(Z=\beta_0 + \beta_1X_{1i} + .... + \beta_k X_{ki}\) and \(0 \leq G(Z) \leq 1\).

We will consider 2 nonlinear functions

Probit: \(G(Z)=\phi(Z)\).
Logit: \(G(Z)=\frac{1}{1+e^{-Z}}\).

The Probit Model

Probit regression models the probability that \(Y=1\).

Using the cumulative standard normal distribution function \(\Phi(Z)\).
evaluated at \(Z=\beta_0 + \beta_1 X_{1i} + .... + \beta_k X_{ki}\).
since \(\phi(z) = Pr(Z \leq z)\) we have that the predicted probabilities of the probit model are between 0 and 1.

Example:

Suppose we have only one regressor and \(Z=-2 + 3 X_1\).
We want to know the probability that \(Y=1\) when \(X_1=0.4\).
\(z=-2 + 3*0.4 = -0.8\).
\(Pr(Y=1)=Pr(Z \leq -0.8) = \Phi(-0.8) = 0.2119\)

The Probit Model: Latent Variable Approach

The probit model is derived from an underlying latent variable model:

\[ y_i^* = x_i'\beta + \epsilon_i \]

where:

\(y_i^*\) is an unobserved latent variable
\(x_i\) is a vector of covariates
\(\beta\) is a vector of parameters
\(\epsilon_i \sim N(0,1)\) (standard normal)

The Probit Model: Binary Outcome Rule

We observe:

\[ y_i = \begin{cases} 1 & \text{if } y_i^* > 0 \\\\ 0 & \text{if } y_i^* \leq 0 \end{cases} \]

This leads to:

\[ P(y_i = 1|x_i) = P(y_i^* > 0|x_i) = P(x_i'\beta + \epsilon_i > 0) \] \[ = P(\epsilon_i > -x_i'\beta) = \Phi(x_i'\beta) \]

where \(\Phi(\cdot)\) is the standard normal CDF.

The probability that Y = 1 given X is: \[ P(Y = 1|X) = \Phi(X\beta) \] where:

\(\Phi\) is the cumulative distribution function of the standard normal distribution
beta are the coefficients
X are the predictors

In probit regression, the cumulative standard normal distribution function \(\Phi(.)\) is used to model the regression function when the dependent variable is binary.

\[ E(Y | X) = P(Y=1 | X) = \Phi(\beta_0 + \beta_1 X) \]

where \(\beta_0 + \beta_1X\) plays the role of a quantile \(z\). Recall that \[ \Phi(z)=P(Z\leq z), Z \sim N(0,1) \]

The Probit Model: Binary Outcome Rule

The probit coefficient \(\beta_1\) is the change in \(z\) associated with one unit change in \(X\). Although th effect on z of a change in X is linear, the link between x and the dependent variable \(Y\) is nonlinear since \(\Phi\) is a nonlinear function of \(X\).
For a model with multiple regressors, the analysis follows the same way

For a binary variable \(Y\),

\[ Y=\beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u \] with \[ P(Y=1 | X_1,X_2,...,X_k) = \Phi(\beta_0 + \beta_1 X_1 + ...+ \beta_k X_k) \] is the population Probit model with multiple regressors \(X_1, X_2, ..., X_k\) and \(\Phi(.)\) is the cumulative standard normal function. Then we calculate the predicted probability that \(Y=1\) given \(X_1,X_2,...,X_k\) in two steps:

Compute \(z= \beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u\)
Look up \(\Phi(z)\) by calling pnorm().

In the above model, \(\beta_j\) is the effect of \(z\) of a one unit change in regressor \(X_j\) holding constant all other \(k-1\) regressors.

Application of Probit to Mortgage Data

Estimate the probit model using the mortgage data:

# estimate the simple probit model
denyprobit <- glm(deny ~ pirat, 
                  family = binomial(link = "probit"), 
                  data = HMDA)

coeftest(denyprobit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
## pirat        2.96787    0.53698   5.5269 3.259e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the data with the estimated regression line:

# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.85)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")

lines(x, y, lwd = 1.5, col = "steelblue")

The estimated regression function has a stretched “S”-shape which is typical for the CDF of a continuous random variable with symmetric PDF like that of a normal random variable.
The function is clearly nonlinear and flattens out for large and small values of P/I ratio.
The functional form thus also ensures that the predicted conditional probabilities of a denial lie between 0 and 1.

We can also run the probit model with additional regressors. The below shows results from regressing deny on PIratio and race (whether the person is black or not).

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -0.090514   0.033430 -2.7076  0.006826 ** 
## pirat        0.559195   0.103671  5.3939 7.575e-08 ***
## blackyes     0.177428   0.025055  7.0815 1.871e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Logit Model

Used when outcome variable is binary (0/1)
Common in economics for analyzing:
- Labor force participation
- Consumer choice
- Market entry decisions
- Default probability

The Logistic Function

\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]

Linear Probability vs. Logit

Comparing Binary Choice Models:

Linear Probability Model (LPM): \[ P(Y=1|X) = X\beta \]
Probit Model:

\[ P(Y=1 | X) = \Phi(X \beta) \]

Logit Model: \[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \] Logit and Probit models are similar. The main difference is that they use different CDF’s.

The logit model uses the CDF of a standard logistic distribution, whereas the probit model uses the standard normal distribution.

We can also rewrite the logit model as:

\[ \log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = X\beta \]

An Application of Logit to Mortgage Data

Estimate the logit model using our mortgage data:

denylogit <- glm(deny ~ pirat, 
                 family = binomial(link = "logit"), 
                 data = HMDA)

coeftest(denylogit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
## pirat        5.88450    1.00015   5.8836 4.014e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the data with the estimated regression line:

# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit and Logit Models of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")

lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"), 
       lty = c(1, 2))

Now also estimate a Logit regression with multiple regressors:

# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

coeftest(denylogit2, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
## pirat        5.37036    0.96376   5.5723 2.514e-08 ***
## blackyes     1.27278    0.14616   8.7081 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compute predictions for P/I ratio = 0.3 and compute difference in probabilities:

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

predictions

##          1          2 
## 0.07485143 0.22414592

# 2. Compute difference in probabilities
diff(predictions)

##         2 
## 0.1492945

As in the Probit model, all the model coefficients are highly significant and the estimates for the coefficient on P/I is positive.

We also computed the predicted probability of denial for two hypothetical applicants that differ in race and have the same P/I ratio 0.3.

White applicants face a denial probabilities of only 7.5 percent, while the black applicants are rejected with probability 22.4 percent.

Probit vs. Logit

The Probit model and the Logit model deliver only approximations to the unknown population regression function \(E(Y | X)\)
It is not obvious how to decide which model to use in practice.
The linear probability model has the clear drawback of not being able to capture the nonlinear nature of the population regression function and it may predict probabilities to lie outside the interval 0-1.
Probit and Logit models are harder to interpret but capture the nonlinearities better than the linear approach: both models produce predictions of probabilities that lie inside the interval 0-1.
Predictions of all three models are often close to each other. There is no one method that is always preferred over the other.

Logit Model: Understanding the Coefficients

The logistic regression model is specified as:

\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]

where:

\(Y\) is the binary outcome
\(X\) is the vector of covariates
\(\beta\) are the coefficients
\(\Lambda(\cdot)\) is the logistic cumulative distribution function

Example: Labor Market Participation

Generate a sample of size \(n=2000\) with variable education, experience, female and age.

# Generate sample data
set.seed(123)
n <- 2000

# Generate explanatory variables
data <- data.frame(
  education = rnorm(n, mean = 16, sd = 2),
  experience = rnorm(n, mean = 10, sd = 5),
  female = rbinom(n, 1, 0.5),
  age = rnorm(n, mean = 40, sd = 10)
)

Generate a variable for employment status:

# Generate probability of working
z <- with(data, 
         -2 + 0.3*education + 0.2*experience - 0.5*female - 0.02*age)
prob <- 1/(1 + exp(-z))
data$working <- rbinom(n, 1, prob)

Use the simulated sample to estimate a logit model where working is the dependent variable:

# Estimate logit model
logit_model <- glm(working ~ education + experience + female + age,
                   family = binomial(link = "logit"),
                   data = data)

Get the log-odds coefficients:

# Display coefficients
summary(logit_model)

## 
## Call:
## glm(formula = working ~ education + experience + female + age, 
##     family = binomial(link = "logit"), data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.48735    1.15126  -2.161   0.0307 *  
## education    0.25462    0.06464   3.939 8.19e-05 ***
## experience   0.21925    0.02756   7.954 1.80e-15 ***
## female      -0.42654    0.25740  -1.657   0.0975 .  
## age          0.01027    0.01276   0.804   0.4212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 600.21  on 1999  degrees of freedom
## Residual deviance: 512.37  on 1995  degrees of freedom
## AIC: 522.37
## 
## Number of Fisher Scoring iterations: 7

Interpretation of the raw coefficients:

For a one-unit increase in X, the change in the log-odds of Y=1 is given by the coefficient estimate.
Example: Education coefficient = 0.255
- One year increase in education changes log-odds of working by 0.255

Odds Ratios: Converting to Odds Ratios

# Calculate odds ratios
odds_ratios <- exp(coef(logit_model))
odds_ratios_ci <- exp(confint(logit_model))

# Combine results
odds_results <- data.frame(
  OR = odds_ratios,
  CI_lower = odds_ratios_ci[,1],
  CI_upper = odds_ratios_ci[,2]
)

kable(odds_results, digits = 3)

	OR	CI_lower	CI_upper
(Intercept)	0.083	0.009	0.798
education	1.290	1.138	1.466
experience	1.245	1.181	1.316
female	0.653	0.391	1.077
age	1.010	0.985	1.036

Odds Ratios: Interpretation

Odds ratio > 1: Positive association
Odds ratio < 1: Negative association
Example: For education, OR = 1.29
- One year increase in education multiplies odds of working by 1.29

Understanding Odds and Odds Ratios

What are Odds?

What are Odds Ratios?

Odds represent the ratio of the probability of success to the probability of failure:

\[ Odds = \frac{P(Y=1)}{P(Y=0)} = \frac{P(Y=1)}{1-P(Y=1)} \]

Example:

If \(P(Y=1) = 0.75\)
Then \(Odds = 0.75/(1-0.75) = 3\)
Interpretation: Event is 3 times more likely to occur than not occur

What are Odds Ratios?

The odds ratio is the ratio of two odds:

\[ OR = \frac{Odds_1}{Odds_0} = \frac{P(Y=1|X=x+1)/P(Y=0|X=x+1)}{P(Y=1|X=x)/P(Y=0|X=x)} \]

Deriving the Odds Ratios in Logistic Regression

The logistic regression model is:

\[ P(Y=1|X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} \]

Deriving the Odds Ratio:

Write out the odds for X = x+1: \[ Odds_1 = \frac{P(Y=1|X=x+1)}{1-P(Y=1|X=x+1)} = e^{\beta_0 + \beta_1(x+1)} \]
Write out the odds for X = x: \[ Odds_0 = \frac{P(Y=1|X=x)}{1-P(Y=1|X=x)} = e^{\beta_0 + \beta_1x} \]
Take the ratio: \[ OR = \frac{Odds_1}{Odds_0} = \frac{e^{\beta_0 + \beta_1(x+1)}}{e^{\beta_0 + \beta_1x}} = e^{\beta_1} \]

Average Marginal Effects (AME)

# Calculate average marginal effects
mfx <- margins(logit_model)
summary(mfx)

##      factor     AME     SE       z      p   lower  upper
##         age  0.0003 0.0004  0.8026 0.4222 -0.0005 0.0011
##   education  0.0079 0.0021  3.7487 0.0002  0.0038 0.0120
##  experience  0.0068 0.0010  6.6597 0.0000  0.0048 0.0088
##      female -0.0132 0.0081 -1.6413 0.1007 -0.0291 0.0026

Interpretation:

Shows change in probability for a one-unit change in X
Evaluated at observed values and averaged
Example: Education AME = 0.008
- One year increase in education increases probability of working by 0.8%

Visualizing Marginal Effects: Predicted Probabilities Across Education Levels

# Create prediction data
new_data <- expand.grid(
  education = seq(min(data$education), 
                 max(data$education), 
                 length.out = 100),
  experience = mean(data$experience),
  female = c(0, 1),
  age = mean(data$age)
)

# Generate predictions
new_data$pred_prob <- predict(logit_model, 
                            newdata = new_data, 
                            type = "response")

# Plot
ggplot(new_data, aes(x = education, y = pred_prob, color = factor(female))) +
  geom_line() +
  labs(title = "Predicted Probability of Working by Education and Gender",
       x = "Years of Education",
       y = "Probability of Working",
       color = "Female") +
  theme_minimal()

Example with Simulated Data

# Generate sample data
set.seed(123)
n <- 1000

# Create education variable
education <- rnorm(n, mean = 16, sd = 2)

# Generate probability of employment
z <- -2 + 0.3 * education
prob <- 1/(1 + exp(-z))
employed <- rbinom(n, 1, prob)

# Create dataset
data <- data.frame(
  education = education,
  employed = employed
)

# Fit logistic regression
model <- glm(employed ~ education, 
            family = binomial(link = "logit"), 
            data = data)

Calculating and Interpreting Odds Ratios

# Calculate odds ratio for education
beta <- coef(model)[2]  # coefficient for education
odds_ratio <- exp(beta)

# Calculate confidence interval
ci <- exp(confint(model))[2,]

# Create results table
results <- data.frame(
  Coefficient = beta,
  Odds_Ratio = odds_ratio,
  CI_Lower = ci[1],
  CI_Upper = ci[2]
)

kable(results, digits = 3)

	Coefficient	Odds_Ratio	CI_Lower	CI_Upper
education	0.362	1.437	1.255	1.653

Interpretation:

The odds ratio of 1.437 means:
- For each additional year of education
- The odds of employment increase by 43.7%

Visualizing Odds Ratios

Create a sequence of education values and calculate predicted probabilities using these values:

# Create sequence of education values
edu_seq <- seq(min(education), max(education), length.out = 100)

# Calculate predicted probabilities
pred_prob <- predict(model, 
                    newdata = data.frame(education = edu_seq),
                    type = "response")

Calculate odds:

# Calculate odds
odds <- pred_prob/(1-pred_prob)

Create a plot to compare odds and odds ratios:

# Create plot data
plot_data <- data.frame(
  Education = edu_seq,
  Probability = pred_prob,
  Odds = odds
)

# Plot both probability and odds
p1 <- ggplot(plot_data, aes(x = Education, y = Probability)) +
  geom_line() +
  labs(title = "Predicted Probability vs. Education",
       y = "Probability of Employment") +
  theme_minimal()

p2 <- ggplot(plot_data, aes(x = Education, y = Odds)) +
  geom_line() +
  labs(title = "Odds vs. Education",
       y = "Odds of Employment") +
  theme_minimal()

# Display plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)

Important Properties of Odds Ratios

Multiplicative Nature

Odds ratios multiply for unit changes in X
For k-unit change: \(OR^k\)
Example: 2-year increase in education: \(OR = (e^{\beta_1})^2\)

Symmetry Around 1

OR > 1: Positive association
OR < 1: Negative association
OR = 1: No association
Reciprocal for opposite comparison: \(OR_{Y|X} = \frac{1}{OR_{X|Y}}\)

Comparing Probability Changes vs Odds Ratios

Key Differences:

Odds ratio remains constant
Probability changes vary by education level
Maximum probability change occurs at P = 0.5

# Function to calculate probability change
calc_prob_change <- function(x, beta0 = -2, beta1 = 0.3) {
  p1 <- 1/(1 + exp(-(beta0 + beta1*(x+1))))
  p0 <- 1/(1 + exp(-(beta0 + beta1*x)))
  return(p1 - p0)
}

# Calculate probability changes at different education levels
edu_levels <- seq(10, 20, by = 2)
prob_changes <- sapply(edu_levels, calc_prob_change)

# Create comparison table
comparison <- data.frame(
  Education_Level = edu_levels,
  Probability_Change = prob_changes,
  Odds_Ratio = rep(odds_ratio, length(edu_levels))
)

kable(comparison, digits = 3)

Education_Level	Probability_Change	Odds_Ratio
10	0.055	1.437
12	0.038	1.437
14	0.024	1.437
16	0.014	1.437
18	0.008	1.437
20	0.005	1.437

Common Misconceptions

Odds vs. Probability

Odds can exceed 1
Probability is bounded [0,1]
Relationship: \(P = \frac{Odds}{1 + Odds}\)

Interpretation Pitfalls

“Times more likely” refers to odds, not probability
Percent changes in odds vs. percent changes in probability

Practical Guidelines: When to Use Odds Ratios

Advantages:
- Constant across levels of X
- Easy to combine across studies
- Standard in medical research
Disadvantages:
- Less intuitive than probabilities
- Can be misleading for rare events
- May exaggerate effects

Alternative Measures:

Risk Ratios:
- Ratio of probabilities
- More intuitive for some audiences
Marginal Effects:
- Change in probability
- Varies across X
- More relevant for policy

Ordinal Logistic Regression (Ordinal Logit)

When to Use Ordinal Logistic Regression:

Dependent variable has ordered categories. Used for ordinal dependent variables with more than two categories
Assumes proportional odds across categories
Examples:
- Education levels (High School, Bachelors, Masters, PhD)
- Survey responses (Strongly Disagree to Strongly Agree)
- Credit ratings (AAA, AA, A, BBB, etc.)
- Customer satisfaction surveys
Product quality control
Also suitable for ordered outcomes like:
- Educational grades
- Disease severity

Proportional Odds

\[ P(Y \leq j|X) = \frac{e^{\alpha_j + X\beta}}{1 + e^{\alpha_j + X\beta}} \]

where: - j is the category

\(\alpha_j\) are the cutpoints
\(\beta\) are the coefficients
X are the predictors

Example 1: Job Satisfaction Analysis

Generate a sample of 1000 observations.

The variables in the sample will be salary, work_hours, experience, training.

# Generate sample data
n <- 1000

# Create base variables
salary <- rnorm(n, mean = 50000, sd = 10000)
work_hours <- rnorm(n, mean = 40, sd = 5)
experience <- rpois(n, lambda = 10)
training <- rbinom(n, 1, 0.3)

# Create the data frame first with raw values
mydata <- data.frame(
    salary = salary,
    work_hours = work_hours,
    experience = experience,
    training = training
)

## 'data.frame':    1000 obs. of  7 variables:
##  $ salary        : num  44395 47698 65587 50705 51293 ...
##  $ work_hours    : num  35 34.8 39.9 39.3 27.3 ...
##  $ experience    : int  8 10 8 13 10 8 14 9 12 9 ...
##  $ training      : int  1 0 0 1 0 1 0 0 0 0 ...
##  $ salary_std    : num  -0.5814 -0.2484 1.5555 0.0548 0.1141 ...
##  $ work_hours_std: num  -1.0283 -1.072 -0.0599 -0.173 -2.567 ...
##  $ satisfaction  : Ord.factor w/ 4 levels "Very Dissatisfied"<..: 1 2 1 3 2 1 2 1 1 1 ...

Now that we generated our sample, we can estimate the model with ordinal logit:

## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std + 
##     training, data = mydata, method = "logistic")
## 
## Coefficients:
##                   Value Std. Error t value
## salary_std      0.42255    0.06232  6.7807
## experience      0.18257    0.02064  8.8462
## work_hours_std -0.01522    0.06130 -0.2483
## training        0.96524    0.13310  7.2519
## 
## Intercepts:
##                                Value   Std. Error t value
## Very Dissatisfied|Dissatisfied  1.8396  0.2184     8.4241
## Dissatisfied|Satisfied          3.7021  0.2434    15.2103
## Satisfied|Very Satisfied        5.5191  0.2891    19.0880
## 
## Residual Deviance: 2118.656 
## AIC: 2132.656

## 
## 
## |                                    |  Value| Std. Error| t value| p value|
## |:-----------------------------------|------:|----------:|-------:|-------:|
## |salary_std                          |  0.423|      0.062|   6.781|   0.000|
## |experience                          |  0.183|      0.021|   8.846|   0.000|
## |work_hours_std                      | -0.015|      0.061|  -0.248|   0.804|
## |training                            |  0.965|      0.133|   7.252|   0.000|
## |Very Dissatisfied&#124;Dissatisfied |  1.840|      0.218|   8.424|   0.000|
## |Dissatisfied&#124;Satisfied         |  3.702|      0.243|  15.210|   0.000|
## |Satisfied&#124;Very Satisfied       |  5.519|      0.289|  19.088|   0.000|

## [1] "Odds Ratios with 95% Confidence Intervals:"

## 
## 
## |               |    OR| CI_lower| CI_upper|
## |:--------------|-----:|--------:|--------:|
## |salary_std     | 1.526|    1.351|    1.726|
## |experience     | 1.200|    1.153|    1.250|
## |work_hours_std | 0.985|    0.873|    1.111|
## |training       | 2.625|    2.024|    3.411|

## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std + 
##     training, data = mydata, method = "logistic")
## 
## Coefficients:
##                   Value Std. Error t value
## salary_std      0.42255    0.06232  6.7807
## experience      0.18257    0.02064  8.8462
## work_hours_std -0.01522    0.06130 -0.2483
## training        0.96524    0.13310  7.2519
## 
## Intercepts:
##                                Value   Std. Error t value
## Very Dissatisfied|Dissatisfied  1.8396  0.2184     8.4241
## Dissatisfied|Satisfied          3.7021  0.2434    15.2103
## Satisfied|Very Satisfied        5.5191  0.2891    19.0880
## 
## Residual Deviance: 2118.656 
## AIC: 2132.656

	Value	Std. Error	t value	p value
salary_std	0.423	0.062	6.781	0.000
experience	0.183	0.021	8.846	0.000
work_hours_std	-0.015	0.061	-0.248	0.804
training	0.965	0.133	7.252	0.000
Very Dissatisfied\|Dissatisfied	1.840	0.218	8.424	0.000
Dissatisfied\|Satisfied	3.702	0.243	15.210	0.000
Satisfied\|Very Satisfied	5.519	0.289	19.088	0.000

Coefficient Interpretation

Direction:
- Positive coefficient: Higher values of X associated with higher satisfaction
- Negative coefficient: Higher values of X associated with lower satisfaction
Magnitude:
- Represents change in log-odds for one unit increase in X
- Must be exponentiated for odds ratios

# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(model))
ci <- exp(confint(model))

# Combine results
or_results <- data.frame(
  OR = odds_ratios,
  CI_lower = ci[,1],
  CI_upper = ci[,2]
)

kable(or_results, digits = 3)

	OR	CI_lower	CI_upper
salary_std	1.526	1.351	1.726
experience	1.200	1.153	1.250
work_hours_std	0.985	0.873	1.111
training	2.625	2.024	3.411

Interpreting Odds Ratios For Continuous Variables:

# Create interpretation table
interpret <- data.frame(
  Variable = names(odds_ratios),
  OR = odds_ratios,
  Interpretation = paste("For one unit increase,",
                        "odds of higher satisfaction",
                        ifelse(odds_ratios > 1, "increase", "decrease"),
                        "by",
                        round(abs(1-odds_ratios)*100, 1),
                        "%")
)

kable(interpret, digits = 3)

	Variable	OR	Interpretation
salary_std	salary_std	1.526	For one unit increase, odds of higher satisfaction increase by 52.6 %
experience	experience	1.200	For one unit increase, odds of higher satisfaction increase by 20 %
work_hours_std	work_hours_std	0.985	For one unit increase, odds of higher satisfaction decrease by 1.5 %
training	training	2.625	For one unit increase, odds of higher satisfaction increase by 162.5 %

Mathematical Framework

The ordinal logit model can be expressed as:

\[ \log\left(\frac{P(Y_i \leq j)}{P(Y_i > j)}\right) = \alpha_j - X_i\beta \]

Where:

\(Y_i\) is the response for observation i
\(j\) represents the category threshold
\(X_i\) are the predictor variables
\(\beta\) are the coefficients
\(\alpha_j\) are the threshold parameters

Example 2: Wine Tasting

temp: Temperature in Celsius (15-25°C)

Represents the fermentation/storage temperature of the wine

Range: 15.0°C to 24.9°C

Median: 19.8°C

contact: Contact time in hours (1-24 hours)

Represents the time the wine spends in contact with grape skins/oak

Range: 1 to 24 hours

Median: 12 hours

rating: Wine quality rating (1-5 scale)

1: Poor

2: Fair

3: Good

4: Very Good

5: Excellent

The data simulates a wine tasting experiment where different combinations of temperature and contact time are tested to see their effect on wine quality ratings.

The ratings are weighted to follow a roughly normal distribution centered around rating 3 (Good).

Example 2: Wine Tasting

# Display first few rows of the wine dataset
head(wine)

##   temp contact rating
## 1 17.9       6      1
## 2 22.9      23      3
## 3 19.1      15      1
## 4 23.8      13      2
## 5 24.4      10      3
## 6 15.5      21      2

# Fit ordinal logistic regression model
model <- polr(rating ~ temp + contact, data = wine, Hess = TRUE)

# Display summary
summary(model)

## Call:
## polr(formula = rating ~ temp + contact, data = wine, Hess = TRUE)
## 
## Coefficients:
##            Value Std. Error t value
## temp    -0.03336    0.04767 -0.6997
## contact -0.01374    0.01880 -0.7306
## 
## Intercepts:
##     Value   Std. Error t value
## 1|2 -3.1019  1.0385    -2.9868
## 2|3 -1.7432  1.0218    -1.7061
## 3|4  0.0281  1.0164     0.0276
## 4|5  1.0191  1.0204     0.9987
## 
## Residual Deviance: 587.4065 
## AIC: 599.4065

Coefficient Interpretation

Negative coefficients indicate lower probability of higher ratings
Positive coefficients indicate higher probability of higher ratings
For a one-unit increase in predictor: \[\log\left(\frac{P(Y \leq j)}{P(Y > j)}\right) = \alpha_j - \beta X\]

# Extract and display coefficients
coef(model)

##        temp     contact 
## -0.03335622 -0.01373835

Odds Ratios

# Calculate odds ratios and confidence intervals
ci <- confint(model)
odds_ratios <- exp(coef(model))
ci_exp <- exp(ci)

# Combine results
or_table <- cbind(OR = odds_ratios, 
                 LowerCI = ci_exp[,1], 
                 UpperCI = ci_exp[,2])
print(or_table)

##                OR   LowerCI  UpperCI
## temp    0.9671940 0.8807487 1.062027
## contact 0.9863556 0.9505159 1.023337

Predicted Probabilities

# Create new data for predictions
new_data <- expand.grid(
  temp = c(min(wine$temp), mean(wine$temp), max(wine$temp)),
  contact = c(min(wine$contact), mean(wine$contact), max(wine$contact))
)

# Get predictions
preds <- predict(model, newdata = new_data, type = "probs")
round(preds, 3)

##       1     2     3     4     5
## 1 0.070 0.156 0.406 0.190 0.178
## 2 0.082 0.176 0.413 0.175 0.154
## 3 0.095 0.195 0.416 0.160 0.134
## 4 0.081 0.174 0.413 0.177 0.156
## 5 0.094 0.194 0.416 0.161 0.135
## 6 0.109 0.213 0.414 0.146 0.117
## 7 0.093 0.193 0.416 0.162 0.136
## 8 0.109 0.213 0.414 0.146 0.117
## 9 0.125 0.233 0.408 0.132 0.102

Model Diagnostics

# Calculate AIC
AIC(model)

## [1] 599.4065

# Test proportional odds assumption
sf <- summary(model)
pval <- pchisq(sf$deviance, sf$df.residual, lower.tail = FALSE)
cat("Proportional Odds Test p-value:", pval)

## Proportional Odds Test p-value: 3.467484e-41

Key Takeaways

Ordinal logistic regression preserves order information
Interpretation through:
- Log-odds coefficients
- Odds ratios
- Predicted probabilities
Important assumptions:
- Proportional odds
- No multicollinearity

Common Pitfalls: What to Watch Out For

Linear Interpretation
- Logit effects are non-linear
- Marginal effects vary across X
Significance vs. Magnitude
- Statistical significance ≠ Practical significance
- Consider effect sizes
Sample Specific
- Effects depend on sample characteristics
- Be careful with extrapolation

Further Considerations

Discrete Changes
- For binary/categorical variables
- Different from continuous marginal effects
Standard Errors
- Delta method
- Bootstrap methods
Model Diagnostics
- Goodness of fit
- Classification accuracy

Key Points to Remember

Interpretation:
- Coefficients represent log-odds
- Odds ratios show multiplicative effects
- Predicted probabilities show absolute effects
Assumptions:
- Test parallel regression assumption
- Consider alternatives if violated
Reporting:
- Include multiple types of effects
- Use visualizations
- Report model diagnostics