Binary Choice Models

Why Binary Choice Models?

Many economic decisions are binary in nature:

  1. Labor Economics
    • Employment decisions
    • Job search outcomes
  2. Industrial Organization
    • Market entry/exit
    • Technology adoption
  3. Financial Economics
    • Default prediction
    • Investment decisions

The Linear Probability Model (LPM)

Consider the following linear regression model:

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ...+ + \beta_k X_{ki} + u_i \]

where \(P(Y=1 \mid X_1,X_2,...,X_k) = \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ...+ + \beta_k X_{k}\).

\[ E(Y)= 1*Pr(Y=1) + 0*Pr(Y=0) = Pr(Y=1) \]

In the multiple regression model with a binary dependent variable, we have \[ E(Y_i | X_{1i},....,X_{ki})=Pr(Y_i=1 | X_{1i},....,X_{ki}) = \beta_0 + \beta_1 X_{1i} + ...+\beta_k X_{ki} \] This means that \(\beta_j\) can be interpreted as the change in the probability that \(Y_i=1\) holding constant the other \(k-1\) regressors. Just as in common multiple regression, the \(\beta_j\) can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.

In other words: The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i},....,X_{ki}) }{\partial X_j } = \beta_j \]

An Application of LPM to Mortage Data

# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)

# inspect the data
head(HMDA)
##   deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
## 1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
## 2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
## 3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
## 4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
## 5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
## 6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
##   afam single hschool
## 1   no     no     yes
## 2   no    yes     yes
## 3   no     no     yes
## 4   no     no     yes
## 5   no     no     yes
## 6   no     no     yes
summary(HMDA)
##   deny          pirat            hirat            lvrat        chist   
##  no :2095   Min.   :0.0000   Min.   :0.0000   Min.   :0.0200   1:1353  
##  yes: 285   1st Qu.:0.2800   1st Qu.:0.2140   1st Qu.:0.6527   2: 441  
##             Median :0.3300   Median :0.2600   Median :0.7795   3: 126  
##             Mean   :0.3308   Mean   :0.2553   Mean   :0.7378   4:  77  
##             3rd Qu.:0.3700   3rd Qu.:0.2988   3rd Qu.:0.8685   5: 182  
##             Max.   :3.0000   Max.   :3.0000   Max.   :1.9500   6: 201  
##  mhist    phist          unemp        selfemp    insurance  condomin  
##  1: 747   no :2205   Min.   : 1.800   no :2103   no :2332   no :1694  
##  2:1571   yes: 175   1st Qu.: 3.100   yes: 277   yes:  48   yes: 686  
##  3:  41              Median : 3.200                                   
##  4:  21              Mean   : 3.774                                   
##                      3rd Qu.: 3.900                                   
##                      Max.   :10.600                                   
##   afam      single     hschool   
##  no :2041   no :1444   no :  39  
##  yes: 339   yes: 936   yes:2341  
##                                  
##                                  
##                                  
## 

We want to model the variable deny, an indicator for whether an applicant’s mortgage application has been accepted (deny = no) or denied (deny = yes).

Specifically we want to model the relationship between deny and pirat (anticipated monthly payments as a ratio of income). It is straightforward to translate this into the following simple regression model:

\[ deny = \beta_0 + \beta_1 * PI ratio + u \]

Estimate this model just like any other linear regression using lm().

# convert 'deny' to numeric
HMDA$deny <- as.numeric(HMDA$deny) - 1

# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1
## 
## Call:
## lm(formula = deny ~ pirat, data = HMDA)
## 
## Coefficients:
## (Intercept)        pirat  
##    -0.07991      0.60353

Now plot the data together with the estimated regression line and see how they compare:

# plot the data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Scatterplot Mortgage Application Denial and 
                                    the Payment-to-Income Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.8)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add the estimated regression line
abline(denymod1, 
       lwd = 1.8, 
       col = "steelblue")

Results show that a payment to income ratio of 1 is associated with an expected probability of mortgage application denial of roughly 60 percent.

There is a positive relationship between the payment-to-income ratio and the probability of a denied mortgage application so individuals with a high ratio of loan payments to income are more likely to be rejected.

Based on the estimation results from the LPM, we find that the true coefficient on the P/I ratio is statistically different from 0 at the 1 percent level.

Limitations of the Linear Probability Model

Recall that the theoretical model is: \[ E(Y_i | X_{1i})=Pr(Y_i=1 | X_{1i}) = \beta_0 + \beta_1 X_{1i} \] where the only regressor is \(X_1\) which is the P/I ratio in the mortgage example.

The population coefficient \(\beta_j\) equals the change in the probability that \(Y_i=1\) associated with a unit change in \(X_j\). \[ \frac{ \partial Pr(Y_i=1 | X_{1i}) }{\partial X_1 } = \beta_1 \] In the mortgage application example, we got \(\hat{\beta_1}=0.60\).

For P/I ratio larger than 1.75, the estimated model predicts that the probability of a mortgage application getting turned down is larger than!

For some values, it is even negative so the predicted probability of denial is even negative so that the model has no meaningful interpretation here.

In the linear probability model the predicted probability can be below 0 or above 1!

So how should we interpret the coefficient estimate \(\hat{\beta}\) for the P/I ratio?

Limitations of the Linear Probability Model

This is a major flaw of the linear probability model: it assumes the conditional probability function to be linear.

In other words, it does not restrict \(P(Y=1 | X_1,..,X_k)\) to be between 0 and 1.

This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable.

Violation of measurements: If the dependent variable is dichotomous e.g. smoker and non-smoker, employed and unemployed, etc.

Limitations of the Linear Probability Model

Problems with LPM:

  1. Non-normality of the errors \(U_i\)

  2. Heteroskedastic variances of the errors,

  3. Non-fulfillment of \(0 \leq E(Y_i|X_i) \leq 1\),

  4. Constant marginal effects,

  5. Questionable of value of \(R^2\) as a measure of goodness of fit.

Limitations of LPM: Heteroskedasticity

\[ Y_i = \beta_0 + \beta_{1i} X_{1i} + ... + \beta_k X_{ki} + u_i \]

The variance of a Bernoulli random variable is: \[ Var(Y)=Pr(Y=1) * [1- Pr(Y=1)] \] We can use this to find the conditional variance of the error term.

Alternatives to LPM: Nonlinear probability models

\[ Pr(Y_i=1) = G(Z) \] where \(Z=\beta_0 + \beta_1X_{1i} + .... + \beta_k X_{ki}\) and \(0 \leq G(Z) \leq 1\).

We will consider 2 nonlinear functions

  1. Probit: \(G(Z)=\phi(Z)\).

  2. Logit: \(G(Z)=\frac{1}{1+e^{-Z}}\).

The Probit Model

Probit regression models the probability that \(Y=1\).

Example:

The Probit Model: Latent Variable Approach

The probit model is derived from an underlying latent variable model:

\[ y_i^* = x_i'\beta + \epsilon_i \]

where:

The Probit Model: Binary Outcome Rule

We observe:

\[ y_i = \begin{cases} 1 & \text{if } y_i^* > 0 \\\\ 0 & \text{if } y_i^* \leq 0 \end{cases} \]

This leads to:

\[ P(y_i = 1|x_i) = P(y_i^* > 0|x_i) = P(x_i'\beta + \epsilon_i > 0) \] \[ = P(\epsilon_i > -x_i'\beta) = \Phi(x_i'\beta) \]

where \(\Phi(\cdot)\) is the standard normal CDF.

The probability that Y = 1 given X is: \[ P(Y = 1|X) = \Phi(X\beta) \] where:

In probit regression, the cumulative standard normal distribution function \(\Phi(.)\) is used to model the regression function when the dependent variable is binary.

\[ E(Y | X) = P(Y=1 | X) = \Phi(\beta_0 + \beta_1 X) \]

where \(\beta_0 + \beta_1X\) plays the role of a quantile \(z\). Recall that \[ \Phi(z)=P(Z\leq z), Z \sim N(0,1) \]

The Probit Model: Binary Outcome Rule

For a binary variable \(Y\),

\[ Y=\beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u \] with \[ P(Y=1 | X_1,X_2,...,X_k) = \Phi(\beta_0 + \beta_1 X_1 + ...+ \beta_k X_k) \] is the population Probit model with multiple regressors \(X_1, X_2, ..., X_k\) and \(\Phi(.)\) is the cumulative standard normal function. Then we calculate the predicted probability that \(Y=1\) given \(X_1,X_2,...,X_k\) in two steps:

  1. Compute \(z= \beta_0 + \beta_1 X_1 + .... + \beta_k X_k + u\)

  2. Look up \(\Phi(z)\) by calling pnorm().

In the above model, \(\beta_j\) is the effect of \(z\) of a one unit change in regressor \(X_j\) holding constant all other \(k-1\) regressors.

Application of Probit to Mortgage Data

Estimate the probit model using the mortgage data:

# estimate the simple probit model
denyprobit <- glm(deny ~ pirat, 
                  family = binomial(link = "probit"), 
                  data = HMDA)

coeftest(denyprobit, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
## pirat        2.96787    0.53698   5.5269 3.259e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the data with the estimated regression line:

# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.85)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")

lines(x, y, lwd = 1.5, col = "steelblue")

We can also run the probit model with additional regressors. The below shows results from regressing deny on PIratio and race (whether the person is black or not).

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -0.090514   0.033430 -2.7076  0.006826 ** 
## pirat        0.559195   0.103671  5.3939 7.575e-08 ***
## blackyes     0.177428   0.025055  7.0815 1.871e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Logit Model

The Logistic Function

\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]

Linear Probability vs. Logit

Comparing Binary Choice Models:

  1. Linear Probability Model (LPM): \[ P(Y=1|X) = X\beta \]

  2. Probit Model:

\[ P(Y=1 | X) = \Phi(X \beta) \]

  1. Logit Model: \[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \] Logit and Probit models are similar. The main difference is that they use different CDF’s.

The logit model uses the CDF of a standard logistic distribution, whereas the probit model uses the standard normal distribution.

We can also rewrite the logit model as:

\[ \log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = X\beta \]

An Application of Logit to Mortgage Data

Estimate the logit model using our mortgage data:

denylogit <- glm(deny ~ pirat, 
                 family = binomial(link = "logit"), 
                 data = HMDA)

coeftest(denylogit, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
## pirat        5.88450    1.00015   5.8836 4.014e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the data with the estimated regression line:

# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit and Logit Models of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")

lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"), 
       lty = c(1, 2))

Now also estimate a Logit regression with multiple regressors:

# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
## pirat        5.37036    0.96376   5.5723 2.514e-08 ***
## blackyes     1.27278    0.14616   8.7081 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compute predictions for P/I ratio = 0.3 and compute difference in probabilities:

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

predictions
##          1          2 
## 0.07485143 0.22414592
# 2. Compute difference in probabilities
diff(predictions)
##         2 
## 0.1492945

As in the Probit model, all the model coefficients are highly significant and the estimates for the coefficient on P/I is positive.

We also computed the predicted probability of denial for two hypothetical applicants that differ in race and have the same P/I ratio 0.3.

White applicants face a denial probabilities of only 7.5 percent, while the black applicants are rejected with probability 22.4 percent.

Probit vs. Logit

Logit Model: Understanding the Coefficients

The logistic regression model is specified as:

\[ P(Y=1|X) = \frac{e^{X\beta}}{1 + e^{X\beta}} = \Lambda(X\beta) \]

where:

Example: Labor Market Participation

Generate a sample of size \(n=2000\) with variable education, experience, female and age.

# Generate sample data
set.seed(123)
n <- 2000

# Generate explanatory variables
data <- data.frame(
  education = rnorm(n, mean = 16, sd = 2),
  experience = rnorm(n, mean = 10, sd = 5),
  female = rbinom(n, 1, 0.5),
  age = rnorm(n, mean = 40, sd = 10)
)

Generate a variable for employment status:

# Generate probability of working
z <- with(data, 
         -2 + 0.3*education + 0.2*experience - 0.5*female - 0.02*age)
prob <- 1/(1 + exp(-z))
data$working <- rbinom(n, 1, prob)

Use the simulated sample to estimate a logit model where working is the dependent variable:

# Estimate logit model
logit_model <- glm(working ~ education + experience + female + age,
                   family = binomial(link = "logit"),
                   data = data)

Get the log-odds coefficients:

# Display coefficients
summary(logit_model)
## 
## Call:
## glm(formula = working ~ education + experience + female + age, 
##     family = binomial(link = "logit"), data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.48735    1.15126  -2.161   0.0307 *  
## education    0.25462    0.06464   3.939 8.19e-05 ***
## experience   0.21925    0.02756   7.954 1.80e-15 ***
## female      -0.42654    0.25740  -1.657   0.0975 .  
## age          0.01027    0.01276   0.804   0.4212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 600.21  on 1999  degrees of freedom
## Residual deviance: 512.37  on 1995  degrees of freedom
## AIC: 522.37
## 
## Number of Fisher Scoring iterations: 7

Interpretation of the raw coefficients:

Odds Ratios: Converting to Odds Ratios

# Calculate odds ratios
odds_ratios <- exp(coef(logit_model))
odds_ratios_ci <- exp(confint(logit_model))

# Combine results
odds_results <- data.frame(
  OR = odds_ratios,
  CI_lower = odds_ratios_ci[,1],
  CI_upper = odds_ratios_ci[,2]
)

kable(odds_results, digits = 3)
OR CI_lower CI_upper
(Intercept) 0.083 0.009 0.798
education 1.290 1.138 1.466
experience 1.245 1.181 1.316
female 0.653 0.391 1.077
age 1.010 0.985 1.036

Odds Ratios: Interpretation

Understanding Odds and Odds Ratios

What are Odds?

What are Odds Ratios?

Odds represent the ratio of the probability of success to the probability of failure:

\[ Odds = \frac{P(Y=1)}{P(Y=0)} = \frac{P(Y=1)}{1-P(Y=1)} \]

Example:

What are Odds Ratios?

The odds ratio is the ratio of two odds:

\[ OR = \frac{Odds_1}{Odds_0} = \frac{P(Y=1|X=x+1)/P(Y=0|X=x+1)}{P(Y=1|X=x)/P(Y=0|X=x)} \]

Deriving the Odds Ratios in Logistic Regression

The logistic regression model is:

\[ P(Y=1|X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} \]

Deriving the Odds Ratio:

  1. Write out the odds for X = x+1: \[ Odds_1 = \frac{P(Y=1|X=x+1)}{1-P(Y=1|X=x+1)} = e^{\beta_0 + \beta_1(x+1)} \]

  2. Write out the odds for X = x: \[ Odds_0 = \frac{P(Y=1|X=x)}{1-P(Y=1|X=x)} = e^{\beta_0 + \beta_1x} \]

  3. Take the ratio: \[ OR = \frac{Odds_1}{Odds_0} = \frac{e^{\beta_0 + \beta_1(x+1)}}{e^{\beta_0 + \beta_1x}} = e^{\beta_1} \]

Average Marginal Effects (AME)

# Calculate average marginal effects
mfx <- margins(logit_model)
summary(mfx)
##      factor     AME     SE       z      p   lower  upper
##         age  0.0003 0.0004  0.8026 0.4222 -0.0005 0.0011
##   education  0.0079 0.0021  3.7487 0.0002  0.0038 0.0120
##  experience  0.0068 0.0010  6.6597 0.0000  0.0048 0.0088
##      female -0.0132 0.0081 -1.6413 0.1007 -0.0291 0.0026

Interpretation:

Visualizing Marginal Effects: Predicted Probabilities Across Education Levels

# Create prediction data
new_data <- expand.grid(
  education = seq(min(data$education), 
                 max(data$education), 
                 length.out = 100),
  experience = mean(data$experience),
  female = c(0, 1),
  age = mean(data$age)
)

# Generate predictions
new_data$pred_prob <- predict(logit_model, 
                            newdata = new_data, 
                            type = "response")

# Plot
ggplot(new_data, aes(x = education, y = pred_prob, color = factor(female))) +
  geom_line() +
  labs(title = "Predicted Probability of Working by Education and Gender",
       x = "Years of Education",
       y = "Probability of Working",
       color = "Female") +
  theme_minimal()

Example with Simulated Data

# Generate sample data
set.seed(123)
n <- 1000

# Create education variable
education <- rnorm(n, mean = 16, sd = 2)

# Generate probability of employment
z <- -2 + 0.3 * education
prob <- 1/(1 + exp(-z))
employed <- rbinom(n, 1, prob)

# Create dataset
data <- data.frame(
  education = education,
  employed = employed
)

# Fit logistic regression
model <- glm(employed ~ education, 
            family = binomial(link = "logit"), 
            data = data)

Calculating and Interpreting Odds Ratios

# Calculate odds ratio for education
beta <- coef(model)[2]  # coefficient for education
odds_ratio <- exp(beta)

# Calculate confidence interval
ci <- exp(confint(model))[2,]

# Create results table
results <- data.frame(
  Coefficient = beta,
  Odds_Ratio = odds_ratio,
  CI_Lower = ci[1],
  CI_Upper = ci[2]
)

kable(results, digits = 3)
Coefficient Odds_Ratio CI_Lower CI_Upper
education 0.362 1.437 1.255 1.653

Interpretation:

Visualizing Odds Ratios

Create a sequence of education values and calculate predicted probabilities using these values:

# Create sequence of education values
edu_seq <- seq(min(education), max(education), length.out = 100)

# Calculate predicted probabilities
pred_prob <- predict(model, 
                    newdata = data.frame(education = edu_seq),
                    type = "response")

Calculate odds:

# Calculate odds
odds <- pred_prob/(1-pred_prob)

Create a plot to compare odds and odds ratios:

# Create plot data
plot_data <- data.frame(
  Education = edu_seq,
  Probability = pred_prob,
  Odds = odds
)

# Plot both probability and odds
p1 <- ggplot(plot_data, aes(x = Education, y = Probability)) +
  geom_line() +
  labs(title = "Predicted Probability vs. Education",
       y = "Probability of Employment") +
  theme_minimal()

p2 <- ggplot(plot_data, aes(x = Education, y = Odds)) +
  geom_line() +
  labs(title = "Odds vs. Education",
       y = "Odds of Employment") +
  theme_minimal()

# Display plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)

Important Properties of Odds Ratios

  1. Multiplicative Nature
  1. Symmetry Around 1

Comparing Probability Changes vs Odds Ratios

Key Differences:

# Function to calculate probability change
calc_prob_change <- function(x, beta0 = -2, beta1 = 0.3) {
  p1 <- 1/(1 + exp(-(beta0 + beta1*(x+1))))
  p0 <- 1/(1 + exp(-(beta0 + beta1*x)))
  return(p1 - p0)
}

# Calculate probability changes at different education levels
edu_levels <- seq(10, 20, by = 2)
prob_changes <- sapply(edu_levels, calc_prob_change)

# Create comparison table
comparison <- data.frame(
  Education_Level = edu_levels,
  Probability_Change = prob_changes,
  Odds_Ratio = rep(odds_ratio, length(edu_levels))
)

kable(comparison, digits = 3)
Education_Level Probability_Change Odds_Ratio
10 0.055 1.437
12 0.038 1.437
14 0.024 1.437
16 0.014 1.437
18 0.008 1.437
20 0.005 1.437

Common Misconceptions

  1. Odds vs. Probability
  1. Interpretation Pitfalls

Practical Guidelines: When to Use Odds Ratios

  1. Advantages:
    • Constant across levels of X
    • Easy to combine across studies
    • Standard in medical research
  2. Disadvantages:
    • Less intuitive than probabilities
    • Can be misleading for rare events
    • May exaggerate effects

Alternative Measures:

  1. Risk Ratios:
    • Ratio of probabilities
    • More intuitive for some audiences
  2. Marginal Effects:
    • Change in probability
    • Varies across X
    • More relevant for policy

Ordinal Logistic Regression (Ordinal Logit)

When to Use Ordinal Logistic Regression:

Proportional Odds

\[ P(Y \leq j|X) = \frac{e^{\alpha_j + X\beta}}{1 + e^{\alpha_j + X\beta}} \]

where: - j is the category

Example 1: Job Satisfaction Analysis

Generate a sample of 1000 observations.

The variables in the sample will be salary, work_hours, experience, training.

# Generate sample data
n <- 1000

# Create base variables
salary <- rnorm(n, mean = 50000, sd = 10000)
work_hours <- rnorm(n, mean = 40, sd = 5)
experience <- rpois(n, lambda = 10)
training <- rbinom(n, 1, 0.3)

# Create the data frame first with raw values
mydata <- data.frame(
    salary = salary,
    work_hours = work_hours,
    experience = experience,
    training = training
)
## 'data.frame':    1000 obs. of  7 variables:
##  $ salary        : num  44395 47698 65587 50705 51293 ...
##  $ work_hours    : num  35 34.8 39.9 39.3 27.3 ...
##  $ experience    : int  8 10 8 13 10 8 14 9 12 9 ...
##  $ training      : int  1 0 0 1 0 1 0 0 0 0 ...
##  $ salary_std    : num  -0.5814 -0.2484 1.5555 0.0548 0.1141 ...
##  $ work_hours_std: num  -1.0283 -1.072 -0.0599 -0.173 -2.567 ...
##  $ satisfaction  : Ord.factor w/ 4 levels "Very Dissatisfied"<..: 1 2 1 3 2 1 2 1 1 1 ...

Now that we generated our sample, we can estimate the model with ordinal logit:

## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std + 
##     training, data = mydata, method = "logistic")
## 
## Coefficients:
##                   Value Std. Error t value
## salary_std      0.42255    0.06232  6.7807
## experience      0.18257    0.02064  8.8462
## work_hours_std -0.01522    0.06130 -0.2483
## training        0.96524    0.13310  7.2519
## 
## Intercepts:
##                                Value   Std. Error t value
## Very Dissatisfied|Dissatisfied  1.8396  0.2184     8.4241
## Dissatisfied|Satisfied          3.7021  0.2434    15.2103
## Satisfied|Very Satisfied        5.5191  0.2891    19.0880
## 
## Residual Deviance: 2118.656 
## AIC: 2132.656
## 
## 
## |                                    |  Value| Std. Error| t value| p value|
## |:-----------------------------------|------:|----------:|-------:|-------:|
## |salary_std                          |  0.423|      0.062|   6.781|   0.000|
## |experience                          |  0.183|      0.021|   8.846|   0.000|
## |work_hours_std                      | -0.015|      0.061|  -0.248|   0.804|
## |training                            |  0.965|      0.133|   7.252|   0.000|
## |Very Dissatisfied&#124;Dissatisfied |  1.840|      0.218|   8.424|   0.000|
## |Dissatisfied&#124;Satisfied         |  3.702|      0.243|  15.210|   0.000|
## |Satisfied&#124;Very Satisfied       |  5.519|      0.289|  19.088|   0.000|
## [1] "Odds Ratios with 95% Confidence Intervals:"
## 
## 
## |               |    OR| CI_lower| CI_upper|
## |:--------------|-----:|--------:|--------:|
## |salary_std     | 1.526|    1.351|    1.726|
## |experience     | 1.200|    1.153|    1.250|
## |work_hours_std | 0.985|    0.873|    1.111|
## |training       | 2.625|    2.024|    3.411|
## Call:
## polr(formula = satisfaction ~ salary_std + experience + work_hours_std + 
##     training, data = mydata, method = "logistic")
## 
## Coefficients:
##                   Value Std. Error t value
## salary_std      0.42255    0.06232  6.7807
## experience      0.18257    0.02064  8.8462
## work_hours_std -0.01522    0.06130 -0.2483
## training        0.96524    0.13310  7.2519
## 
## Intercepts:
##                                Value   Std. Error t value
## Very Dissatisfied|Dissatisfied  1.8396  0.2184     8.4241
## Dissatisfied|Satisfied          3.7021  0.2434    15.2103
## Satisfied|Very Satisfied        5.5191  0.2891    19.0880
## 
## Residual Deviance: 2118.656 
## AIC: 2132.656
Value Std. Error t value p value
salary_std 0.423 0.062 6.781 0.000
experience 0.183 0.021 8.846 0.000
work_hours_std -0.015 0.061 -0.248 0.804
training 0.965 0.133 7.252 0.000
Very Dissatisfied|Dissatisfied 1.840 0.218 8.424 0.000
Dissatisfied|Satisfied 3.702 0.243 15.210 0.000
Satisfied|Very Satisfied 5.519 0.289 19.088 0.000

Coefficient Interpretation

  1. Direction:
    • Positive coefficient: Higher values of X associated with higher satisfaction
    • Negative coefficient: Higher values of X associated with lower satisfaction
  2. Magnitude:
    • Represents change in log-odds for one unit increase in X
    • Must be exponentiated for odds ratios
# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(model))
ci <- exp(confint(model))

# Combine results
or_results <- data.frame(
  OR = odds_ratios,
  CI_lower = ci[,1],
  CI_upper = ci[,2]
)

kable(or_results, digits = 3)
OR CI_lower CI_upper
salary_std 1.526 1.351 1.726
experience 1.200 1.153 1.250
work_hours_std 0.985 0.873 1.111
training 2.625 2.024 3.411

Interpreting Odds Ratios For Continuous Variables:

# Create interpretation table
interpret <- data.frame(
  Variable = names(odds_ratios),
  OR = odds_ratios,
  Interpretation = paste("For one unit increase,",
                        "odds of higher satisfaction",
                        ifelse(odds_ratios > 1, "increase", "decrease"),
                        "by",
                        round(abs(1-odds_ratios)*100, 1),
                        "%")
)

kable(interpret, digits = 3)
Variable OR Interpretation
salary_std salary_std 1.526 For one unit increase, odds of higher satisfaction increase by 52.6 %
experience experience 1.200 For one unit increase, odds of higher satisfaction increase by 20 %
work_hours_std work_hours_std 0.985 For one unit increase, odds of higher satisfaction decrease by 1.5 %
training training 2.625 For one unit increase, odds of higher satisfaction increase by 162.5 %

Mathematical Framework

The ordinal logit model can be expressed as:

\[ \log\left(\frac{P(Y_i \leq j)}{P(Y_i > j)}\right) = \alpha_j - X_i\beta \]

Where:

Example 2: Wine Tasting

temp: Temperature in Celsius (15-25°C)

Represents the fermentation/storage temperature of the wine

Range: 15.0°C to 24.9°C

Median: 19.8°C

contact: Contact time in hours (1-24 hours)

Represents the time the wine spends in contact with grape skins/oak

Range: 1 to 24 hours

Median: 12 hours

rating: Wine quality rating (1-5 scale)

1: Poor

2: Fair

3: Good

4: Very Good

5: Excellent

The data simulates a wine tasting experiment where different combinations of temperature and contact time are tested to see their effect on wine quality ratings.

The ratings are weighted to follow a roughly normal distribution centered around rating 3 (Good).

Example 2: Wine Tasting

# Display first few rows of the wine dataset
head(wine)
##   temp contact rating
## 1 17.9       6      1
## 2 22.9      23      3
## 3 19.1      15      1
## 4 23.8      13      2
## 5 24.4      10      3
## 6 15.5      21      2
# Fit ordinal logistic regression model
model <- polr(rating ~ temp + contact, data = wine, Hess = TRUE)

# Display summary
summary(model)
## Call:
## polr(formula = rating ~ temp + contact, data = wine, Hess = TRUE)
## 
## Coefficients:
##            Value Std. Error t value
## temp    -0.03336    0.04767 -0.6997
## contact -0.01374    0.01880 -0.7306
## 
## Intercepts:
##     Value   Std. Error t value
## 1|2 -3.1019  1.0385    -2.9868
## 2|3 -1.7432  1.0218    -1.7061
## 3|4  0.0281  1.0164     0.0276
## 4|5  1.0191  1.0204     0.9987
## 
## Residual Deviance: 587.4065 
## AIC: 599.4065

Coefficient Interpretation

# Extract and display coefficients
coef(model)
##        temp     contact 
## -0.03335622 -0.01373835

Odds Ratios

# Calculate odds ratios and confidence intervals
ci <- confint(model)
odds_ratios <- exp(coef(model))
ci_exp <- exp(ci)

# Combine results
or_table <- cbind(OR = odds_ratios, 
                 LowerCI = ci_exp[,1], 
                 UpperCI = ci_exp[,2])
print(or_table)
##                OR   LowerCI  UpperCI
## temp    0.9671940 0.8807487 1.062027
## contact 0.9863556 0.9505159 1.023337

Predicted Probabilities

# Create new data for predictions
new_data <- expand.grid(
  temp = c(min(wine$temp), mean(wine$temp), max(wine$temp)),
  contact = c(min(wine$contact), mean(wine$contact), max(wine$contact))
)

# Get predictions
preds <- predict(model, newdata = new_data, type = "probs")
round(preds, 3)
##       1     2     3     4     5
## 1 0.070 0.156 0.406 0.190 0.178
## 2 0.082 0.176 0.413 0.175 0.154
## 3 0.095 0.195 0.416 0.160 0.134
## 4 0.081 0.174 0.413 0.177 0.156
## 5 0.094 0.194 0.416 0.161 0.135
## 6 0.109 0.213 0.414 0.146 0.117
## 7 0.093 0.193 0.416 0.162 0.136
## 8 0.109 0.213 0.414 0.146 0.117
## 9 0.125 0.233 0.408 0.132 0.102

Model Diagnostics

# Calculate AIC
AIC(model)
## [1] 599.4065
# Test proportional odds assumption
sf <- summary(model)
pval <- pchisq(sf$deviance, sf$df.residual, lower.tail = FALSE)
cat("Proportional Odds Test p-value:", pval)
## Proportional Odds Test p-value: 3.467484e-41

Key Takeaways

Common Pitfalls: What to Watch Out For

  1. Linear Interpretation

    • Logit effects are non-linear

    • Marginal effects vary across X

  2. Significance vs. Magnitude

    • Statistical significance ≠ Practical significance

    • Consider effect sizes

  3. Sample Specific

    • Effects depend on sample characteristics

    • Be careful with extrapolation

Further Considerations

  1. Discrete Changes

    • For binary/categorical variables

    • Different from continuous marginal effects

  2. Standard Errors

    • Delta method

    • Bootstrap methods

  3. Model Diagnostics

    • Goodness of fit

    • Classification accuracy

Key Points to Remember

  1. Interpretation:
    • Coefficients represent log-odds
    • Odds ratios show multiplicative effects
    • Predicted probabilities show absolute effects
  2. Assumptions:
    • Test parallel regression assumption
    • Consider alternatives if violated
  3. Reporting:
    • Include multiple types of effects
    • Use visualizations
    • Report model diagnostics