Lecture 10: Probit and Logit Regression (Binary DVs)

Slides for Week 10 is available here.

For the lab, we will be using the code materials from this book: https://www.econometrics-with-r.org/11-2-palr.html

Install and load all necessary libraries

# install.packages("AER")
library(stargazer)
library(car)
library(sandwich)
# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)

Exploratory Data Analysis and Variable Transformation

head(HMDA)
##   deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
## 1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
## 2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
## 3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
## 4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
## 5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
## 6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
##   afam single hschool
## 1   no     no     yes
## 2   no    yes     yes
## 3   no     no     yes
## 4   no     no     yes
## 5   no     no     yes
## 6   no     no     yes
# rename the variable 'afam' for consistency
colnames(HMDA)[colnames(HMDA) == "afam"] <- "black"
summary(HMDA)
##   deny          pirat            hirat            lvrat        chist   
##  no :2095   Min.   :0.0000   Min.   :0.0000   Min.   :0.0200   1:1353  
##  yes: 285   1st Qu.:0.2800   1st Qu.:0.2140   1st Qu.:0.6527   2: 441  
##             Median :0.3300   Median :0.2600   Median :0.7795   3: 126  
##             Mean   :0.3308   Mean   :0.2553   Mean   :0.7378   4:  77  
##             3rd Qu.:0.3700   3rd Qu.:0.2988   3rd Qu.:0.8685   5: 182  
##             Max.   :3.0000   Max.   :3.0000   Max.   :1.9500   6: 201  
##  mhist    phist          unemp        selfemp    insurance  condomin  
##  1: 747   no :2205   Min.   : 1.800   no :2103   no :2332   no :1694  
##  2:1571   yes: 175   1st Qu.: 3.100   yes: 277   yes:  48   yes: 686  
##  3:  41              Median : 3.200                                   
##  4:  21              Mean   : 3.774                                   
##                      3rd Qu.: 3.900                                   
##                      Max.   :10.600                                   
##  black      single     hschool   
##  no :2041   no :1444   no :  39  
##  yes: 339   yes: 936   yes:2341  
##                                  
##                                  
##                                  
## 

Probit Regression

We now estimate a simple Probit model of the probability of a mortgage denial.

# estimate the simple probit model
denyprobit <- glm(deny ~ pirat, 
                  family = binomial(link = "probit"), 
                  data = HMDA)

coeftest(denyprobit, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
## pirat        2.96787    0.53698   5.5269 3.259e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We use predict() to compute the predicted change in the denial probability when P/I ratio increased from 0.3 to 0.4

# 1. compute predictions for P/I ratio = 0.3, 0.4
predictions <- predict(denyprobit, 
                       newdata = data.frame("pirat" = c(0.3, 0.4)),
                       type = "response")

# 2. Compute difference in probabilities
diff(predictions)
##          2 
## 0.06081433

We find that an increase in the payment-to-income ratio from 0.3 to 0.4 is predicted to increase the probability of denial by approximately 6.2 %.

We continue by using an augmented Probit model to estimate the effect of race on the probability of a mortgage application denial.

denyprobit2 <- glm(deny ~ pirat + black, 
                   family = binomial(link = "probit"), 
                   data = HMDA)

coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##              Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
## pirat        2.741779   0.497673   5.5092 3.605e-08 ***
## blackyes     0.708155   0.083091   8.5227 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[\begin{align} \widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} P/I \ ratio + \underset{(0.08)}{0.71} black). \tag{11.6} \end{align}\]

While all coefficients are highly significant, both the estimated coefficients on the payments-to-income ratio and the indicator for African American descent are positive. Again, the coefficients are difficult to interpret but they indicate that, first, African Americans have a higher probability of denial than white applicants, holding constant the payments-to-income ratio and second, applicants with a high payments-to-income ratio face a higher risk of being rejected.

How big is the estimated difference in denial probabilities between two hypothetical applicants with the same payments-to-income ratio? As before, we may use predict() to compute this difference.

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denyprobit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

# 2. compute difference in probabilities
diff(predictions)
##         2 
## 0.1578117

In this case, the estimated difference in denial probabilities is about 15.8%.

Logit Regression

It is fairly easy to estimate a Logit regression model using R.

denylogit <- glm(deny ~ pirat, 
                 family = binomial(link = "logit"), 
                 data = HMDA)

coeftest(denylogit, vcov. = vcovHC, type = "HC1")
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
## pirat        5.88450    1.00015   5.8836 4.014e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")

lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"), 
       lty = c(1, 2))

Both models produce very similar estimates of the probability that a mortgage application will be denied depending on the applicants payment-to-income ratio.

Following the book we extend the simple Logit model of mortgage denial with the additional regressor black.

# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black, 
                  family = binomial(link = "logit"), 
                  data = HMDA)
coeftest(denylogit2, vcov. = vcovHC, type = "HC1") 
## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
## pirat        5.37036    0.96376   5.5723 2.514e-08 ***
## blackyes     1.27278    0.14616   8.7081 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# same as summary function with robust standard errors

As for the Probit model, all model coefficients are highly significant and we obtain positive estimates for the coefficients on P/I ratio and black. For comparison we compute the predicted probability of denial for two hypothetical applicants that differ in race and have a P/I ratio of 0.3 .

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")
summary(predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07485 0.11218 0.14950 0.14950 0.18682 0.22415
# 2. Compute difference in probabilities
diff(predictions)
##         2 
## 0.1492945

We find that the white applicant faces a denial probability of only 7.5%, while the African American is rejected with a probability of 22.4%, a difference of 14.9%.

Application to Boston HMDA Data

# Mean P/I ratio
mean(HMDA$pirat)
## [1] 0.3308136
# inhouse expense-to-total-income ratio
mean(HMDA$hirat)
## [1] 0.2553461
# loan-to-value ratio
mean(HMDA$lvrat)
## [1] 0.7377759
# consumer credit score
mean(as.numeric(HMDA$chist))
## [1] 2.116387
# mortgage credit score
mean(as.numeric(HMDA$mhist))
## [1] 1.721008
# public bad credit record
mean(as.numeric(HMDA$phist)-1)
## [1] 0.07352941
# denied mortgage insurance
prop.table(table(HMDA$insurance))
## 
##         no        yes 
## 0.97983193 0.02016807
# self-employed
prop.table(table(HMDA$selfemp))
## 
##        no       yes 
## 0.8836134 0.1163866
# single
prop.table(table(HMDA$single))
## 
##        no       yes 
## 0.6067227 0.3932773
# high school diploma
prop.table(table(HMDA$hschool))
## 
##         no        yes 
## 0.01638655 0.98361345
# unemployment rate
mean(HMDA$unemp)
## [1] 3.774496
# condominium
prop.table(table(HMDA$condomin))
## 
##        no       yes 
## 0.7117647 0.2882353
# black
prop.table(table(HMDA$black))
## 
##       no      yes 
## 0.857563 0.142437
# deny
prop.table(table(HMDA$deny))
## 
##        no       yes 
## 0.8802521 0.1197479

Before estimating the models we transform the loan-to-value ratio (lvrat) into a factor variable, where \[\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\] and convert both credit scores to numeric variables.

# define low, medium and high loan-to-value ratio
HMDA$lvrat <- factor(
  ifelse(HMDA$lvrat < 0.8, "low",
  ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")),
  levels = c("low", "medium", "high"))


# convert credit scores to numeric
HMDA$mhist <- as.numeric(HMDA$mhist)
HMDA$chist <- as.numeric(HMDA$chist)

Next we reproduce the estimation results presented in Table 11.2 of the book.

# estimate all 5 models for the denial probability

logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                  + insurance + selfemp, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_2 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist 
                     + phist + insurance + selfemp + single + hschool + unemp + condomin 
                     + I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5) 
                     + I(chist==6), 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

Just as in previous chapters, we store heteroskedasticity-robust standard errors of the coefficient estimators in a list which is then used as the argument se in stargazer().

rob_se <- list(sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1"))))


logit_HMDA$AIC <- AIC(logit_HMDA)
probit_HMDA_1$AIC <- AIC(probit_HMDA_1)
probit_HMDA_2$AIC <- AIC(probit_HMDA_2)
probit_HMDA_3$AIC <- AIC(probit_HMDA_3)
probit_HMDA_4$AIC <- AIC(probit_HMDA_4)

logit_HMDA$BIC <- BIC(logit_HMDA)
probit_HMDA_1$BIC <- BIC(probit_HMDA_1)
probit_HMDA_2$BIC <- BIC(probit_HMDA_2)
probit_HMDA_3$BIC <- BIC(probit_HMDA_3)
probit_HMDA_4$BIC <- BIC(probit_HMDA_4)


# logit model

stargazer(logit_HMDA, 
          digits = 3, out = "logit.html",
          type = "html", 
          header = FALSE,
          title = "Logit Regressions of Mortgage Denial Using the Boston HMDA data",
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          se = rob_se,
          keep.stat = c("n","aic","bic"),
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c( "(1)"))
Logit Regressions of Mortgage Denial Using the Boston HMDA data
Dependent Variable: Mortgage Denial
deny
(1)
blackyes 0.688***
(0.183)
pirat 4.764***
(1.332)
hirat -0.109
(1.298)
lvratmedium 0.464**
(0.160)
lvrathigh 1.495***
(0.325)
chist 0.290***
(0.039)
mhist 0.279*
(0.138)
phistyes 1.226***
(0.203)
insuranceyes 4.548***
(0.576)
selfempyes 0.666**
(0.214)
Constant -5.707***
(0.484)
Observations 2,380
Akaike Inf. Crit. 1,293.273
Bayesian Inf. Crit. 1,356.797
Note: (+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)
# probit model
stargazer(probit_HMDA_1, 
          probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,  
          digits = 3, out = "probit.html",
          type = "html", 
          header = FALSE,
          title = "Logit and Probit Regressions of Mortgage Denial Using the Boston HMDA data",
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          se = rob_se,
          keep.stat = c("n","aic","bic"),
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c( "(2)", "(3)", "(4)", "(5)"))
Logit and Probit Regressions of Mortgage Denial Using the Boston HMDA data
Dependent Variable: Mortgage Denial
deny
(2) (3) (4) (5)
blackyes 0.389* 0.371*** 0.363*** 0.246*
(0.183) (0.099) (0.100) (0.101)
pirat 2.442+ 2.464*** 2.622*** 2.572***
(1.332) (0.673) (0.654) (0.665)
hirat -0.185 -0.302 -0.502 -0.538
(1.298) (0.689) (0.689) (0.715)
lvratmedium 0.214 0.216** 0.215** 0.216*
(0.160) (0.082) (0.082) (0.084)
lvrathigh 0.791* 0.795*** 0.836*** 0.788***
(0.325) (0.183) (0.184) (0.185)
chist 0.155*** 0.158*** 0.344*** 0.158
(0.039) (0.021) (0.021) (0.108)
mhist 0.148 0.110 0.162* 0.111
(0.138) (0.073) (0.076) (0.104)
phistyes 0.697*** 0.702*** 0.717*** 0.705***
(0.203) (0.114) (0.115) (0.116)
insuranceyes 2.557*** 2.585*** 2.589*** 2.590***
(0.576) (0.305) (0.299) (0.306)
selfempyes 0.359+ 0.346** 0.342** 0.348**
(0.214) (0.113) (0.116) (0.116)
singleyes 0.229 0.230** 0.226**
(0.080) (0.086)
hschoolyes -0.613 -0.604** -0.620**
(0.229) (0.237)
unemp 0.030 0.028 0.030+
(0.018) (0.018)
condominyes -0.055
I(mhist == 3) -0.107
I(mhist == 4) -0.383
I(chist == 3) -0.226
I(chist == 4) -0.251
I(chist == 5) -0.789
I(chist == 6) -0.905
blackyes:pirat -0.579
blackyes:hirat 1.232
Constant -3.041*** -2.575*** -2.896*** -2.543***
(0.484) (0.250) (0.350) (0.404)
Observations 2,380 2,380 2,380 2,380
Akaike Inf. Crit. 1,295.694 1,285.227 1,292.129 1,288.664
Bayesian Inf. Crit. 1,359.218 1,366.075 1,413.401 1,381.062
Note: (+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)

Exponentiate Coefficients for logit_HMDA and Show Odds Ratios in Stargazer

# Exponentiate the coefficients and compute robust SE for the logit model
logit_OR <- exp(coef(logit_HMDA))
logit_OR_se <- exp(coef(logit_HMDA)) * sqrt(diag(vcovHC(logit_HMDA, type = "HC1")))

# Create a dummy model with exponentiated coefficients to use in stargazer
logit_HMDA_OR <- logit_HMDA
logit_HMDA_OR$coefficients <- logit_OR

# Stargazer table for exponentiated log odds (odds ratios)
stargazer(logit_HMDA_OR,
          type = "html",
          out = "logit_odds_ratios.html",
          title = "Exponentiated Coefficients (Odds Ratios) for Logit Model of Mortgage Denial",
          se = list(logit_OR_se),
          digits = 3,
          keep.stat = c("n","aic","bic"),
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"),
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c("Logit OR"))
## 
## <table style="text-align:center"><caption><strong>Exponentiated Coefficients (Odds Ratios) for Logit Model of Mortgage Denial</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Dependent Variable: Mortgage Denial</td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>deny</td></tr>
## <tr><td style="text-align:left"></td><td>Logit OR</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">blackyes</td><td>1.991<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.363)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">pirat</td><td>117.263</td></tr>
## <tr><td style="text-align:left"></td><td>(156.217)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">hirat</td><td>0.897</td></tr>
## <tr><td style="text-align:left"></td><td>(1.164)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">lvratmedium</td><td>1.590<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.255)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">lvrathigh</td><td>4.458<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.448)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">chist</td><td>1.337<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">mhist</td><td>1.322<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.182)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">phistyes</td><td>3.407<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.693)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">insuranceyes</td><td>94.459<sup>+</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(54.373)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">selfempyes</td><td>1.947<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.416)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.003<sup>*</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>2,380</td></tr>
## <tr><td style="text-align:left">Akaike Inf. Crit.</td><td>1,293.273</td></tr>
## <tr><td style="text-align:left">Bayesian Inf. Crit.</td><td>1,356.797</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right">(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)</td></tr>
## </table>