SPP608 Statistics in R: Probit and Logit Regression

Lecture 10: Probit and Logit Regression (Binary DVs)

Slides for Week 10 is available here.

For the lab, we will be using the code materials from this book: https://www.econometrics-with-r.org/11-2-palr.html

Install and load all necessary libraries

# install.packages("AER")
library(stargazer)
library(car)
library(sandwich)
# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)

Exploratory Data Analysis and Variable Transformation

head(HMDA)

##   deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
## 1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
## 2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
## 3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
## 4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
## 5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
## 6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
##   afam single hschool
## 1   no     no     yes
## 2   no    yes     yes
## 3   no     no     yes
## 4   no     no     yes
## 5   no     no     yes
## 6   no     no     yes

# rename the variable 'afam' for consistency
colnames(HMDA)[colnames(HMDA) == "afam"] <- "black"
summary(HMDA)

##   deny          pirat            hirat            lvrat        chist   
##  no :2095   Min.   :0.0000   Min.   :0.0000   Min.   :0.0200   1:1353  
##  yes: 285   1st Qu.:0.2800   1st Qu.:0.2140   1st Qu.:0.6527   2: 441  
##             Median :0.3300   Median :0.2600   Median :0.7795   3: 126  
##             Mean   :0.3308   Mean   :0.2553   Mean   :0.7378   4:  77  
##             3rd Qu.:0.3700   3rd Qu.:0.2988   3rd Qu.:0.8685   5: 182  
##             Max.   :3.0000   Max.   :3.0000   Max.   :1.9500   6: 201  
##  mhist    phist          unemp        selfemp    insurance  condomin  
##  1: 747   no :2205   Min.   : 1.800   no :2103   no :2332   no :1694  
##  2:1571   yes: 175   1st Qu.: 3.100   yes: 277   yes:  48   yes: 686  
##  3:  41              Median : 3.200                                   
##  4:  21              Mean   : 3.774                                   
##                      3rd Qu.: 3.900                                   
##                      Max.   :10.600                                   
##  black      single     hschool   
##  no :2041   no :1444   no :  39  
##  yes: 339   yes: 936   yes:2341  
##                                  
##                                  
##                                  
##

Probit Regression

We now estimate a simple Probit model of the probability of a mortgage denial.

# estimate the simple probit model
denyprobit <- glm(deny ~ pirat, 
                  family = binomial(link = "probit"), 
                  data = HMDA)

coeftest(denyprobit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
## pirat        2.96787    0.53698   5.5269 3.259e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We use predict() to compute the predicted change in the denial probability when P/I ratio increased from 0.3 to 0.4

# 1. compute predictions for P/I ratio = 0.3, 0.4
predictions <- predict(denyprobit, 
                       newdata = data.frame("pirat" = c(0.3, 0.4)),
                       type = "response")

# 2. Compute difference in probabilities
diff(predictions)

##          2 
## 0.06081433

We find that an increase in the payment-to-income ratio from 0.3 to 0.4 is predicted to increase the probability of denial by approximately 6.2 %.

We continue by using an augmented Probit model to estimate the effect of race on the probability of a mortgage application denial.

denyprobit2 <- glm(deny ~ pirat + black, 
                   family = binomial(link = "probit"), 
                   data = HMDA)

coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##              Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
## pirat        2.741779   0.497673   5.5092 3.605e-08 ***
## blackyes     0.708155   0.083091   8.5227 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[\begin{align} \widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} P/I \ ratio + \underset{(0.08)}{0.71} black). \tag{11.6} \end{align}\]

While all coefficients are highly significant, both the estimated coefficients on the payments-to-income ratio and the indicator for African American descent are positive. Again, the coefficients are difficult to interpret but they indicate that, first, African Americans have a higher probability of denial than white applicants, holding constant the payments-to-income ratio and second, applicants with a high payments-to-income ratio face a higher risk of being rejected.

How big is the estimated difference in denial probabilities between two hypothetical applicants with the same payments-to-income ratio? As before, we may use predict() to compute this difference.

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denyprobit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

# 2. compute difference in probabilities
diff(predictions)

##         2 
## 0.1578117

In this case, the estimated difference in denial probabilities is about 15.8%.

Logit Regression

It is fairly easy to estimate a Logit regression model using R.

denylogit <- glm(deny ~ pirat, 
                 family = binomial(link = "logit"), 
                 data = HMDA)

coeftest(denylogit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
## pirat        5.88450    1.00015   5.8836 4.014e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")

lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"), 
       lty = c(1, 2))

Both models produce very similar estimates of the probability that a mortgage application will be denied depending on the applicants payment-to-income ratio.

Following the book we extend the simple Logit model of mortgage denial with the additional regressor black.

# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black, 
                  family = binomial(link = "logit"), 
                  data = HMDA)
coeftest(denylogit2, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##             Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
## pirat        5.37036    0.96376   5.5723 2.514e-08 ***
## blackyes     1.27278    0.14616   8.7081 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# same as summary function with robust standard errors

As for the Probit model, all model coefficients are highly significant and we obtain positive estimates for the coefficients on P/I ratio and black. For comparison we compute the predicted probability of denial for two hypothetical applicants that differ in race and have a P/I ratio of 0.3 .

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")
summary(predictions)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07485 0.11218 0.14950 0.14950 0.18682 0.22415

# 2. Compute difference in probabilities
diff(predictions)

##         2 
## 0.1492945

We find that the white applicant faces a denial probability of only 7.5%, while the African American is rejected with a probability of 22.4%, a difference of 14.9%.

Application to Boston HMDA Data

# Mean P/I ratio
mean(HMDA$pirat)

## [1] 0.3308136

# inhouse expense-to-total-income ratio
mean(HMDA$hirat)

## [1] 0.2553461

# loan-to-value ratio
mean(HMDA$lvrat)

## [1] 0.7377759

# consumer credit score
mean(as.numeric(HMDA$chist))

## [1] 2.116387

# mortgage credit score
mean(as.numeric(HMDA$mhist))

## [1] 1.721008

# public bad credit record
mean(as.numeric(HMDA$phist)-1)

## [1] 0.07352941

# denied mortgage insurance
prop.table(table(HMDA$insurance))

## 
##         no        yes 
## 0.97983193 0.02016807

# self-employed
prop.table(table(HMDA$selfemp))

## 
##        no       yes 
## 0.8836134 0.1163866

# single
prop.table(table(HMDA$single))

## 
##        no       yes 
## 0.6067227 0.3932773

# high school diploma
prop.table(table(HMDA$hschool))

## 
##         no        yes 
## 0.01638655 0.98361345

# unemployment rate
mean(HMDA$unemp)

## [1] 3.774496

# condominium
prop.table(table(HMDA$condomin))

## 
##        no       yes 
## 0.7117647 0.2882353

# black
prop.table(table(HMDA$black))

## 
##       no      yes 
## 0.857563 0.142437

# deny
prop.table(table(HMDA$deny))

## 
##        no       yes 
## 0.8802521 0.1197479

Before estimating the models we transform the loan-to-value ratio (lvrat) into a factor variable, where \[\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\] and convert both credit scores to numeric variables.

# define low, medium and high loan-to-value ratio
HMDA$lvrat <- factor(
  ifelse(HMDA$lvrat < 0.8, "low",
  ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")),
  levels = c("low", "medium", "high"))


# convert credit scores to numeric
HMDA$mhist <- as.numeric(HMDA$mhist)
HMDA$chist <- as.numeric(HMDA$chist)

Next we reproduce the estimation results presented in Table 11.2 of the book.

# estimate all 5 models for the denial probability

logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                  + insurance + selfemp, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_2 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist 
                     + phist + insurance + selfemp + single + hschool + unemp + condomin 
                     + I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5) 
                     + I(chist==6), 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

Just as in previous chapters, we store heteroskedasticity-robust standard errors of the coefficient estimators in a list which is then used as the argument se in stargazer().

rob_se <- list(sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1"))))


logit_HMDA$AIC <- AIC(logit_HMDA)
probit_HMDA_1$AIC <- AIC(probit_HMDA_1)
probit_HMDA_2$AIC <- AIC(probit_HMDA_2)
probit_HMDA_3$AIC <- AIC(probit_HMDA_3)
probit_HMDA_4$AIC <- AIC(probit_HMDA_4)

logit_HMDA$BIC <- BIC(logit_HMDA)
probit_HMDA_1$BIC <- BIC(probit_HMDA_1)
probit_HMDA_2$BIC <- BIC(probit_HMDA_2)
probit_HMDA_3$BIC <- BIC(probit_HMDA_3)
probit_HMDA_4$BIC <- BIC(probit_HMDA_4)


# logit model

stargazer(logit_HMDA, 
          digits = 3, out = "logit.html",
          type = "html", 
          header = FALSE,
          title = "Logit Regressions of Mortgage Denial Using the Boston HMDA data",
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          se = rob_se,
          keep.stat = c("n","aic","bic"),
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c( "(1)"))

**Logit Regressions of Mortgage Denial Using the Boston HMDA data**

	Dependent Variable: Mortgage Denial

	deny
	(1)

blackyes	0.688^***
	(0.183)

pirat	4.764^***
	(1.332)

hirat	-0.109
	(1.298)

lvratmedium	0.464^**
	(0.160)

lvrathigh	1.495^***
	(0.325)

chist	0.290^***
	(0.039)

mhist	0.279^*
	(0.138)

phistyes	1.226^***
	(0.203)

insuranceyes	4.548^***
	(0.576)

selfempyes	0.666^**
	(0.214)

Constant	-5.707^***
	(0.484)


Observations	2,380
Akaike Inf. Crit.	1,293.273
Bayesian Inf. Crit.	1,356.797

Note:	(+ p<0.1; * p<0.05; p<0.01; * p<0.001)

# probit model
stargazer(probit_HMDA_1, 
          probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,  
          digits = 3, out = "probit.html",
          type = "html", 
          header = FALSE,
          title = "Logit and Probit Regressions of Mortgage Denial Using the Boston HMDA data",
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          se = rob_se,
          keep.stat = c("n","aic","bic"),
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c( "(2)", "(3)", "(4)", "(5)"))

**Logit and Probit Regressions of Mortgage Denial Using the Boston HMDA data**

	Dependent Variable: Mortgage Denial

	deny
	(2)	(3)	(4)	(5)

blackyes	0.389^*	0.371^***	0.363^***	0.246^*
	(0.183)	(0.099)	(0.100)	(0.101)

pirat	2.442⁺	2.464^***	2.622^***	2.572^***
	(1.332)	(0.673)	(0.654)	(0.665)

hirat	-0.185	-0.302	-0.502	-0.538
	(1.298)	(0.689)	(0.689)	(0.715)

lvratmedium	0.214	0.216^**	0.215^**	0.216^*
	(0.160)	(0.082)	(0.082)	(0.084)

lvrathigh	0.791^*	0.795^***	0.836^***	0.788^***
	(0.325)	(0.183)	(0.184)	(0.185)

chist	0.155^***	0.158^***	0.344^***	0.158
	(0.039)	(0.021)	(0.021)	(0.108)

mhist	0.148	0.110	0.162^*	0.111
	(0.138)	(0.073)	(0.076)	(0.104)

phistyes	0.697^***	0.702^***	0.717^***	0.705^***
	(0.203)	(0.114)	(0.115)	(0.116)

insuranceyes	2.557^***	2.585^***	2.589^***	2.590^***
	(0.576)	(0.305)	(0.299)	(0.306)

selfempyes	0.359⁺	0.346^**	0.342^**	0.348^**
	(0.214)	(0.113)	(0.116)	(0.116)

singleyes		0.229	0.230^**	0.226^**
			(0.080)	(0.086)

hschoolyes		-0.613	-0.604^**	-0.620^**
			(0.229)	(0.237)

unemp		0.030	0.028	0.030⁺
			(0.018)	(0.018)

condominyes			-0.055


I(mhist == 3)			-0.107


I(mhist == 4)			-0.383


I(chist == 3)			-0.226


I(chist == 4)			-0.251


I(chist == 5)			-0.789


I(chist == 6)			-0.905


blackyes:pirat				-0.579


blackyes:hirat				1.232


Constant	-3.041^***	-2.575^***	-2.896^***	-2.543^***
	(0.484)	(0.250)	(0.350)	(0.404)


Observations	2,380	2,380	2,380	2,380
Akaike Inf. Crit.	1,295.694	1,285.227	1,292.129	1,288.664
Bayesian Inf. Crit.	1,359.218	1,366.075	1,413.401	1,381.062

Note:	(+ p<0.1; * p<0.05; p<0.01; * p<0.001)

Exponentiate Coefficients for logit_HMDA and Show Odds Ratios in Stargazer

# Exponentiate the coefficients and compute robust SE for the logit model
logit_OR <- exp(coef(logit_HMDA))
logit_OR_se <- exp(coef(logit_HMDA)) * sqrt(diag(vcovHC(logit_HMDA, type = "HC1")))

# Create a dummy model with exponentiated coefficients to use in stargazer
logit_HMDA_OR <- logit_HMDA
logit_HMDA_OR$coefficients <- logit_OR

# Stargazer table for exponentiated log odds (odds ratios)
stargazer(logit_HMDA_OR,
          type = "html",
          out = "logit_odds_ratios.html",
          title = "Exponentiated Coefficients (Odds Ratios) for Logit Model of Mortgage Denial",
          se = list(logit_OR_se),
          digits = 3,
          keep.stat = c("n","aic","bic"),
          dep.var.caption = "Dependent Variable: Mortgage Denial",
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"),
          notes.append = FALSE,
          model.numbers = FALSE,
          column.labels = c("Logit OR"))

## 
## <table style="text-align:center"><caption><strong>Exponentiated Coefficients (Odds Ratios) for Logit Model of Mortgage Denial</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Dependent Variable: Mortgage Denial</td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>deny</td></tr>
## <tr><td style="text-align:left"></td><td>Logit OR</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">blackyes</td><td>1.991<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.363)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">pirat</td><td>117.263</td></tr>
## <tr><td style="text-align:left"></td><td>(156.217)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">hirat</td><td>0.897</td></tr>
## <tr><td style="text-align:left"></td><td>(1.164)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">lvratmedium</td><td>1.590<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.255)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">lvrathigh</td><td>4.458<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.448)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">chist</td><td>1.337<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">mhist</td><td>1.322<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.182)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">phistyes</td><td>3.407<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.693)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">insuranceyes</td><td>94.459<sup>+</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(54.373)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">selfempyes</td><td>1.947<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.416)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.003<sup>*</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>2,380</td></tr>
## <tr><td style="text-align:left">Akaike Inf. Crit.</td><td>1,293.273</td></tr>
## <tr><td style="text-align:left">Bayesian Inf. Crit.</td><td>1,356.797</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right">(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)</td></tr>
## </table>

SPP608 Statistics in R: Probit and Logit Regression

Viviana Wu

4/8/2025

Lecture 10: Probit and Logit Regression (Binary DVs)

Probit Regression

Logit Regression

Application to Boston HMDA Data