Stat 115s (Introduction to Econometrics)

Lesson 3.2- Binary Dependent Variable

Author

Norberto E. Milla, Jr.

Published

May 26, 2023

1 Introduction

So far, in all of the models we examined the dependent variable y has been a quantitative variable, e.g., wages, GPA score, prices, etc.
Can we explain a qualitative (i. e., binary or dummy) variable using multiple regression?
Binary dependent variable y = 1 or y = 0; e. g., it may indicate whether an adult has a high school education, whether a household owns a house, whether an adult is married, owns a car, etc.
The case where y = 1 is called success whereas y = 0 is called failure.
What happens if we regress a 0/1 variable on a set of independent variables? How can we interpret regression coefficients?

2 Linear Probability Model

Under the standard assumptions the conditional expectation of the dependent variable can be written as follows:

E(y|x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k

Since y takes only values of 0 or 1 this conditional expectation can be written as follows:

\begin{align} E(y|x) &= P(y=1|x) \notag \\ &= \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k \notag \end{align}

The probability of success is given by p(x) = P(y = 1|x). The expression above states that the success probability is a linear function of x variables.
By definition the probability of failure is

P(y=0|x) = 1- P(y=1|x)

The linear probability model (PLM):

y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k +u

Here x_i can be quantitative or qualitative
Slope coefficients are now interpreted as the change in the probability of success:

\Delta P(y=1|x) = \beta_j \Delta x_j

If we use OLS estimation then the OLS sample regression equation is given by

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \cdots + \hat{\beta}_kx_k - In this estimated equation, \hat{y} is the predicted probability of success.

2.1 Example: Women’s Labor Force Participation

inlf = \beta_0 + \beta_1\,nwifeinc + \beta_2\, educ + \beta_3\, exper + \beta_4\,expersq \\ + \beta_5\,age + \beta_6\, kidslt6 + \beta_7\, kidsge6

y (infl = in the labor force) equals 1 if a married woman reported working for a wage outside the home in 1975, and 0 otherwise
Explantory variables:
- nwifeinc: wife’s earnings (in $1000),
- kidslt6: number of children less than 6 years old,
- kidsge6: number of children between 6-18 years of age,
- educ
- exper
- age

Code

lpm1 <- lm(inlf ~ nwifeinc + educ + exper +expersq + age + kidslt6 + kidsge6,
           data = mroz)
summary(lpm1)


Call:
lm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
    kidslt6 + kidsge6, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93432 -0.37526  0.08833  0.34404  0.99417 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.5855192  0.1541780   3.798 0.000158 ***
nwifeinc    -0.0034052  0.0014485  -2.351 0.018991 *  
educ         0.0379953  0.0073760   5.151 3.32e-07 ***
exper        0.0394924  0.0056727   6.962 7.38e-12 ***
expersq     -0.0005963  0.0001848  -3.227 0.001306 ** 
age         -0.0160908  0.0024847  -6.476 1.71e-10 ***
kidslt6     -0.2618105  0.0335058  -7.814 1.89e-14 ***
kidsge6      0.0130122  0.0131960   0.986 0.324415    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4271 on 745 degrees of freedom
Multiple R-squared:  0.2642,    Adjusted R-squared:  0.2573 
F-statistic: 38.22 on 7 and 745 DF,  p-value: < 2.2e-16

All variables are individually statistically significant except kidsge6. All coefficients have expected signs using standard economic theory and intuition.
Interpreation of estimated coefficients:
- For example, the coefficient estimate on educ, 0.038, implies that, ceteris paribus, an additional year of education increases predicted probability of labor force participation by 0:038.
- The coefficient estimate on nwifeinc: if wife’s income increases by 10 units (ie, $10000) the probability of labor force participation falls by 0.034.
- exper has a quadratic relationship with inlf which means that the effect of past experience on the probability of labor force participation is diminishing
- The number of young children has a big impact on labor force participation. The coefficient estimate on kidslt6 is -0.262 which means that, ceteris paribus, having one additional child less than six years old reduces the probability of participation by 0.262

2.2 Shortcomings of LPM

Predicted probability of success is given by \hat{y} and it can have values outside the range 0-1. Obviously, this contradicts the rules of probability
In the example out of 753 observations, 16 have \widehat{inlf} < 0 and 17 have \widehat{inlf} > 1.
If these are relatively few, they can be interpreted as 0 and 1, respectively.
Nevertheless, the major shortcoming of LPM is not implausible probability predictions. The major problem is that a probability cannot be linearly related to the independent variables for all their possible values
In the example, the model predicts that the effect of going from zero children to one young child reduces the probability of working by 0.262.
This is also the predicted drop if the woman goes from having one child to 2 or 2 to 3, etc.
It seems more realistic that the first small child would reduce the probability by a large amount, but subsequent children would have a smaller marginal effect
Thus, the relationship may be nonlinear

Code

ggplot(aes(x=kidslt6, y=inlf), data = mroz) + 
  geom_point() + geom_smooth(method="lm", se=FALSE)

Despite these shortcomings LPM is useful and often applied in economics
It usually works well for values of the independent variables that are near the averages in the sample.
In the previous example, 96% of the women have either no children or one child under 6. Thus, the coefficient estimate on kidslt6 (-0.262) practically measures the impact of the first children on the probability of labor force participation.
Therefore, we should not use this estimate for changes from 3 to 4 or 4 to 5, etc.
LPM is heteroscedastic
Recall that y is a binary variable following a Bernoulli distribution with variance given by: Var(u|x) = Var(y|x) = p(x) \times [1 - p(x)]
Since p(x) is a linear combination of x variables, Var(u|x) is not constant
We learned that in this case OLS is unbiased and consistent but inefficient. The Gauss-Markov Theorem fails. Standard errors and the usual inference procedures are not valid.
It is possible to find more efficient estimators than OLS

Code

plot(fitted(lpm1),resid(lpm1))
abline(0,0, col = "red")

There are two widely used binary dependent variable models: logit and probit
As stated above, LPM is heteroscedastic and in order to conduct hypothesis tests and confidence intervals for the marginal effects an explanatory variable has on the outcome variable, we must first correct for heteroskedasticity. We can use the White estimator for correcting heteroskedasticity
We compute the White heteroskedastic variance/covariance matrix for the coefficients with the call to vcovHC (which stands for Variance/Covariance Heteroskedastic Consistent)
Then we call coeftest() to use this estimate for the variance/covariance to properly compute our standard errors, t-statistics, and p-values for the coefficients.

Code

cvc <- vcovHC(lpm1, type="HC1")
coeftest(lpm1, vcov = cvc)


t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)  0.58551922  0.15225987  3.8455 0.0001306 ***
nwifeinc    -0.00340517  0.00152493 -2.2330 0.0258453 *  
educ         0.03799530  0.00726604  5.2292 2.214e-07 ***
exper        0.03949239  0.00581002  6.7973 2.185e-11 ***
expersq     -0.00059631  0.00019000 -3.1384 0.0017656 ** 
age         -0.01609081  0.00239901 -6.7073 3.922e-11 ***
kidslt6     -0.26181047  0.03178320 -8.2374 7.892e-16 ***
kidsge6      0.01301223  0.01353293  0.9615 0.3366009    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3 Probit regression

In probit regression, the cumulative standard normal distribution function \Phi(.) is used to model the regression function when the dependent variable is binary, that is, we assume

E(Y|X) = P(Y=1|X) = \Phi(\beta_0 + \beta_1X)

\beta_0 + \beta_1X plays the role of a quantile z. Remember that \Phi(z) = P(Z \leq z), where Z \sim N(0,1) such that the probit coefficient \beta_1 is the change in z associated with a one unit change in X
Although the effect on z of a change in X is linear, the link between z and the dependent variable Y is nonlinear since \Phi is a nonlinear function of X
Assume that Y is a binary variable. The model

Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k + u

with P(Y=1|X_1, X_2, \cdots, X_k) = \Phi(\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k)

is the population Probit model with multiple regressors X_1, X_2, \cdots, X_k and \Phi(.) is the cumulative standard normal distribution function

The predicted probability that Y = 1 given X_1, X_2, \cdots, X_k can be calculated in two steps
1. Compute z = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k
2. Look up \Phi(z) in a standard normal table or use the pnorm() function in R.
In R, probit models can be estimated using the function glm() from the package stats; using the argument family we specify that we want to use a probit link function

Code

inlf_probit <- glm(inlf ~ nwifeinc + educ + exper +expersq + age + kidslt6 + kidsge6, 
                   family = binomial(link = "probit"),
                   data = mroz)
summary(inlf_probit)


Call:
glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
    kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.2700736  0.5080782   0.532  0.59503    
nwifeinc    -0.0120236  0.0049392  -2.434  0.01492 *  
educ         0.1309040  0.0253987   5.154 2.55e-07 ***
exper        0.1233472  0.0187587   6.575 4.85e-11 ***
expersq     -0.0018871  0.0005999  -3.145  0.00166 ** 
age         -0.0528524  0.0084624  -6.246 4.22e-10 ***
kidslt6     -0.8683247  0.1183773  -7.335 2.21e-13 ***
kidsge6      0.0360056  0.0440303   0.818  0.41350    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1029.7  on 752  degrees of freedom
Residual deviance:  802.6  on 745  degrees of freedom
AIC: 818.6

Number of Fisher Scoring iterations: 4

Code

length(which(inlf_probit$fitted>1))

[1] 0

Code

length(which(inlf_probit$fitted<0))

[1] 0

4 Logistic regression

The population logit regression function is

\begin{align} P(Y=1|X_1, X_2, \cdots, X_k) &= F(\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k) \notag \\ &= \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k)}} \notag \end{align}

The idea is similar to probit regression except that a different CDF is used F(x) = \frac{1}{1 + e^{-x}}

is the CDF of the logistic distribution

Odds: odds(Y=1) = \frac{P(Y=1)}{P(Y=0)}
A logistic regression takes the form log(odds_i) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k + u

Code

inlf_logit <- glm(inlf ~ nwifeinc + educ + exper +expersq + age + kidslt6 + kidsge6, 
                   family = binomial(link = "logit"),
                   data = mroz,
                  x = TRUE)
summary(inlf_logit)


Call:
glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
    kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz, 
    x = TRUE)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.425452   0.860365   0.495  0.62095    
nwifeinc    -0.021345   0.008421  -2.535  0.01126 *  
educ         0.221170   0.043439   5.091 3.55e-07 ***
exper        0.205870   0.032057   6.422 1.34e-10 ***
expersq     -0.003154   0.001016  -3.104  0.00191 ** 
age         -0.088024   0.014573  -6.040 1.54e-09 ***
kidslt6     -1.443354   0.203583  -7.090 1.34e-12 ***
kidsge6      0.060112   0.074789   0.804  0.42154    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1029.75  on 752  degrees of freedom
Residual deviance:  803.53  on 745  degrees of freedom
AIC: 819.53

Number of Fisher Scoring iterations: 4

Code

exp(coef(inlf_logit))

(Intercept)    nwifeinc        educ       exper     expersq         age 
  1.5302825   0.9788810   1.2475360   1.2285929   0.9968509   0.9157386 
    kidslt6     kidsge6 
  0.2361344   1.0619557

4.1 Marginal Effects

The magnitude of the coefficients have no intuitive meaning. They are the magnitude that the log odds ratio increases when each explanatory variable increases by one unit
The marginal effects of a regression are estimates of the impact that one-unit increases in the explanatory variables have on the outcome variable.
In a logistic regression, the marginal effects are defined as the impact that one-unit increases in the explanatory variables have on the probability that the outcome variable is equal to 1.0.
The function maBina() from the package erer can be used to compute the marginal effects of the logistic regression as follows.

Code

maBina(inlf_logit, x.mean=TRUE)

            effect error t.value p.value
(Intercept)  0.103 0.209   0.495   0.621
nwifeinc    -0.005 0.002  -2.534   0.011
educ         0.054 0.011   5.092   0.000
exper        0.050 0.008   6.397   0.000
expersq     -0.001 0.000  -3.096   0.002
age         -0.021 0.004  -6.047   0.000
kidslt6     -0.351 0.050  -7.070   0.000
kidsge6      0.015 0.018   0.804   0.422

4.2 Goodness-of-fit test

Hosmer-Lemeshow test of goodness of fit using hoslem() in the ResourceSelection package

Code

hoslem.test(mroz$inlf, inlf_logit$fitted, g = 10)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  mroz$inlf, inlf_logit$fitted
X-squared = 12.851, df = 8, p-value = 0.1171

4.3 Confusion Matrix

A confusion matrix gives a short comparison of the actual values for your outcome variable, and the values predicted from your outcome variable
A confusion matrix compares these predicted values (P(Y=1|X)) with the actual values, and reports how often we correctly identified Y=1 and Y=0
The code below creates predicted values for Y based on a cut off value of 0.5 for the predicted probability.

Code

pred <- as.numeric(inlf_logit$fitted >= 0.5)

The code below computes the confusion matrix along with the several other model performance measures. The confusionMatrix() function is in the caret package

Code

confusionMatrix(as.factor(pred), as.factor(mroz$inlf))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 207  81
         1 118 347
                                          
               Accuracy : 0.7357          
                 95% CI : (0.7027, 0.7669)
    No Information Rate : 0.5684          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.4539          
                                          
 Mcnemar's Test P-Value : 0.01071         
                                          
            Sensitivity : 0.6369          
            Specificity : 0.8107          
         Pos Pred Value : 0.7188          
         Neg Pred Value : 0.7462          
             Prevalence : 0.4316          
         Detection Rate : 0.2749          
   Detection Prevalence : 0.3825          
      Balanced Accuracy : 0.7238          
                                          
       'Positive' Class : 0

Code

knitr::include_graphics("table2x2.jpg")

Code

knitr::include_graphics("spec_sense.png")