L11 (Lab) - Regression with a Binary Dependent Variable

In this lab exercise, you will use command glm to run regressions for the probit model and the logit model.

Employment during the Great Recession

In April 2008, the unemployment rate in the United States stood at 5.0%. By April 2009, it had increased to 9.0%, and it had increased further, to 10.0%, by October 2009. Were some groups of workers more likely to lose their jobs than others during the Great Recession? For example, were young workers more likely to lose their jobs than middle-aged workers? What about workers with a college degree versus those without a degree or women versus men?

The data set used for this exercise is employment_08_09.xlsx from E11.1 of Stock and Watson (2020, e4). These data file contains data on 5412 workers who were survey in the April 2008 Current Population Survey and reported that they were employed. These workers were surveyed one year later, in April 2009, and asked about their employment status (employed, unemployed, or out of the labor force). The data set also includes various demographic measures for each individual. A detailed description of the data is given in employment_08_09.pdf, available in LMS.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

library(openxlsx)         # load package; to read xlsx file

## Warning: package 'openxlsx' was built under R version 4.3.3

library(lmtest)           # load package; to conduct hypothesis test using robust SE
library(sandwich)         # load package; to compute heteroskedasticity robust standard errors
library(car)              # load package; to conduct hypothesis for multiple coefficients
library(stargazer)        # load package; to put regression results into a single stargazer table

Import Data: employment_08_09

id <- "13Eed2So0p3Z22FD36VgCsL63VUTaVPUd"
EMP <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(EMP)

## 'data.frame':    5412 obs. of  21 variables:
##  $ age         : num  53 39 41 27 29 50 27 24 63 43 ...
##  $ race        : num  1 1 1 1 3 3 1 1 2 1 ...
##  $ earnwke     : num  NA NA 500 520 615 ...
##  $ employed    : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ unemployed  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ married     : num  1 1 1 1 0 1 0 0 0 0 ...
##  $ union       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ ne_states   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ so_states   : num  0 0 1 1 1 0 1 1 1 1 ...
##  $ ce_states   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ we_states   : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ government  : num  0 0 0 0 1 1 0 1 0 0 ...
##  $ private     : num  0 0 1 1 0 0 1 0 1 0 ...
##  $ self        : num  1 1 0 0 0 0 0 0 0 1 ...
##  $ educ_lths   : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ educ_hs     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ educ_somecol: num  1 0 1 0 0 0 0 0 0 1 ...
##  $ educ_aa     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ educ_bac    : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ educ_adv    : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ female      : num  0 1 1 0 0 1 0 1 1 1 ...

attach(EMP)

Description of main variables:

employed: \(=1\) if employed in 2009.
unemployed: \(=1\) if unemployed in 2009.
age: age.
female: \(=1\) if female.
married: \(=1\) if married.
race: \(= 1\) if self-identified race = white (only); \(= 2\) if self-identified race = black (only); \(= 3\) if self-identified race was not white (only) or black (only).
earnwke: average weekly earnings.
educ_lths: \(=1\) if highest level of education is less than a high school graduate.
educ_hs: \(=1\) if highest level of education is a high school graduate.
educ_somecol: \(=1\) if highest level of education is some college.
educ_aa: \(=1\) if highest level of education is AA degree.
educ_hs: \(=1\) if highest level of education is BA or BS degree.
educ_hs: \(=1\) if highest level of education is advanced degree.

What fraction of workers in the sample were employed in April 2009? Use your answer to compute a 95% confidence interval for the probability that a worker was employed in April 2009, conditional on being employed in April 2008.

[Ans] 87.55% reported being employed. A 95% confidence interval is 86.67% to 88.43%, which is very narrow because the sample size is large, n = 5412.

n <- dim(EMP)[1]   # sample size

# Employment rate
emp.rt <- sum(employed)/n  
emp.rt

## [1] 0.8754619

# 95% CI for the employment rate
sd.emp <- sd(employed)        # sd for the variable "employed"
sd.emp.rt <- sd.emp/sqrt(n)   # sd of emp.rt, which is the average of "employed"
sd.emp.rt

## [1] 0.004488807

LCL <- emp.rt - 1.96*sd.emp.rt
UCL <- emp.rt + 1.96*sd.emp.rt
LCL; UCL          # 95% CI

## [1] 0.8666639

## [1] 0.88426

Regress \(employed\) on \(age\) and \(age^2\), using a linear probability model.

b(i). Based on this regression, was age a statistically significant determinant of employment in April 2009? Is there evidence of a nonlinear effect of age on the probability of being employed?

[Ans] Based on both the t test and the F test, age was a statistically significant determinant of employment in April 2009. In addition, the coefficient on \(age^2\) is negative and statistically significant at the 1% level; thus age appears to have nonlinear effect on the probability of employment.

Remarks:

Use heteroskedasticity robust standard errors for hypothesis tests.
What is the value of \(age\) that maximizes the probability of employment?

## Linear Probability Model
model <- employed~age+I(age^2)   # regression equation
fit.lpm <- lm(model)             # OLS regression
coeftest(fit.lpm, vcov=vcovHC, type="HC1")  #use heteroskedasticity robust SE

## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)  3.0749e-01  6.6858e-02  4.5992 4.338e-06 ***
## age          2.8272e-02  3.2812e-03  8.6165 < 2.2e-16 ***
## I(age^2)    -3.2663e-04  3.8773e-05 -8.4243 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Test joint significance of age and age^2
linearHypothesis(fit.lpm, c("age=0", "I(age^2)=0"), white.adjust=c("hc1"))

## Linear hypothesis test
## 
## Hypothesis:
## age = 0
## I(age^2) = 0
## 
## Model 1: restricted model
## Model 2: employed ~ age + I(age^2)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1   5411                        
## 2   5409  2 37.502 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b(ii). Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

[Ans] \(Pr(Emp.| Age = 20) = 0.74\); \(Pr(Emp. | Age = 40) = 0.92\); \(Pr(Emp. | Age = 60) =0.83\).

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit.lpm$coefficients[1]  # estimated intercept
b1 <- fit.lpm$coefficients[2]  # estimated coef. for age
b2 <- fit.lpm$coefficients[3]  # estimated coef. for age^2

emp.20 <- b0 + b1*20 + b2*20^2 # Predicted probability for age 20
emp.40 <- b0 + b1*40 + b2*40^2 # Predicted probability for age 40
emp.60 <- b0 + b1*60 + b2*60^2 # Predicted probability for age 60

emp.20; emp.40; emp.60

## (Intercept) 
##   0.7422841

## (Intercept) 
##   0.9157685

## (Intercept) 
##   0.8279458

Regress Employed on Age and Age2, using a probit model. Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

[Ans] \(Pr(Emp.| Age = 20) = 0.73\); \(Pr(Emp. | Age = 40) = 0.91\); \(Pr(Emp. | Age = 60) =0.83\).

## Probit Model
fit.probit <- glm(model, family = binomial(link = "probit")) # probit regression
coeftest(fit.probit, vcov. = vcovHC, type = "HC1") #use heteroskedasticity robust SE

## 
## z test of coefficients:
## 
##               Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -1.2579285  0.2521475 -4.9889 6.074e-07 ***
## age          0.1217230  0.0130507  9.3269 < 2.2e-16 ***
## I(age^2)    -0.0014125  0.0001576 -8.9622 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit.probit$coefficients[1]  # estimated intercept
b1 <- fit.probit$coefficients[2]  # estimated coef. for age
b2 <- fit.probit$coefficients[3]  # estimated coef. for age^2

emp.20 <- pnorm(b0 + b1*20 + b2*20^2) # Predicted probability for age 20
emp.40 <- pnorm(b0 + b1*40 + b2*40^2) # Predicted probability for age 40
emp.60 <- pnorm(b0 + b1*60 + b2*60^2) # Predicted probability for age 60
emp.20; emp.40; emp.60

## (Intercept) 
##   0.7295817

## (Intercept) 
##   0.9116616

## (Intercept) 
##   0.8316237

Regress Employed on Age and Age2, using a logit model. Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

[Ans] \(Pr(Emp.| Age = 20) = 0.73\); \(Pr(Emp. | Age = 40) = 0.91\); \(Pr(Emp. | Age = 60) =0.83\).

## Logit Model
fit.logit <- glm(model, family = binomial(link = "logit")) # logit regression
coeftest(fit.logit, vcov. = vcovHC, type = "HC1") #use heteroskedasticity robust SE

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept) -2.48975412  0.44627864 -5.5789  2.42e-08 ***
## age          0.22546624  0.02345904  9.6111 < 2.2e-16 ***
## I(age^2)    -0.00262366  0.00028483 -9.2112 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit.logit$coefficients[1]  # estimated intercept
b1 <- fit.logit$coefficients[2]  # estimated coef. for age
b2 <- fit.logit$coefficients[3]  # estimated coef. for age^2

emp.20 <- plogis(b0 + b1*20 + b2*20^2) # Predicted probability for age 20
emp.40 <- plogis(b0 + b1*40 + b2*40^2) # Predicted probability for age 40
emp.60 <- plogis(b0 + b1*60 + b2*60^2) # Predicted probability for age 60
emp.20; emp.40; emp.60

## (Intercept) 
##    0.725141

## (Intercept) 
##   0.9114157

## (Intercept) 
##   0.8310454

Are there important differences in your answers to (b)–(d)? Explain.

[Ans] The results for the probit and logit models are essentially identical. The predicted probability of employmnet based on linear probability model are approximately 1% larger than the probit/logit estimates for individuals with age 20 and 40. But the age effects for the probit, logit, and linear probability models exhibit the same pattern across ages.

f(i). The data set includes variables measuring the workers’ educational attainment, sex, race, marital status, and weekly earnings in April 2008. Run a regression including these variables and investigate whether the conclusion on the effect of age on employment from (b)-(d) are affected by omitted variable bias.

First, generate dummy variables for race and workers’ educational attainment.

# Generate dummy variable for race #
white <- as.numeric(race==1, 1, 0)

# Generate dummy varialbes for education
educ_1 <- as.numeric(educ_lths==1|educ_hs==1, 1, 0)
educ_2 <- as.numeric(educ_somecol==1|educ_aa==1, 1, 0)
educ_3 <- as.numeric(educ_bac==1, 1, 0)
educ_4 <- as.numeric(educ_adv==1, 1, 0)

Second, define a model with additional control variables.

# Regression Model 
model2 <- employed ~ age + I(age^2) + white + 
  female + married + earnwke + educ_2 + educ_3 + educ_4

Third, run a linear probability model and compute the predicted probabilities of employment for different age groups.

## (1) Linear Probability Model
fit2.lpm <- lm(model2)
coeftest(fit2.lpm, vcov=vcovHC, type="HC1")  #use heteroskedasticity robust SE

## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)  3.0768e-01  7.1698e-02  4.2913 1.811e-05 ***
## age          2.5037e-02  3.5642e-03  7.0244 2.453e-12 ***
## I(age^2)    -2.9133e-04  4.2093e-05 -6.9212 5.074e-12 ***
## white        2.6387e-02  1.3635e-02  1.9353 0.0530179 .  
## female      -3.1619e-03  1.0073e-02 -0.3139 0.7536049    
## married     -1.9204e-03  1.0429e-02 -0.1841 0.8539147    
## earnwke      3.5660e-05  9.4368e-06  3.7788 0.0001595 ***
## educ_2       3.6365e-02  1.2159e-02  2.9906 0.0027981 ** 
## educ_3       2.0172e-02  1.4107e-02  1.4300 0.1527957    
## educ_4       3.3132e-02  1.6187e-02  2.0468 0.0407282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit2.lpm$coefficients[1]  # estimated intercept
b1 <- fit2.lpm$coefficients[2]  # estimated coef. for age
b2 <- fit2.lpm$coefficients[3]  # estimated coef. for age^2

emp.20 <- b0 + b1*20 + b2*20^2 # Predicted probability for age 20
emp.40 <- b0 + b1*40 + b2*40^2 # Predicted probability for age 40
emp.60 <- b0 + b1*60 + b2*60^2 # Predicted probability for age 60

emp.20; emp.40; emp.60

## (Intercept) 
##   0.6918811

## (Intercept) 
##   0.8430173

## (Intercept) 
##   0.7610885

Fourth, run a probit model and compute the predicted probabilities of employment for different age groups.

## (2) Probit Model
fit2.probit <- glm(model2, family = binomial(link = "probit"))
coeftest(fit2.probit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept) -1.2632e+00  2.7905e-01 -4.5270 5.984e-06 ***
## age          1.0645e-01  1.4858e-02  7.1647 7.793e-13 ***
## I(age^2)    -1.2505e-03  1.7830e-04 -7.0132 2.330e-12 ***
## white        1.1946e-01  6.1721e-02  1.9355 0.0529253 .  
## female      -1.4886e-02  5.0238e-02 -0.2963 0.7669872    
## married     -1.5055e-02  5.3251e-02 -0.2827 0.7773974    
## earnwke      2.1254e-04  6.0857e-05  3.4924 0.0004787 ***
## educ_2       1.7472e-01  5.8153e-02  3.0045 0.0026601 ** 
## educ_3       8.9064e-02  6.9582e-02  1.2800 0.2005456    
## educ_4       1.5967e-01  9.1349e-02  1.7479 0.0804821 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit2.probit$coefficients[1]  # estimated intercept
b1 <- fit2.probit$coefficients[2]  # estimated coef. for age
b2 <- fit2.probit$coefficients[3]  # estimated coef. for age^2

emp.20 <- pnorm(b0 + b1*20 + b2*20^2) # Predicted probability for age 20
emp.40 <- pnorm(b0 + b1*40 + b2*40^2) # Predicted probability for age 40
emp.60 <- pnorm(b0 + b1*60 + b2*60^2) # Predicted probability for age 60

emp.20; emp.40; emp.60

## (Intercept) 
##   0.6426879

## (Intercept) 
##   0.8399315

## (Intercept) 
##     0.73314

Finally, run a logit model and compute the predicted probabilities of employment for different age groups.

## (3) Logit Model
fit2.logit <- glm(model2, family = binomial(link = "logit"))
coeftest(fit2.logit, vcov. = vcovHC, type = "HC1")

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept) -2.44291507  0.49775097 -4.9079 9.205e-07 ***
## age          0.19334995  0.02701894  7.1561 8.301e-13 ***
## I(age^2)    -0.00227840  0.00032515 -7.0072 2.432e-12 ***
## white        0.22533632  0.11358848  1.9838 0.0472786 *  
## female      -0.02203740  0.09409693 -0.2342 0.8148306    
## married     -0.03340792  0.10062645 -0.3320 0.7398897    
## earnwke      0.00043180  0.00012111  3.5655 0.0003632 ***
## educ_2       0.31539872  0.10835023  2.9109 0.0036037 ** 
## educ_3       0.13857721  0.13053679  1.0616 0.2884195    
## educ_4       0.27806106  0.17694785  1.5714 0.1160829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Predicted probability of employment for a 20/40/60-year-old worker
b0 <- fit2.logit$coefficients[1]  # estimated intercept
b1 <- fit2.logit$coefficients[2]  # estimated coef. for age
b2 <- fit2.logit$coefficients[3]  # estimated coef. for age^2

emp.20 <- pnorm(b0 + b1*20 + b2*20^2) # Predicted probability for age 20
emp.40 <- pnorm(b0 + b1*40 + b2*40^2) # Predicted probability for age 40
emp.60 <- pnorm(b0 + b1*60 + b2*60^2) # Predicted probability for age 60

emp.20; emp.40; emp.60

## (Intercept) 
##   0.6959274

## (Intercept) 
##   0.9500809

## (Intercept) 
##   0.8304216

Results from (b)-(f) can be summarized in the following table:

## Summary of the regression results ##
rob_se <- list(sqrt(diag(vcovHC(fit.lpm, type = "HC1"))),
               sqrt(diag(vcovHC(fit.probit, type = "HC1"))),
               sqrt(diag(vcovHC(fit.logit, type = "HC1"))),
               sqrt(diag(vcovHC(fit2.lpm, type = "HC1"))),
               sqrt(diag(vcovHC(fit2.probit, type = "HC1"))),
               sqrt(diag(vcovHC(fit2.logit, type = "HC1"))))

stargazer(fit.lpm, fit.probit, fit.logit, fit2.lpm, fit2.probit, fit2.logit, title="Regression Results: Employment_08_09", type="text", se = rob_se, df=FALSE, digits=2)

## 
## Regression Results: Employment
## =================================================================================
##                                          Dependent variable:                     
##                     -------------------------------------------------------------
##                                               employed                           
##                        OLS      probit   logistic     OLS      probit   logistic 
##                        (1)        (2)       (3)       (4)        (5)       (6)   
## ---------------------------------------------------------------------------------
## age                  0.03***    0.12***   0.23***   0.03***    0.11***   0.19*** 
##                      (0.003)    (0.01)    (0.02)    (0.004)    (0.01)    (0.03)  
##                                                                                  
## I(age2)             -0.0003*** -0.001*** -0.003*** -0.0003*** -0.001*** -0.002***
##                      (0.0000)  (0.0002)  (0.0003)   (0.0000)  (0.0002)  (0.0003) 
##                                                                                  
## white                                                0.03*      0.12*    0.23**  
##                                                      (0.01)    (0.06)    (0.11)  
##                                                                                  
## female                                               -0.003     -0.01     -0.02  
##                                                      (0.01)    (0.05)    (0.09)  
##                                                                                  
## married                                              -0.002     -0.02     -0.03  
##                                                      (0.01)    (0.05)    (0.10)  
##                                                                                  
## earnwke                                            0.0000***  0.0002*** 0.0004***
##                                                     (0.0000)  (0.0001)  (0.0001) 
##                                                                                  
## educ_2                                              0.04***    0.17***   0.32*** 
##                                                      (0.01)    (0.06)    (0.11)  
##                                                                                  
## educ_3                                                0.02      0.09      0.14   
##                                                      (0.01)    (0.07)    (0.13)  
##                                                                                  
## educ_4                                               0.03**     0.16*     0.28   
##                                                      (0.02)    (0.09)    (0.18)  
##                                                                                  
## Constant             0.31***   -1.26***  -2.49***   0.31***   -1.26***  -2.44*** 
##                       (0.07)    (0.25)    (0.45)     (0.07)    (0.28)    (0.50)  
##                                                                                  
## ---------------------------------------------------------------------------------
## Observations          5,412      5,412     5,412     4,773      4,773     4,773  
## R2                     0.02                           0.03                       
## Adjusted R2            0.02                           0.03                       
## Log Likelihood                 -1,986.92 -1,986.43            -1,733.87 -1,733.62
## Akaike Inf. Crit.              3,979.84  3,978.86             3,487.73  3,487.25 
## Residual Std. Error    0.33                           0.33                       
## F Statistic          54.22***                       15.19***                     
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

f(ii). Based on the logit regression, discuss the characteristics of workers who were hurt most by the Great Recession.

[Ans] The effect of Age on employment is increasing until approximately Age = 47.5 (\(-\hat\beta_{age}/(2*\hat\beta_{age^2})\)) and then declining; this suggests that younger and older workers are less likely to remain employed than middle-aged workers. The regression suggests that workers with the education levels shown in the table were more likely to remain employed than those in the omitted education category (no college). Workers with higher wages were more likely to remain employed.

Exercise-I. The results in (a)–(f) were based on the probability of employment. Workers who are not employed can either be (i) unemployed or (ii) out the labor force. Do the conclusions in (a)–(f) also hold for workers who became unemployed? (Hint: Use the binary variable Unemployed instead of Employed.)

Exercise-II. These results have covered employment transitions during the Great Recession, but what about transitions during normal times? Use the data file Employment_06_07, which measures the same variables but for the years 2006–2007, and analyze these data. Comment on the differences in employment transitions during recessions and normal times.

## Import data: employmnet_06_07
id <- "1wOEv1oQLRqtD5MEcQ7jxrX6fMZIcrw9M"
EMP_0607 <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)

L11 (Lab) - Regression with a Binary Dependent Variable

Rui Fan

Employment during the Great Recession

Clear the Workspace

Install and Load Needed Packages

Import Data: employment_08_09