Labor Econ: HW4

You should replicate my results in Quarto or Rmd. You should also state your answers clearly when being asked.

When submitting your HW, you should submit a ZIP file containing both the qmd (or Rmd) source and the generated HTML file.

We’ll be using the dataset PSID1976 from package AER. PSID1976 contains cross-section data originating from the 1976 Panel Study of Income Dynamics (PSID).

library(AER)
library(ggplot2)
library(dplyr)
data("PSID1976")
set.seed(2020)

You may want to view documentation of the dataset by running ?PSID1976. It’s from the paper Mroz (1987) published in Econometrica.

Part I

  1. Use summary() function to show the summary of the dataset PSID1976.
 participation     hours          youngkids         oldkids     
 no :325       Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 yes:428       1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
               Median : 288.0   Median :0.0000   Median :1.000  
               Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
               3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
               Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
      age          education          wage           repwage         hhours    
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
      hage         heducation        hwage            fincome     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      tax           meducation       feducation         unemp         city    
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000   no :269  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500   yes:484  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500            
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624            
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000            
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000            
   experience    college   hcollege 
 Min.   : 0.00   no :541   no :458  
 1st Qu.: 4.00   yes:212   yes:295  
 Median : 9.00                      
 Mean   :10.63                      
 3rd Qu.:15.00                      
 Max.   :45.00                      
  1. Regress log(wage) on education using lm(). You will get an error. Why? Explain using the participation variable.

  2. Regress log(wage) on education and state the estimated return to another year of education for women that participated in the labor force.

(Intercept)   education 
 -0.1851968   0.1086487 
  1. Plot log(wage) against education along with the fitted regression line above. (Note: here I use ggplot2 for plotting. You may use Base R plot() if you like)

  1. Regress education on feducation (father’s years of education) and comment on the regression table. Specifically, if we use feducation as the IV, will it satisfy the relevance restriction? How about the as-good-as-random assignment and exclusion restrictions?

Call:
lm(formula = education ~ feducation, data = subset(PSID1976, 
    participation == "yes"))

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4704 -1.1231 -0.1231  0.9546  5.9546 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.23705    0.27594  37.099   <2e-16 ***
feducation   0.26944    0.02859   9.426   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.081 on 426 degrees of freedom
Multiple R-squared:  0.1726,    Adjusted R-squared:  0.1706 
F-statistic: 88.84 on 1 and 426 DF,  p-value: < 2.2e-16
  1. Use feducation as an IV for education in estimating the effect of schooling years on log(wage).

Call:
ivreg(formula = log(wage) ~ education | feducation, data = subset(PSID1976, 
    participation == "yes"))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0870 -0.3393  0.0525  0.4042  2.0677 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44110    0.44610   0.989   0.3233  
education    0.05917    0.03514   1.684   0.0929 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6894 on 426 degrees of freedom
Multiple R-Squared: 0.09344,    Adjusted R-squared: 0.09131 
Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294 
  1. Repeat Q6 using mother’s education (meducation) as the IV and comment on the results.

Call:
ivreg(formula = log(wage) ~ education | meducation, data = subset(PSID1976, 
    participation == "yes"))

Residuals:
     Min       1Q   Median       3Q      Max 
-3.14184 -0.34291  0.05939  0.39750  2.05410 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.70217    0.48510   1.447    0.148
education    0.03855    0.03823   1.008    0.314

Residual standard error: 0.6987 on 426 degrees of freedom
Multiple R-Squared: 0.06881,    Adjusted R-squared: 0.06663 
Wald test: 1.017 on 1 and 426 DF,  p-value: 0.3138 

Part II

  1. Create a data frame df for all married women that were employed. This data frame will be used for the following exercises. (That is, you should filter the dataset using the condition participation=="yes". I prefer dplyr::tibble() when dealing with data frames. You can use data.frame() if you like.)
# A tibble: 428 × 21
   participation hours youngkids oldkids   age education  wage repwage hhours
   <fct>         <int>     <int>   <int> <int>     <int> <dbl>   <dbl>  <int>
 1 yes            1610         1       0    32        12  3.35    2.65   2708
 2 yes            1656         0       2    30        12  1.39    2.65   2310
 3 yes            1980         1       3    35        12  4.55    4.04   3072
 4 yes             456         0       3    34        12  1.10    3.25   1920
 5 yes            1568         1       2    31        14  4.59    3.6    2000
 6 yes            2032         0       0    54        12  4.74    4.7    1040
 7 yes            1440         0       2    37        16  8.33    5.95   2670
 8 yes            1020         0       0    54        12  7.84    9.98   4120
 9 yes            1458         0       2    48        12  2.13    0      1995
10 yes            1600         0       2    39        12  4.69    4.15   2100
# … with 418 more rows, and 12 more variables: hage <int>, heducation <int>,
#   hwage <dbl>, fincome <int>, tax <dbl>, meducation <int>, feducation <int>,
#   unemp <dbl>, city <fct>, experience <int>, college <fct>, hcollege <fct>
  1. Regress log(wage) on education, experience and experience^2. What’s the OLS estimate for return to education?
    (Intercept)       education      experience I(experience^2) 
  -0.5220405591    0.1074896390    0.0415665105   -0.0008111931 
  1. Regress education on experience, experience^2, feducation and meducation. Comment on your results.
    (Intercept)      experience I(experience^2)      feducation      meducation 
    9.102640110     0.045225423    -0.001009091     0.189548410     0.157597033 
  1. Use ivreg() to estimate the return to education using both feducation and meducation as IV.
    (Intercept)       education      experience I(experience^2) 
   0.0481003046    0.0613966279    0.0441703943   -0.0008989696 
  1. Regress log(wage) on education, experience, experience^2 and residuals from the model estimated in Q3. Use your result to test for the endogeneity of education. Can you conclude that education is endogenous?

Call:
lm(formula = log(wage) ~ education + experience + I(experience^2) + 
    residuals(model3), data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.03743 -0.30775  0.04191  0.40361  2.33303 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.0481003  0.3945753   0.122 0.903033    
education          0.0613966  0.0309849   1.981 0.048182 *  
experience         0.0441704  0.0132394   3.336 0.000924 ***
I(experience^2)   -0.0008990  0.0003959  -2.271 0.023672 *  
residuals(model3)  0.0581666  0.0348073   1.671 0.095441 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.665 on 423 degrees of freedom
Multiple R-squared:  0.1624,    Adjusted R-squared:  0.1544 
F-statistic:  20.5 on 4 and 423 DF,  p-value: 1.888e-15
  1. Compare your answers in Q2 and Q4. Which estimate is higher? Why?
model_ols$coefficients
    (Intercept)       education      experience I(experience^2) 
  -0.5220405591    0.1074896390    0.0415665105   -0.0008111931 
model_iv3$coefficients
    (Intercept)       education      experience I(experience^2) 
   0.0481003046    0.0613966279    0.0441703943   -0.0008989696