MOOC Econometrics

Test Exercise 4

Notes:

• See website for how to submit your answers and how feedback is organized • This exercise uses the datafile TestExer4 Wage and requires a computer. • The dataset TestExer4 Wage is available on the website.

Goals and skills being used:

• Obtain insight in consequences of endogeneity • Practice with identifying causes of endogeneity • Practice with identifying valid instruments • Obtain hands-on experience with applying 2SLS and the Sargan test

Questions

A challenging and very relevant economic problem is the measurement of the returns to schooling. In this question we will use the following variables on 3010 US men:

• logw: log wage • educ: number of years of schooling • age: age of the individual in years • exper: working experience in years • smsa: dummy indicating whether the individual lived in a metropolitan area • south: dummy indicating whether the individual lived in the south • nearc: dummy indicating whether the individual lived near a 4-year college • dadeduc: education of the individual’s father (in years) • momeduc: education of the individual’s mother (in years)

This data is a selection of the data used by D. Card (1995)1

(a) Use OLS to estimate the parameters of the model logw = β1 + β2educ + β3exper + β4exper2 + β5smsa + β6south + ε.Give an interpretation to the estimated β2 coefficient.

First we need to load the data and then fit the model

library(ggplot2)
library(dplyr)
library(lmtest)
library(readxl)
df <- read_excel("FINALDATA4.xls", col_names = TRUE) 
head(df)
## # A tibble: 6 x 9
##    logw  educ   age exper  smsa south nearc daded momed
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  6.31     7    29    16     1     0     0  9.94  10.2
## 2  6.18    12    27     9     1     0     0  8      8  
## 3  6.58    12    34    16     1     0     0 14     12  
## 4  5.52    11    27    10     1     0     1 11     12  
## 5  6.59    12    34    16     1     0     1  8      7  
## 6  6.21    12    26     8     1     0     1  9     12

We need to add the squared variable:

df$nbr <- 1:nrow(df)
df$exper2 <- df$exper^2
df$age2 <- df$age^2
str(df)
## tibble [3,010 x 12] (S3: tbl_df/tbl/data.frame)
##  $ logw  : num [1:3010] 6.31 6.18 6.58 5.52 6.59 ...
##  $ educ  : num [1:3010] 7 12 12 11 12 12 18 14 12 12 ...
##  $ age   : num [1:3010] 29 27 34 27 34 26 33 29 28 29 ...
##  $ exper : num [1:3010] 16 9 16 10 16 8 9 9 10 11 ...
##  $ smsa  : num [1:3010] 1 1 1 1 1 1 1 1 1 1 ...
##  $ south : num [1:3010] 0 0 0 0 0 0 0 0 0 0 ...
##  $ nearc : num [1:3010] 0 0 0 1 1 1 1 1 1 1 ...
##  $ daded : num [1:3010] 9.94 8 14 11 8 9 14 14 12 12 ...
##  $ momed : num [1:3010] 10.2 8 12 12 7 ...
##  $ nbr   : int [1:3010] 1 2 3 4 5 6 7 8 9 10 ...
##  $ exper2: num [1:3010] 256 81 256 100 256 64 81 81 100 121 ...
##  $ age2  : num [1:3010] 841 729 1156 729 1156 ...
df %>% ggplot(aes(nbr, logw)) + geom_point()

So after that, we already continue with fitting the model:

fit <- lm(logw~ educ + exper + exper2+smsa+south, data= df)
summary(fit)

Call:
lm(formula = logw ~ educ + exper + exper2 + smsa + south, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.71487 -0.22987  0.02268  0.24898  1.38552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.6110144  0.0678950  67.914  < 2e-16 ***
educ         0.0815797  0.0034990  23.315  < 2e-16 ***
exper        0.0838357  0.0067735  12.377  < 2e-16 ***
exper2      -0.0022021  0.0003238  -6.800 1.26e-11 ***
smsa         0.1508006  0.0158360   9.523  < 2e-16 ***
south       -0.1751761  0.0146486 -11.959  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3813 on 3004 degrees of freedom
Multiple R-squared:  0.2632,    Adjusted R-squared:  0.2619 
F-statistic: 214.6 on 5 and 3004 DF,  p-value: < 2.2e-16

So the interpretation for B2 :

  • The 0.082 educ coefficient means that with each additional year of schooling, the log wage increases by about 0.082. Therefore, with each additional year of schooling the wage increases by about exp(0.082), or by 1.085 or ~8.5%.

(b) OLS may be inconsistent in this case as educ and exper may be endogenous. Give a reason why this may be the case. Also indicate whether the estimate in part (a) is still useful.

#Rpta.

# Experience and education are variables that are not enough to fit the model. They need to be integrated by data on family background, wealth and social class.

#The OLS model is still useful, as indicated by p-values.

#As we say It is possible the wage, experience and education variables to be affected by some other variable (i.e. ability, social class, family support, etc.) in a way, such as, a higher ability to lead to a higher wage, longer education and less experience (due to long education as mixed effect) and vice versa.

#In this case, these variables would be endogenous and the OLS estimates would be biased and inconsistent, therefore not useful anymore, because they are correlated with the errors.

resid_fitted <- (fit$residuals)
df$Resid_fit <- resid_fitted
head(df)
## # A tibble: 6 x 13
##    logw  educ   age exper  smsa south nearc daded momed   nbr exper2  age2
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>  <dbl> <dbl>
## 1  6.31     7    29    16     1     0     0  9.94  10.2     1    256   841
## 2  6.18    12    27     9     1     0     0  8      8       2     81   729
## 3  6.58    12    34    16     1     0     0 14     12       3    256  1156
## 4  5.52    11    27    10     1     0     1 11     12       4    100   729
## 5  6.59    12    34    16     1     0     1  8      7       5    256  1156
## 6  6.21    12    26     8     1     0     1  9     12       6     64   676
## # ... with 1 more variable: Resid_fit <dbl>
df %>% ggplot(aes(nbr, Resid_fit)) + geom_point()

(c) Give a motivation why age and age2 can be used as instruments for exper and exper2

# Rpta.

# Age is exogenous as it cannot be influenced by the people, and it is also obviously related to experience as younger people cannot have a very long experience. Age can explain how many years of working experience a man has. On the other hand it should not influence wage, so it qualifies as a potential instrument of experience.

# So it’s a good instrument for the experience variable. And the same applies for their squared values.

df %>% ggplot(aes(exper2, logw)) + geom_point()

df %>% ggplot(aes(age2,logw)) + geom_point()

(d) Run the first-stage regression for educ for the two-stage least squares estimation of the parameters in the model above when age, age2, nearc, dadeduc, and momeduc are used as additional instruments. What do you conclude about the suitability of these instruments for schooling?

# Rpta.

fit2 <-lm(formula = educ ~ age + age2 + smsa + south + nearc + daded + momed, data = df)
summary(fit2)

Call:
lm(formula = educ ~ age + age2 + smsa + south + nearc + daded + 
    momed, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2777  -1.5450  -0.2224   1.6957   7.2250 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.652354   3.976343  -1.421 0.155277    
age          0.989610   0.278714   3.551 0.000390 ***
age2        -0.017019   0.004838  -3.518 0.000441 ***
smsa         0.529566   0.101504   5.217 1.94e-07 ***
south       -0.424851   0.091037  -4.667 3.19e-06 ***
nearc        0.264554   0.099085   2.670 0.007626 ** 
daded        0.190443   0.015611  12.199  < 2e-16 ***
momed        0.234515   0.017028  13.773  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.326 on 3002 degrees of freedom
Multiple R-squared:  0.2466,    Adjusted R-squared:  0.2448 
F-statistic: 140.4 on 7 and 3002 DF,  p-value: < 2.2e-16

A shown by the model summary, the additional instruments are significant regressors of education. The additional instruments (age, age2, nearc, daded, and momed) are significantly correlated with the education. This is especially true about the later two (daded and momed, education of father and mother) due to their high t-statistics, which makes perfect sense as highly educated parents are more likely to support and promote their children education as well.

Model characteristics: R2 = 0.2466

So, the instrument variables and the endogenous variable educ are significantly related.

Now fitting values for education:

educ_f<- fitted.values(fit2)
df$educ_fitted <- educ_f
summary(df$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   13.00   13.26   16.00   18.00
summary(df$educ_fitted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.87   12.53   13.39   13.26   14.14   17.10
with(df, plot(nbr, educ, type='l', col='blue', main="Education", xlab='# number'))

with(df, plot(nbr, educ_fitted, type='l', col='red'))
legend("bottomleft", c("Actual","Fitted"), horiz=TRUE, 
       lty=c(1,1), lwd=c(2,2), col=c("blue","red"), bg="grey")

(e) Estimate the parameters of the model for log wage using two-stage least squares where you correct for the endogeneity of education and experience. Compare your result to the estimate in part (a).

library(AER)
fit3 <- ivreg(formula = logw ~ educ + exper + exper2 + smsa + south |  age + age2 + smsa + south + nearc + daded + momed, data = df)

summary(fit3)

Call:
ivreg(formula = logw ~ educ + exper + exper2 + smsa + south | 
    age + age2 + smsa + south + nearc + daded + momed, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7494 -0.2360  0.0266  0.2498  1.3468 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.4169039  0.1154208  38.268  < 2e-16 ***
educ         0.0998429  0.0065738  15.188  < 2e-16 ***
exper        0.0728669  0.0167134   4.360 1.35e-05 ***
exper2      -0.0016393  0.0008381  -1.956   0.0506 .  
smsa         0.1349370  0.0167695   8.047 1.21e-15 ***
south       -0.1589869  0.0156854 -10.136  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3844 on 3004 degrees of freedom
Multiple R-Squared: 0.2512, Adjusted R-squared: 0.2499 
Wald test: 175.9 on 5 and 3004 DF,  p-value: < 2.2e-16 
logw_f<- fitted.values(fit3)

df$logw_fitted <- educ_f
summary(df$logw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.605   5.977   6.287   6.262   6.564   7.785
summary(df$logw_fitted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.87   12.53   13.39   13.26   14.14   17.10
with(df, plot(nbr, logw, type='l', col='gray', main="Log Wage", xlab='# number'))

with(df, plot(nbr, logw_fitted, type='l', col='green'))
legend("bottomleft", c("Actual","Fitted"), horiz=TRUE, 
       lty=c(1,1), lwd=c(2,2), col=c("gray","green"), bg="white")

(f) Perform the Sargan test for validity of the instruments. What is your conclusion?

final_r <- residuals.lm(fit3)
df$final_residual <- final_r

fres <-lm(formula = final_residual ~ smsa + south + age + age2 + nearc + daded + momed, data = df)
sargan.tstat = nrow(df) * summary(fres)$r.squared
sargan.tstat 
## [1] 3.702389
residualPlot(fres)