Lab 7: Exercise

E7.2

In the empirical exercises on earning and height in Chapters 4 and 5, you estimated a relatively large and statistically significant effect of a worker’s height on his or her earnings. One explanation for this result is omitted variable bias: Height is correlated with an omitted factor that affects earnings. For example, Case and Paxson (2008) suggest that cognitive ability (or intelligence) is the omitted factor. The mechanism they describe is straightforward: Poor nutrition and other harmful environmental factors in utero and in early childhood have, on average, deleterious effects on both cognitive and physical development. Cognitive ability affects earnings later in life and thus is an omitted variable in the regression.

Start the project by clearing the workspace. Then load the R package openxlsx and the data Earnings_and_Height.

rm(list=ls()) 
library(car)

## Loading required package: carData

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(sandwich)
library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.3.3

id <- "1XKjDOQBJcxwslhwipkJAF2qLNmFW9Bfu"
earn <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(earn)

## 'data.frame':    17870 obs. of  11 variables:
##  $ sex       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age       : num  48 41 26 37 35 25 29 44 50 38 ...
##  $ mrd       : num  1 6 1 1 6 6 1 4 6 1 ...
##  $ educ      : num  13 12 16 16 16 15 16 18 14 12 ...
##  $ cworker   : num  1 1 1 1 1 1 1 3 2 4 ...
##  $ region    : num  3 2 1 2 1 4 2 4 3 3 ...
##  $ race      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ earnings  : num  84055 14021 84055 84055 28560 ...
##  $ height    : num  65 65 60 67 68 63 67 65 67 66 ...
##  $ weight    : num  133 155 108 150 180 101 150 125 129 110 ...
##  $ occupation: num  1 1 1 1 1 1 1 1 1 1 ...

Suppose that the mechanism described above is correct. Explain how this leads to omitted variable bias in the OLS regression of Earnings on Height. Does the bias lead the estimated slope to be too large or too small? [Hint: Review Equation (6.1).]

[Ans] From Key Concept 6.1, omitted variable bias arises if: (1). X is correlated with the omitted variable (i.e.,$\rho_{xu} \neq 0$); and (2). the omitted variable is a determinant of the dependent variable Y (i.e., the omitted variable enters $u$ with a non-zero coefficient).

The mechanism described in the problem explains why height ($X$) and cognitive ability (the omitted variable) are correlated and why cognitive ability is a determinant of earning ($Y$). The mechanism suggests that height and cognitive ability are positively correlated and that cognitive ability has a positive effect on earnings. Thus, $X$ will be positively correlated with the error in (2) leading to a positive bias in the estimated coefficient.

If the mechanism described above is correct, the estimated effect of height on earnings should disappear if a variable measuring cognitive ability is included in the regression. Unfortunately, there isn’t a direct measure of cognitive ability in the data set, but the data set does include years of education for each individual. Because students with higher cognitive ability are more likely to attend school longer, years of education might serve as a control variable for cognitive ability; in this case, including education in the regression will eliminate, or at least attenuate, the omitted variable bias problem.

Use the years of education variable ($educ$) to construct four indicator variables for whether a worker has less than a high school diploma ($LT_HS = 1$ if $educ < 12$, $=0$ otherwise), a high school diploma ($HS = 1$ if $educ = 12$, $=0$ otherwise), some college ($Some_Col = 1$ if $12 < educ < 16$, $=0$ otherwise), or a bachelor’s degree or higher ($College = 1$ if $educ \geq 16$, $=0$ otherwise).

We first create the dummy variables for education based on years of education:

earn$lt_hs <- as.numeric(earn$educ < 12)
earn$hs <- as.numeric(earn$educ == 12)
earn$some_col <- as.numeric(earn$educ<16 & earn$educ>12)
earn$college <- as.numeric(earn$educ>=16)

Focusing first on women only, run a regression of (1) $Earnings$ on $Height$ and (2) $Earnings$ on $Height$, including $LT\_HS$, $HS$, and $Some\_Col$ as control variables.

earn.female <- subset(earn, sex==0)
fit.1 <- lm(earnings~height, data=earn.female)
summary(fit.1)

## 
## Call:
## lm(formula = earnings ~ height, data = earn.female)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42748 -22006  -7466  36641  46865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12650.9     6383.7   1.982   0.0475 *  
## height         511.2       98.9   5.169  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared:  0.002672,   Adjusted R-squared:  0.002572 
## F-statistic: 26.72 on 1 and 9972 DF,  p-value: 2.396e-07

fit.2 <- lm(earnings~height+lt_hs+hs+some_col, data=earn.female)
summary(fit.2)

## 
## Call:
## lm(formula = earnings ~ height + lt_hs + hs + some_col, data = earn.female)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54537 -19082  -5808  24386  58676 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50749.52    6013.68   8.439   <2e-16 ***
## height         135.14      92.55   1.460    0.144    
## lt_hs       -31857.81     963.77 -33.055   <2e-16 ***
## hs          -20417.89     626.18 -32.607   <2e-16 ***
## some_col    -12649.07     685.30 -18.458   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24920 on 9969 degrees of freedom
## Multiple R-squared:  0.1382, Adjusted R-squared:  0.1378 
## F-statistic: 399.6 on 4 and 9969 DF,  p-value: < 2.2e-16

Compare the estimated coefficient on Height in regressions (1) and (2). Is there a large change in the coefficient? Has it changed in a way consistent with the cognitive ability explanation? Explain.

[Ans] The estimated coefficient on height falls by approximately 75%, from 511 to 135 when the education variables are added as control variables in the regression. This is consistent with positive omitted bias in (1).

The regression omits the control variable $College$. Why?

[Ans] College is perfectly collinear with other education regressors and the constant regressor.

Test the joint null hypothesis that the coefficients on the education variables are equal to 0.

[Ans] Using robust standard errors, the F-statistic is $577$ with $p-value=0$. Therefore, we reject the null hypothesis and claim that at least one coefficient of the education dummies is significant at the 5% significance level.

## F-test using robust SE ##
linearHypothesis(fit.2, c("lt_hs=0", "hs=0", "some_col=0"), white.adjust=c("hc3"))

Discuss the values of the estimated coefficients on $LT\_HS$, $HS$, and $Some\_Col$. (Each of the estimated coefficients is negative, and the coefficient on $LT\_HS$ is more negative than the coefficient on $HS$, which in turn is more negative than the coefficient on $Some\_Col$. Why? What do the coefficients measure?)

[Ans] The coefficients measure the effect of education on earnings relative to the omitted category, which is $College$. Thus, the estimated coefficient on the “Less than High School” regressor implies that workers with less than a high school education on average earn $\$31,858$ less per year than a college graduate; a worker with a high school education on average earns $\$20,418$ less per year than a college graduate; a worker with a some college on average earns $\$12,649$ less per year than a college graduate.

Repeat (b), using data for men.

earn.male <- subset(earn, sex==1)
fit.1.male <- lm(earnings~height, data=earn.male)
summary(fit.1.male)

## 
## Call:
## lm(formula = earnings ~ height, data = earn.male)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22373  -8118  33091  59228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43130.3     7068.5  -6.102  1.1e-09 ***
## height        1306.9      100.8  12.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.02074 
## F-statistic: 168.2 on 1 and 7894 DF,  p-value: < 2.2e-16

fit.2.male <- lm(earnings~height+lt_hs+hs+some_col, data=earn.male)
summary(fit.2.male)

## 
## Call:
## lm(formula = earnings ~ height + lt_hs + hs + some_col, data = earn.male)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58009 -19026  -5227  21320  60912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9862.7     6697.0   1.473    0.141    
## height         744.7       94.7   7.864 4.22e-15 ***
## lt_hs       -31400.5      978.4 -32.092  < 2e-16 ***
## hs          -20345.8      692.4 -29.383  < 2e-16 ***
## some_col    -12610.9      758.1 -16.636  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24620 on 7891 degrees of freedom
## Multiple R-squared:  0.1658, Adjusted R-squared:  0.1654 
## F-statistic:   392 on 4 and 7891 DF,  p-value: < 2.2e-16

Compare the estimated coefficient on Height in regressions (1) and (2). Is there a large change in the coefficient? Has it changed in a way consistent with the cognitive ability explanation? Explain.

[Ans] The estimated coefficient on height falls by approximately $50\%$, from $1307$ to $745$. This is consistent with positive omitted bias in the simple regression (1).

The regression omits the control variable $College$. Why?

[Ans] Same answer as (b).

Test the joint null hypothesis that the coefficients on the education variables are equal to 0.

[Ans] Using robust standard errors, the F-statistic is $500$ with $p-value=0$. Therefore, we reject the null hypothesis and claim that at least one coefficient of the education dummies is significant at the 5% significance level.

## F-test using robust SE ##
linearHypothesis(fit.2.male, c("lt_hs=0", "hs=0", "some_col=0"), white.adjust=c("hc3"))

Discuss the values of the estimated coefficients on $LT\_HS$, $HS$, and $Some\_Col$. (Each of the estimated coefficients is negative, and the coefficient on $LT\_HS$ is more negative than the coefficient on $HS$, which in turn is more negative than the coefficient on $Some\_Col$. Why? What do the coefficients measure?)

[Ans] The coefficients measure the effect of education on earnings relative to the omitted category, which is $College$. Thus, the estimated coefficient on the “Less than High School” regressor implies that workers with less than a high school education on average earn $\$31,401$ less per year than a college graduate; a worker with a high school education on average earns $\$20,346$ less per year than a college graduate; a worker with a some college on average earns $\$12,611$ less per year than a college graduate.