In this lab exercise, you will learn:
The data set used for this exercise is Birthweight_Smoking.xlsx from E5.3 of Stock and Watson (2020, e4). Birthweight_Smoking.xlsx is from the 1989 linked National Natality-Mortality Detail files, which contains a census of infant births and deaths. This data set was provided by Professors Douglas Almond, Kenneth Chay, and David Lee, and is a subset of the data used in their paper “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083. A detailed description is given in Birthweight_Smoking_Description.pdf, available in LMS.
rm(list=ls())
Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).
#install.packages("openxlsx") # install R package "openxlsx"
#install.packages("stargazer") # install R package "stargazer"
library(openxlsx) # load package; to read xlsx file
## Warning: package 'openxlsx' was built under R version 4.3.3
library(stargazer) # load package; to put regression results into a single stargazer table
id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)
## 'data.frame': 3000 obs. of 12 variables:
## $ nprevist : num 12 5 12 13 9 11 12 10 13 10 ...
## $ alcohol : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre1 : num 1 0 1 1 1 1 1 1 1 1 ...
## $ tripre2 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ tripre3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre0 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ birthweight: num 4253 3459 2920 2600 3742 ...
## $ smoker : num 1 0 1 0 0 0 1 0 0 0 ...
## $ unmarried : num 1 0 0 0 0 0 0 0 0 0 ...
## $ educ : num 12 16 11 17 13 16 14 13 17 14 ...
## $ age : num 27 24 23 28 27 33 24 38 29 28 ...
## $ drinks : num 0 0 0 0 0 0 0 0 0 0 ...
Description of main variables:
summary(bw$birthweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 425 3062 3420 3383 3750 5755
sd(bw$birthweight) # standard deviation of birthweight
## [1] 592.1629
plot(density(bw$birthweight)) #plot the empirical density function of birthweight
abline(v=mean(bw$birthweight), col="red") #add a line indicating the mean of birthweight
Question-1: What is the average birth weight for smoker group? How about the non-smoker group? What’s the difference in means between smokers and non-smokers?
Question-2: How to test if the difference in means between smokers and non-smokers is significantly different from zero?
Construct a scatterplot of birth weight (\(birthweight\)) on the dummy variable for smoking (\(smoker\)).
plot(x=bw$smoker, y=bw$birthweight, main="Infant Birthweight and Smoking")
To identify the relationship between birth weight and smoking, you estimate the following linear regression model: \[birthweight_i = \beta_0 + \beta_1 \cdot smoker_i + u_i,\] where \(i=1,...,3000\). The dummy variable, smoker, has only two values: 1 and 0.
fit.sk <- lm(birthweight~smoker, data=bw)
summary(fit.sk)
##
## Call:
## lm(formula = birthweight ~ smoker, data = bw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3007.06 -313.06 26.94 366.94 2322.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3432.06 11.87 289.115 <2e-16 ***
## smoker -253.23 26.95 -9.396 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared: 0.0286, Adjusted R-squared: 0.02828
## F-statistic: 88.28 on 1 and 2998 DF, p-value: < 2.2e-16
How to interpret \(\hat\beta_0\) and \(\hat\beta_1\)?
Hint:
You might indicate the average of birthweight for each subgroup in the plot as follows:
plot(x=bw$smoker, y=bw$birthweight, main="Infant Birthweight and Smoking")
bw.sk <- subset(bw, smoker==1) # get the subset of data for smoker
bw.nsk <- subset(bw, smoker==0) # get the subset of data for non-smoker
m.bw.sk <- mean(bw.sk$birthweight) # average birth weight for smoker
m.bw.nsk <- mean(bw.nsk$birthweight) # average birth weight for non-smoker
points(x=c(1), y=c(m.bw.sk), pch=17, col="red")
points(x=c(0), y=c(m.bw.nsk), pch=17, col="blue")
abline(h=m.bw.nsk, col="blue")
abline(h=m.bw.sk, col="red")
Question-3: Create a dummy variable for non-smokers. How will the estimated coefficients change if you regress birth weight on non-smokers?
The dummy variable for non-smokers is defined as follows: \[nsmoker_i = \begin{cases} 1, & \text{if individual i is a non-smoker} \\ 0, & otherwise \end{cases}\]
confint(fit.sk)
## 2.5 % 97.5 %
## (Intercept) 3408.7840 3455.3359
## smoker -306.0736 -200.3831
confint(fit.sk, "smoker")
## 2.5 % 97.5 %
## smoker -306.0736 -200.3831
confint(fit.sk, "smoker", level=0.9)
## 5 % 95 %
## smoker -297.5733 -208.8834
Using function lm, we get the results for testing the significance of the coefficients, including standard errors, t-statistic, and p-value. For a one-sided significance test (\(H_a: \beta>0\) or \(H_a: \beta<0\)), the \(t\)-statistic is unchanged, but the \(p\)-value will be different.
Note: The \(p\)-value of a one-sided test is not always a half of the p-value for a two-sided test. (Why?)
Example: If \(t=1.2\), what is the p-value for a lower-tailed test? What about the p-value for a two-sided test?
pnorm(1.2) # p-value for a lower-tailed test
## [1] 0.8849303
2*pnorm(-1.2) # p-value for a two-sided test
## [1] 0.2301393
In this section, we will practice how to conduct hypothesis tests for a non-zero valued null hypothesis.
summary(fit.sk)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3432.0600 11.87090 289.115391 0.000000e+00
## smoker -253.2284 26.95149 -9.395709 1.090663e-20
## Obtain estimate for beta_1 and SE for this estimate
b.sk <- summary(fit.sk)$coefficients[2,1]
se.sk <- summary(fit.sk)$coefficients[2,2]
## Compute t-statistic
t.sk <- (b.sk+250)/se.sk
print(t.sk)
## [1] -0.1197838
## Find p-value
pnorm(t.sk)
## [1] 0.4523272
Conclusion: Since the p-value is much greater than \(0.05\), we do not reject the null hypothesis at 5% significance level.
Think about the economic value of education: if there were no expected economic value-added to receiving university education, you probably would not be reading this script right now. A starting point to empirically verify such a relation is to have data on working individuals. More precisely, we need data on wages and education of workers in order to estimate a model like \[wage_i = \beta_0 + \beta_1 \cdot education_i + u_i.\] What can be presumed about this relation? It is likely that, on average, higher educated workers earn more than workers with less education, so we expect to estimate an upward sloping regression line. Also, it seems plausible that earnings of better educated workers have a higher dispersion than those of low-skilled workers: solid education is not a guarantee for a high salary so even highly qualified workers take on low-income jobs. However, they are more likely to meet the requirements for the well-paid jobs than workers with less education for whom opportunities in the labor market are much more limited.
To verify this empirically we may use real data on hourly earnings and the number of years of education of employees. Such data can be found in CPSSWEducation. This data set is part of the package AER and comes from the Current Population Survey (CPS) which is conducted periodically by the Bureau of Labor Statistics in the United States.
rm(list=ls())
Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).
#install.packages("AER") # install R package "AER"
#install.packages("sandwich") # install R package "sandwich"
#install.packages("lmtest") # install R package "lmtest"
library(AER) # load package; to get data "CPSSWEducation"
library(sandwich) # load package; to compute heteroskedasticity robust standard errors
library(lmtest) # load package; to conduct hypothesis test using robust SE
data("CPSSWEducation")
attach(CPSSWEducation)
str(CPSSWEducation)
## 'data.frame': 2950 obs. of 4 variables:
## $ age : int 30 30 30 30 30 30 30 29 29 30 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 1 1 1 1 2 2 2 1 ...
## $ earnings : num 34.6 19.2 13.7 13.9 19.2 ...
## $ education: int 16 16 12 13 16 12 12 16 16 12 ...
Description of variables:
Find mean, median, and standard deviation.
summary(earnings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.137 10.577 14.615 16.743 20.192 97.500
sd(earnings)
## [1] 9.402076
summary(education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 12.00 13.00 13.55 16.00 18.00
sd(education)
## [1] 2.3141
Construct a scatterplot.
plot(education, earnings, main="Return to Education")
Hypothesis B: Solid education is not a guarantee for a high salary so even highly qualified workers take on low-income jobs.
hat_u <- fit.edu$residuals
plot(x=education, y=hat_u, main="The regression residuals vs. education")
The easiest way to test for heteroskedasticity is to get a good look at your data. Ideally, you want your data to all follow a pattern of a line (the dispersion of data does not change across \(X\)), but sometimes it doesn’t. The quickest way to identify heteroskedastic data is to see the shape of the plotted data. For example, if the data generally is cone shaped, heteroskedasticity is likely.
The plot indicates that there is heteroskedasticity: if we assume the regression line to be a reasonably good representation of the conditional mean function \(E(earnings_i | education_i)\), the dispersion of hourly earnings around that function clearly increases with the level of education, i.e., the variance of the distribution of earnings increases. In other words: the variance of the errors (the errors made in explaining earnings by education) increases with education so that the regression errors are heteroskedastic.
This example makes a case that the assumption of homoskedasticity is doubtful in economic applications. Should we care about heteroskedasticity? Yes, we should. This is because heteroskedasticity can have serious negative consequences in hypothesis testing, if we ignore it.
The heteroskedasticity robust SEs for estimated coefficients are often referred to as the Eicker-Huber-White standard errors, the most frequently cited paper on this is White (1980). In R, vcovHC computes the variance-covariance matrix of coefficient estimates. We are interested in the square root of the diagonal elements of this matrix, i.e., the standard error estimates.
The variance-covariance matrix of coefficient estimates is defined as: \[ var\left(\begin{array}{c} \hat\beta_0 \\ \hat\beta_1 \end{array}\right) =\left(\begin{array}{cc} var(\hat\beta_0) & cov(\hat\beta_0, \hat\beta_1) \\ cov(\hat\beta_0, \hat\beta_1) & var(\hat\beta_1) \end{array}\right) \] So vcovHC gives us an estimate of this matrix. We are mostly interested in the diagonal elements of the estimated matrix.
vcov.fit <- vcovHC(fit.edu, type = "HC3") # variance-covariance matrix for beta_0 and beta_1
print(vcov.fit)
## (Intercept) education
## (Intercept) 0.85894706 -0.065865676
## education -0.06586568 0.005185651
robust_se <- sqrt(diag(vcov.fit)) # robust SE for beta_0 and beta_1
print(robust_se)
## (Intercept) education
## 0.92679397 0.07201146
robust_se[2] # robust SE for beta_1
## education
## 0.07201146
Now assume we want to generate a coefficient summary as provided by summary but with robust standard errors of the coefficient estimators, robust \(t\)-statistics and corresponding \(p\)-values for the regression model linear_model. This can be done using coeftest from the package lmtest, see ?coeftest. Further we specify in the argument vcov that vcovHC, the Eicker-Huber-White estimate of the variance matrix we have computed before, should be used.
coeftest(fit.edu, vcov=vcovHC, type="HC3")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.134371 0.926794 -3.382 0.0007291 ***
## education 1.466925 0.072011 20.371 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How severe are the implications of using homoskedasticity-only standard errors in the presence of heteroskedasticity? The answer is: it depends. As mentioned above we face the risk of drawing wrong conclusions when conducting significance tests.
MacKinnon, James G, and Halbert White. 1985. “Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties.” Journal of Econometrics 29 (3): 305–25.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48 (4): pp. 817–38.