L5 (Lab) - Inference for Simple OLS Regression

In this lab exercise, you will learn:

Regression with dummy variables
- How to interpret coefficients
- How to construct confidence intervals and conduct hypothesis tests
Heteroskedasticity
- Visual inspection for heteroskedasticity
- How to obtain robust standard errors (SE): vcovHC
- How to conduct significance test using robust SE: coeftest

Exercise (Dummy Variable): Birth Weight and Smoking

The data set used for this exercise is Birthweight_Smoking.xlsx from E5.3 of Stock and Watson (2020, e4). Birthweight_Smoking.xlsx is from the 1989 linked National Natality-Mortality Detail files, which contains a census of infant births and deaths. This data set was provided by Professors Douglas Almond, Kenneth Chay, and David Lee, and is a subset of the data used in their paper “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083. A detailed description is given in Birthweight_Smoking_Description.pdf, available in LMS.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

#install.packages("openxlsx")   # install R package "openxlsx"
#install.packages("stargazer")   # install R package "stargazer"
library(openxlsx)               # load package; to read xlsx file

## Warning: package 'openxlsx' was built under R version 4.3.3

library(stargazer)              # load package; to put regression results into a single stargazer table

Import the Birthweight_Smoking Data

id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                 sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)

## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : num  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: num  4253 3459 2920 2600 3742 ...
##  $ smoker     : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : num  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : num  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : num  0 0 0 0 0 0 0 0 0 0 ...

Description of main variables:

Birthweight and Smoking
- birthweight: birth weight of infant (in grams)
- smoker: indicator =1 if the mother smoked during pregnancy and 0, otherwise.
Mother’s Attributes
- age: age
- educ: years of educational attainment (more than 16 years coded as 17)
- unmarried: indicator =1 if mother is unmarried

Summary Statistics: birthweight

summary(bw$birthweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     425    3062    3420    3383    3750    5755

sd(bw$birthweight)              # standard deviation of birthweight

## [1] 592.1629

plot(density(bw$birthweight))   #plot the empirical density function of birthweight
abline(v=mean(bw$birthweight), col="red")   #add a line indicating the mean of birthweight

Question-1: What is the average birth weight for smoker group? How about the non-smoker group? What’s the difference in means between smokers and non-smokers?

[Ans] To obtain the summary statistics for a subgroup of data, you might first use subset to get the subset of data, and then apply functions such as summary, sd and var. For example, find the average birth weight for smoker group and non-smoker group, respectively:

bw.sk <- subset(bw, smoker==1)    # get the subset of data for smoker
bw.nsk <- subset(bw, smoker==0)   # get the subset of data for non-smoker

m.bw.sk <- mean(bw.sk$birthweight)   # average birth weight for smoker
m.bw.nsk <- mean(bw.nsk$birthweight) # average birth weight for non-smoker
print(m.bw.sk)

## [1] 3178.832

print(m.bw.nsk)

## [1] 3432.06

m.bw.nsk - m.bw.sk                   # the difference in means

## [1] 253.2284

Question-2: How to test if the difference in means between smokers and non-smokers is significantly different from zero?

[Ans] First find the standard errors for m.bw.sk and m.bw.nsk. Then, compute the t statistic as follows and find the p-value (Note: This is a two-tailed test.) \[ t = \frac{m.bw.sk - m.bw.nsk}{\sqrt{SE(m.bw.sk)^2 + SE(m.bw.nsk)^2}} \]

1. Scatterplot

Construct a scatterplot of birth weight ($birthweight$) on the dummy variable for smoking ($smoker$).

plot(x=bw$smoker, y=bw$birthweight, main="Infant Birthweight and Smoking")

2. OLS regression: birthweight vs. smoker

To identify the relationship between birth weight and smoking, you estimate the following linear regression model: \[birthweight_i = \beta_0 + \beta_1 \cdot smoker_i + u_i,\] where $i=1,...,3000$. The dummy variable, smoker, has only two values: 1 and 0.

fit.sk <- lm(birthweight~smoker, data=bw)
summary(fit.sk)

## 
## Call:
## lm(formula = birthweight ~ smoker, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3432.06      11.87 289.115   <2e-16 ***
## smoker       -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16

How to interpret $\hat\beta_0$ and $\hat\beta_1$?

Hint:

For $smoker_i=1$, $E(birthweight_i) = \beta_0 + \beta_1$, given that $E(u_i)=0$ (why?).
For $smoker_i=0$, $E(birthweight_i) = \beta_0 + \beta_1 \times 0 = \beta_0$.
Thus, $\beta_1 = E(birthweight_i| smoker_i = 1) - E(birthweight_i| smoker_i = 0)$.

You might indicate the average of birthweight for each subgroup in the plot as follows:

plot(x=bw$smoker, y=bw$birthweight, main="Infant Birthweight and Smoking")
points(x=c(1), y=c(m.bw.sk), pch=17, col="red")
points(x=c(0), y=c(m.bw.nsk), pch=17, col="blue")
abline(h=m.bw.nsk, col="blue")
abline(h=m.bw.sk, col="red")

Question-3: Create a dummy variable for non-smokers. How will the estimated coefficients change if you regress birth weight on non-smokers?

The dummy variable for non-smokers is defined as follows: \[nsmoker_i = \begin{cases} 1, & \text{if individual i is a non-smoker} \\ 0, & otherwise \end{cases}\]

bw$nsmoker <- 1-bw$smoker  # add the new dummy variable, non-smoker, to data "bw"
str(bw)

## 'data.frame':    3000 obs. of  13 variables:
##  $ nprevist   : num  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: num  4253 3459 2920 2600 3742 ...
##  $ smoker     : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : num  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : num  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nsmoker    : num  0 1 0 1 1 1 0 1 1 1 ...

Then, estimate the following model: \[birthweight_i = \beta_0 + \beta_1 \cdot nsmoker_i + u_i.\]

fit.nsk <- lm(birthweight~nsmoker, data=bw)
summary(fit.nsk)

## 
## Call:
## lm(formula = birthweight ~ nsmoker, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3178.83      24.20 131.376   <2e-16 ***
## nsmoker       253.23      26.95   9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16

Compare results from two regressions: with smoker vs. with nsmoker

stargazer(fit.sk, fit.nsk, 
          title="Regression Results", type="text",
          df=FALSE, digits=3)

## 
## Regression Results
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                             birthweight         
##                          (1)            (2)     
## ------------------------------------------------
## smoker               -253.228***                
##                        (26.951)                 
##                                                 
## nsmoker                             253.228***  
##                                      (26.951)   
##                                                 
## Constant             3,432.060***  3,178.832*** 
##                        (11.871)      (24.196)   
##                                                 
## ------------------------------------------------
## Observations            3,000          3,000    
## R2                      0.029          0.029    
## Adjusted R2             0.028          0.028    
## Residual Std. Error    583.730        583.730   
## F Statistic           88.279***      88.279***  
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

We find:

$\beta_{1,sk} = -\beta_{1,nsk}$

$\beta_{0,sk} = \beta_{0,nsk} + \beta_{1,nsk}$

This is because: \[ \begin{eqnarray*} birthweight_i &=& \beta_{0,nsk} + \beta_{1,nsk} \cdot nsmoker_i + u_i\notag\\ &=& \beta_{0,nsk} + \beta_{1,nsk} \cdot (1-smoker_i) + u_i\notag\\ &=& (\underbrace{\beta_{0,nsk} + \beta_{1,nsk}}_{\beta_{0,sk}}) + (\underbrace{- \beta_{1,nsk}}_{\beta_{1,sk}}) \cdot smoker_i + u_i\notag\\ \end{eqnarray*} \]

3. Construct Confidence Intervals

95% CI for both intercept and slope

confint(fit.sk)

##                 2.5 %    97.5 %
## (Intercept) 3408.7840 3455.3359
## smoker      -306.0736 -200.3831

95% CI for a specific coefficient

confint(fit.sk, "smoker")

##            2.5 %    97.5 %
## smoker -306.0736 -200.3831

90% CI for a specific coefficient

confint(fit.sk, "smoker", level=0.9)

##              5 %      95 %
## smoker -297.5733 -208.8834

4. Hypothesis Test

Using function lm, we get the results for testing the significance of the coefficients, including standard errors, t-statistic, and p-value. For a one-sided significance test ($H_a: \beta>0$ or $H_a: \beta<0$), the $t$-statistic is unchanged, but the $p$-value will be different.

Note: The $p$-value of a one-sided test is not always a half of the p-value for a two-sided test. (Why?)

Example: If $t=1.2$, what is the p-value for a lower-tailed test? What about the p-value for a two-sided test?

pnorm(1.2)     # p-value for a lower-tailed test

## [1] 0.8849303

2*pnorm(-1.2)  # p-value for a two-sided test

## [1] 0.2301393

In this section, we will practice how to conduct hypothesis tests for a non-zero valued null hypothesis.

A lower-tailed test: $H_0: \beta_{1,sk}=-250$ vs. $H_a: \beta_{1,sk}<-250$

summary(fit.sk)$coefficients

##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 3432.0600   11.87090 289.115391 0.000000e+00
## smoker      -253.2284   26.95149  -9.395709 1.090663e-20

## Obtain estimate for beta_1 and SE for this estimate 
b.sk <- summary(fit.sk)$coefficients[2,1]  
se.sk <- summary(fit.sk)$coefficients[2,2]

  ## Compute t-statistic
t.sk <- (b.sk+250)/se.sk
print(t.sk)

## [1] -0.1197838

  ## Find p-value
pnorm(t.sk)

## [1] 0.4523272

Conclusion: Since the p-value is much greater than $0.05$, we do not reject the null hypothesis at 5% significance level.

Exercise (Heteroskedasticity): Return to Education

Think about the economic value of education: if there were no expected economic value-added to receiving university education, you probably would not be reading this script right now. A starting point to empirically verify such a relation is to have data on working individuals. More precisely, we need data on wages and education of workers in order to estimate a model like \[wage_i = \beta_0 + \beta_1 \cdot education_i + u_i.\] What can be presumed about this relation? It is likely that, on average, higher educated workers earn more than workers with less education, so we expect to estimate an upward sloping regression line. Also, it seems plausible that earnings of better educated workers have a higher dispersion than those of low-skilled workers: solid education is not a guarantee for a high salary so even highly qualified workers take on low-income jobs. However, they are more likely to meet the requirements for the well-paid jobs than workers with less education for whom opportunities in the labor market are much more limited.

To verify this empirically we may use real data on hourly earnings and the number of years of education of employees. Such data can be found in CPSSWEducation. This data set is part of the package AER and comes from the Current Population Survey (CPS) which is conducted periodically by the Bureau of Labor Statistics in the United States.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

#install.packages("AER")        # install R package "AER"
#install.packages("sandwich")   # install R package "sandwich"
#install.packages("lmtest")     # install R package "lmtest"

library(AER)                   # load package; to get data "CPSSWEducation"
library(sandwich)                   # load package; to compute heteroskedasticity robust standard errors
library(lmtest)                   # load package; to conduct hypothesis test using robust SE

Import the CPSSWEducation Data

data("CPSSWEducation")
attach(CPSSWEducation)
str(CPSSWEducation)

## 'data.frame':    2950 obs. of  4 variables:
##  $ age      : int  30 30 30 30 30 30 30 29 29 30 ...
##  $ gender   : Factor w/ 2 levels "female","male": 2 1 1 1 1 1 2 2 2 1 ...
##  $ earnings : num  34.6 19.2 13.7 13.9 19.2 ...
##  $ education: int  16 16 12 13 16 12 12 16 16 12 ...

Description of variables:

age: age in years.
gender: factor indicating gender.
earnings: average hourly earnings (sum of annual pretax wages, salaries, tips, and bonuses, divided by the number of hours worked annually).
education: number of years of education.

Summary Statistics: Earnings vs. Education

Find mean, median, and standard deviation.

summary(earnings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.137  10.577  14.615  16.743  20.192  97.500

sd(earnings)

## [1] 9.402076

summary(education)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   12.00   13.00   13.55   16.00   18.00

sd(education)

## [1] 2.3141

Construct a scatterplot.

plot(education, earnings, main="Return to Education")

1. Is Education Positively Correlated to Earnings?

Hypothesis A: On average, higher educated workers earn more than workers with less education.

Estimate the following model: \[wage_i = \beta_0 + \beta_1 \cdot education_i + u_i.\]

fit.edu <- lm(earnings ~ education)
summary(fit.edu)

## 
## Call:
## lm(formula = earnings ~ education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.270  -5.355  -1.513   3.194  77.164 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.13437    0.95925  -3.268   0.0011 ** 
## education    1.46693    0.06978  21.021   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.769 on 2948 degrees of freedom
## Multiple R-squared:  0.1304, Adjusted R-squared:  0.1301 
## F-statistic: 441.9 on 1 and 2948 DF,  p-value: < 2.2e-16

Note: The function summary estimates the homoskedasticity-only standard error.

The estimated regression equation states that, on average, an additional year of education increases a worker’s hourly earnings by about $1.47. To test if the slope coefficient is greater than zero, we find that the $p$-value is 0.

1-pnorm(21.021)

## [1] 0

Therefore, we reject the null hypothesis that the coefficient on education is greater than zero at the 5% significance level.

The regression results support Hypothesis A that Education is positively correlated to earnings.

Plot the fitted regression line.

plot(education, earnings, main="Returns to Education")
abline(fit.edu, col = "red")

fit.edu.noi <- lm(earnings ~ education - 1 )
summary(fit.edu.noi)

## 
## Call:
## lm(formula = earnings ~ education - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.655  -5.371  -1.725   3.083  77.625 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## education  1.24216    0.01176   105.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.784 on 2949 degrees of freedom
## Multiple R-squared:  0.7908, Adjusted R-squared:  0.7907 
## F-statistic: 1.115e+04 on 1 and 2949 DF,  p-value: < 2.2e-16

plot(education, earnings, main="Returns to Education", xlim=c(0,20))
abline(fit.edu, col = "red")
abline(fit.edu.noi, col = "blue")

2. Heteroskedasticity

Hypothesis B: Solid education is not a guarantee for a high salary so even highly qualified workers take on low-income jobs.

2.1 Do a visual inspection: plot residuals vs. education

hat_u <- fit.edu$residuals
plot(x=education, y=hat_u, main="The regression residuals vs. education")

The easiest way to test for heteroskedasticity is to get a good look at your data. Ideally, you want your data to all follow a pattern of a line (the dispersion of data does not change across $X$), but sometimes it doesn’t. The quickest way to identify heteroskedastic data is to see the shape of the plotted data. For example, if the data generally is cone shaped, heteroskedasticity is likely.

The plot indicates that there is heteroskedasticity: if we assume the regression line to be a reasonably good representation of the conditional mean function $E(earnings_i | education_i)$, the dispersion of hourly earnings around that function clearly increases with the level of education, i.e., the variance of the distribution of earnings increases. In other words: the variance of the errors (the errors made in explaining earnings by education) increases with education so that the regression errors are heteroskedastic.

This example makes a case that the assumption of homoskedasticity is doubtful in economic applications. Should we care about heteroskedasticity? Yes, we should. This is because heteroskedasticity can have serious negative consequences in hypothesis testing, if we ignore it.

2.2 Obtain heteroskedasticity robust standard errors

The heteroskedasticity robust SEs for estimated coefficients are often referred to as the Eicker-Huber-White standard errors, the most frequently cited paper on this is White (1980). In R, vcovHC computes the variance-covariance matrix of coefficient estimates. We are interested in the square root of the diagonal elements of this matrix, i.e., the standard error estimates.

The variance-covariance matrix of coefficient estimates is defined as: \[ var\left(\begin{array}{c} \hat\beta_0 \\ \hat\beta_1 \end{array}\right) =\left(\begin{array}{cc} var(\hat\beta_0) & cov(\hat\beta_0, \hat\beta_1) \\ cov(\hat\beta_0, \hat\beta_1) & var(\hat\beta_1) \end{array}\right) \] So vcovHC gives us an estimate of this matrix. We are mostly interested in the diagonal elements of the estimated matrix.

vcov.fit <- vcovHC(fit.edu, type = "HC3")  # variance-covariance matrix for beta_0 and beta_1
print(vcov.fit)

##             (Intercept)    education
## (Intercept)  0.85894706 -0.065865676
## education   -0.06586568  0.005185651

robust_se <- sqrt(diag(vcov.fit))  # robust SE for beta_0 and beta_1
print(robust_se)

## (Intercept)   education 
##  0.92679397  0.07201146

robust_se[2]    # robust SE for beta_1

##  education 
## 0.07201146

2.3 The significance test using robust standard errors

Now assume we want to generate a coefficient summary as provided by summary but with robust standard errors of the coefficient estimators, robust $t$-statistics and corresponding $p$-values for the regression model linear_model. This can be done using coeftest from the package lmtest, see ?coeftest. Further we specify in the argument vcov that vcovHC, the Eicker-Huber-White estimate of the variance matrix we have computed before, should be used.

coeftest(fit.edu, vcov=vcovHC, type="HC3")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -3.134371   0.926794  -3.382 0.0007291 ***
## education    1.466925   0.072011  20.371 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How severe are the implications of using homoskedasticity-only standard errors in the presence of heteroskedasticity? The answer is: it depends. As mentioned above we face the risk of drawing wrong conclusions when conducting significance tests.

Reference:

MacKinnon, James G, and Halbert White. 1985. “Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties.” Journal of Econometrics 29 (3): 305–25.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48 (4): pp. 817–38.
https://www.econometrics-with-r.org/5-4-hah.html

L5 (Lab) - Inference for Simple OLS Regression

Rui Fan

Exercise (Dummy Variable): Birth Weight and Smoking

Clear the Workspace

Install and Load Needed Packages

Import the Birthweight_Smoking Data

Summary Statistics: birthweight

1. Scatterplot

2. OLS regression: birthweight vs. smoker

Compare results from two regressions: with smoker vs. with nsmoker

3. Construct Confidence Intervals

4. Hypothesis Test

Exercise (Heteroskedasticity): Return to Education

Clear the Workspace

Install and Load Needed Packages

Import the CPSSWEducation Data

Summary Statistics: Earnings vs. Education

1. Is Education Positively Correlated to Earnings?

2. Heteroskedasticity

2.1 Do a visual inspection: plot residuals vs. education

2.2 Obtain heteroskedasticity robust standard errors

2.3 The significance test using robust standard errors

Reference: