L7 (Lab) - Inference for Multiple Regressions

Birth Weight and Smoking

The data set used for this exercise is Birthweight_Smoking.xlsx from E5.3 of Stock and Watson (2020, e4). Birthweight_Smoking.xlsx is from the 1989 linked National Natality-Mortality Detail files, which contains a census of infant births and deaths. This data set was provided by Professors Douglas Almond, Kenneth Chay, and David Lee, and is a subset of the data used in their paper “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083. A detailed description is given in Birthweight_Smoking_Description.pdf, available in LMS.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

library(openxlsx)               # load package; to read xlsx file

## Warning: package 'openxlsx' was built under R version 4.3.3

library(sandwich)               # load package; to compute heteroskedasticity robust standard errors
library(lmtest)                 # load package; to conduct hypothesis test using robust SE
library(car)                    # load package; to conduct hypothesis for multiple coefficients
library(stargazer)              # load package; to put regression results into a single stargazer table

Import the Birthweight_Smoking Data

id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id), sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)

## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : num  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: num  4253 3459 2920 2600 3742 ...
##  $ smoker     : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : num  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : num  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : num  0 0 0 0 0 0 0 0 0 0 ...

Description of main variables:

birthweight: birth weight of infant (in grams)
smoker: indicator =1 if the mother smoked during pregnancy and 0, otherwise.
alcohol: indicator = 1 if mother drank alcohol during pregnancy
nprevist: total number of prenatal visits
age: age
educ: years of educational attainment (more than 16 years coded as 17)
unmarried: indicator =1 if mother is unmarried.

Consider the following multiple regression models: \[ birthweight_i = \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + \beta_4 unmarried_i + \beta_5 age_i + \beta_6 educ_i + u_i \]

We run the OLS regression and estimate the coeffients:

## Run the OLS regressions ##
fit <- lm(birthweight~smoker+alcohol+nprevist+unmarried+age+educ, data=bw)
summary(fit)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried + 
##     age + educ, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2820.01  -304.82    22.59   357.29  2355.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3199.426     83.337  38.392  < 2e-16 ***
## smoker      -176.959     27.474  -6.441 1.38e-10 ***
## alcohol      -14.758     75.839  -0.195    0.846    
## nprevist      29.775      2.926  10.175  < 2e-16 ***
## unmarried   -199.320     28.396  -7.019 2.75e-12 ***
## age           -2.494      2.268  -1.099    0.272    
## educ           0.238      5.542   0.043    0.966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared:  0.08901,    Adjusted R-squared:  0.08719 
## F-statistic: 48.74 on 6 and 2993 DF,  p-value: < 2.2e-16

Can we say that the effect of drinking on birthweight is insignificant due to imperfect multicollinearity?

To answer this question, we check the sample correlations between independent variables using cor. Because there is no large sample correlation between alcohol and any other independent variable, we say the insignificance of alcohol is not due to imperfect multicollinearity.

Note: For a multiple regression with \(k\) independent variables, the variance of \(\hat{\beta}_j, j=1,...,k\) is: \[var(\hat{\beta}_j) = \frac{\sigma_u^2}{n(1-R_j^2)\sigma_{x_j}^2},\] where \(\sigma_u^2\) is the variance of the error term, \(\sigma_{x_j}^2\) the variance of \(x_j\), and \(R_j^2\) the \(R\)-squared of regressing \(x_j\) on \((1, x_1,..,x_{j-1},x_{j+1},...,x_{k})\). If \(R_j^2=0\), it implies that \(x_j\) has zero sample correlation with every other independent variable; if \(R_j^2=1\), it implies perfect collinearity.

## A list of independent variables ##
x.list <- c("smoker", "alcohol", "nprevist", "unmarried", "age", "educ")

## Correlation Matrix ##
cor(bw[,x.list])

##               smoker      alcohol    nprevist   unmarried        age
## smoker     1.0000000  0.120898079 -0.10863524  0.23774397 -0.1437645
## alcohol    0.1208981  1.000000000 -0.04254006  0.05119108  0.0381249
## nprevist  -0.1086352 -0.042540063  1.00000000 -0.23318800  0.1468002
## unmarried  0.2377440  0.051191080 -0.23318800  1.00000000 -0.4136319
## age       -0.1437645  0.038124905  0.14680023 -0.41363187  1.0000000
## educ      -0.2326451  0.004910194  0.21020526 -0.34133354  0.4457169
##                   educ
## smoker    -0.232645112
## alcohol    0.004910194
## nprevist   0.210205261
## unmarried -0.341333536
## age        0.445716896
## educ       1.000000000

Obtain heteroskedasticity robust standard errors.

If homoskedasticity assumption is not valid, i.e., \(var(u_i)\) is not constant across \(i\), using the default SE of \(\hat\beta\) for confidence intervals or hypothesis test could be problematic: The estimates of the SE is biased, thus invalidating statistical tests of significance that assume that the modelling errors all have the same variance. Thus, under the null hypothesis, the usual \(t\) test no longer follows the \(t\) distribution, and \(F\) test is no longer valid.

To obtain the heteroskedasticity robust standard errors, we use the vcovHC function from R package sandwich, and then compute the square root of it.

## Obtain robust variance
var.r <- vcovHC(fit, type = "HC3")  

## Obtain robust SE
se.r <- sqrt(diag(var.r))  

## Find robust SE of the OLS estimator for variable "smoker"
se.r["smoker"]

##   smoker 
## 27.40443

## Summary of results
stargazer(fit, fit, se=list(NULL, se.r), column.labels=c("default","robust"), type="text", digits=2, omit.stat=c("f", "ser"))

## 
## =========================================
##                  Dependent variable:     
##              ----------------------------
##                      birthweight         
##                 default        robust    
##                   (1)            (2)     
## -----------------------------------------
## smoker         -176.96***    -176.96***  
##                 (27.47)        (27.40)   
##                                          
## alcohol          -14.76        -14.76    
##                 (75.84)        (74.22)   
##                                          
## nprevist        29.78***      29.78***   
##                  (2.93)        (3.61)    
##                                          
## unmarried      -199.32***    -199.32***  
##                 (28.40)        (31.07)   
##                                          
## age              -2.49          -2.49    
##                  (2.27)        (2.45)    
##                                          
## educ              0.24          0.24     
##                  (5.54)        (5.55)    
##                                          
## Constant      3,199.43***    3,199.43*** 
##                 (83.34)        (90.83)   
##                                          
## -----------------------------------------
## Observations     3,000          3,000    
## R2                0.09          0.09     
## Adjusted R2       0.09          0.09     
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

From the table above, we observe that:
- Two sets of standard errors are not very different.
- Heteroskedasticity-robust SE can be either larger or smaller than the usual SE.

The use of robust standard errors will not change the coefficient estimates provided by OLS, but they will change the standard errors and give you reasonable accurate p-values.

If we can compute robust SE regardless of heteroskedasticity, why do we bother with the usual SE at all?
- The heteroskedasticity-robust SE and robust t tests are justified only when the sample size is large.
- With smaller sample sizes, Heteroskedasticity-robust statistics may have more bias than the usual statistics.

Construct a 95% confidence interval for the effect of smoking on birth weight. How about a 95% CI using robust standard errors?

The 95% confidence interval using usual SE:

confint(fit)

##                   2.5 %      97.5 %
## (Intercept) 3036.023237 3362.829598
## smoker      -230.828310 -123.089391
## alcohol     -163.459664  133.943006
## nprevist      24.037200   35.513014
## unmarried   -254.997302 -143.641683
## age           -6.941573    1.954517
## educ         -10.628659   11.104696

The 95% confidence interval using robust SE:

coefci(fit, vcov=vcovHC, type="HC3")

##                   2.5 %      97.5 %
## (Intercept) 3021.321591 3377.531243
## smoker      -230.692271 -123.225429
## alcohol     -160.294688  130.778030
## nprevist      22.696472   36.853742
## unmarried   -260.241126 -138.397859
## age           -7.299868    2.312812
## educ         -10.637517   11.113554

Also, we can calculate confidence intervals for a specific coefficient at any confidence level.

For example, a 99% CI for the coefficient on “smoker”:
- Using usual SE: confint(fit, “smoker”, level=0.99).
- Using robust SE: coefci(fit, , vcov=vcovHC, type=“HC3”, “smoker”, level=0.99).

In-class exercise: Consider the coefficient on “unmarried”.

Construct a 90% CI for the coefficient using robust SE.
Is the coefficient statistically significant? What is the \(t\)-statistic and p-value for this significant test?
Your roommate doesn’t know anything about econometrics and asks you the meaning of this coefficient. How will you explain the estimate on unmarried to him/her?

Hypothesis Tests for the Multiple Regression Model. \[ birthweight_i = \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + \beta_4 unmarried_i + \beta_5 age_i + \beta_6 educ_i + u_i. \]

(1) Test the joint significance of the coefficients for smoker and alcohol.

\[H_0: \beta_{1} = \beta_{2}=0 \quad vs. \quad H_a: \text{At least one of the coefficients is non-zero.}\]

## F-test using usual (default) SE ##
linearHypothesis(fit, c("smoker=0", "alcohol=0"))

## Linear hypothesis test
## 
## Hypothesis:
## smoker = 0
## alcohol = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2995 971581039                                  
## 2   2993 958012642  2  13568397 21.195 7.239e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## F-test using robust SE ##
linearHypothesis(fit, c("smoker=0", "alcohol=0"), white.adjust=c("hc3"))

## Linear hypothesis test
## 
## Hypothesis:
## smoker = 0
## alcohol = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1   2995                        
## 2   2993  2 21.204 7.173e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(2) Test the joint significance of all coefficients (except for the intercept).

\[H_0: \beta_1 = 0, \beta_2=0, ..., \beta_6=0 \quad vs. \quad H_a: \text{At least one of the coefficients is non-zero.}\]

One-way to conduct this F test is to use the function summary:

summary(fit)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried + 
##     age + educ, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2820.01  -304.82    22.59   357.29  2355.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3199.426     83.337  38.392  < 2e-16 ***
## smoker      -176.959     27.474  -6.441 1.38e-10 ***
## alcohol      -14.758     75.839  -0.195    0.846    
## nprevist      29.775      2.926  10.175  < 2e-16 ***
## unmarried   -199.320     28.396  -7.019 2.75e-12 ***
## age           -2.494      2.268  -1.099    0.272    
## educ           0.238      5.542   0.043    0.966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared:  0.08901,    Adjusted R-squared:  0.08719 
## F-statistic: 48.74 on 6 and 2993 DF,  p-value: < 2.2e-16

Another way is to use the function linearHypothesis:

## A list of coefficients in Model 2 ##
coefs <- names(coef(fit))

## F-test using usual (default) SE ##
linearHypothesis(fit, coefs[-1])  # "-1" means "excluding the intercept"

## Linear hypothesis test
## 
## Hypothesis:
## smoker = 0
## alcohol = 0
## nprevist = 0
## unmarried = 0
## age = 0
## educ = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   2999 1051620004                                  
## 2   2993  958012642  6  93607362 48.741 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## F-test using robust SE ##
linearHypothesis(fit, coefs[-1], white.adjust=c("hc3"))

## Linear hypothesis test
## 
## Hypothesis:
## smoker = 0
## alcohol = 0
## nprevist = 0
## unmarried = 0
## age = 0
## educ = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1   2999                        
## 2   2993  6 37.359 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(3) Test single restrictions on multiple coefficients.

For example, unmarried is a control variable that captures the effects of several factors that might differ between married and unmarried mothers such as age, education, income, diet and other health factors, and so forth. So, we can test if the effect of unmarried is equal to the effect of educ: \[H_0: \beta_{4} = \beta_{6} \quad vs. \quad H_a: \beta_{4} \neq \beta_{6}\]

One-way is to perform the test directly using the function linearHypothesis:

## F-test using usual (default) SE ##
linearHypothesis(fit, c("unmarried=educ"))

## Linear hypothesis test
## 
## Hypothesis:
## unmarried - educ = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2994 974080516                                  
## 2   2993 958012642  1  16067874 50.199 1.725e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## F-test using robust SE ##
linearHypothesis(fit, c("unmarried=educ"), white.adjust=c("hc3"))

## Linear hypothesis test
## 
## Hypothesis:
## unmarried - educ = 0
## 
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age + 
##     educ
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1   2994                        
## 2   2993  1 42.178 9.723e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second way is to rearrange the regression so that the restriction becomes a restriction on a single coefficient. To do this, we subtract and add \(\beta_6 \cdot unmarried_i\) to the original model: \[\begin{aligned} birthweight_i &= \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + \beta_4 unmarried_i + \beta_5 age_i + \beta_6 educ_i + u_i \\ &= \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + {\color{red}{(\beta_4-\beta_6)}} unmarried_i + \beta_5 age_i + \beta_6 {\color{red}{(unmarried_i + educ_i)}} + u_i \\ &= \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + {\color{red}{\gamma_1}} unmarried_i + \beta_5 age_i + \beta_6 {\color{red}{W_i}} + u_i \end{aligned}\]

## Generate a new variable: W ##
bw$W <- bw$unmarried + bw$educ

## Regression for the transformed model ##
fit.tr <- lm(birthweight~smoker+alcohol+nprevist+unmarried+age+W, data=bw)
summary(fit.tr)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried + 
##     age + W, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2820.01  -304.82    22.59   357.29  2355.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3199.426     83.337  38.392  < 2e-16 ***
## smoker      -176.959     27.474  -6.441 1.38e-10 ***
## alcohol      -14.758     75.839  -0.195    0.846    
## nprevist      29.775      2.926  10.175  < 2e-16 ***
## unmarried   -199.558     28.166  -7.085 1.72e-12 ***
## age           -2.494      2.268  -1.099    0.272    
## W              0.238      5.542   0.043    0.966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared:  0.08901,    Adjusted R-squared:  0.08719 
## F-statistic: 48.74 on 6 and 2993 DF,  p-value: < 2.2e-16

## Test gamma_1 using usual (default) SE ##
coeftest(fit.tr)

## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 3199.42642   83.33679 38.3915 < 2.2e-16 ***
## smoker      -176.95885   27.47381 -6.4410 1.378e-10 ***
## alcohol      -14.75833   75.83874 -0.1946    0.8457    
## nprevist      29.77511    2.92637 10.1747 < 2.2e-16 ***
## unmarried   -199.55751   28.16574 -7.0851 1.725e-12 ***
## age           -2.49353    2.26853 -1.0992    0.2718    
## W              0.23802    5.54208  0.0429    0.9657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Test gamma_1 using heteroskedasticity robust SE ##
coeftest(fit.tr, vcov=vcovHC, type="HC3")

## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 3199.42642   90.83473 35.2225 < 2.2e-16 ***
## smoker      -176.95885   27.40443 -6.4573 1.239e-10 ***
## alcohol      -14.75833   74.22458 -0.1988    0.8424    
## nprevist      29.77511    3.61015  8.2476 2.391e-16 ***
## unmarried   -199.55751   30.72740 -6.4944 9.723e-11 ***
## age           -2.49353    2.45127 -1.0172    0.3091    
## W              0.23802    5.54660  0.0429    0.9658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In-class exercise: Conduct the following test \[H_0: \beta_{smoker} = \beta_{alcohol} \quad vs. \quad H_a: \beta_{smoker} \neq \beta_{alcohol}\]