In this lab exercise, you will learn:
The data set used for this exercise is Birthweight_Smoking.xlsx from E5.3 of Stock and Watson (2020, e4). Birthweight_Smoking.xlsx is from the 1989 linked National Natality-Mortality Detail files, which contains a census of infant births and deaths. This data set was provided by Professors Douglas Almond, Kenneth Chay, and David Lee, and is a subset of the data used in their paper “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083. A detailed description is given in Birthweight_Smoking_Description.pdf, available in LMS.
rm(list=ls())
Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).
library(openxlsx) # load package; to read xlsx file
## Warning: package 'openxlsx' was built under R version 4.3.3
library(sandwich) # load package; to compute heteroskedasticity robust standard errors
library(lmtest) # load package; to conduct hypothesis test using robust SE
library(car) # load package; to conduct hypothesis for multiple coefficients
library(stargazer) # load package; to put regression results into a single stargazer table
id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id), sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)
## 'data.frame': 3000 obs. of 12 variables:
## $ nprevist : num 12 5 12 13 9 11 12 10 13 10 ...
## $ alcohol : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre1 : num 1 0 1 1 1 1 1 1 1 1 ...
## $ tripre2 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ tripre3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre0 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ birthweight: num 4253 3459 2920 2600 3742 ...
## $ smoker : num 1 0 1 0 0 0 1 0 0 0 ...
## $ unmarried : num 1 0 0 0 0 0 0 0 0 0 ...
## $ educ : num 12 16 11 17 13 16 14 13 17 14 ...
## $ age : num 27 24 23 28 27 33 24 38 29 28 ...
## $ drinks : num 0 0 0 0 0 0 0 0 0 0 ...
Description of main variables:
Consider the following multiple regression models: \[ birthweight_i = \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + \beta_4 unmarried_i + \beta_5 age_i + \beta_6 educ_i + u_i \]
We run the OLS regression and estimate the coeffients:
## Run the OLS regressions ##
fit <- lm(birthweight~smoker+alcohol+nprevist+unmarried+age+educ, data=bw)
summary(fit)
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried +
## age + educ, data = bw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2820.01 -304.82 22.59 357.29 2355.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3199.426 83.337 38.392 < 2e-16 ***
## smoker -176.959 27.474 -6.441 1.38e-10 ***
## alcohol -14.758 75.839 -0.195 0.846
## nprevist 29.775 2.926 10.175 < 2e-16 ***
## unmarried -199.320 28.396 -7.019 2.75e-12 ***
## age -2.494 2.268 -1.099 0.272
## educ 0.238 5.542 0.043 0.966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared: 0.08901, Adjusted R-squared: 0.08719
## F-statistic: 48.74 on 6 and 2993 DF, p-value: < 2.2e-16
- Can we say that the effect of drinking on birthweight is insignificant due to imperfect multicollinearity?
To answer this question, we check the sample correlations between independent variables using cor. Because there is no large sample correlation between alcohol and any other independent variable, we say the insignificance of alcohol is not due to imperfect multicollinearity.
Note: For a multiple regression with \(k\) independent variables, the variance of \(\hat{\beta}_j, j=1,...,k\) is: \[var(\hat{\beta}_j) = \frac{\sigma_u^2}{n(1-R_j^2)\sigma_{x_j}^2},\] where \(\sigma_u^2\) is the variance of the error term, \(\sigma_{x_j}^2\) the variance of \(x_j\), and \(R_j^2\) the \(R\)-squared of regressing \(x_j\) on \((1, x_1,..,x_{j-1},x_{j+1},...,x_{k})\). If \(R_j^2=0\), it implies that \(x_j\) has zero sample correlation with every other independent variable; if \(R_j^2=1\), it implies perfect collinearity.
## A list of independent variables ##
x.list <- c("smoker", "alcohol", "nprevist", "unmarried", "age", "educ")
## Correlation Matrix ##
cor(bw[,x.list])
## smoker alcohol nprevist unmarried age
## smoker 1.0000000 0.120898079 -0.10863524 0.23774397 -0.1437645
## alcohol 0.1208981 1.000000000 -0.04254006 0.05119108 0.0381249
## nprevist -0.1086352 -0.042540063 1.00000000 -0.23318800 0.1468002
## unmarried 0.2377440 0.051191080 -0.23318800 1.00000000 -0.4136319
## age -0.1437645 0.038124905 0.14680023 -0.41363187 1.0000000
## educ -0.2326451 0.004910194 0.21020526 -0.34133354 0.4457169
## educ
## smoker -0.232645112
## alcohol 0.004910194
## nprevist 0.210205261
## unmarried -0.341333536
## age 0.445716896
## educ 1.000000000
- Obtain heteroskedasticity robust standard errors.
If homoskedasticity assumption is not valid, i.e., \(var(u_i)\) is not constant across \(i\), using the default SE of \(\hat\beta\) for confidence intervals or hypothesis test could be problematic: The estimates of the SE is biased, thus invalidating statistical tests of significance that assume that the modelling errors all have the same variance. Thus, under the null hypothesis, the usual \(t\) test no longer follows the \(t\) distribution, and \(F\) test is no longer valid.
To obtain the heteroskedasticity robust standard errors, we use the vcovHC function from R package sandwich, and then compute the square root of it.
## Obtain robust variance
var.r <- vcovHC(fit, type = "HC3")
## Obtain robust SE
se.r <- sqrt(diag(var.r))
## Find robust SE of the OLS estimator for variable "smoker"
se.r["smoker"]
## smoker
## 27.40443
## Summary of results
stargazer(fit, fit, se=list(NULL, se.r), column.labels=c("default","robust"), type="text", digits=2, omit.stat=c("f", "ser"))
##
## =========================================
## Dependent variable:
## ----------------------------
## birthweight
## default robust
## (1) (2)
## -----------------------------------------
## smoker -176.96*** -176.96***
## (27.47) (27.40)
##
## alcohol -14.76 -14.76
## (75.84) (74.22)
##
## nprevist 29.78*** 29.78***
## (2.93) (3.61)
##
## unmarried -199.32*** -199.32***
## (28.40) (31.07)
##
## age -2.49 -2.49
## (2.27) (2.45)
##
## educ 0.24 0.24
## (5.54) (5.55)
##
## Constant 3,199.43*** 3,199.43***
## (83.34) (90.83)
##
## -----------------------------------------
## Observations 3,000 3,000
## R2 0.09 0.09
## Adjusted R2 0.09 0.09
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
- Construct a 95% confidence interval for the effect of smoking on birth weight. How about a 95% CI using robust standard errors?
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 3036.023237 3362.829598
## smoker -230.828310 -123.089391
## alcohol -163.459664 133.943006
## nprevist 24.037200 35.513014
## unmarried -254.997302 -143.641683
## age -6.941573 1.954517
## educ -10.628659 11.104696
coefci(fit, vcov=vcovHC, type="HC3")
## 2.5 % 97.5 %
## (Intercept) 3021.321591 3377.531243
## smoker -230.692271 -123.225429
## alcohol -160.294688 130.778030
## nprevist 22.696472 36.853742
## unmarried -260.241126 -138.397859
## age -7.299868 2.312812
## educ -10.637517 11.113554
Also, we can calculate confidence intervals for a specific coefficient at any confidence level.
- Hypothesis Tests for the Multiple Regression Model. \[ birthweight_i = \beta_0 + \beta_1 smoker_i + \beta_2 alcohol_i + \beta_3 nprevist_i + \beta_4 unmarried_i + \beta_5 age_i + \beta_6 educ_i + u_i. \]
\[H_0: \beta_{1} = \beta_{2}=0 \quad vs. \quad H_a: \text{At least one of the coefficients is non-zero.}\]
## F-test using usual (default) SE ##
linearHypothesis(fit, c("smoker=0", "alcohol=0"))
## Linear hypothesis test
##
## Hypothesis:
## smoker = 0
## alcohol = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2995 971581039
## 2 2993 958012642 2 13568397 21.195 7.239e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## F-test using robust SE ##
linearHypothesis(fit, c("smoker=0", "alcohol=0"), white.adjust=c("hc3"))
## Linear hypothesis test
##
## Hypothesis:
## smoker = 0
## alcohol = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 2995
## 2 2993 2 21.204 7.173e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[H_0: \beta_1 = 0, \beta_2=0, ..., \beta_6=0 \quad vs. \quad H_a: \text{At least one of the coefficients is non-zero.}\]
summary(fit)
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried +
## age + educ, data = bw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2820.01 -304.82 22.59 357.29 2355.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3199.426 83.337 38.392 < 2e-16 ***
## smoker -176.959 27.474 -6.441 1.38e-10 ***
## alcohol -14.758 75.839 -0.195 0.846
## nprevist 29.775 2.926 10.175 < 2e-16 ***
## unmarried -199.320 28.396 -7.019 2.75e-12 ***
## age -2.494 2.268 -1.099 0.272
## educ 0.238 5.542 0.043 0.966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared: 0.08901, Adjusted R-squared: 0.08719
## F-statistic: 48.74 on 6 and 2993 DF, p-value: < 2.2e-16
## A list of coefficients in Model 2 ##
coefs <- names(coef(fit))
## F-test using usual (default) SE ##
linearHypothesis(fit, coefs[-1]) # "-1" means "excluding the intercept"
## Linear hypothesis test
##
## Hypothesis:
## smoker = 0
## alcohol = 0
## nprevist = 0
## unmarried = 0
## age = 0
## educ = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2999 1051620004
## 2 2993 958012642 6 93607362 48.741 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## F-test using robust SE ##
linearHypothesis(fit, coefs[-1], white.adjust=c("hc3"))
## Linear hypothesis test
##
## Hypothesis:
## smoker = 0
## alcohol = 0
## nprevist = 0
## unmarried = 0
## age = 0
## educ = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 2999
## 2 2993 6 37.359 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For example, unmarried is a control variable that captures the effects of several factors that might differ between married and unmarried mothers such as age, education, income, diet and other health factors, and so forth. So, we can test if the effect of unmarried is equal to the effect of educ: \[H_0: \beta_{4} = \beta_{6} \quad vs. \quad H_a: \beta_{4} \neq \beta_{6}\]
## F-test using usual (default) SE ##
linearHypothesis(fit, c("unmarried=educ"))
## Linear hypothesis test
##
## Hypothesis:
## unmarried - educ = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2994 974080516
## 2 2993 958012642 1 16067874 50.199 1.725e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## F-test using robust SE ##
linearHypothesis(fit, c("unmarried=educ"), white.adjust=c("hc3"))
## Linear hypothesis test
##
## Hypothesis:
## unmarried - educ = 0
##
## Model 1: restricted model
## Model 2: birthweight ~ smoker + alcohol + nprevist + unmarried + age +
## educ
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 2994
## 2 2993 1 42.178 9.723e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Generate a new variable: W ##
bw$W <- bw$unmarried + bw$educ
## Regression for the transformed model ##
fit.tr <- lm(birthweight~smoker+alcohol+nprevist+unmarried+age+W, data=bw)
summary(fit.tr)
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried +
## age + W, data = bw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2820.01 -304.82 22.59 357.29 2355.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3199.426 83.337 38.392 < 2e-16 ***
## smoker -176.959 27.474 -6.441 1.38e-10 ***
## alcohol -14.758 75.839 -0.195 0.846
## nprevist 29.775 2.926 10.175 < 2e-16 ***
## unmarried -199.558 28.166 -7.085 1.72e-12 ***
## age -2.494 2.268 -1.099 0.272
## W 0.238 5.542 0.043 0.966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.8 on 2993 degrees of freedom
## Multiple R-squared: 0.08901, Adjusted R-squared: 0.08719
## F-statistic: 48.74 on 6 and 2993 DF, p-value: < 2.2e-16
## Test gamma_1 using usual (default) SE ##
coeftest(fit.tr)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3199.42642 83.33679 38.3915 < 2.2e-16 ***
## smoker -176.95885 27.47381 -6.4410 1.378e-10 ***
## alcohol -14.75833 75.83874 -0.1946 0.8457
## nprevist 29.77511 2.92637 10.1747 < 2.2e-16 ***
## unmarried -199.55751 28.16574 -7.0851 1.725e-12 ***
## age -2.49353 2.26853 -1.0992 0.2718
## W 0.23802 5.54208 0.0429 0.9657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Test gamma_1 using heteroskedasticity robust SE ##
coeftest(fit.tr, vcov=vcovHC, type="HC3")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3199.42642 90.83473 35.2225 < 2.2e-16 ***
## smoker -176.95885 27.40443 -6.4573 1.239e-10 ***
## alcohol -14.75833 74.22458 -0.1988 0.8424
## nprevist 29.77511 3.61015 8.2476 2.391e-16 ***
## unmarried -199.55751 30.72740 -6.4944 9.723e-11 ***
## age -2.49353 2.45127 -1.0172 0.3091
## W 0.23802 5.54660 0.0429 0.9658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In-class exercise: Conduct the following test \[H_0: \beta_{smoker} = \beta_{alcohol} \quad vs. \quad H_a: \beta_{smoker} \neq \beta_{alcohol}\]