To set this (and the following couple of tutorials in context) suppose the Federal Government feels that there is a need to reduce the incidence of cigarette smoking.
Apart using just taxation increases, the Government is considering an advertising campaign to highlight the detrimental effects of smoking.
We have been hired as consultants to assess the causal impact of mothers’ smoking habits on childrens birthweight (a proxy of a baby’s health).
The Government has asked us to quantify the effect of smoking during pregnancy so that it can use these results in it’s ad campaign.
So, we are going to use the data set provided to advise Scott Morrison on the impact of smoking whilst pregnant on a baby’s birthweight.
To avoid any accusations of “fake news” in the ad campaign we need to be careful in determining the true causal effect.
To install the estimatr package please use the same procedure that we have used previously in R studio.
Note once you have installed a package in R you do (should) not have to install it again; it will now reside in your R working directory.
Next call the required stargazer, AER and estimatr packages
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
and create a dataframe called mydata1 , by reading the tute7_smoke.csv data file into R.
The tute7 smoke.csv is a micro dataset with the following 13 variables:
stargazer(mydata1,type = "text", nobs = FALSE, mean.sd = TRUE, median = TRUE,
iqr = TRUE, digits=3, omit = "earnings",
title = "Summary statistics for baby's birthweight data"
)
##
## Summary statistics for baby's birthweight data
## ==================================================================
## Statistic Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## ------------------------------------------------------------------
## id 1,500.500 866.170 1 750.8 1,500.5 2,250.2 3,000
## birthweight 3,382.934 592.163 425 3,062 3,420 3,750 5,755
## smoker 0.194 0.395 0 0 0 0 1
## alcohol 0.019 0.138 0 0 0 0 1
## drinks 0.058 0.688 0 0 0 0 21
## nprevisit 10.992 3.672 0 9 12 13 35
## tripre1 0.804 0.397 0 1 1 1 1
## tripre2 0.153 0.360 0 0 0 0 1
## tripre3 0.033 0.179 0 0 0 0 1
## tripre0 0.010 0.100 0 0 0 0 1
## unmarried 0.227 0.419 0 0 0 0 1
## educ 12.907 2.167 0 12 12 14 17
## age 26.889 5.362 14 23 27 31 44
## ------------------------------------------------------------------
A typical observation is a baby that weighs 3383 with a mother who is a smoker 19% of the time, drinks alcohol 2% of the time, has 11 pre-natal visits, is unmarried 23% of the time, has 12.9 years of educational attainment, and is 27 years old.
Is there a difference in birthweight between smoking and non-smoking mothers (during pregnancy) ?
## [1] 3178.832
## [1] 3432.06
## [1] -253.2284
##
## Welch Two Sample t-test
##
## data: mydata1$birthweight[mydata1$smoker == 1] and mydata1$birthweight[mydata1$smoker == 0]
## t = -9.4414, df = 887.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -305.8685 -200.5882
## sample estimates:
## mean of x mean of y
## 3178.832 3432.060
The difference in probability densities in the plot above highlights a leftward shift in the distribution of birthweight for mothers who smoke.
From the two-sample t-tests, the difference in mean of birthweight among babies with smoking and non-smoking mothers is -253 grams, a difference that is statistically significant with a p-value less than 0.0001, and with a 95% CI [-306,-200].
It’s a large difference: the difference is \(100 \times 253/3383=7.5\%\) of the sample mean.
Lets look at a SLR line to look at the association between a mother smoking and the baby’s birthweight.
\[Birthgweight_i = \underbrace{\beta_0 + \beta_1\, smoking_i}_{\scriptsize{\text{determistic}}} +U_i \tag{1}\]
Recall that the error term, \(U_i\), in the Population Regression Line (PRL) captures all other factors not included in the deterministic part of the PRL.
So the question is do we have a causal relationship using Model 1 ?
This idea was discussed in the last question of Tutorial 6 where we looked at the causal relationship between the number of police and homicides in a cross-sectional across UK counties in 2012.
Looking back on the dataset that we are using now, there could be other variables (characteristics or choices made by the mother to be) that may impact on a baby’s birthweight that are not included in Model 1.
\(\dots\) back to the available dataset.
Consider whether a mother to be drank alcohol while being pregnant, attended pre-natal visits and if the level of education the mother attained may impact on the baby’s birthweight for that individual.
Remember the “brief” we are attempting to advise the Government on the (causal) effect of smoking while pregnant.
So now we are trying to determine whether we should use the results of Model 1 to advise the Governemnt of the partial effect of an extra cigarette smoked while a mother is pregnant will have on the (average) birthweight of the child.
The variables we are considering are:
alcohol: \(\quad\) equals one if mother drank alcohol during pregnancy, 0 otherwise
tripre0: \(\quad\) equals one if no prenatal visits, 0 otherwise
educ: \(\quad\) years of educational attainment of mother
Do these variables (that are not in the SRL - Model 1) have an influence on the partial (or marginal) effect of smoking while pregnant on birthweight?
We will next look at these graphically and conduct two sample t tests..
t-test of difference in drinking for smokers and non-smokers
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
0.0420984 | 0.0532646 | 0.0111663 | 4.40434 | 1.24e-05 | 643.3413 | 0.023329 | 0.0608677 | Welch Two Sample t-test | two.sided |
t-test of difference in pre natal care for smokers and non-smokers
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
0.0153062 | 0.0223368 | 0.0070306 | 2.405874 | 0.0164026 | 672.7704 | 0.0028144 | 0.0277979 | Welch Two Sample t-test | two.sided |
t-test of difference in attained education level for smokers and non-smokers
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
-1.274535 | 11.87973 | 13.15426 | -15.80019 | 0 | 1164.658 | -1.432801 | -1.116268 | Welch Two Sample t-test | two.sided |
Alcohol: From the two-sample t-tests, the difference in mean of alcohol betweensmoking and non-smoking mothers is 0.042 (4.2% higher chance of drinking alcohol during pregnancy if a mother smokes), a statistically significant difference with a p-value less than 0.0001.
Pre-Natal Care: the two-sample t-tests show the difference in mean of tripre0 between smoking and non-smoking mothers is -0.015 (1.5% less chance of prenatal care if a mother smokes), a statistically significant difference with a p-value less than 0.016.
Education: From the two-sample t-tests, the difference in mean of educ between smoking and non-smoking mothers is -1.27 (1.27 less years of educational attainment if a other smokes), a statistically significant difference with a p-value less than 0.0001.
Therefore, the t-tests indicate that the variables alcohol, nprevisit and educ also could impact on birthweight.
Discuss the relationship each variable, alcohol , tripre0 and educ , would have, if any, with smoker.
we would expect alcohol to be positively related to smoking,
pre-natal care to be negatively related to smoking, and
education to be negatively related to smoking.
If these variables should be in the model then estimates from our our Simple Linear Regression Line may be biased.
We will look at this more closely in the last question.
In this case the assumption \(\small{E\left(U_i \vert \bf{X} \right)=0}\) is violated leading to Omitted Varaible Bias (OVB).
It can be shown (as stated in L6 slide 20) that in this case \(\small{\hat{\beta_1} \rightarrow \beta_1 + \rho_{X_U} \dfrac{\sigma_U}{\sigma_X}}\) so 2 things have to happen for our estimate \(\small{\hat{\beta_1}}\) to be biased the omitted varaible should be in the model and the omitted variable is correlated with the included X variable in the model.
The size of the bias depends on the size of \(\small{\rho_{X_U}}\) and the direction of the bias depends on the sign of ${_{X_U}} $; where the latter term can be though of as the expected sign of the \(\beta\) that the omitted X variable would take.
From what we have discussed already omitting,
alcohol would result in a \(\small{({\color{blue}{+}}) \times ({\color{red}{-}}) \rightarrow \text{ negative bias in } \hat{\beta_1}}\)
prenatal care; a \(\small{({\color{red}{-}}) \times ({\color{blue}{+}}) \rightarrow \text{ negative bias in } \hat{\beta_1}}\)
and, education; a \(\small{({\color{red}{-}}) \times ({\color{blue}{+}}) \rightarrow \text{ negative bias in } \hat{\beta_1}}\)
Before exploring this further, Q2 introduces the issue of heteroskedasticity.
We are assuming that all the error terms in the Population Regression Line (PRL) have the same cariance, \(\small{Var\left(U_i \lvert \bf{X} \right) = \sigma^2}\), e.g. they are homoskedastic.
If this assumption does not hold and the errors are heteroskedastic then the OLS estimators are no longer minimum variance resulting in incorrecct inference, i.e. sinc the standard errors are not minimum variances then the results of hypotheses test and confidence interval estimation will be wrong.
Heteroskedastic errors can be defind as \(\small{Var\left(U_i \lvert \bf{X} \right) = \sigma^2 (\bf{\cdot})}\) where \(\small{(\bf{\cdot})}\) is some function of the X’s. For example, the functional form of the heterskedasicity could be \(\small{\sigma^2 (X_i)}\), \(\small{\sigma^2 (X^2_i)}\), \(\small{\sigma^2 (\frac{1}{X_i})}\), \(\dots\)
A prominent econometrician, Halbert White, developed an estimator that “washes out” any functional form of heterskedasticity from the variance-covariance matrix of the OLS etimators.
The resulting OLS standarrd errors are known as White standard errors; we can now safely conduct any inference (providing the other assumptions hold) and basically not have to worry whether the errors are heteroskedastic or not.
There are a number of ways to address this issue in R ; one of the most straightforward ways is to use the *estimatr package.
The lm_robust() command in this pacakge automatically provides Heteroskestic Consistent (HC) standards errors (se).
The default option for lm_robust() is White’s variance-covariance estimator with a degree of freedom adjustment.
While the estimatr package is easy to use there is one drawback; it is not compatible with Stargazer so we need a little more work to obtain the correct HCse’s in the Strgazer regression tables.
Since the HC estimator only operates on the varaince-covariance matrix of the OLS estimators we can use the usual lm() command to estimate the model, then re-estimate the model using lm_robust, save the estimated coefficient standard errors, then insert these in Stargazer; e.g.
# lm() does not account for any heteroskedasicity
reg1=lm(birthweight~smoker,data=mydata1)
# lm_robust (from the estimatr package) uses HC2 standard errors
lmout <- lm_robust(birthweight~smoker,data=mydata1)
#save the HC standard errors
se_hc<-lmout$std.error
# tabulate results in stargazer
# note we now have to specify which s.e. belong to each regression
# stargazer automatically includes the s.e. from the lm() command
# however does not identify those from the lm_robust() command
# so enter se=list(NULL,se_hc)
stargazer(reg1,reg1,type="text", se=list(NULL,se_hc),model.numbers= FALSE)
Results are
##
## ============================================================
## Dependent variable:
## ----------------------------
## birthweight
## OLS se HC se
## (1) (2)
## ------------------------------------------------------------
## smoker -253.228*** -253.228***
## (26.951) (26.821)
##
## Constant 3,432.060*** 3,432.060***
## (11.871) (11.889)
##
## ------------------------------------------------------------
## Observations 3,000 3,000
## R2 0.029 0.029
## Adjusted R2 0.028 0.028
## Residual Std. Error (df = 2998) 583.730 583.730
## F Statistic (df = 1; 2998) 88.279*** 88.279***
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
There is no change to the parameter estimates but you can see, in this case only a small, change in the estimated standard errors.
On to Q3; here we will introduce the Multiple Regression Model; start to “build” a model; look at the effects of OVB; and, have another look at not taking heteroskedasticity into account \(\dots\) so quite a bit to get through here.