Consider the following regression results: \[ \begin{aligned} \widehat{AHE} = & 0.33 + 10.42 \cdot College - 4.57 \cdot Female + 0.61 \cdot Age + 0.74 \cdot Northeast \\ &(1.47)\hskip10pt (0.29)\hskip50pt (0.29) \hskip50pt (0.05) \hskip40pt (0.47) \\ & - 1.54 \cdot Midwest - 0.44 \cdot South,~~ n = 7178,~~SER=12.01,~~R^2=0.185 \\ & \hskip10pt (0.40) \hskip60pt(0.37) \end{aligned} \] \[ \begin{aligned} & AHE = \mbox{average hourly earnings}\\ & College = \mbox{binary variable (1 if college, 0 if high school)}\\ & Female = \mbox{binary variable (1 if female, 0 if male)}\\ & Age = (in years)\\ & Northeast = \mbox{binary variable(1 if Region = Northeast, 0 otherwise)}\\ & Midwest = \mbox{binary variable (1 if Region = Midwest, 0 otherwise)}\\ & South = \mbox{binary variable (1 if Region = South, 0 otherwise)}\\ & West = \mbox{binary variable (1 if Region = West, 0 otherwise)} \end{aligned} \]
Note that the region should be categorized one of those 4 regions listed above. Explain why the regression model drops \(West\) from the set of regressors.
West is the baseline of the model, it depends on the value of other three regions. If Northest, Midwest, South equals 0, then, West should be 1. Otherwise, if Northest, Midwest, South equals 1, West should be 0.
Do there appear to be important regional differences? Conduct a t-test for each region.
For Northeast, t=(0.74-0)/SE=0.74/0.47=1.57 which is smaller than 1.96 and larger than -1.96, thus, it does NOT appear to be important regional differences.
For Midwest, t=(-1.54-0)/SE=-1.54/0.40=-3.85 which is smaller than -1.96, thus, it appears to be important regional differences
For South, t=(-0.44-0)/SE=-0.44/0.37=-1.19 which is larger than -1.96, thus, it does NOT appear to be important regional differences.
Juanita is a 38-year-old female college graduate from the South. Molly is a 28-year-old female college graduate from the West. Jennifer is a 28-year-old female college graduate from the Midwest. Construct a 95% confidence interval for the difference in expected earnings between Juanita and Molly. It is given that \(Cov(\hat{\beta}_a,\hat{\beta}_s)=0.02\), where \(\hat{\beta}_a\) and \(\hat{\beta}_s\) are estimators for the coefficient of \(age\) and \(South\), respectively.
AHE of Juanita = 0.33+10.42-4.57+0.6138-0.44; AHE of Molly = 0.33+10.42-4.57+0.6128,
Then, $ = 0.61*(38-28) + (-0.44) = 5.66 $,
Therefore, \[ \begin{aligned} SE(\widehat{D}) \\ & =\sqrt(Var(\widehat{D})) = \sqrt(Var(\widehat{\beta_a} - \widehat{\beta_s})) \\ & =\sqrt(Var(\widehat{\beta_a}) + Var(\widehat{\beta_s}) - 2*Cov(\hat{\beta}_a,\hat{\beta}_s)) \\ & =\sqrt(0.05^2 + 0.37^2-2*0.02) = \sqrt(0.0994) \\ & = 0.3153, \end{aligned} \]
Therefore, 95% of Confidence interval is:
$ +/- 1.96*SE = [5.04, 6.28] $.
Consider the regression model \(Y_i=\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i\). Transform the regression so that you can use a \(t\)-statistic to test the following.
$ = _1 - _2 $ , $ _1 = + _2 $
\[
\begin{aligned}
Y_i = \beta_0 + \beta_1*X_{1i} + \beta_2*X_{2i} + u_i \\
& = \beta_0 + (\gamma + \beta_2 )*X_{1i} + \beta_2*X_{2i} + u_i \\
& = \beta_0 + \gamma*X_{1i} + \beta_2*W_i + u_i
\end{aligned}
\]
In this case, the dependent varible is Yi; regressor is X1i, and Wi is (X1i + X2i).
$ = _1 + 2*_2 $ , $ _1 = - 2*_2 $ \[ \begin{aligned} Y_i = \beta_0 + (\gamma + 2*\beta_2)*X_{1i} + \beta_2*X_{2i} + u_i \\ & = \beta_0 + \gamma*X_{1i} + \beta_2*(X_{2i}-2*X_{1i}) + u_i \\ & = \beta_0 + \gamma*X_{1i} + \beta_2*W_i + u_i \end{aligned} \]
Therefore, the dependent varible is Yi; regressor is X1i, and Wi is (X2i - 2*X1i).
$ _1 + _2 = 1, = _1 + _2 $ $ _1 = - _2 $ \[ \begin{aligned} Y_i = \beta_0 + (\gamma - \beta_2 )*X_{1i} + \beta_2*X_{2i} + u_i \\ & = \beta_0 + \gamma*X_{1i} - \beta_2*W_i + u_i \end{aligned} \]
Therefore, the dependent varible is Yi; regressor is X1i, and Wi is (X1i - X2i).
Sales in a company are $156 in 2013 and increase to $158 in 2014.
Compute the percentage increase in sales, using the usual formula \(100 \times \frac{(Sales_{2014}-Sales_{2013})}{Sales_{2013}}\). Compute this value to the approximation \(100\times (\ln(Sales_{2014})-\ln(Sales_{2013}))\).
For the actual percentage increased,
((158 - 156)/156)*100
## [1] 1.282051
On the other hand, for approximation,
100*(log10(158) - log10(156))
## [1] 0.5532489
b. Repeat (a), assuming that $Sales_{2013}=200$ and $Sales_{2014}=300$, respectively. Can you confirm the increased approximation errors?
For the actual percentage increased,
100*((300-200)/200)
## [1] 50
For approximation,
100*(log10(300) - log10(200))
## [1] 17.60913
According to the equation, if "a" is a small number, then ln(1+a) will be very close to "a", and then, ln(X) - ln(a) will be equal to ln(x/a).
For Sales 2013 is 200, Sales 2014 is 300,
ln(Sales 2014) - ln(Sales 2013)
=ln(Sales 2014/Sales 2013)
=ln((Sales 2013 + Sales 2014 - Sales 2013)/Sales 2013)
=ln(1+(Sales 2014 - Sales 2013)/Sales 2013)
In this case, "a" will be (Sales 2014 - Sales 2013)/Sales 2013, "a" is 0.5 which is a small number, and therefore,
ln(1+(Sales 2014 - Sales 2013)/Sales 2013) = (Sales 2013 + Sales 2014)/Sales 2013
Use the birthweight_smoking.csv data set and answer the following questions.
rm(list=ls())
dat = read.csv('birthweight_smoking.csv')
m1 = lm(birthweight ~ smoker, data = dat)
s.m1 = summary(m1)
print(s.m1)
##
## Call:
## lm(formula = birthweight ~ smoker, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3007.06 -313.06 26.94 366.94 2322.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3432.06 11.87 289.115 <2e-16 ***
## smoker -253.23 26.95 -9.396 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared: 0.0286, Adjusted R-squared: 0.02828
## F-statistic: 88.28 on 1 and 2998 DF, p-value: < 2.2e-16
Therefore, i. The estimated effect of smoking on birth weight is -253.23
m2 = lm(birthweight ~ smoker + alcohol + nprevist, data = dat)
s.m2 = summary(m2)
print(s.m2)
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2733.53 -307.57 21.42 358.09 2192.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3051.249 34.016 89.701 < 2e-16 ***
## smoker -217.580 26.680 -8.155 5.07e-16 ***
## alcohol -30.491 76.234 -0.400 0.689
## nprevist 34.070 2.855 11.933 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 570.5 on 2996 degrees of freedom
## Multiple R-squared: 0.07285, Adjusted R-squared: 0.07192
## F-statistic: 78.47 on 3 and 2996 DF, p-value: < 2.2e-16
Therefore, ii. The estimated effect of smoking on birth weight is -217.580, alcohol is -30.491, nprevist is 34.070.
m3 = lm(birthweight ~ smoker + alcohol + nprevist + unmarried, data = dat)
s.m3 = summary(m3)
print(s.m3)
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2798.81 -309.22 25.37 361.80 2363.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3134.400 35.656 87.907 < 2e-16 ***
## smoker -175.377 27.099 -6.472 1.13e-10 ***
## alcohol -21.083 75.607 -0.279 0.78
## nprevist 29.603 2.898 10.213 < 2e-16 ***
## unmarried -187.133 26.007 -7.195 7.84e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.7 on 2995 degrees of freedom
## Multiple R-squared: 0.08861, Adjusted R-squared: 0.08739
## F-statistic: 72.79 on 4 and 2995 DF, p-value: < 2.2e-16
Therefore, iii. The estimated effect of smoking on birth weight is -175.377, alcohol is -21.083, nprevist is 29.603, unmarried is -187.133.
b. Construct a 95% confidence interval for the effect of smoking on birth weight, using each of the regressions above.
i. The 95% confidence interval is [-306.052, -200.408].
-253.23 - 1.96*26.95
## [1] -306.052
-253.23 + 1.96*26.95
## [1] -200.408
ii. The 95% confidence interval is [-269.8728, -165.2872].
-217.580 - 1.96*26.680
## [1] -269.8728
-217.580 + 1.96*26.680
## [1] -165.2872
iii. The 95% confidence interval is [-228.491, -122.263].
-175.377 - 1.96*27.099
## [1] -228.491
-175.377 + 1.96*27.099
## [1] -122.263
c. Does the coefficient on *Smoker* in regression (i) suffer from omitted variable bias? Explain.
From the result of regression in (i), and (ii), the estimator of smoking is -253.23 in (i), and -217.580 in (ii), which indicates an increase of effect of smoking on the birthweight. However, since in (ii), there are two more varibles that taken out from the error term, and added in the regression model, means that the effect of smoking is overestimated in the model and the *Smoker* suffers from omitted varible bias.
d. Does the coefficient on *Smoker* in regression (ii) suffer from omitted variable bias? Explain.
From the result of regression analyze, there are still very large differences between the coefficients of smoker in these two section, and since the factor "Unmarried" is added to the model and we can see from the result that the factor of "Unmarried" affects the dependent varible in the regression, thus, *Smoker* in (ii) suffers from omitted varivale bias.
e. A family advocacy group notes that the large coefficient in (iii) suggests that public policies that encourage marriage will lead, on average, to healthier babies. Do you agree? Explain.
"Unmarried" is a control variable, which is added to take off the omitted variable bias of smoker in the regression model. There are many other omitted variable affecting the effect of unmarried on birthweight, therefore, it is not sufficient evidence to prove that "unmarried" could have the causal effect on birthweight.