Name: [Wang Kewen]

Student ID Number: [400268580]

  1. Data were collected from a random sample of 220 home sales from a community in 2013. Let \(Price\) denote the selling price (in $1000s), \(BDR\) denote the number of bedrooms, \(Bath\) denote the number of bathrooms, \(Hsize\) denote the size of the house (in square feet), \(Lsize\) denote the lot size (in square feet), \(Age\) denote the age of the house (in years), and \(Poor\) denote a binary variable that is equal to 1 if the condition of the house is reported as “poor”. An estimated regression yields \[ \begin{aligned} \widehat{Price} =& 119.2 + 0.445 BDR + 26.4 Bath + 0.136 Hsize + 0.002Lsize \\ & + 0.090 Age - 43.8 Poor,~~~\overline{R}^2=0.72,~ SER = 41.5. \end{aligned} \]

    1. Suppose a homeowner coverts part of an existing family room in her house into a new bathroom. What is the expected increase in the value of the house?

      The expected increased in the value of the house=26.4*1000=$26400.

    2. Suppose a homeowner adds a new bathroom to her house, which increases the size of the house by 100 square feet. What is the expected increase in the value of the house?

      The expected increase in the value of the house=26.41000+0.136100*1000=26400+13600=$40000.

    3. What is the loss in value if a homeowner lets his house run down, so that its condition becomes “poor”?

      When the condition becomes “poor”, the “poor” in the regression model becomes to 1, then the expected loss in the value of the house=43.8*1000=$43800.

    4. Compute the \(R^2\) for the regression.

      Since \(R^2=1-SSR/TSS\), and \(\bar{R}^2=1-\frac{n-1}{n-k-1}\frac{SSR}{TSS}=0.72\), therefore, in the condition of n=220, k=6, 0.72=1-(219/213)*SSR/TSS, SSR/TSS=59.64/219, then, \(R^2=1-59.64/219=159.36/219\).

  2. Use the data file caschool.csv for this question. A detailed description of the data set is given in caschool_description.pdf. In this exercise, you will investigate the relationship between the class size and students’ performance.

    1. Run the following regression model and report the coefficiet estimates, (heteroskedasticity robust) standard errors, adjusted \(R^2\), and SER. \[ \begin{aligned} \widehat{TestScore}_i = \beta_0 + \beta_1 STR_i + u_i \end{aligned} \]
rm(list=ls())
library(sandwich)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
```r
rm(list=ls())

cadata = read.csv('Caschool.csv')

m = lm(testscr ~ str, data = cadata)
s.m = summary(m)
print(s.m)
```

```
## 
## Call:
## lm(formula = testscr ~ str, data = cadata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
```
    
  Coefficient is 698.9330, and slope is -2.2798, adjusted R-squred is 0.04897, residual standard error is 18.58. 
m$cov = vcovHC(m, type="HC1")
print(coeftest(m, vcov=m$cov))
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 698.93295   10.36436 67.4362 < 2.2e-16 ***
## str          -2.27981    0.51949 -4.3886 1.447e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  (heteroskedasticity robust) standard errors for str is 0.51949, for intercept is 10.36436. 

b. Find the variable for the percentage of English learners. Add the variable as a control variable and run the regression:
$$
\begin{aligned}
    \widehat{TestScore}_i = \beta_0 + \beta_1 STR_i + \beta_3 PctEL + u_i
\end{aligned}
$$
Report the coefficiet estimates, (heteroskedasticity robust) standard errors, adjusted $R^2$, and SER.
m1 = lm(testscr ~ str+el_pct, data = cadata)
    s.m = summary(m1)
    print(s.m)
## 
## Call:
## lm(formula = testscr ~ str + el_pct, data = cadata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03225    7.41131  92.566  < 2e-16 ***
## str          -1.10130    0.38028  -2.896  0.00398 ** 
## el_pct       -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16
   Coefficient is 686.03225, adjusted R-squared is 0.4237, residual standard error os 14.46.
m1$cov = vcovHC(m1, type="HC1")
print(coeftest(m1, vcov=m1$cov))
## 
## t test of coefficients:
## 
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 686.032249   8.728224  78.5993  < 2e-16 ***
## str          -1.101296   0.432847  -2.5443  0.01131 *  
## el_pct       -0.649777   0.031032 -20.9391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  (heteroskedasticity robust) standard errors for el_pct is 0.031032, for str is 0.432847, for intercept is 8.728224. 

c. Compare the coefficient estimate of $STR$ in (a) and (b). Does the direction of the omitted variable bias coincide with your intuition?

  Percentage of english learners is a negative omiited variable bias because in part (a) the coefficient estimate increases in part (b). As the str and el_pct goes up, they are positively correlated with each other, since with the increase of str also increases the el_pct, and according to the regression, both of them are negatively correlated with test score. Thus, the effect of str in part (a) on test score is overestimated if we do not include el_pct. 

d. In addition to the percentage of English learners, we now use the percentage eligible for subsidized lunch as a control variable. Find the relevant variable from the data set and run the regression model. Report the results. Is the coefficient of lunch subsidy statistically significant?
$$
\begin{aligned}
    \widehat{TestScore}_i = \beta_0 + \beta_1 STR_i + \beta_3 PctEL + \beta_4 LchPct + u_i
\end{aligned}
$$
m2 = lm(testscr ~ str+el_pct+meal_pct, data = cadata)
    s.m = summary(m2)
    print(s.m)
## 
## Call:
## lm(formula = testscr ~ str + el_pct + meal_pct, data = cadata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.849  -5.151  -0.308   5.243  31.501 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 700.14996    4.68569 149.423  < 2e-16 ***
## str          -0.99831    0.23875  -4.181 3.54e-05 ***
## el_pct       -0.12157    0.03232  -3.762 0.000193 ***
## meal_pct     -0.54735    0.02160 -25.341  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.08 on 416 degrees of freedom
## Multiple R-squared:  0.7745, Adjusted R-squared:  0.7729 
## F-statistic: 476.3 on 3 and 416 DF,  p-value: < 2.2e-16
  The intercept is 700.14996, slope of str, el_pct and meal_pct is -0.99831, -0.12157, and -0.54735. Adjusted R-squared is 0.7729, Residual standard error is 9.08.
    
m2$cov = vcovHC(m2, type="HC1")
print(coeftest(m2, vcov=m2$cov))
## 
## t test of coefficients:
## 
##               Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 700.149965   5.568450 125.7352 < 2.2e-16 ***
## str          -0.998309   0.270080  -3.6963 0.0002480 ***
## el_pct       -0.121573   0.032832  -3.7029 0.0002418 ***
## meal_pct     -0.547346   0.024107 -22.7046 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  (heteroskedasticity robust) standard errors for str, el_pct, meal_pct is 0.270080, 0.032832, 0.024107. On the other hand, the coefficient of lunch subsidy is statistically significant. 
  1. Use the data file birthweight_smoking.csv for this question. A detailed description of the data set is given in birthweight_smoking_description.pdf. In this exercise, you will investigate the effect of smoking on baby’s birthweight.

    1. Regress \(Birthweight\) on \(Smoker\). What is the esimated effect of smoking on birth weight?
rm(list=ls())

## data 
dat = read.csv('birthweight_smoking.csv')
m = lm(birthweight~smoker, data = dat)
summary(m)
## 
## Call:
## lm(formula = birthweight ~ smoker, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3432.06      11.87 289.115   <2e-16 ***
## smoker       -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16
  The estimated effect of smoking on baby's birthweight is -253.23g, if the mother is a smoker compare to the mother who is not a smoker, the birthweight of the baby will decrease by the unit of -253.23g.

b. Regress $Birthweight$ on $Smoker, Alcohol,$ and $Nprevist$. 


```r
m1 = lm(birthweight ~ smoker+alcohol+nprevist, data=dat)
summary(m1)
```

```
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2733.53  -307.57    21.42   358.09  2192.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3051.249     34.016  89.701  < 2e-16 ***
## smoker      -217.580     26.680  -8.155 5.07e-16 ***
## alcohol      -30.491     76.234  -0.400    0.689    
## nprevist      34.070      2.855  11.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570.5 on 2996 degrees of freedom
## Multiple R-squared:  0.07285,    Adjusted R-squared:  0.07192 
## F-statistic: 78.47 on 3 and 2996 DF,  p-value: < 2.2e-16
```

    i. Explain why the exclusion of $Alcohol$ and $Nprevist$ could lead to omitted variable bias in the regression estimated in (a)

     Since the slope of the regression modelfrom part (a) is -253.23, which indicates the correlation between smoking and birthweight, and after adding the factors of alcohol and Nprevist, the slope becomes -217.58, so alcohol and Nprevist is correlated with smoking, and on the other hand, according to the regression, both alcohol and Nprevist are correlated with the birthweight, therefore, they are omitted varviable bias. 
    
    ii. Is the estimated effect of smoking on birth weight substantially different from the regression that excludes $Alcohol$ and $Nprevist$? Does the regression in (a) seem to suffer from omitted variable bias?
    
      Yes, the estimated effect of smoking on birth weight is different from the regression that excludes $Alcohol$ and $Nprevist$. The regression in (a) suffers from omitted variable bias
    
    iii. Jane smoked during her pregnancy, did not drink alcohol, and had 8 pernatal care visits. Use the regression to predict the birth weight of Jane's child.
    
       Predicted Birthweight = 3051.249-217.580Smoker-30.491Alcohol+34.070Nprevist, since smoker=1, alcohol=0, Nprevist=8, Predicted Birthweight is 3051.249-217.580+34.070*8=3106.229.
       
    iv. Compute $R^2$ and $\overline{R}^2$. Why are they so similar?
    
        According to the regression model, $R^2$ is 0.0286, and $\overline{R}^2$ is 0.02828.
        Since the degree of fredom is 2996 which equeals n-k-1, and k=3 becasue there are three factors measured within the regression model. 2996=n-3-1, thus, n=3000, the sample size in the model is 3000. According to the formula of $\overline{R}^2$, it equals {1-[(n-1)/(n-k-1)](SSR/TSS)}, the only difference between $\overline{R}^2$ and $R^2$ is [(n-1)/(n-k-1)], since $R^2$ is 1-SSR/TSS. 
        [(n-1)/(n-k-1)], n=3000, k=3, the value of it is 2999/2996=1.00100134 which is very close to 1, therefore, they are very similar.  
    
    v. How should you interpret the coefficient on $Nprevist$? Does the coefficient measure a causal effect of prenatal visits on birth weight? If not, what does it measure?
    
        We should interpret the cofficient (34.070) as the partial effect of Nprevist, while holding all other regressiors, smoking, alcohol, fixed. When the Nprevist increase with 1 unit, the birthweight increases as well with 34.070g. The coefficient measures a causal effect of prenatal visits on birth weight. 
    
c. Estimate the coefficient on $Smoking$ for the multiple regression model in (b), using the three-step process of the Frisch-Waugh theorem. Verify that the three-step process yieelds the same estimated coefficient for $Smoking$ as that obtained in (b). 

        Step 1: 
m1 = lm(smoker ~ alcohol + nprevist, data=dat)
tilde.smoker = resid(m1)
        Step 2: 
m2 = lm(birthweight ~ alcohol + nprevist, data=dat)
tilde.birthweight = resid(m2)
        Step 3: 
m3 = lm(tilde.birthweight ~ tilde.smoker)
s.m3 = summary(m3)
print(s.m3)
## 
## Call:
## lm(formula = tilde.birthweight ~ tilde.smoker)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2733.53  -307.57    21.42   358.09  2192.70 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.346e-13  1.041e+01   0.000        1    
## tilde.smoker -2.176e+02  2.667e+01  -8.158 4.95e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570.3 on 2998 degrees of freedom
## Multiple R-squared:  0.02172,    Adjusted R-squared:  0.02139 
## F-statistic: 66.55 on 1 and 2998 DF,  p-value: 4.955e-16
        Therefore, we see that the coefficient of $\tilde{smoker}$ in Step 3 equals the OLS coefficient of      $smoker$ in the original multiple regression model.
        
cat("Original Coefficient: ", round(coef(m)[2],4),"\n")
## Original Coefficient:  -253.2284
cat("Step 3   Coefficient: ", round(coef(m3)[2],4),"\n")
## Step 3   Coefficient:  -217.5801