FINAL EXAM

Loading package:

library(dplyr)
library(dslabs)
library(ISLR2)
library(matlib)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)

Chapter 7

1. SLEEP75

data(sleep75)
model1 <- lm(sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
summary(model1)

## 
## Call:
## lm(formula = sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2378.00  -243.29     6.74   259.24  1350.19 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3840.83197  235.10870  16.336   <2e-16 ***
## totwrk        -0.16342    0.01813  -9.013   <2e-16 ***
## educ         -11.71332    5.86689  -1.997   0.0463 *  
## age           -8.69668   11.20746  -0.776   0.4380    
## I(age^2)       0.12844    0.13390   0.959   0.3378    
## male          87.75243   34.32616   2.556   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.7 on 700 degrees of freedom
## Multiple R-squared:  0.1228, Adjusted R-squared:  0.1165 
## F-statistic: 19.59 on 5 and 700 DF,  p-value: < 2.2e-16

The male coefficient stands at 87.75, suggesting that, on average, men sleep approximately one and a half hours more per week compared to women in a similar situation. The t-value for male, roughly 2.56, is quite close to the critical value of 2.58 for a two-sided alternative, indicating strong evidence for a gender-based difference.
The t statistic for totwrk is around -9.06, signifying high statistical significance. The coefficient suggests that an additional hour of work correlates with roughly 9.8 minutes less sleep.

model1 <- lm(sleep ~ totwrk + educ + male, data = sleep75)
summary(model1)

## 
## Call:
## lm(formula = sleep ~ totwrk + educ + male, data = sleep75)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2380.27  -239.15     6.74   257.31  1370.63 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3747.51727   81.00609  46.262  < 2e-16 ***
## totwrk        -0.16734    0.01794  -9.329  < 2e-16 ***
## educ         -13.88479    5.65757  -2.454  0.01436 *  
## male          90.96919   34.27441   2.654  0.00813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 418 on 702 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1155 
## F-statistic: 31.69 on 3 and 702 DF,  p-value: < 2.2e-16

The R^2 is pretty the same, when we run a regression excluding both the ‘age’ and ‘age squared’ terms. In a model containing both these terms, age’s influence would only be considered null if the coefficients for both age and age squared equate to zero.

3. GPA 2

data("gpa2")
model2 <- lm(sat ~ hsize + I(hsize^2) + female + black + I(female*black), data = gpa2)
summary(model2)

## 
## Call:
## lm(formula = sat ~ hsize + I(hsize^2) + female + black + I(female * 
##     black), data = gpa2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.45  -89.54   -5.24   85.41  479.13 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1028.0972     6.2902 163.445  < 2e-16 ***
## hsize               19.2971     3.8323   5.035 4.97e-07 ***
## I(hsize^2)          -2.1948     0.5272  -4.163 3.20e-05 ***
## female             -45.0915     4.2911 -10.508  < 2e-16 ***
## black             -169.8126    12.7131 -13.357  < 2e-16 ***
## I(female * black)   62.3064    18.1542   3.432 0.000605 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.4 on 4131 degrees of freedom
## Multiple R-squared:  0.08578,    Adjusted R-squared:  0.08468 
## F-statistic: 77.52 on 5 and 4131 DF,  p-value: < 2.2e-16

The absolute value of the t statistic for hsize^2 exceeds four, providing robust evidence for its relevance in the equation. This determination is achieved by identifying the turning point, which maximizes satˆ (keeping other factors constant): 19.3/(2⋅ 2.19) ≈ 4.41. Given that hsize is scaled in hundreds, the optimal graduating class size is estimated to be around 441.
The difference in SAT scores for nonblack females, as indicated by the coefficient on female (given black = 0), is approximately 45 points lower compared to nonblack males. With a t statistic of roughly -10.51, this difference holds high statistical significance, possibly amplified by the considerably large sample size.
When considering female = 0, the coefficient on black implies that a black male’s estimated SAT score is nearly 170 points lower than that of a comparable nonblack male. The t statistic surpasses 13 in absolute value, decisively rejecting the hypothesis that there exists no ceteris paribus difference.
By substituting black = 1, female = 1 for black females, and black = 0, female = 1 for nonblack females, the resulting difference calculates to -169.81 + 62.31 = -107.50. As this estimation relies on two coefficients, generating a t statistic directly from this information isn’t feasible. A simpler method involves creating dummy variables for three out of the four race/gender categories and setting nonblack females as the reference group. Then, we can derive the desired t statistic from the coefficient related to the black female dummy variable.

C1. GPA 1

data("gpa1")
model3 <- lm (colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll, data = gpa1)
summary (model3)

## 
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll, 
##     data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78149 -0.25726 -0.02121  0.24691  0.74432 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.255554   0.335392   3.744 0.000268 ***
## PC           0.151854   0.058716   2.586 0.010762 *  
## hsGPA        0.450220   0.094280   4.775 4.61e-06 ***
## ACT          0.007724   0.010678   0.723 0.470688    
## mothcoll    -0.003758   0.060270  -0.062 0.950376    
## fathcoll     0.041800   0.061270   0.682 0.496265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared:  0.2222, Adjusted R-squared:  0.1934 
## F-statistic: 7.713 on 5 and 135 DF,  p-value: 2.083e-06

The impact of PC remains largely consistent with equation (7.6), maintaining substantial significance with a t(pc) value approximately around 2.58.

linearHypothesis(model3, c("mothcoll = 0", "fathcoll = 0"))

## Linear hypothesis test
## 
## Hypothesis:
## mothcoll = 0
## fathcoll = 0
## 
## Model 1: restricted model
## Model 2: colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    137 15.149                           
## 2    135 15.094  2  0.054685 0.2446 0.7834

Conducting the F test to assess the combined significance of mothcoll and fathcoll, with 2 and 135 degrees of freedom, yields a result of about 0.24 with a p-value close to 0.78. This joint insignificance of these variables doesn’t notably affect the estimates of the other coefficients when included in the regression.

model3 <- lm (colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll + fathcoll, data = gpa1)
summary (model3)

## 
## Call:
## lm(formula = colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll + 
##     fathcoll, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78998 -0.24327 -0.00648  0.26179  0.72231 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.040328   2.443038   2.063   0.0410 *
## PC           0.140446   0.058858   2.386   0.0184 *
## hsGPA       -1.802520   1.443552  -1.249   0.2140  
## I(hsGPA^2)   0.337341   0.215711   1.564   0.1202  
## ACT          0.004786   0.010786   0.444   0.6580  
## mothcoll     0.003091   0.060110   0.051   0.9591  
## fathcoll     0.062761   0.062401   1.006   0.3163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared:  0.2361, Adjusted R-squared:  0.2019 
## F-statistic: 6.904 on 6 and 134 DF,  p-value: 2.088e-06

Upon including hsGPA^2 in the regression, its coefficient stands at approximately 0.337, with a t-statistic of around 1.56. The coefficient for hsGPA itself is roughly -1.803. This situation borders on the edge of significance. The quadratic relationship in hsGPA presents a U-shape, manifesting around hsGPA* = 2.68, which poses challenges in interpretation. Despite the addition of hsGPA^2 serving as a straightforward robustness check, the primary finding regarding the coefficient of PC diminishes to about 0.140 but remains statistically significant.

C2: WAGE2

data("wage2")
model4 <- lm(log(wage) ~ educ +exper +tenure + married + black + south + urban, data = wage2)
summary(model4)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban, data = wage2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98069 -0.21996  0.00707  0.24288  1.22822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.395497   0.113225  47.653  < 2e-16 ***
## educ         0.065431   0.006250  10.468  < 2e-16 ***
## exper        0.014043   0.003185   4.409 1.16e-05 ***
## tenure       0.011747   0.002453   4.789 1.95e-06 ***
## married      0.199417   0.039050   5.107 3.98e-07 ***
## black       -0.188350   0.037667  -5.000 6.84e-07 ***
## south       -0.090904   0.026249  -3.463 0.000558 ***
## urban        0.183912   0.026958   6.822 1.62e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared:  0.2526, Adjusted R-squared:  0.2469 
## F-statistic: 44.75 on 7 and 927 DF,  p-value: < 2.2e-16

The black coefficient indicates that, under constant levels of other factors, black men earn roughly 18.8% less compared to nonblack men. With a t statistic of approximately -4.95, this difference holds significant statistical importance.

model4 <- lm(log(wage) ~ educ +exper + I(exper^2) + tenure + I(tenure^2)
             + married + black + south + urban, data = wage2)
summary(model4)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + I(exper^2) + tenure + 
##     I(tenure^2) + married + black + south + urban, data = wage2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98236 -0.21972 -0.00036  0.24078  1.25127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.3586756  0.1259143  42.558  < 2e-16 ***
## educ         0.0642761  0.0063115  10.184  < 2e-16 ***
## exper        0.0172146  0.0126138   1.365 0.172665    
## I(exper^2)  -0.0001138  0.0005319  -0.214 0.830622    
## tenure       0.0249291  0.0081297   3.066 0.002229 ** 
## I(tenure^2) -0.0007964  0.0004710  -1.691 0.091188 .  
## married      0.1985470  0.0391103   5.077 4.65e-07 ***
## black       -0.1906636  0.0377011  -5.057 5.13e-07 ***
## south       -0.0912153  0.0262356  -3.477 0.000531 ***
## urban        0.1854241  0.0269585   6.878 1.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.2477 
## F-statistic: 35.17 on 9 and 925 DF,  p-value: < 2.2e-16

linearHypothesis(model4, c("I(exper^2)=0", "I(tenure^2)=0" ))

## Linear hypothesis test
## 
## Hypothesis:
## I(exper^2) = 0
## I(tenure^2) = 0
## 
## Model 1: restricted model
## Model 2: log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + 
##     married + black + south + urban
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    927 123.82                           
## 2    925 123.42  2   0.39756 1.4898  0.226

The F statistic evaluating the joint significance of exper^2 and tenure^2, with 2 and 925 degrees of freedom, equates to approximately 1.49, resulting in a p-value around 0.226. Since the p-value exceeds 0.20, these quadratic terms are collectively insignificant at the 20% significance level.

model4 <- lm(log(wage) ~ educ +exper +tenure + married 
             + black + south + urban + I(black*educ), data = wage2)
summary(model4)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban + I(black * educ), data = wage2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97782 -0.21832  0.00475  0.24136  1.23226 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.374817   0.114703  46.859  < 2e-16 ***
## educ             0.067115   0.006428  10.442  < 2e-16 ***
## exper            0.013826   0.003191   4.333 1.63e-05 ***
## tenure           0.011787   0.002453   4.805 1.80e-06 ***
## married          0.198908   0.039047   5.094 4.25e-07 ***
## black            0.094809   0.255399   0.371 0.710561    
## south           -0.089450   0.026277  -3.404 0.000692 ***
## urban            0.183852   0.026955   6.821 1.63e-11 ***
## I(black * educ) -0.022624   0.020183  -1.121 0.262603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared:  0.2536, Adjusted R-squared:  0.2471 
## F-statistic: 39.32 on 8 and 926 DF,  p-value: < 2.2e-16

Upon introducing the interaction black ⋅ educ into the equation from the first part, the coefficient for this interaction measures roughly -0.0226 (standard error ≈ 0.0202). Consequently, the estimated point suggests that the increase in returns per year of education for black men is around 2.3 percentage points lower than for nonblack men (approximately 6.7%). While this discrepancy is noteworthy if reflective of population differences, the t statistic stands at approximately 1.12 in absolute value, insufficient to reject the null hypothesis that the relationship between education and returns remains independent of race.

Chapter 8

1

and (iii) In Chapter 5, the homoskedasticity assumption wasn’t crucial for demonstrating OLS consistency. However, it’s recognized that heteroskedasticity disrupts the validity of statistical inference using typical t and F statistics, even with large sample sizes. Since heteroskedasticity breaches the Gauss-Markov assumptions, OLS ceases to be Best Linear Unbiased Estimator (BLUE).

5

No. The standard errors for each coefficient, whether usual or heteroskedasticity-robust, show very similar practical results.
The impact translates to −.029(4) = −.116, indicating a decrease in the probability of smoking by about .116.
As is customary, we determine the turning point within the quadratic: .020/[2(.00026)] ≈ 38.46, roughly around 38 and a half years.
With other variables in the equation held constant, an individual in a state with restaurant smoking restrictions faces a reduction of .101 in the likelihood of smoking. This resembles the impact of obtaining four more years of education.
smokes=−.656+.069xlog(67.44)+.012xlog(6,500)−.029x(16)+.020x(77)+.00026x(77^2 )−.0052. Hence, the estimated smoking probability for this individual aligns closely with zero. (In fact, this individual isn’t a smoker, validating the equation’s prediction for this specific observation.)

C4

data ("vote1")
model5 <- lm (voteA ~ prtystrA + democA + lexpendA + lexpendB, data = vote1)
summary(model5)

## 
## Call:
## lm(formula = voteA ~ prtystrA + democA + lexpendA + lexpendB, 
##     data = vote1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.576  -4.864  -1.146   4.903  24.566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.66142    4.73604   7.952 2.56e-13 ***
## prtystrA     0.25192    0.07129   3.534  0.00053 ***
## democA       3.79294    1.40652   2.697  0.00772 ** 
## lexpendA     5.77929    0.39182  14.750  < 2e-16 ***
## lexpendB    -6.23784    0.39746 -15.694  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.573 on 168 degrees of freedom
## Multiple R-squared:  0.8012, Adjusted R-squared:  0.7964 
## F-statistic: 169.2 on 4 and 168 DF,  p-value: < 2.2e-16

Estimated equation: voteA = 37.66 + .252 prtystrA + 3.793 democA + 5.779 log(expendA) − 6.238 log(expendB) + u (4.74) (.071) (1.407) (0.392) (0.397)

When regressing the u on all explanatory variables, we find an R-squared close to zero due to rounding errors in computer output. OLS operates by selecting estimates, ˆβj, ensuring that residuals are uncorrelated with each independent variable and maintain a zero sample average.

bptest(model5)

## 
##  studentized Breusch-Pagan test
## 
## data:  model5
## BP = 9.0934, df = 4, p-value = 0.05881

The F statistic for joint significant (with 4 and 168 df) is about 2.33 with p-value ≈ .058. Therefore, there is some evidence of heteroskedasticity, but not quite at the 5% level.

bptest (model5, ~ prtystrA*democA*lexpendA*lexpendB + I(prtystrA^2) + I(democA^2) + I(lexpendA^2) + I(lexpendB^2), data = vote1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model5
## BP = 43.327, df = 18, p-value = 0.0007194

The F test, with 2 and 170 df, is about 2.79 with p-value ≈ .065. This is slightly less evidence of heteroskedasticity than provided by the B-P test, but the conclusion is very similar.

C13: FERTIL2

data("fertil2")
model6 <- lm(children ~ age + agesq + educ + electric + urban, data = fertil2 )
summary(model6)

## 
## Call:
## lm(formula = children ~ age + agesq + educ + electric + urban, 
##     data = fertil2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9012 -0.7136 -0.0039  0.7119  7.4318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.2225162  0.2401888 -17.580  < 2e-16 ***
## age          0.3409255  0.0165082  20.652  < 2e-16 ***
## agesq       -0.0027412  0.0002718 -10.086  < 2e-16 ***
## educ        -0.0752323  0.0062966 -11.948  < 2e-16 ***
## electric    -0.3100404  0.0690045  -4.493 7.20e-06 ***
## urban       -0.2000339  0.0465062  -4.301 1.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 4352 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5734, Adjusted R-squared:  0.5729 
## F-statistic:  1170 on 5 and 4352 DF,  p-value: < 2.2e-16

coeftest(model6, vcov=vcovHC(model6, type = "HC0"))

## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) -4.22251623  0.24368307 -17.3279 < 2.2e-16 ***
## age          0.34092552  0.01916146  17.7923 < 2.2e-16 ***
## agesq       -0.00274121  0.00035027  -7.8260 6.278e-15 ***
## educ        -0.07523232  0.00630336 -11.9353 < 2.2e-16 ***
## electric    -0.31004041  0.06390411  -4.8517 1.267e-06 ***
## urban       -0.20003386  0.04543962  -4.4022 1.097e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some robust standard are smaller than the non-robust standard errors.

bptest(model6)

## 
##  studentized Breusch-Pagan test
## 
## data:  model6
## BP = 1065, df = 5, p-value < 2.2e-16

Ho: There is no heteroskedasticity H1: There is heteroskedasticity

In conclusion, as the p-value falls below alpha at a 95% confidence level, there exists substantial statistical evidence to reject the null hypothesis, indicating the presence of heteroskedasticity within the model.

usqr=(summary(model6)$residual)^2
yhat=model6$fitted.values
yhatsqr=model6$fitted.values^2
model6WT=summary(lm(usqr~yhat+yhatsqr,data=fertil2))$r.squared
model6WT*4361

## [1] 1090.862

1-pchisq(1093.183,2)

## [1] 0

Ho: There is no heteroskedasticity H1: There is heteroskedasticity

As the p-value is below alpha at a 95% confidence level, there exists adequate statistical evidence to reject the null hypothesis, signifying the presence of heteroskedasticity in the model.

No, since the difference between the robust and non robust standar errors was really small.

Chapter 9

1: CEO Salary2

If β 6 or β 7, denoting the population parameters on ‘ceoten2’ and ‘comten2’, respectively, do not equal zero, it suggests a potential mismatch in the functional form. To evaluate this, we conduct a joint significance test using the R-squared form of the F test: F = [(.375 − .353)/(1 − .375)][(177 – 8)/2] ≈ 2.97. With degrees of freedom 2 and approaching infinity, the 10% critical value is 2.30, and the 5% critical value is 3.00. Consequently, the p-value slightly exceeds .05, indicating reasonable evidence of functional form mismatch. However, whether this influences the estimated partial effects for different levels of the explanatory variables is a separate consideration.

5

The sample selection could be endogenous here. If colleges with high crime rates are inclined to conceal crime statistics due to their influence on prospective students, the likelihood of being in the sample is inversely linked to the crime equation’s u. In essence, higher u, indicating more crime for a given school size, leads to a reduced probability of the school disclosing its crime data.

C3: JTRAIN

data("jtrain")
jtrain <- jtrain %>% filter(year == 1988)
model7 <- lm(lscrap ~ grant, data = jtrain)
summary(model7)

## 
## Call:
## lm(formula = lscrap ~ grant, data = jtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4043 -0.9536 -0.0465  0.9636  2.8103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.4085     0.2406   1.698   0.0954 .
## grant         0.0566     0.4056   0.140   0.8895  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.423 on 52 degrees of freedom
##   (103 observations deleted due to missingness)
## Multiple R-squared:  0.0003744,  Adjusted R-squared:  -0.01885 
## F-statistic: 0.01948 on 1 and 52 DF,  p-value: 0.8895

If grants were allocated to firms based on their characteristics or those of their workers, it’s plausible for the grant to be correlated with factors influencing productivity, which are encapsulated within u in the basic regression model.
log(scrap) = 0.406 + 0.057*grant (0.24) (0.405) n = 54, R^2 = 0.00003 The coefficient on grant is actually positive, but not statistically different from zero.

model8 <- lm(lscrap ~ grant + lscrap_1, data = jtrain)
summary(model8)

## 
## Call:
## lm(formula = lscrap ~ grant + lscrap_1, data = jtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9146 -0.1763  0.0057  0.2308  1.5991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02124    0.08910   0.238   0.8126    
## grant       -0.25397    0.14703  -1.727   0.0902 .  
## lscrap_1     0.83116    0.04444  18.701   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5127 on 51 degrees of freedom
##   (103 observations deleted due to missingness)
## Multiple R-squared:  0.8728, Adjusted R-squared:  0.8678 
## F-statistic: 174.9 on 2 and 51 DF,  p-value: < 2.2e-16

The t statistic for H0: β grant = 0 is -1.73. We use the 5% critical value for 40 df in Table G.2: -1.68. Because t = −1.73 < −1.68, we reject H0 in favor of H1: β grant < 0 at the 5% level.
The t statistic is (.831 – 1)/.044≈ −3.84, which is a strong rejection of H0

coeftest(model7, vcov = vcovHC(model7, type = "HC0"))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.40853    0.25801  1.5833   0.1194
## grant        0.05660    0.36390  0.1555   0.8770

coeftest(model8, vcov = vcovHC(model8, type = "HC0"))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  0.021237   0.097032  0.2189   0.82763    
## grant       -0.253970   0.142249 -1.7854   0.08014 .  
## lscrap_1     0.831161   0.071469 11.6297 5.862e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

, the t statistic for grant88 is −1.79, so the coefficient is even more significantly less than zero when we use the heteroskedasticity-robust standard error. The t statistic for H0: log(scrap87) which β = 1 is (.831 – 1)/.071 ≈−2.38, which is notably smaller than before, but it is still pretty significant.

C4: INFMRT

data ("infmrt")
model9 <- lm(infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt ) 
summary(model9)

## 
## Call:
## lm(formula = infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6894 -0.8973 -0.1412  0.7054  3.1705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.2125     6.7267   5.383 5.09e-07 ***
## lpcinc       -2.3964     0.8866  -2.703  0.00812 ** 
## lphysic      -1.5548     0.7734  -2.010  0.04718 *  
## lpopul        0.5755     0.1365   4.215 5.60e-05 ***
## DC           13.9632     1.2466  11.201  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.255 on 97 degrees of freedom
## Multiple R-squared:  0.6432, Adjusted R-squared:  0.6285 
## F-statistic: 43.72 on 4 and 97 DF,  p-value: < 2.2e-16

The coefficient for DC implies that if a state shared the exact per capita income, per capita physicians, and population as Washington D.C., the predicted infant mortality rate for D.C. would be roughly 16 deaths per 1000 live births higher. This signifies a substantial and notable “D.C. effect.”
In the regression from part (i), including or excluding a single observation, such as D.C., yields matching coefficients and standard errors for the other variables. However, the R-squared values differ notably because predicting the infant mortality rate perfectly for D.C. impacts these metrics. Verifying that the residual for the D.C. observation is zero is advisable.

Final_exam_FI

Vu Hong Ha

2023-12-29