Loading package:
library(dplyr)
library(dslabs)
library(ISLR2)
library(matlib)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)
data(sleep75)
model1 <- lm(sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
summary(model1)
##
## Call:
## lm(formula = sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2378.00 -243.29 6.74 259.24 1350.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3840.83197 235.10870 16.336 <2e-16 ***
## totwrk -0.16342 0.01813 -9.013 <2e-16 ***
## educ -11.71332 5.86689 -1.997 0.0463 *
## age -8.69668 11.20746 -0.776 0.4380
## I(age^2) 0.12844 0.13390 0.959 0.3378
## male 87.75243 34.32616 2.556 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 417.7 on 700 degrees of freedom
## Multiple R-squared: 0.1228, Adjusted R-squared: 0.1165
## F-statistic: 19.59 on 5 and 700 DF, p-value: < 2.2e-16
model1 <- lm(sleep ~ totwrk + educ + male, data = sleep75)
summary(model1)
##
## Call:
## lm(formula = sleep ~ totwrk + educ + male, data = sleep75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2380.27 -239.15 6.74 257.31 1370.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3747.51727 81.00609 46.262 < 2e-16 ***
## totwrk -0.16734 0.01794 -9.329 < 2e-16 ***
## educ -13.88479 5.65757 -2.454 0.01436 *
## male 90.96919 34.27441 2.654 0.00813 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 418 on 702 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1155
## F-statistic: 31.69 on 3 and 702 DF, p-value: < 2.2e-16
data("gpa2")
model2 <- lm(sat ~ hsize + I(hsize^2) + female + black + I(female*black), data = gpa2)
summary(model2)
##
## Call:
## lm(formula = sat ~ hsize + I(hsize^2) + female + black + I(female *
## black), data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570.45 -89.54 -5.24 85.41 479.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1028.0972 6.2902 163.445 < 2e-16 ***
## hsize 19.2971 3.8323 5.035 4.97e-07 ***
## I(hsize^2) -2.1948 0.5272 -4.163 3.20e-05 ***
## female -45.0915 4.2911 -10.508 < 2e-16 ***
## black -169.8126 12.7131 -13.357 < 2e-16 ***
## I(female * black) 62.3064 18.1542 3.432 0.000605 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133.4 on 4131 degrees of freedom
## Multiple R-squared: 0.08578, Adjusted R-squared: 0.08468
## F-statistic: 77.52 on 5 and 4131 DF, p-value: < 2.2e-16
The absolute value of the t statistic for hsize^2 exceeds four, providing robust evidence for its relevance in the equation. This determination is achieved by identifying the turning point, which maximizes satˆ (keeping other factors constant): 19.3/(2⋅ 2.19) ≈ 4.41. Given that hsize is scaled in hundreds, the optimal graduating class size is estimated to be around 441.
The difference in SAT scores for nonblack females, as indicated by the coefficient on female (given black = 0), is approximately 45 points lower compared to nonblack males. With a t statistic of roughly -10.51, this difference holds high statistical significance, possibly amplified by the considerably large sample size.
When considering female = 0, the coefficient on black implies that a black male’s estimated SAT score is nearly 170 points lower than that of a comparable nonblack male. The t statistic surpasses 13 in absolute value, decisively rejecting the hypothesis that there exists no ceteris paribus difference.
By substituting black = 1, female = 1 for black females, and black = 0, female = 1 for nonblack females, the resulting difference calculates to -169.81 + 62.31 = -107.50. As this estimation relies on two coefficients, generating a t statistic directly from this information isn’t feasible. A simpler method involves creating dummy variables for three out of the four race/gender categories and setting nonblack females as the reference group. Then, we can derive the desired t statistic from the coefficient related to the black female dummy variable.
data("gpa1")
model3 <- lm (colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll, data = gpa1)
summary (model3)
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll,
## data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78149 -0.25726 -0.02121 0.24691 0.74432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.255554 0.335392 3.744 0.000268 ***
## PC 0.151854 0.058716 2.586 0.010762 *
## hsGPA 0.450220 0.094280 4.775 4.61e-06 ***
## ACT 0.007724 0.010678 0.723 0.470688
## mothcoll -0.003758 0.060270 -0.062 0.950376
## fathcoll 0.041800 0.061270 0.682 0.496265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.1934
## F-statistic: 7.713 on 5 and 135 DF, p-value: 2.083e-06
linearHypothesis(model3, c("mothcoll = 0", "fathcoll = 0"))
## Linear hypothesis test
##
## Hypothesis:
## mothcoll = 0
## fathcoll = 0
##
## Model 1: restricted model
## Model 2: colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 137 15.149
## 2 135 15.094 2 0.054685 0.2446 0.7834
model3 <- lm (colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll + fathcoll, data = gpa1)
summary (model3)
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll +
## fathcoll, data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78998 -0.24327 -0.00648 0.26179 0.72231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.040328 2.443038 2.063 0.0410 *
## PC 0.140446 0.058858 2.386 0.0184 *
## hsGPA -1.802520 1.443552 -1.249 0.2140
## I(hsGPA^2) 0.337341 0.215711 1.564 0.1202
## ACT 0.004786 0.010786 0.444 0.6580
## mothcoll 0.003091 0.060110 0.051 0.9591
## fathcoll 0.062761 0.062401 1.006 0.3163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2019
## F-statistic: 6.904 on 6 and 134 DF, p-value: 2.088e-06
data("wage2")
model4 <- lm(log(wage) ~ educ +exper +tenure + married + black + south + urban, data = wage2)
summary(model4)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98069 -0.21996 0.00707 0.24288 1.22822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
## educ 0.065431 0.006250 10.468 < 2e-16 ***
## exper 0.014043 0.003185 4.409 1.16e-05 ***
## tenure 0.011747 0.002453 4.789 1.95e-06 ***
## married 0.199417 0.039050 5.107 3.98e-07 ***
## black -0.188350 0.037667 -5.000 6.84e-07 ***
## south -0.090904 0.026249 -3.463 0.000558 ***
## urban 0.183912 0.026958 6.822 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared: 0.2526, Adjusted R-squared: 0.2469
## F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
model4 <- lm(log(wage) ~ educ +exper + I(exper^2) + tenure + I(tenure^2)
+ married + black + south + urban, data = wage2)
summary(model4)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + I(exper^2) + tenure +
## I(tenure^2) + married + black + south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98236 -0.21972 -0.00036 0.24078 1.25127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3586756 0.1259143 42.558 < 2e-16 ***
## educ 0.0642761 0.0063115 10.184 < 2e-16 ***
## exper 0.0172146 0.0126138 1.365 0.172665
## I(exper^2) -0.0001138 0.0005319 -0.214 0.830622
## tenure 0.0249291 0.0081297 3.066 0.002229 **
## I(tenure^2) -0.0007964 0.0004710 -1.691 0.091188 .
## married 0.1985470 0.0391103 5.077 4.65e-07 ***
## black -0.1906636 0.0377011 -5.057 5.13e-07 ***
## south -0.0912153 0.0262356 -3.477 0.000531 ***
## urban 0.1854241 0.0269585 6.878 1.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2477
## F-statistic: 35.17 on 9 and 925 DF, p-value: < 2.2e-16
linearHypothesis(model4, c("I(exper^2)=0", "I(tenure^2)=0" ))
## Linear hypothesis test
##
## Hypothesis:
## I(exper^2) = 0
## I(tenure^2) = 0
##
## Model 1: restricted model
## Model 2: log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) +
## married + black + south + urban
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 927 123.82
## 2 925 123.42 2 0.39756 1.4898 0.226
model4 <- lm(log(wage) ~ educ +exper +tenure + married
+ black + south + urban + I(black*educ), data = wage2)
summary(model4)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + I(black * educ), data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97782 -0.21832 0.00475 0.24136 1.23226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.374817 0.114703 46.859 < 2e-16 ***
## educ 0.067115 0.006428 10.442 < 2e-16 ***
## exper 0.013826 0.003191 4.333 1.63e-05 ***
## tenure 0.011787 0.002453 4.805 1.80e-06 ***
## married 0.198908 0.039047 5.094 4.25e-07 ***
## black 0.094809 0.255399 0.371 0.710561
## south -0.089450 0.026277 -3.404 0.000692 ***
## urban 0.183852 0.026955 6.821 1.63e-11 ***
## I(black * educ) -0.022624 0.020183 -1.121 0.262603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2471
## F-statistic: 39.32 on 8 and 926 DF, p-value: < 2.2e-16
No. The standard errors for each coefficient, whether usual or heteroskedasticity-robust, show very similar practical results.
The impact translates to −.029(4) = −.116, indicating a decrease in the probability of smoking by about .116.
As is customary, we determine the turning point within the quadratic: .020/[2(.00026)] ≈ 38.46, roughly around 38 and a half years.
With other variables in the equation held constant, an individual in a state with restaurant smoking restrictions faces a reduction of .101 in the likelihood of smoking. This resembles the impact of obtaining four more years of education.
smokes=−.656+.069xlog(67.44)+.012xlog(6,500)−.029x(16)+.020x(77)+.00026x(77^2 )−.0052. Hence, the estimated smoking probability for this individual aligns closely with zero. (In fact, this individual isn’t a smoker, validating the equation’s prediction for this specific observation.)
data ("vote1")
model5 <- lm (voteA ~ prtystrA + democA + lexpendA + lexpendB, data = vote1)
summary(model5)
##
## Call:
## lm(formula = voteA ~ prtystrA + democA + lexpendA + lexpendB,
## data = vote1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.576 -4.864 -1.146 4.903 24.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.66142 4.73604 7.952 2.56e-13 ***
## prtystrA 0.25192 0.07129 3.534 0.00053 ***
## democA 3.79294 1.40652 2.697 0.00772 **
## lexpendA 5.77929 0.39182 14.750 < 2e-16 ***
## lexpendB -6.23784 0.39746 -15.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.573 on 168 degrees of freedom
## Multiple R-squared: 0.8012, Adjusted R-squared: 0.7964
## F-statistic: 169.2 on 4 and 168 DF, p-value: < 2.2e-16
When regressing the u on all explanatory variables, we find an R-squared close to zero due to rounding errors in computer output. OLS operates by selecting estimates, ˆβj, ensuring that residuals are uncorrelated with each independent variable and maintain a zero sample average.
bptest(model5)
##
## studentized Breusch-Pagan test
##
## data: model5
## BP = 9.0934, df = 4, p-value = 0.05881
bptest (model5, ~ prtystrA*democA*lexpendA*lexpendB + I(prtystrA^2) + I(democA^2) + I(lexpendA^2) + I(lexpendB^2), data = vote1)
##
## studentized Breusch-Pagan test
##
## data: model5
## BP = 43.327, df = 18, p-value = 0.0007194
data("fertil2")
model6 <- lm(children ~ age + agesq + educ + electric + urban, data = fertil2 )
summary(model6)
##
## Call:
## lm(formula = children ~ age + agesq + educ + electric + urban,
## data = fertil2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9012 -0.7136 -0.0039 0.7119 7.4318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2225162 0.2401888 -17.580 < 2e-16 ***
## age 0.3409255 0.0165082 20.652 < 2e-16 ***
## agesq -0.0027412 0.0002718 -10.086 < 2e-16 ***
## educ -0.0752323 0.0062966 -11.948 < 2e-16 ***
## electric -0.3100404 0.0690045 -4.493 7.20e-06 ***
## urban -0.2000339 0.0465062 -4.301 1.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 4352 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.5729
## F-statistic: 1170 on 5 and 4352 DF, p-value: < 2.2e-16
coeftest(model6, vcov=vcovHC(model6, type = "HC0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.22251623 0.24368307 -17.3279 < 2.2e-16 ***
## age 0.34092552 0.01916146 17.7923 < 2.2e-16 ***
## agesq -0.00274121 0.00035027 -7.8260 6.278e-15 ***
## educ -0.07523232 0.00630336 -11.9353 < 2.2e-16 ***
## electric -0.31004041 0.06390411 -4.8517 1.267e-06 ***
## urban -0.20003386 0.04543962 -4.4022 1.097e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bptest(model6)
##
## studentized Breusch-Pagan test
##
## data: model6
## BP = 1065, df = 5, p-value < 2.2e-16
In conclusion, as the p-value falls below alpha at a 95% confidence level, there exists substantial statistical evidence to reject the null hypothesis, indicating the presence of heteroskedasticity within the model.
usqr=(summary(model6)$residual)^2
yhat=model6$fitted.values
yhatsqr=model6$fitted.values^2
model6WT=summary(lm(usqr~yhat+yhatsqr,data=fertil2))$r.squared
model6WT*4361
## [1] 1090.862
1-pchisq(1093.183,2)
## [1] 0
As the p-value is below alpha at a 95% confidence level, there exists adequate statistical evidence to reject the null hypothesis, signifying the presence of heteroskedasticity in the model.
If β 6 or β 7, denoting the population parameters on ‘ceoten2’ and ‘comten2’, respectively, do not equal zero, it suggests a potential mismatch in the functional form. To evaluate this, we conduct a joint significance test using the R-squared form of the F test: F = [(.375 − .353)/(1 − .375)][(177 – 8)/2] ≈ 2.97. With degrees of freedom 2 and approaching infinity, the 10% critical value is 2.30, and the 5% critical value is 3.00. Consequently, the p-value slightly exceeds .05, indicating reasonable evidence of functional form mismatch. However, whether this influences the estimated partial effects for different levels of the explanatory variables is a separate consideration.
The sample selection could be endogenous here. If colleges with high crime rates are inclined to conceal crime statistics due to their influence on prospective students, the likelihood of being in the sample is inversely linked to the crime equation’s u. In essence, higher u, indicating more crime for a given school size, leads to a reduced probability of the school disclosing its crime data.
data("jtrain")
jtrain <- jtrain %>% filter(year == 1988)
model7 <- lm(lscrap ~ grant, data = jtrain)
summary(model7)
##
## Call:
## lm(formula = lscrap ~ grant, data = jtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4043 -0.9536 -0.0465 0.9636 2.8103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4085 0.2406 1.698 0.0954 .
## grant 0.0566 0.4056 0.140 0.8895
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.423 on 52 degrees of freedom
## (103 observations deleted due to missingness)
## Multiple R-squared: 0.0003744, Adjusted R-squared: -0.01885
## F-statistic: 0.01948 on 1 and 52 DF, p-value: 0.8895
If grants were allocated to firms based on their characteristics or those of their workers, it’s plausible for the grant to be correlated with factors influencing productivity, which are encapsulated within u in the basic regression model.
log(scrap) = 0.406 + 0.057*grant (0.24) (0.405) n = 54, R^2 = 0.00003 The coefficient on grant is actually positive, but not statistically different from zero.
model8 <- lm(lscrap ~ grant + lscrap_1, data = jtrain)
summary(model8)
##
## Call:
## lm(formula = lscrap ~ grant + lscrap_1, data = jtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9146 -0.1763 0.0057 0.2308 1.5991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02124 0.08910 0.238 0.8126
## grant -0.25397 0.14703 -1.727 0.0902 .
## lscrap_1 0.83116 0.04444 18.701 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5127 on 51 degrees of freedom
## (103 observations deleted due to missingness)
## Multiple R-squared: 0.8728, Adjusted R-squared: 0.8678
## F-statistic: 174.9 on 2 and 51 DF, p-value: < 2.2e-16
The t statistic for H0: β grant = 0 is -1.73. We use the 5% critical value for 40 df in Table G.2: -1.68. Because t = −1.73 < −1.68, we reject H0 in favor of H1: β grant < 0 at the 5% level.
The t statistic is (.831 – 1)/.044≈ −3.84, which is a strong rejection of H0
coeftest(model7, vcov = vcovHC(model7, type = "HC0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40853 0.25801 1.5833 0.1194
## grant 0.05660 0.36390 0.1555 0.8770
coeftest(model8, vcov = vcovHC(model8, type = "HC0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.021237 0.097032 0.2189 0.82763
## grant -0.253970 0.142249 -1.7854 0.08014 .
## lscrap_1 0.831161 0.071469 11.6297 5.862e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data ("infmrt")
model9 <- lm(infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt )
summary(model9)
##
## Call:
## lm(formula = infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.8973 -0.1412 0.7054 3.1705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.2125 6.7267 5.383 5.09e-07 ***
## lpcinc -2.3964 0.8866 -2.703 0.00812 **
## lphysic -1.5548 0.7734 -2.010 0.04718 *
## lpopul 0.5755 0.1365 4.215 5.60e-05 ***
## DC 13.9632 1.2466 11.201 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.255 on 97 degrees of freedom
## Multiple R-squared: 0.6432, Adjusted R-squared: 0.6285
## F-statistic: 43.72 on 4 and 97 DF, p-value: < 2.2e-16
The coefficient for DC implies that if a state shared the exact per capita income, per capita physicians, and population as Washington D.C., the predicted infant mortality rate for D.C. would be roughly 16 deaths per 1000 live births higher. This signifies a substantial and notable “D.C. effect.”
In the regression from part (i), including or excluding a single observation, such as D.C., yields matching coefficients and standard errors for the other variables. However, the R-squared values differ notably because predicting the infant mortality rate perfectly for D.C. impacts these metrics. Verifying that the residual for the D.C. observation is zero is advisable.