2.23 Get data:
setwd("/Users/traves/Dropbox/SM339/day15/")
ME = read.csv("MathEnrollment.csv")
summary(ME)
## Ayear Fall Spring
## Min. :2001 Min. :248 Min. :206
## 1st Qu.:2004 1st Qu.:266 1st Qu.:238
## Median :2006 Median :286 Median :254
## Mean :2006 Mean :286 Mean :258
## 3rd Qu.:2008 3rd Qu.:302 3rd Qu.:286
## Max. :2011 Max. :343 Max. :308
attach(ME)
(a) Remove data for 2003:
MER = ME[-which(Ayear == 2003), ]
e.fit = lm(Spring ~ Fall, data = MER)
summary(e.fit)
##
## Call:
## lm(formula = Spring ~ Fall, data = MER)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.50 -17.35 -6.06 22.71 29.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.009 106.724 5.13 0.00089 ***
## Fall -1.048 0.381 -2.75 0.02487 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.423
## F-statistic: 7.59 on 1 and 8 DF, p-value: 0.0249
plot(MER$Spring ~ MER$Fall, main = "Math enrollment", xlab = "Fall enrollment",
ylab = "Spring enrollment")
abline(e.fit)
plot(rstandard(e.fit) ~ e.fit$fitted, main = "Residuals vs. Fitted", ylab = "Standardized Residuals",
xlab = "Fitted values")
require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
densityplot(rstandard(e.fit))
plot(rstandard(e.fit) ~ MER$Ayear, main = "Residuals vs. Academic Year", ylab = "Standardized Residuals",
xlab = "Academic Year")
The residuals seem to be bimodal: most are either too high or too low, but few fitted values are right on. The spread of the residuals seems constant (does not depend on the fitted value), but there is a clear trend when plotting the residuals against academic year. The model predictions are too low at the start of the data range and get larger, eventually rising to overshoot the actual values. This trend suggests that the residuals are not independent of one another and so we should be careful to add a caveat when we state the conclusions of any statistical analysis based on this data and this model.
(b) Variability in model:
anova(e.fit)
## Analysis of Variance Table
##
## Response: Spring
## Df Sum Sq Mean Sq F value Pr(>F)
## Fall 1 4721 4721 7.59 0.025 *
## Residuals 8 4976 622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2 = 4721.1/(4721.1 + 4976.5)
r2
## [1] 0.4868
Despite the limitations of the model, the model shows that Fall enrollment can be used to explain over 48% of the variation in Spring enrollment.
(c) See (b) above.
(d) We can test for whether there is a significant linear association between Fall and Spring enrollment in three ways. I'll mention all of them but you'd normally just do one test.
First we could use a t-test to check whether \( \beta_1 = 0 \) or not. From the summary of the linear model, we see that the test statistic is -2.755 (on 8 degrees of freedom), giving a p-value of 0.02487. So we'd reject the null hypotheis (\( \beta_1=0 \)) at any level of significance > 0.02487. At these levels, we'd conclude that the data supports the conclusion of a linear relationship between Fall and Spring enrollments.
Instead we could have used the F-test on variance. Here the test statistic is F = 7.5895 (distributed with 1 and 8 degrees of freedom) and the p-value is again 0.02487. Finally, we could use the hypothesis test based on the correlation coefficient:
r = -sqrt(r2)
n = length(MER$Fall)
t = r * sqrt(n - 2)/sqrt(1 - r^2) # = -2.75
p = 2 * pt(t, df = n - 2)
p
## [1] 0.02487
Again the p-value is 0.02487.
(e) Find a 95% CI for the slope of the model:
summary(e.fit)
##
## Call:
## lm(formula = Spring ~ Fall, data = MER)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.50 -17.35 -6.06 22.71 29.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.009 106.724 5.13 0.00089 ***
## Fall -1.048 0.381 -2.75 0.02487 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.423
## F-statistic: 7.59 on 1 and 8 DF, p-value: 0.0249
beta1hat = -1.0483
SE = 0.3805
tstar = qt(0.975, df = n - 2)
low = beta1hat - tstar * SE
high = beta1hat + tstar * SE
print(c(low, high))
## [1] -1.9257 -0.1709
The 95% CI for the slope is (-1.93,-0.17). That is, we are 95% confident that for every extra student in the Fall, there will be between 0.17 and 1.93 fewer students in the Spring (an odd conclusion).
2.24 (a) Predict Spring enrollment when Fall enrollment is 290:
new = data.frame(Fall = 290)
predict(e.fit, newdata = new)
## 1
## 244
Using the model, we predict Spring enrollment of 244.
(b) 95% CI for mean Spring enrollment when Fall enrollment is 290:
predict(e.fit, newdata = new, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 244 223.7 264.3
The 95% for mean Spring enrollment is (224,264) when Fall enrollment is 290.
(c) 95% CI for Spring enrollment when Fall enrollment is 290 (a prediction interval):
predict(e.fit, newdata = new, interval = "predict", level = 0.95)
## fit lwr upr
## 1 244 183 305
The 95% for Spring enrollment is (183,305) when Fall enrollment is 290.
(d) Use the interval from part © – we are predicting a new value for a particular Spring enrollment, not the average of Spring enrollments.