Solutions to Homework from Day 15

2.23 Get data:

setwd("/Users/traves/Dropbox/SM339/day15/")
ME = read.csv("MathEnrollment.csv")
summary(ME)

##      Ayear           Fall         Spring   
##  Min.   :2001   Min.   :248   Min.   :206  
##  1st Qu.:2004   1st Qu.:266   1st Qu.:238  
##  Median :2006   Median :286   Median :254  
##  Mean   :2006   Mean   :286   Mean   :258  
##  3rd Qu.:2008   3rd Qu.:302   3rd Qu.:286  
##  Max.   :2011   Max.   :343   Max.   :308

attach(ME)

(a) Remove data for 2003:

MER = ME[-which(Ayear == 2003), ]
e.fit = lm(Spring ~ Fall, data = MER)
summary(e.fit)

## 
## Call:
## lm(formula = Spring ~ Fall, data = MER)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.50 -17.35  -6.06  22.71  29.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  548.009    106.724    5.13  0.00089 ***
## Fall          -1.048      0.381   -2.75  0.02487 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487,   Adjusted R-squared: 0.423 
## F-statistic: 7.59 on 1 and 8 DF,  p-value: 0.0249

plot(MER$Spring ~ MER$Fall, main = "Math enrollment", xlab = "Fall enrollment", 
    ylab = "Spring enrollment")
abline(e.fit)

plot of chunk unnamed-chunk-2

plot(rstandard(e.fit) ~ e.fit$fitted, main = "Residuals vs. Fitted", ylab = "Standardized Residuals", 
    xlab = "Fitted values")

plot of chunk unnamed-chunk-2

require(lattice)

## Loading required package: lattice

require(latticeExtra)

## Loading required package: latticeExtra

## Loading required package: RColorBrewer

densityplot(rstandard(e.fit))

plot of chunk unnamed-chunk-2

plot(rstandard(e.fit) ~ MER$Ayear, main = "Residuals vs. Academic Year", ylab = "Standardized Residuals", 
    xlab = "Academic Year")

plot of chunk unnamed-chunk-2

The residuals seem to be bimodal: most are either too high or too low, but few fitted values are right on. The spread of the residuals seems constant (does not depend on the fitted value), but there is a clear trend when plotting the residuals against academic year. The model predictions are too low at the start of the data range and get larger, eventually rising to overshoot the actual values. This trend suggests that the residuals are not independent of one another and so we should be careful to add a caveat when we state the conclusions of any statistical analysis based on this data and this model.

(b) Variability in model:

anova(e.fit)

## Analysis of Variance Table
## 
## Response: Spring
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Fall       1   4721    4721    7.59  0.025 *
## Residuals  8   4976     622                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

r2 = 4721.1/(4721.1 + 4976.5)
r2

## [1] 0.4868

Despite the limitations of the model, the model shows that Fall enrollment can be used to explain over 48% of the variation in Spring enrollment.

(c) See (b) above.

(d) We can test for whether there is a significant linear association between Fall and Spring enrollment in three ways. I'll mention all of them but you'd normally just do one test.

First we could use a t-test to check whether \( \beta_1 = 0 \) or not. From the summary of the linear model, we see that the test statistic is -2.755 (on 8 degrees of freedom), giving a p-value of 0.02487. So we'd reject the null hypotheis (\( \beta_1=0 \)) at any level of significance > 0.02487. At these levels, we'd conclude that the data supports the conclusion of a linear relationship between Fall and Spring enrollments.

Instead we could have used the F-test on variance. Here the test statistic is F = 7.5895 (distributed with 1 and 8 degrees of freedom) and the p-value is again 0.02487. Finally, we could use the hypothesis test based on the correlation coefficient:

r = -sqrt(r2)
n = length(MER$Fall)
t = r * sqrt(n - 2)/sqrt(1 - r^2)  # = -2.75
p = 2 * pt(t, df = n - 2)
p

## [1] 0.02487

Again the p-value is 0.02487.

(e) Find a 95% CI for the slope of the model:

summary(e.fit)

## 
## Call:
## lm(formula = Spring ~ Fall, data = MER)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.50 -17.35  -6.06  22.71  29.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  548.009    106.724    5.13  0.00089 ***
## Fall          -1.048      0.381   -2.75  0.02487 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487,   Adjusted R-squared: 0.423 
## F-statistic: 7.59 on 1 and 8 DF,  p-value: 0.0249

beta1hat = -1.0483
SE = 0.3805
tstar = qt(0.975, df = n - 2)
low = beta1hat - tstar * SE
high = beta1hat + tstar * SE
print(c(low, high))

## [1] -1.9257 -0.1709

The 95% CI for the slope is (-1.93,-0.17). That is, we are 95% confident that for every extra student in the Fall, there will be between 0.17 and 1.93 fewer students in the Spring (an odd conclusion).

2.24 (a) Predict Spring enrollment when Fall enrollment is 290:

new = data.frame(Fall = 290)
predict(e.fit, newdata = new)

##   1 
## 244

Using the model, we predict Spring enrollment of 244.

(b) 95% CI for mean Spring enrollment when Fall enrollment is 290:

predict(e.fit, newdata = new, interval = "confidence", level = 0.95)

##   fit   lwr   upr
## 1 244 223.7 264.3

The 95% for mean Spring enrollment is (224,264) when Fall enrollment is 290.

(c) 95% CI for Spring enrollment when Fall enrollment is 290 (a prediction interval):

predict(e.fit, newdata = new, interval = "predict", level = 0.95)

##   fit lwr upr
## 1 244 183 305

The 95% for Spring enrollment is (183,305) when Fall enrollment is 290.

(d) Use the interval from part © – we are predicting a new value for a particular Spring enrollment, not the average of Spring enrollments.