Homework 3 Solutions

1. (a)

Get Data:

setwd("/Users/traves/Dropbox/SM339/homework 3")
CL = read.csv("Cereal.csv")
summary(CL)
##                  Cereal      Calories       Sugar           Fiber      
##  100% Bran          : 1   Min.   : 50   Min.   : 0.00   Min.   : 0.00  
##  All Bran Xtra Fiber: 1   1st Qu.: 90   1st Qu.: 1.75   1st Qu.: 1.00  
##  Batman             : 1   Median :104   Median : 5.00   Median : 3.00  
##  Bran Buds          : 1   Mean   :102   Mean   : 5.71   Mean   : 3.59  
##  Bran Flakes        : 1   3rd Qu.:110   3rd Qu.: 9.07   3rd Qu.: 4.25  
##  Capt. Crunch       : 1   Max.   :160   Max.   :15.00   Max.   :14.00  
##  (Other)            :30
attach(CL)

Fit the basic model:

CF = lm(Calories ~ Fiber)
plot(Calories ~ Fiber, pch = 19, col = "red", main = "Calories vs Fiber")
abline(CF, col = "blue", lwd = 3)

plot of chunk unnamed-chunk-2

CC = cor(Fiber, Calories)
CC
## [1] -0.715
CC^2
## [1] 0.5112

The plot shows that Calories and Fiber are NEGATIVELY correlated. The correlation coefficient is approximately -0.715 and the coefficient of determination is approximately 0.51.

(b) Get fitted regression line:

summary(CF)
## 
## Call:
## lm(formula = Calories ~ Fiber)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.36  -7.36  -4.01   1.41  55.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  117.363      3.722   31.54  < 2e-16 ***
## Fiber         -4.388      0.736   -5.96  9.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.7 on 34 degrees of freedom
## Multiple R-squared:  0.511,  Adjusted R-squared:  0.497 
## F-statistic: 35.6 on 1 and 34 DF,  p-value: 9.6e-07

The equation of the regression line is Calories = 117.36 - 4.39*Fiber.

(c) Plot the residuals:

plot(rstandard(CF) ~ CF$fitted, col = "red", pch = 19, main = "Standardized Residuals vs. Fitted Values", 
    xlab = "Fitted Values of Basic Linear Model", ylab = "Standardized Residuals")
abline(0, 0, col = "blue")

plot of chunk unnamed-chunk-4

We see how the spread of the residuals increases with increasing fitted values. Let's fit the new linear model Calories~sqrt(Fiber).

CFs = lm(Calories ~ sqrt(Fiber))
summary(CFs)
## 
## Call:
## lm(formula = Calories ~ sqrt(Fiber))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.46 -11.69  -1.23   2.24  60.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   121.69       5.55   21.92  < 2e-16 ***
## sqrt(Fiber)   -12.62       2.93   -4.31  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 34 degrees of freedom
## Multiple R-squared:  0.353,  Adjusted R-squared:  0.334 
## F-statistic: 18.6 on 1 and 34 DF,  p-value: 0.000132

The fitted line for the new model has equation Calories = 121.690 - 12.622 sqrt(Fiber).

(d) Now for the residuals:

plot(Calories ~ sqrt(Fiber), pch = 19, col = "red", main = "Calories vs. Sqrt(Fiber)")
abline(CFs, col = "blue", lwd = 3)

plot of chunk unnamed-chunk-6

plot(rstandard(CFs) ~ CFs$fitted, main = "Standardized residuals for the new model", 
    xlab = "Fitted values of the new model", ylab = "Standardized residuals")
abline(0, 0, col = "blue", lwd = 3)

plot of chunk unnamed-chunk-6

The outliers seem to have standardized residuals larger than 2:

rstandard(CFs)[which(rstandard(CFs) > 2)]
##     4     9    27 
## 2.029 2.590 3.375
CL[which(rstandard(CFs) > 2), ]
##                  Cereal Calories Sugar Fiber
## 4            Just Right      140     9     2
## 9      Kenmei Rice Bran      150     9     2
## 27 Mueslix Crispy Blend      160    13     3

The three outliers are Just Right, Kenmei Rice Bran, and Mueslix Crispy Blend.

(e) The residual for Uncle Sam is 21.7 calories.

CFs$residual[which(Cereal == "Uncle Sam")]
##   28 
## 21.7

(f) Model is Calories = 121.69 - 12.622*sqrt(Fiber)

Plot:

plot(Calories ~ Fiber, pch = 19, col = "red", main = "Calories vs. Fiber for Breakfast Cereals", 
    xlab = "Fiber in grams", ylab = "Calories")
o = order(Fiber)
fits = 121.69 - 12.622 * sqrt(Fiber)
lines(Fiber[o], fits[o], col = "blue", lwd = 2)

plot of chunk unnamed-chunk-9

(g) We test the two hypotheses

\( H_0 \): \( \beta_1 = 0 \)

\( H_1 \): \( \beta_1 \neq 0 \)

at the 5% level:

summary(CFs)
## 
## Call:
## lm(formula = Calories ~ sqrt(Fiber))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.46 -11.69  -1.23   2.24  60.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   121.69       5.55   21.92  < 2e-16 ***
## sqrt(Fiber)   -12.62       2.93   -4.31  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 34 degrees of freedom
## Multiple R-squared:  0.353,  Adjusted R-squared:  0.334 
## F-statistic: 18.6 on 1 and 34 DF,  p-value: 0.000132
B1 = -12.622
SE = 2.929
test.stat = (B1 - 0)/SE  # test.stat is negative
test.stat
## [1] -4.309
p = 2 * (pt(test.stat, df = length(Calories) - 2))
p
## [1] 0.0001323

The test statistic is \( \hat{\beta}_1/SE_{\hat{\beta}_1} \) = -4.309 and the p-value is 0.0001325. Since p<\( \alpha \) = 0.05, we reject the null hypothesis and conclude that \( \beta_1 \neq 0 \). Thus, the data supports the claim that Fiber is a useful variable when trying to predict Calories.

(h) We compute a 95% confidence interval for \( \beta_1 \).

B1 = -12.622
SE = 2.929
tstar = qt(0.975, df = length(Calories) - 2)
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] -18.57  -6.67

A 95% confidence interval for the coefficient of sqrt(Fiber) from out model in part © is (-18.6, -6.7). That is, we are 95% confident that for each extra gram of Fiber in a serving, the calorie count drops by between 6.7 and 18.6 calories.

2. (a) 95% CI for true mean of cereals with 4 grams of fiber (i.e. sqrt(Fiber) = 2 grams):

new = data.frame(Fiber = 4)
predict(CFs, newdata = new, interval = "confidence", level = 0.95)
##     fit   lwr upr
## 1 96.45 89.86 103

The confidence interval for the mean number of calories for cereals with 4 grams of fiber is (89.85, 103.04). Let's reproduce this calculation without using the predict function (as a check that everything is correct):

fit = 121.69 - 12.622 * sqrt(4)
n = length(Calories)
RSE = 18.09
xstar = 2
xbar = mean(sqrt(Fiber))
top = (xstar - xbar)^2
bot = sum((sqrt(Fiber) - xbar)^2)
tstar = qt(0.975, df = n - 2)
SEU = RSE * sqrt((1/n) + (top/bot))
low = predict(CFs, newdata = new) - tstar * SEU
high = predict(CFs, newdata = new) + tstar * SEU
print(c(low, high))
##      1      1 
##  89.85 103.04

(b) In this case we need a prediction interval:

new = data.frame(Fiber = 4)
predict(CFs, newdata = new, interval = "predict", level = 0.95)
##     fit   lwr   upr
## 1 96.45 59.11 133.8

A 95% prediction interval (i.e. a 95% confidence interval for the number of calories in the new cereal) is (59.11,133.79).

(c) Test

\( H_0 \): true coefficient of sqrt(Fiber) = 0

\( H_1 \): true coefficient of sqrt(Fiber) \( \neq \) 0

by computing the test statistic F = MSM/MSE and computing p = Pr(\( F_{1,n-2}> \) test.statistic).

anova(CFs)
## Analysis of Variance Table
## 
## Response: Calories
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## sqrt(Fiber)  1   6073    6073    18.6 0.00013 ***
## Residuals   34  11121     327                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test.statistic = 18.566  # F-value in anova table
p = 1 - pf(test.statistic, df1 = 1, df2 = n - 2)
p  # can also see the p-value in the anova table
## [1] 0.0001325

Since the p value is quite low we would reject the null hypothesis and conclude that sqrt(Fiber) is a useful variable when trying to predict calories of cereals (p = 0.0001325).

(d) The percentage of total variation in Calories explained by the model involving sqrt(Fiber) is given by the coefficient of determination.

cor(Calories, sqrt(Fiber))^2
## [1] 0.3532

So just over 35% of the total variation is explained by the model involving sqrt(Fiber).