1. (a)
Get Data:
setwd("/Users/traves/Dropbox/SM339/homework 3")
CL = read.csv("Cereal.csv")
summary(CL)
## Cereal Calories Sugar Fiber
## 100% Bran : 1 Min. : 50 Min. : 0.00 Min. : 0.00
## All Bran Xtra Fiber: 1 1st Qu.: 90 1st Qu.: 1.75 1st Qu.: 1.00
## Batman : 1 Median :104 Median : 5.00 Median : 3.00
## Bran Buds : 1 Mean :102 Mean : 5.71 Mean : 3.59
## Bran Flakes : 1 3rd Qu.:110 3rd Qu.: 9.07 3rd Qu.: 4.25
## Capt. Crunch : 1 Max. :160 Max. :15.00 Max. :14.00
## (Other) :30
attach(CL)
Fit the basic model:
CF = lm(Calories ~ Fiber)
plot(Calories ~ Fiber, pch = 19, col = "red", main = "Calories vs Fiber")
abline(CF, col = "blue", lwd = 3)
CC = cor(Fiber, Calories)
CC
## [1] -0.715
CC^2
## [1] 0.5112
The plot shows that Calories and Fiber are NEGATIVELY correlated. The correlation coefficient is approximately -0.715 and the coefficient of determination is approximately 0.51.
(b) Get fitted regression line:
summary(CF)
##
## Call:
## lm(formula = Calories ~ Fiber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.36 -7.36 -4.01 1.41 55.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.363 3.722 31.54 < 2e-16 ***
## Fiber -4.388 0.736 -5.96 9.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.7 on 34 degrees of freedom
## Multiple R-squared: 0.511, Adjusted R-squared: 0.497
## F-statistic: 35.6 on 1 and 34 DF, p-value: 9.6e-07
The equation of the regression line is Calories = 117.36 - 4.39*Fiber.
(c) Plot the residuals:
plot(rstandard(CF) ~ CF$fitted, col = "red", pch = 19, main = "Standardized Residuals vs. Fitted Values",
xlab = "Fitted Values of Basic Linear Model", ylab = "Standardized Residuals")
abline(0, 0, col = "blue")
We see how the spread of the residuals increases with increasing fitted values. Let's fit the new linear model Calories~sqrt(Fiber).
CFs = lm(Calories ~ sqrt(Fiber))
summary(CFs)
##
## Call:
## lm(formula = Calories ~ sqrt(Fiber))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.46 -11.69 -1.23 2.24 60.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.69 5.55 21.92 < 2e-16 ***
## sqrt(Fiber) -12.62 2.93 -4.31 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 34 degrees of freedom
## Multiple R-squared: 0.353, Adjusted R-squared: 0.334
## F-statistic: 18.6 on 1 and 34 DF, p-value: 0.000132
The fitted line for the new model has equation Calories = 121.690 - 12.622 sqrt(Fiber).
(d) Now for the residuals:
plot(Calories ~ sqrt(Fiber), pch = 19, col = "red", main = "Calories vs. Sqrt(Fiber)")
abline(CFs, col = "blue", lwd = 3)
plot(rstandard(CFs) ~ CFs$fitted, main = "Standardized residuals for the new model",
xlab = "Fitted values of the new model", ylab = "Standardized residuals")
abline(0, 0, col = "blue", lwd = 3)
The outliers seem to have standardized residuals larger than 2:
rstandard(CFs)[which(rstandard(CFs) > 2)]
## 4 9 27
## 2.029 2.590 3.375
CL[which(rstandard(CFs) > 2), ]
## Cereal Calories Sugar Fiber
## 4 Just Right 140 9 2
## 9 Kenmei Rice Bran 150 9 2
## 27 Mueslix Crispy Blend 160 13 3
The three outliers are Just Right, Kenmei Rice Bran, and Mueslix Crispy Blend.
(e) The residual for Uncle Sam is 21.7 calories.
CFs$residual[which(Cereal == "Uncle Sam")]
## 28
## 21.7
(f) Model is Calories = 121.69 - 12.622*sqrt(Fiber)
Plot:
plot(Calories ~ Fiber, pch = 19, col = "red", main = "Calories vs. Fiber for Breakfast Cereals",
xlab = "Fiber in grams", ylab = "Calories")
o = order(Fiber)
fits = 121.69 - 12.622 * sqrt(Fiber)
lines(Fiber[o], fits[o], col = "blue", lwd = 2)
(g) We test the two hypotheses
\( H_0 \): \( \beta_1 = 0 \)
\( H_1 \): \( \beta_1 \neq 0 \)
at the 5% level:
summary(CFs)
##
## Call:
## lm(formula = Calories ~ sqrt(Fiber))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.46 -11.69 -1.23 2.24 60.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.69 5.55 21.92 < 2e-16 ***
## sqrt(Fiber) -12.62 2.93 -4.31 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 34 degrees of freedom
## Multiple R-squared: 0.353, Adjusted R-squared: 0.334
## F-statistic: 18.6 on 1 and 34 DF, p-value: 0.000132
B1 = -12.622
SE = 2.929
test.stat = (B1 - 0)/SE # test.stat is negative
test.stat
## [1] -4.309
p = 2 * (pt(test.stat, df = length(Calories) - 2))
p
## [1] 0.0001323
The test statistic is \( \hat{\beta}_1/SE_{\hat{\beta}_1} \) = -4.309 and the p-value is 0.0001325. Since p<\( \alpha \) = 0.05, we reject the null hypothesis and conclude that \( \beta_1 \neq 0 \). Thus, the data supports the claim that Fiber is a useful variable when trying to predict Calories.
(h) We compute a 95% confidence interval for \( \beta_1 \).
B1 = -12.622
SE = 2.929
tstar = qt(0.975, df = length(Calories) - 2)
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] -18.57 -6.67
A 95% confidence interval for the coefficient of sqrt(Fiber) from out model in part © is (-18.6, -6.7). That is, we are 95% confident that for each extra gram of Fiber in a serving, the calorie count drops by between 6.7 and 18.6 calories.
2. (a) 95% CI for true mean of cereals with 4 grams of fiber (i.e. sqrt(Fiber) = 2 grams):
new = data.frame(Fiber = 4)
predict(CFs, newdata = new, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 96.45 89.86 103
The confidence interval for the mean number of calories for cereals with 4 grams of fiber is (89.85, 103.04). Let's reproduce this calculation without using the predict function (as a check that everything is correct):
fit = 121.69 - 12.622 * sqrt(4)
n = length(Calories)
RSE = 18.09
xstar = 2
xbar = mean(sqrt(Fiber))
top = (xstar - xbar)^2
bot = sum((sqrt(Fiber) - xbar)^2)
tstar = qt(0.975, df = n - 2)
SEU = RSE * sqrt((1/n) + (top/bot))
low = predict(CFs, newdata = new) - tstar * SEU
high = predict(CFs, newdata = new) + tstar * SEU
print(c(low, high))
## 1 1
## 89.85 103.04
(b) In this case we need a prediction interval:
new = data.frame(Fiber = 4)
predict(CFs, newdata = new, interval = "predict", level = 0.95)
## fit lwr upr
## 1 96.45 59.11 133.8
A 95% prediction interval (i.e. a 95% confidence interval for the number of calories in the new cereal) is (59.11,133.79).
(c) Test
\( H_0 \): true coefficient of sqrt(Fiber) = 0
\( H_1 \): true coefficient of sqrt(Fiber) \( \neq \) 0
by computing the test statistic F = MSM/MSE and computing p = Pr(\( F_{1,n-2}> \) test.statistic).
anova(CFs)
## Analysis of Variance Table
##
## Response: Calories
## Df Sum Sq Mean Sq F value Pr(>F)
## sqrt(Fiber) 1 6073 6073 18.6 0.00013 ***
## Residuals 34 11121 327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test.statistic = 18.566 # F-value in anova table
p = 1 - pf(test.statistic, df1 = 1, df2 = n - 2)
p # can also see the p-value in the anova table
## [1] 0.0001325
Since the p value is quite low we would reject the null hypothesis and conclude that sqrt(Fiber) is a useful variable when trying to predict calories of cereals (p = 0.0001325).
(d) The percentage of total variation in Calories explained by the model involving sqrt(Fiber) is given by the coefficient of determination.
cor(Calories, sqrt(Fiber))^2
## [1] 0.3532
So just over 35% of the total variation is explained by the model involving sqrt(Fiber).