Question 1.1:
The incorrect answer is c: when X = 2, \( \hat{Y} \) = 100 + 15(2) = 130, not 110.
Question 1.2:
The slope of the regression line is the coefficeint of WingLength, 0.4674.
Question 1.3:
The intercept of teh least square regression line is the coefficient of the Constant, 1.3655.
Question 1.4:
The size of the typical error when predicting weight from wing length is the mean square residual error, 1.96.
Question 1.5:
According to the regression output the degrees of freedom when predicting the residual error is 114.
Question 1.6:
Given the regression equation \( \hat{Y} = 25 + 7X \) and data point \( (x_1,y_1)= (10,100) \), the fitted value is \( \hat{y}_1 = 25 + 7(10) = 95 \) and the residual is \( y_1 - \hat{y}_1 = 100-95 = 5 \).
Question 1.7:
The answer is c: a plot of the residuals is not helpful in determining the independence of observations, you really need to think about the context of the study.
Question 1.8:
a. Load some data:
setwd("/Users/traves/Dropbox/SM339/day09")
CD <- read.csv("Cereal.csv")
summary(CD)
## Cereal Calories Sugar Fiber
## 100% Bran : 1 Min. : 50 Min. : 0.00 Min. : 0.00
## All Bran Xtra Fiber: 1 1st Qu.: 90 1st Qu.: 1.75 1st Qu.: 1.00
## Batman : 1 Median :104 Median : 5.00 Median : 3.00
## Bran Buds : 1 Mean :102 Mean : 5.71 Mean : 3.59
## Bran Flakes : 1 3rd Qu.:110 3rd Qu.: 9.07 3rd Qu.: 4.25
## Capt. Crunch : 1 Max. :160 Max. :15.00 Max. :14.00
## (Other) :30
names(CD)
## [1] "Cereal" "Calories" "Sugar" "Fiber"
head(CD)
## Cereal Calories Sugar Fiber
## 1 Common Sense Oat Bran 100 6 3
## 2 Product 19 100 3 1
## 3 All Bran Xtra Fiber 50 0 14
## 4 Just Right 140 9 2
## 5 Original Oat Bran 70 5 10
## 6 Heartwise 90 5 6
attach(CD)
Graph calories vs. sugar:
plot(Calories ~ Sugar, col = "red", pch = 19, xlab = "Sugar in grams", ylab = "Calories",
main = "Calories vs. Sugar")
There is a general trend that increasing sugar seems to be associated to increased calories, but the variance of the data is so high that it is hard to decide on a model that might fit the data.
b. We fit the model
Calories = \( \beta_0 \) + \( \beta_1 \cdot \) Sugar + \( \epsilon \),
with \( \epsilon \sim N(0,\sigma) \).
Fit model:
CSline = lm(Calories ~ Sugar)
summary(CSline)
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.43 -9.83 0.24 8.91 40.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.428 5.163 16.93 <2e-16 ***
## Sugar 2.481 0.707 3.51 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.3 on 34 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.244
## F-statistic: 12.3 on 1 and 34 DF, p-value: 0.0013
The least squares regression line is
\( \widehat{\text{Calories}} \) = 87.428 + 2.481*Sugar.
c. The slope value is 2.481 and the units are calories/g of Sugar. The value indicates that for every extra gram of sugar in the cereal, we expect 2.481 additional calories.
Question 1.9:
a. The fitted model is \( \widehat{\text{Calories}} \) = 87.428 + 2.481*Sugar. If Sugar = 10 then the fitted value would be \( \widehat{\text{Calories}} = 112.236 \).
Here's a way to do this using only R functions. You need to feed predict the linear model CSline and an option newdata, which equals a data.frame containing the variable values you want to get predictions for.
predict(CSline, newdata = data.frame(Sugar = c(10)))
## 1
## 112.2
b. The model predicts that Cheerios should have 87.428 + 2.481*1 = 89.909 calories. The residual is 110 - 89.909 = 20.091 calories.
110 - predict(CSline, newdata = data.frame(Sugar = c(1)))
## 1
## 20.09
c.
plot(Calories ~ Sugar, col = "red", pch = 19, xlab = "Sugar in grams", ylab = "Calories",
main = "Calories vs. Sugar")
lines(Sugar, CSline$fitted, col = "blue", lwd = 4)
RSE = 19.27
lines(Sugar, CSline$fitted + RSE, col = "green", lwd = 2)
lines(Sugar, CSline$fitted - RSE, col = "green", lwd = 2)
The linear regression model does seem to capture the general trend of the data and a plot of the regression line (with green error bars indicating 1 standard deviation (i.e. 1 residual standard error away from the regression line)) does seem to capture most of the data points. Still, I'm nervous about the large amount of scatter away from the fitted line.