Solutions to Homework from Day 09

Question 1.1:

The incorrect answer is c: when X = 2, \( \hat{Y} \) = 100 + 15(2) = 130, not 110.

Question 1.2:

The slope of the regression line is the coefficeint of WingLength, 0.4674.

Question 1.3:

The intercept of teh least square regression line is the coefficient of the Constant, 1.3655.

Question 1.4:

The size of the typical error when predicting weight from wing length is the mean square residual error, 1.96.

Question 1.5:

According to the regression output the degrees of freedom when predicting the residual error is 114.

Question 1.6:

Given the regression equation \( \hat{Y} = 25 + 7X \) and data point \( (x_1,y_1)= (10,100) \), the fitted value is \( \hat{y}_1 = 25 + 7(10) = 95 \) and the residual is \( y_1 - \hat{y}_1 = 100-95 = 5 \).

Question 1.7:

The answer is c: a plot of the residuals is not helpful in determining the independence of observations, you really need to think about the context of the study.

Question 1.8:

a. Load some data:

setwd("/Users/traves/Dropbox/SM339/day09")
CD <- read.csv("Cereal.csv")
summary(CD)
##                  Cereal      Calories       Sugar           Fiber      
##  100% Bran          : 1   Min.   : 50   Min.   : 0.00   Min.   : 0.00  
##  All Bran Xtra Fiber: 1   1st Qu.: 90   1st Qu.: 1.75   1st Qu.: 1.00  
##  Batman             : 1   Median :104   Median : 5.00   Median : 3.00  
##  Bran Buds          : 1   Mean   :102   Mean   : 5.71   Mean   : 3.59  
##  Bran Flakes        : 1   3rd Qu.:110   3rd Qu.: 9.07   3rd Qu.: 4.25  
##  Capt. Crunch       : 1   Max.   :160   Max.   :15.00   Max.   :14.00  
##  (Other)            :30
names(CD)
## [1] "Cereal"   "Calories" "Sugar"    "Fiber"
head(CD)
##                  Cereal Calories Sugar Fiber
## 1 Common Sense Oat Bran      100     6     3
## 2            Product 19      100     3     1
## 3   All Bran Xtra Fiber       50     0    14
## 4            Just Right      140     9     2
## 5     Original Oat Bran       70     5    10
## 6             Heartwise       90     5     6
attach(CD)

Graph calories vs. sugar:

plot(Calories ~ Sugar, col = "red", pch = 19, xlab = "Sugar in grams", ylab = "Calories", 
    main = "Calories vs. Sugar")

plot of chunk unnamed-chunk-2

There is a general trend that increasing sugar seems to be associated to increased calories, but the variance of the data is so high that it is hard to decide on a model that might fit the data.

b. We fit the model

Calories = \( \beta_0 \) + \( \beta_1 \cdot \) Sugar + \( \epsilon \),

with \( \epsilon \sim N(0,\sigma) \).

Fit model:

CSline = lm(Calories ~ Sugar)
summary(CSline)
## 
## Call:
## lm(formula = Calories ~ Sugar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.43  -9.83   0.24   8.91  40.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   87.428      5.163   16.93   <2e-16 ***
## Sugar          2.481      0.707    3.51   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 19.3 on 34 degrees of freedom
## Multiple R-squared: 0.266,   Adjusted R-squared: 0.244 
## F-statistic: 12.3 on 1 and 34 DF,  p-value: 0.0013

The least squares regression line is

\( \widehat{\text{Calories}} \) = 87.428 + 2.481*Sugar.

c. The slope value is 2.481 and the units are calories/g of Sugar. The value indicates that for every extra gram of sugar in the cereal, we expect 2.481 additional calories.

Question 1.9:

a. The fitted model is \( \widehat{\text{Calories}} \) = 87.428 + 2.481*Sugar. If Sugar = 10 then the fitted value would be \( \widehat{\text{Calories}} = 112.236 \).

Here's a way to do this using only R functions. You need to feed predict the linear model CSline and an option newdata, which equals a data.frame containing the variable values you want to get predictions for.

predict(CSline, newdata = data.frame(Sugar = c(10)))
##     1 
## 112.2

b. The model predicts that Cheerios should have 87.428 + 2.481*1 = 89.909 calories. The residual is 110 - 89.909 = 20.091 calories.

110 - predict(CSline, newdata = data.frame(Sugar = c(1)))
##     1 
## 20.09

c.

plot(Calories ~ Sugar, col = "red", pch = 19, xlab = "Sugar in grams", ylab = "Calories", 
    main = "Calories vs. Sugar")
lines(Sugar, CSline$fitted, col = "blue", lwd = 4)
RSE = 19.27
lines(Sugar, CSline$fitted + RSE, col = "green", lwd = 2)
lines(Sugar, CSline$fitted - RSE, col = "green", lwd = 2)

plot of chunk unnamed-chunk-6

The linear regression model does seem to capture the general trend of the data and a plot of the regression line (with green error bars indicating 1 standard deviation (i.e. 1 residual standard error away from the regression line)) does seem to capture most of the data points. Still, I'm nervous about the large amount of scatter away from the fitted line.