Day 14 homework solutions

Exercise 2.13: Breakfast Cereal

Get Data:

setwd("/Users/traves/Dropbox/SM339/day14")
CL = read.csv("Cereal.csv")
summary(CL)
##                  Cereal      Calories       Sugar           Fiber      
##  100% Bran          : 1   Min.   : 50   Min.   : 0.00   Min.   : 0.00  
##  All Bran Xtra Fiber: 1   1st Qu.: 90   1st Qu.: 1.75   1st Qu.: 1.00  
##  Batman             : 1   Median :104   Median : 5.00   Median : 3.00  
##  Bran Buds          : 1   Mean   :102   Mean   : 5.71   Mean   : 3.59  
##  Bran Flakes        : 1   3rd Qu.:110   3rd Qu.: 9.07   3rd Qu.: 4.25  
##  Capt. Crunch       : 1   Max.   :160   Max.   :15.00   Max.   :14.00  
##  (Other)            :30
attach(CL)

(a) Fit the model Calories~Sugar:

CS = lm(Calories ~ Sugar)
summary(CS)
## 
## Call:
## lm(formula = Calories ~ Sugar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.43  -9.83   0.24   8.91  40.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   87.428      5.163   16.93   <2e-16 ***
## Sugar          2.481      0.707    3.51   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 19.3 on 34 degrees of freedom
## Multiple R-squared: 0.266,   Adjusted R-squared: 0.244 
## F-statistic: 12.3 on 1 and 34 DF,  p-value: 0.0013

The coefficient \( \hat{\beta}_1 \) of Sugar in the fitted model is 2.4808 and its Standard Error is 0.7074. This gives a test statistic of 2.4808/0.7074 = 3.507 which should be distributed as a student-T random variable with n-2 = 34 degrees of freedom. Computing the p-value gives:

p = 2 * (1 - pt(3.507, df = 34))
# p = 0.001295724

So we would reject the null hypothesis (that \( \beta_1 = 0 \)) at any level of significance bigger thatn 0.001295724. We conclude that Sugar is a useful predictor of Calories in the linear model (p value 0.001295724).

(b) Let's compute the 95% CI for \( \beta_1 \):

B1 = 2.4808
SE = 0.7074
tstar = qt(0.975, df = 34)
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] 1.043 3.918

A 95% confidence interval for the slope of Sugar in the regression model is (0.11, 3.92). That is, for each extra gram of sugar the calorie content goes up by a value between 0.11 and 3.92 calories.

Exercise 2.14: Textbook Prices

Load Data:

setwd("/Users/traves/Dropbox/SM339/day14")
TP <- read.csv("TextPrices.csv")
summary(TP)
##      Pages          Price       
##  Min.   :  51   Min.   :  4.25  
##  1st Qu.: 212   1st Qu.: 17.59  
##  Median : 456   Median : 55.12  
##  Mean   : 464   Mean   : 65.02  
##  3rd Qu.: 672   3rd Qu.: 95.75  
##  Max.   :1060   Max.   :169.75
attach(TP)

Fit a line of best fit for the model Price ~ Pages:

PP = lm(Price ~ Pages)
plot(Price ~ Pages, col = "red", pch = 19, main = "Textbook prices versus Number of Pages", 
    ylab = "Price in dollars")
abline(PP, col = "blue", lwd = 3)

plot of chunk unnamed-chunk-6

summary(PP)
## 
## Call:
## lm(formula = Price ~ Pages)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65.47 -12.32  -0.58  15.30  72.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4223    10.4637   -0.33     0.75    
## Pages         0.1473     0.0193    7.65  2.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 29.8 on 28 degrees of freedom
## Multiple R-squared: 0.677,   Adjusted R-squared: 0.665 
## F-statistic: 58.6 on 1 and 28 DF,  p-value: 2.45e-08

The fitted model is: Price = -3.42 + 0.147*Pages.

(a) Let's check whether the fitted model indicates that Pages is a useful predictor of price. That is, we test whether the coefficient of Pages in the model is 0 (null hypothesis) or non-zero (alternative hypothesis). We do the test several ways:

# T-test for $\beta_1 = 0$ vs. $\beta_1 \neq 0$ test statistic =
# $\hat{\beta}_1/SE(\hat{\beta}_1)$
test.statistic = 0.14733/0.01925  # = 7.653506
n = length(Pages)  # n = 30
p = 2 * (1 - pt(test.statistic, df = n - 2))  # so prob(|test.statistic| > obs value of test.statistic is 2.450631e-08)
p
## [1] 2.451e-08

Since the p-value is so low (2.450631e-08) we report that we ought to reject the null-hypothesis. The data strongly supports the alternative hypothesis, that the slope coefficient is non-zero, and that Pages is a useful predictor of a textbook's price.

Here's a second way to test that the linear term is useful. We'll use ANOVA.

anova(PP)
## Analysis of Variance Table
## 
## Response: Price
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## Pages      1  51877   51877    58.6 2.5e-08 ***
## Residuals 28  24799     886                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table also reports a p-value of 2.452e-08. Since this is so low we'd reject the null hypothesis (that \( \beta_1 = 0 \)) and conclude that the slope coefficient is non-zero, and that Pages is a useful predictor of a textbook's price.

(b) Let's get a 95% confidence interval for the population slope coefficient (that is, for the true value of \( \beta_1 \)).

tstar = qt(0.975, df = n - 2)
SE = 0.01925
B1 = 0.14733
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] 0.1079 0.1868

A 95% confidence interval for the population slope (the additional price in dollars for each page of the book) is (0.108,0.187). That is, between 10.8 and 18.7 cents per page. Note that zero is not in this range, further confirming that Pages is a useful predictor of Price.