Exercise 2.13: Breakfast Cereal
Get Data:
setwd("/Users/traves/Dropbox/SM339/day14")
CL = read.csv("Cereal.csv")
summary(CL)
## Cereal Calories Sugar Fiber
## 100% Bran : 1 Min. : 50 Min. : 0.00 Min. : 0.00
## All Bran Xtra Fiber: 1 1st Qu.: 90 1st Qu.: 1.75 1st Qu.: 1.00
## Batman : 1 Median :104 Median : 5.00 Median : 3.00
## Bran Buds : 1 Mean :102 Mean : 5.71 Mean : 3.59
## Bran Flakes : 1 3rd Qu.:110 3rd Qu.: 9.07 3rd Qu.: 4.25
## Capt. Crunch : 1 Max. :160 Max. :15.00 Max. :14.00
## (Other) :30
attach(CL)
(a) Fit the model Calories~Sugar:
CS = lm(Calories ~ Sugar)
summary(CS)
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.43 -9.83 0.24 8.91 40.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.428 5.163 16.93 <2e-16 ***
## Sugar 2.481 0.707 3.51 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.3 on 34 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.244
## F-statistic: 12.3 on 1 and 34 DF, p-value: 0.0013
The coefficient \( \hat{\beta}_1 \) of Sugar in the fitted model is 2.4808 and its Standard Error is 0.7074. This gives a test statistic of 2.4808/0.7074 = 3.507 which should be distributed as a student-T random variable with n-2 = 34 degrees of freedom. Computing the p-value gives:
p = 2 * (1 - pt(3.507, df = 34))
# p = 0.001295724
So we would reject the null hypothesis (that \( \beta_1 = 0 \)) at any level of significance bigger thatn 0.001295724. We conclude that Sugar is a useful predictor of Calories in the linear model (p value 0.001295724).
(b) Let's compute the 95% CI for \( \beta_1 \):
B1 = 2.4808
SE = 0.7074
tstar = qt(0.975, df = 34)
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] 1.043 3.918
A 95% confidence interval for the slope of Sugar in the regression model is (0.11, 3.92). That is, for each extra gram of sugar the calorie content goes up by a value between 0.11 and 3.92 calories.
Exercise 2.14: Textbook Prices
Load Data:
setwd("/Users/traves/Dropbox/SM339/day14")
TP <- read.csv("TextPrices.csv")
summary(TP)
## Pages Price
## Min. : 51 Min. : 4.25
## 1st Qu.: 212 1st Qu.: 17.59
## Median : 456 Median : 55.12
## Mean : 464 Mean : 65.02
## 3rd Qu.: 672 3rd Qu.: 95.75
## Max. :1060 Max. :169.75
attach(TP)
Fit a line of best fit for the model Price ~ Pages:
PP = lm(Price ~ Pages)
plot(Price ~ Pages, col = "red", pch = 19, main = "Textbook prices versus Number of Pages",
ylab = "Price in dollars")
abline(PP, col = "blue", lwd = 3)
summary(PP)
##
## Call:
## lm(formula = Price ~ Pages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.47 -12.32 -0.58 15.30 72.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4223 10.4637 -0.33 0.75
## Pages 0.1473 0.0193 7.65 2.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.8 on 28 degrees of freedom
## Multiple R-squared: 0.677, Adjusted R-squared: 0.665
## F-statistic: 58.6 on 1 and 28 DF, p-value: 2.45e-08
The fitted model is: Price = -3.42 + 0.147*Pages.
(a) Let's check whether the fitted model indicates that Pages is a useful predictor of price. That is, we test whether the coefficient of Pages in the model is 0 (null hypothesis) or non-zero (alternative hypothesis). We do the test several ways:
# T-test for $\beta_1 = 0$ vs. $\beta_1 \neq 0$ test statistic =
# $\hat{\beta}_1/SE(\hat{\beta}_1)$
test.statistic = 0.14733/0.01925 # = 7.653506
n = length(Pages) # n = 30
p = 2 * (1 - pt(test.statistic, df = n - 2)) # so prob(|test.statistic| > obs value of test.statistic is 2.450631e-08)
p
## [1] 2.451e-08
Since the p-value is so low (2.450631e-08) we report that we ought to reject the null-hypothesis. The data strongly supports the alternative hypothesis, that the slope coefficient is non-zero, and that Pages is a useful predictor of a textbook's price.
Here's a second way to test that the linear term is useful. We'll use ANOVA.
anova(PP)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Pages 1 51877 51877 58.6 2.5e-08 ***
## Residuals 28 24799 886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table also reports a p-value of 2.452e-08. Since this is so low we'd reject the null hypothesis (that \( \beta_1 = 0 \)) and conclude that the slope coefficient is non-zero, and that Pages is a useful predictor of a textbook's price.
(b) Let's get a 95% confidence interval for the population slope coefficient (that is, for the true value of \( \beta_1 \)).
tstar = qt(0.975, df = n - 2)
SE = 0.01925
B1 = 0.14733
low = B1 - tstar * SE
high = B1 + tstar * SE
print(c(low, high))
## [1] 0.1079 0.1868
A 95% confidence interval for the population slope (the additional price in dollars for each page of the book) is (0.108,0.187). That is, between 10.8 and 18.7 cents per page. Note that zero is not in this range, further confirming that Pages is a useful predictor of Price.