require(mosaic)
require(Stat2Data)
data(TextPrices)
xyplot(Price ~ Pages, data=TextPrices)
#Since it looks linear, continue with a simple linear regression model.
slr=lm(Price ~ Pages, data=TextPrices)
summary(slr)
##
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.475 -12.324 -0.584 15.304 72.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.42231 10.46374 -0.327 0.746
## Pages 0.14733 0.01925 7.653 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.665
## F-statistic: 58.57 on 1 and 28 DF, p-value: 2.452e-08
confint(slr, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -24.8563229 18.011694
## Pages 0.1078959 0.186761
# Make a data frame of the Pages that you want to predict
pages.new = data.frame("Pages" = c(450))
predict(slr, pages.new, interval="confidence")
## fit lwr upr
## 1 62.87549 51.73074 74.02024
predict(slr, pages.new, interval="predict")
## fit lwr upr
## 1 62.87549 0.9035981 124.8474
The 95% confidence interval for the individual price of a particular 450-age textbook in the population is [.90, 124.85].
The midpoints of these two intervals are the same. This makes sense because the two types of intervals are essentially the same except that the prediction interval has a larger standard error. Moreover, both intervals are centered around the single predicted value calculated from the model. Thus, no matter how wide the intervals are, their centers must be equal.
The widths of these two intervals are very different. This makes sense because the standard error for the prediction interval is much larger and therefore has a wider interval. The standard error is larger than that for the confidence interval because it is more difficult to predict one individual response than to predict a mean response.
The mean number of pages, which is 464, would produce the narrowest possible prediction interval for its price because the mean minimizes the standard error in the model.
pages.new1 = data.frame("Pages" = c(1500))
predict(slr, pages.new1, interval="predict")
## fit lwr upr
## 1 217.5704 143.3587 291.782
data(ChildSpeaks)
xyplot(Gesell ~ Age, data=ChildSpeaks, type=c("p","r"))
mod=lm(Gesell ~ Age, data=ChildSpeaks)
summary(mod)
##
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.604 -8.731 1.396 4.523 30.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.8738 5.0678 21.681 7.31e-15 ***
## Age -1.1270 0.3102 -3.633 0.00177 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 19 degrees of freedom
## Multiple R-squared: 0.41, Adjusted R-squared: 0.3789
## F-statistic: 13.2 on 1 and 19 DF, p-value: 0.001769
# The residuals from your regression
residuals(mod)
## 1 2 3 4 5 6
## 2.0309931 -9.5721288 -15.6039514 -8.7309404 9.0309931 -0.3340623
## 7 8 9 10 11 12
## 3.4119599 2.5230375 3.1420707 6.6659377 11.0150818 -3.7309404
## 13 14 15 16 17 18
## -15.6039514 -13.4769625 4.5230375 1.3960486 8.6500264 -5.5403062
## 19 20 21
## 30.2849710 -11.4769625 1.3960486
# The fitted values from your regression
fitted.values(mod)
## 1 2 3 4 5 6 7
## 92.96901 80.57213 98.60395 99.73094 92.96901 87.33406 89.58804
## 8 9 10 11 12 13 14
## 97.47696 100.85793 87.33406 101.98492 99.73094 98.60395 97.47696
## 15 16 17 18 19 20 21
## 97.47696 98.60395 96.34997 62.54031 90.71503 97.47696 98.60395
ChildSpeaks.small=subset(ChildSpeaks, Age < 42)
xyplot(Gesell ~ Age, data=ChildSpeaks.small, type=c("p","r"))
mod1=lm(Gesell ~ Age, data=ChildSpeaks.small)
summary(mod1)
##
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks.small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.838 -8.477 1.779 4.688 28.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.6299 7.1619 14.749 1.71e-11 ***
## Age -0.7792 0.5167 -1.508 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.11 on 18 degrees of freedom
## Multiple R-squared: 0.1122, Adjusted R-squared: 0.06284
## F-statistic: 2.274 on 1 and 18 DF, p-value: 0.1489
Once we remove the data case for the child who took 42 months to speak, we see from the scatterplot that the negative relationship between age and Gesell score is less intense. In fact, our fitted regression model equation is now \(Gesell = 105.6299 - 0.7792*Age\), which tells us that the slope coefficient for age has become less negative compared to that in question 2.30. The value of r-squared has also decreased to 0.1122.
ChildSpeaks.smaller=subset(ChildSpeaks, Age < 26)
xyplot(Gesell ~ Age, ChildSpeaks.smaller, type=c("p","r"))
mod2=lm(Gesell ~ Age, data=ChildSpeaks.smaller)
summary(mod2)
##
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks.smaller)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.991 -7.599 -1.078 5.270 24.618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.86180 8.02578 12.19 7.88e-10 ***
## Age -0.08707 0.62174 -0.14 0.89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.48 on 17 degrees of freedom
## Multiple R-squared: 0.001152, Adjusted R-squared: -0.0576
## F-statistic: 0.01961 on 1 and 17 DF, p-value: 0.8903
When we removed the child who took 26 months to speak, we see that the linear relationship between age and Gesell score has become almost completely flat. The scatterplot shows that there is a very small negative relationship between age and Gesell score. In fact, the slope coefficient has increased to a very small negative number and the value of r-squared has decreased to 0.001152. Interestingly, once we remove two outliers, it seems that there is almost no relationship between the age one begins to speak and Gesell aptitude score. This information leads us to believe that our initial hypothesis was wrong.