HW#3

Reading: Ch. 2.2 - 2.5
Exercises to hand in: 2.14, 2.15, 2.30, 2.31, 2.32

require(mosaic)
require(Stat2Data)

Exercise 2.14

data(TextPrices)
xyplot(Price ~ Pages, data=TextPrices)

#Since it looks linear, continue with a simple linear regression model.
slr=lm(Price ~ Pages, data=TextPrices)
summary(slr)

## 
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.475 -12.324  -0.584  15.304  72.991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.42231   10.46374  -0.327    0.746    
## Pages        0.14733    0.01925   7.653 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.665 
## F-statistic: 58.57 on 1 and 28 DF,  p-value: 2.452e-08

Our hypothesis is that the number of pages in a textbook has a positive linear relationship with the price of the textbook. In particular, we are testing \(H_0: \beta_1 = 0 vs. H_1: \beta_1 \neq 0\). From our summary command, we see that the t-statistic is 7.653 with a statistically significant p-value. These calculations mean that the number of pages is a useful predictor of textbook price.

confint(slr, level=0.95)

##                   2.5 %    97.5 %
## (Intercept) -24.8563229 18.011694
## Pages         0.1078959  0.186761

The 95% confidence interval for the population slope coefficient is [0.1078959, 0.186761]. This means that we are 95% confident that the population parameter is within the interval [0.1078959, 0.186761]. Similarly, 95% of the data is within the confidence interval.

Exercise 2.15

# Make a data frame of the Pages that you want to predict
pages.new = data.frame("Pages" = c(450))
predict(slr, pages.new, interval="confidence")

##        fit      lwr      upr
## 1 62.87549 51.73074 74.02024

The 95% confidence interval for the mean price of a 450-page textbook in the population is [51.73, 74.02].

predict(slr, pages.new, interval="predict")

##        fit       lwr      upr
## 1 62.87549 0.9035981 124.8474

The 95% confidence interval for the individual price of a particular 450-age textbook in the population is [.90, 124.85].
The midpoints of these two intervals are the same. This makes sense because the two types of intervals are essentially the same except that the prediction interval has a larger standard error. Moreover, both intervals are centered around the single predicted value calculated from the model. Thus, no matter how wide the intervals are, their centers must be equal.
The widths of these two intervals are very different. This makes sense because the standard error for the prediction interval is much larger and therefore has a wider interval. The standard error is larger than that for the confidence interval because it is more difficult to predict one individual response than to predict a mean response.
The mean number of pages, which is 464, would produce the narrowest possible prediction interval for its price because the mean minimizes the standard error in the model.

pages.new1 = data.frame("Pages" = c(1500))
predict(slr, pages.new1, interval="predict")

##        fit      lwr     upr
## 1 217.5704 143.3587 291.782

The 95% prediction interval for the price of a particular 1500-page textbook in the population is [143.36, 291.78]. We do not really have 95% confidence in this interval because we are asking the model to predict a price outside of the range of data available. Based on the limited data we have and the model we have created based on that data, we are able to predict that the price of an individual 1500-page textbook is somewhere between [143.36, 291.78], but we are not sure if our model is a good fit for any textbooks with greater than 1060 pages (the maximum number of pages in our population).

Exercise 2.30

data(ChildSpeaks)

I would expect to see a negative relationship between the age at which a child first speaks and his or her score on the Gesell Aptitude Test because the sooner a child starts to speak, the sooner he or she can learn from his or her environment, which I presume will build aptitude. Thus, I expect to see the children who began speaking at younger ages with higher Gesell scores.

xyplot(Gesell ~ Age, data=ChildSpeaks, type=c("p","r"))

Based on the scatterplot, the age of first speaking appears to have a negative correlation with Gesell score and therefore seems to be a useful predictor of the Gesell score.

mod=lm(Gesell ~ Age, data=ChildSpeaks)
summary(mod)

## 
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.604  -8.731   1.396   4.523  30.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 109.8738     5.0678  21.681 7.31e-15 ***
## Age          -1.1270     0.3102  -3.633  0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.02 on 19 degrees of freedom
## Multiple R-squared:   0.41,  Adjusted R-squared:  0.3789 
## F-statistic:  13.2 on 1 and 19 DF,  p-value: 0.001769

The regression equation is \(Gesell = 109.8738 - 1.127*Age\). The value of r-squared is 0.41. The relationship between these two variables is statistically significant because both have very small p-values.

# The residuals from your regression
residuals(mod)

##           1           2           3           4           5           6 
##   2.0309931  -9.5721288 -15.6039514  -8.7309404   9.0309931  -0.3340623 
##           7           8           9          10          11          12 
##   3.4119599   2.5230375   3.1420707   6.6659377  11.0150818  -3.7309404 
##          13          14          15          16          17          18 
## -15.6039514 -13.4769625   4.5230375   1.3960486   8.6500264  -5.5403062 
##          19          20          21 
##  30.2849710 -11.4769625   1.3960486

# The fitted values from your regression
fitted.values(mod)

##         1         2         3         4         5         6         7 
##  92.96901  80.57213  98.60395  99.73094  92.96901  87.33406  89.58804 
##         8         9        10        11        12        13        14 
##  97.47696 100.85793  87.33406 101.98492  99.73094  98.60395  97.47696 
##        15        16        17        18        19        20        21 
##  97.47696  98.60395  96.34997  62.54031  90.71503  97.47696  98.60395

Child #19 has the largest residual value, which is 30.2849710. The unusual thing about this child is that despite what our fitted model might allude, he or she is not the youngest child and yet has the highest Gesell score.

Exercise 2.31

ChildSpeaks.small=subset(ChildSpeaks, Age < 42)
xyplot(Gesell ~ Age, data=ChildSpeaks.small, type=c("p","r"))

mod1=lm(Gesell ~ Age, data=ChildSpeaks.small)
summary(mod1)

## 
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks.small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.838  -8.477   1.779   4.688  28.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.6299     7.1619  14.749 1.71e-11 ***
## Age          -0.7792     0.5167  -1.508    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.11 on 18 degrees of freedom
## Multiple R-squared:  0.1122, Adjusted R-squared:  0.06284 
## F-statistic: 2.274 on 1 and 18 DF,  p-value: 0.1489

Once we remove the data case for the child who took 42 months to speak, we see from the scatterplot that the negative relationship between age and Gesell score is less intense. In fact, our fitted regression model equation is now \(Gesell = 105.6299 - 0.7792*Age\), which tells us that the slope coefficient for age has become less negative compared to that in question 2.30. The value of r-squared has also decreased to 0.1122.

Exercise 2.32

ChildSpeaks.smaller=subset(ChildSpeaks, Age < 26)
xyplot(Gesell ~ Age, ChildSpeaks.smaller, type=c("p","r"))

mod2=lm(Gesell ~ Age, data=ChildSpeaks.smaller)
summary(mod2)

## 
## Call:
## lm(formula = Gesell ~ Age, data = ChildSpeaks.smaller)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.991  -7.599  -1.078   5.270  24.618 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.86180    8.02578   12.19 7.88e-10 ***
## Age         -0.08707    0.62174   -0.14     0.89    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.48 on 17 degrees of freedom
## Multiple R-squared:  0.001152,   Adjusted R-squared:  -0.0576 
## F-statistic: 0.01961 on 1 and 17 DF,  p-value: 0.8903

When we removed the child who took 26 months to speak, we see that the linear relationship between age and Gesell score has become almost completely flat. The scatterplot shows that there is a very small negative relationship between age and Gesell score. In fact, the slope coefficient has increased to a very small negative number and the value of r-squared has decreased to 0.001152. Interestingly, once we remove two outliers, it seems that there is almost no relationship between the age one begins to speak and Gesell aptitude score. This information leads us to believe that our initial hypothesis was wrong.