Hw #3

Reading: Ch. 2
Exercises to hand in: 2.12, 2.14, 2.15, 2.30, 2.44

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.4

## Warning: package 'ggplot2' was built under R version 3.4.3

## Warning: package 'tibble' was built under R version 3.4.3

## Warning: package 'tidyr' was built under R version 3.4.3

## Warning: package 'purrr' was built under R version 3.4.3

## Warning: package 'dplyr' was built under R version 3.4.3

## Warning: package 'forcats' was built under R version 3.4.3

library(Stat2Data)
library(skimr)

## Warning: package 'skimr' was built under R version 3.4.4

Exercise 2.12

2.12 Partitioning variability. The sum of squares for the regression model, SSModel, for the regression of Y on X was 110, the sum of squared errors, SSE, was 40, and the total sum of squares, SSTotal, was 150. Calculate and interpret the value of r2.

r^2=SSM/SST=1-(SSE/SST)

rsq<-110/150
rsq2<-1-(40/150)
rsq

## [1] 0.7333333

rsq2

## [1] 0.7333333

As we can see by imputing the values for the SSModel, SSTotal, and the SSE, r^2 is .7333333 this means that 73.333% of the variability in the response variable Y is explained by the model.

Exercise 2.14

data(TextPrices)
head(TextPrices)

##   Pages  Price
## 1   600  95.00
## 2    91  19.95
## 3   200  51.50
## 4   400 128.50
## 5   521  96.00
## 6   315  48.50

attach(TextPrices)

2.14 Textbook prices. Exercise 1.26 examined data on the price and number of pages for a random sample of 30 textbooks from the Cal Poly campus bookstore. The data are stored in the file TextPrices and appear in Table 1.5.

A. Perform a significance test to address the students’ question of whether the number of pages is a useful predictor of a textbook’s price. Report the hypotheses, test statistic, and p-value, along with your conclusion.

Null Hypotheses: there is no relationship between the price of textbooks and the number of pages. AKA: the slope equals 0 (\[\beta_1=0\])

Alternative Hypotheses: there is a relationship between the price of textbooks and the number of pages AKA the slope doesn’t equal 0 (\[\beta_1!=0\])

modPages<-lm(Price~Pages)
summary (modPages)

## 
## Call:
## lm(formula = Price ~ Pages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.475 -12.324  -0.584  15.304  72.991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.42231   10.46374  -0.327    0.746    
## Pages        0.14733    0.01925   7.653 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.665 
## F-statistic: 58.57 on 1 and 28 DF,  p-value: 2.452e-08

The test statistic or the F-statistic is 58.57. This is used to calculate the p value which is 2.45*10^-8

Conclusion: Since the p value is below the alpha .05, we have enough evidence to be able to reject the null hypothesis and accept the alternate hypothesis that there is a relationship between the number of pages in a textbook and its cost.

B. Determine a 95% confidence interval for the population slope coefficient. Also explain what this slope coefficient means in the context of these data.

confint(modPages, 'Pages', level=0.95)

##           2.5 %   97.5 %
## Pages 0.1078959 0.186761

The 95% confidence interval for the slope is (.1078959, .186761). This means that we are 95% confident that the slope is between .1078959 and .186761 dollars per page in textbooks.

Exercise 2.15

2.15 Textbook prices (continued). Refer to Exercise 2.14 on prices and numbers of pages in textbooks.

A. Determine a 95% confidence interval for the mean price of a 450-page textbook in the population.

pagebook<-data.frame(Pages=450)
predict(modPages,newdata=pagebook,interval = "confidence")

##        fit      lwr      upr
## 1 62.87549 51.73074 74.02024

We are 95% confident that the mean cost of 450 page textbooks will cost between 51.73 and 74.02 dollars.

B. Determine a 95% confidence interval for the price of a particular 450-page textbook in the population.

pagebook<-data.frame(Pages=450)
predict(modPages,newdata=pagebook,interval = "prediction")

##        fit       lwr      upr
## 1 62.87549 0.9035981 124.8474

We are 95% confident that a textbook with 450 pages will cost between .90 and 124.85 dollars.

C. How do the midpoints of these two intervals compare? Explain why this makes sense.

(124.8474+.9035981)/2

## [1] 62.8755

(51.73074+74.02024)/2

## [1] 62.87549

modPages

## 
## Call:
## lm(formula = Price ~ Pages)
## 
## Coefficients:
## (Intercept)        Pages  
##     -3.4223       0.1473

-3.4223+.1473*450

## [1] 62.8627

The midpoints of these two intervals are the same as \[\hat{y}=62.88 dollars\] This makes sense because the two intervals are centered around the linear regression model and its slope. As seen above when imputing 450 pages into the model, we are given an estimate for the cost of a 450 page book which turns out to be $62.86, which is very close to the midpoints of the two intervals. And thus it makes sense that these two intervals are centered on this point.

D. How do the widths of these two intervals compare? Explain why this makes sense.

The width for the singular item is wider than the width for the mean price, this is because the singular item is harder to predict than it is to predict a mean. This also means that a prediction interval (which predicts singular items) is always wider than a confidence interval.

E.What value for number of pages would produce the narrowest possible prediction interval for its price? Explain.

mean(Pages)

## [1] 464.5333

The value of 464.5333 pages will produce the narrowest possible prediction interval for its price. This is because it is the mean number of pages for all textbooks in this study.

F. Determine a 95% prediction interval for the price of a particular 1500-page textbook in the population. Do you really have 95% confidence in this interval? Explain.

pagebook<-data.frame(Pages=1500)
predict(modPages,newdata=pagebook,interval = "prediction")

##        fit      lwr     upr
## 1 217.5704 143.3587 291.782

The prediction interval for the price of the 1500 page book is 143.36 to 291.782 dollars. So we would say that we are 95% confident that a particular book with 1500 pages would be between 143.36 to 291.782 dollars, but I am not really confident in this interval since the number of pages is higher than anything in the dataset.

Exercise 2.30

data(ChildSpeaks)
head(ChildSpeaks)

##   Child Age Gesell
## 1     1  15     95
## 2     2  26     71
## 3     3  10     83
## 4     4   9     91
## 5     5  15    102
## 6     6  20     87

attach(ChildSpeaks)

2.30 Child first speaks-full data. Use the data in Table 2.2 and ChildSpeaks to consider a model to predict age at first speaking using the Gesell score.

A. Before you analyze the data, would you expect to see a positive relationship, negative relationship, or no relationship between these variables? Provide a rationale for your choice.

Comments: I would expect an negative relationship between the age of the child in months when they first speak and the Gesell score. This is because the Gesell Aptitude Test is used to measure a child’s level of cognitive development and it is reasonable to assume that those that speak at a younger age will have a higher score in the Gesell Test. Thus there will be a negative relationship.

B. Produce a scatterplot of these data and comment on whether age of first speaking appears to be a useful predictor of the Gesell aptitude score.

ggplot(data=ChildSpeaks,aes(x=Age,y=Gesell))+
    geom_point(shape=1)+
  geom_smooth(method=lm,   # Add linear regression line
                se=FALSE)

Comments: Age of first speaking appears be a okay predictor for the Gesell score. After all the points on the scatterplot are close to the linear regression line.

C. Report the regression equation and the value of r^2. Also, determine whether the relationship between these variables is statistically significant.

modGesell<-lm(Age~Gesell)
summary(modGesell)

## 
## Call:
## lm(formula = Age ~ Gesell)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2612 -4.5321 -0.3495  3.3735 14.2806 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.4546     9.4769   5.113 6.18e-05 ***
## Gesell       -0.3638     0.1001  -3.633  0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.263 on 19 degrees of freedom
## Multiple R-squared:   0.41,  Adjusted R-squared:  0.3789 
## F-statistic:  13.2 on 1 and 19 DF,  p-value: 0.001769

Comments: I used the Gesell as the predictor variable because that is what the description for the question said to do even if part B used it as the response variable. The linear regression equation is \[\hat{Age}=48.4546-.3638*Gesell\] and the r^2 was .41 which means that 41% of the variation in Age is explained by the model. The p-value of this was .001769. Even through the r^2 was not very high for a single variable it’s not that bad and since the p value is below the alpha of .05 we can say that the relationship between these variables is statistically significant.

D. Which child has the largest (in absolute value) residual? Explain what is unusual about that child.

residuals(modGesell)

##          1          2          3          4          5          6 
##  1.1040820  3.3734635 -8.2612273 -6.3510211  3.6505124  3.1938758 
##          7          8          9         10         11         12 
##  3.3765304 -1.0770392 -2.6219361  5.7403062 -0.3479541 -4.5321422 
##         13         14         15         16         17         18 
## -8.2612273 -6.8974515 -0.3494876 -2.0770392  1.7418397 14.2806027 
##         19         20         21 
## 12.5622520 -6.1699000 -2.0770392

ChildSpeaks

##    Child Age Gesell
## 1      1  15     95
## 2      2  26     71
## 3      3  10     83
## 4      4   9     91
## 5      5  15    102
## 6      6  20     87
## 7      7  18     93
## 8      8  11    100
## 9      9   8    104
## 10    10  20     94
## 11    11   7    113
## 12    12   9     96
## 13    13  10     83
## 14    14  11     84
## 15    15  11    102
## 16    16  10    100
## 17    17  12    105
## 18    18  42     57
## 19    19  17    121
## 20    20  11     86
## 21    21  10    100

Child number 18 had the greatest residuals. What was unusual about this child was that it took the child 42 months to speak and they got a very low score on the Gesell test of 57. This is a very big outlier since the majority of the children were able to speak before they were 20 months and no one else got below a 70 on the Gesell test.

Exercise 2.44

For this question, use R as a calculator to show your arithmetic. 2.44 Gate count-Computing the least squares line from descriptive statistics.

Many libraries have gates that automatically count persons as they leave the building, thus making it easy to find out how many persons used the library in a given year. (Of course, someone who enters, leaves, and enters again is counted twice.) Researchers conducted a survey of liberal arts college libraries and found the following descriptive statistics on enrollment and gate count:

The least squares regression line can be computed from five descriptive statistics, as follows:

where , , sx, and sy are the sample means and standard deviations for the predictor and response, respectively, and r is their correlation.

A. Find the equation of the least squares line for predicting the gate count from enrollment.

r<-.701
Beta1hat<-r*(104807/657)
Beta0hat<-247235-Beta1hat*2009
Beta0hat

## [1] 22576.49

Beta1hat

## [1] 111.826

A. The least squares line for predicting gate count from enrollment is \[\hat{gate count}=22576.49+111.826*enroll\].

B. What percentage of the variation in the gate counts is explained by enrollments?

r*r

## [1] 0.491401

Comments: r^2 is the correlation squared which means .701*.701 which is .491401. So 49.14% of the variation in gate count is explained by enrollments.

C. Predict the number of persons who will use the library at a small liberal arts college with an enrollment of 1445.

22576.49+111.826*1445

## [1] 184165.1

I predict that the number of persons who will use the library (ie the gate count) at a college with 1445 students will be 184165 people.

D. One of the reporting colleges has an enrollment of 2200 and a gate count of 130,000. Find the value of the residual for this college.

a<-22576.49+111.826*2200
res<-a-130000
res

## [1] 138593.7

The residual for this college was 138593.7.