The following are the R commands and answers to the in-class handout.
Load the data and create a scatterplot:
bac<-read.file("/home/emesekennedy/Data/Ch10/bac.txt")
## Reading data with read.table()
xyplot(BAC~Beers, data=bac)
As the scatterplot shows, there is a fairly strong positive linear relationship between the number of beers drank and blood alcohol content.
cor(BAC~Beers, data=bac)
## [1] 0.8943381
\(r=.8943\), which means that the linear relationship between the two variables is strong and positive since the \(r\) value is close to 1.
cor(BAC~Beers, data=bac)^2
## [1] 0.7998407
\(r^2=.7998\), which means that 79.98% of the variation in blood alcohol content can explained by a least-squares regression line.
Fit a regression line:
y<-lm(BAC~Beers, data=bac)
Display the slope and intercept of the line:
y
##
## Call:
## lm(formula = BAC ~ Beers, data = bac)
##
## Coefficients:
## (Intercept) Beers
## -0.01270 0.01796
The least squares regression line is \(\hat{y}=.01796\text{Beers}-.0127\)
Create a function out of the regression line:
f<-makeFun(y)
Plot the line on the scatterplot:
plotFun(f(Beers)~Beers, data=bac, add=T)
Plot the residuals:
mplot(y, which=1)
## [[1]]
The residuals look random with no outliers, which means that the least-squares regression line fits the data well.
Create a Normal quantile plot of the residuals:
mplot(y, which=2)
## [[1]]
The residuals appear fairly close to Normal, so it is appropriate to use the inference procedures from Chapter 10 on the regression line.
\(H_0: \rho=0\) (no correlation between beers and BAC)
\(H_a: \rho>0\)
Get the statistics for the regression line:
summary(y)
##
## Call:
## lm(formula = BAC ~ Beers, data = bac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027118 -0.017350 0.001773 0.008623 0.041027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.012701 0.012638 -1.005 0.332
## Beers 0.017964 0.002402 7.480 2.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02044 on 14 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7855
## F-statistic: 55.94 on 1 and 14 DF, p-value: 2.969e-06
2.969/2
## [1] 1.4845
The value of the test statistic is \(t=7.48\) and the \(P\)-value is \(1.4845\times 10^{-6}\).
The \(P\)-value is very small (\(P<.0001\)), so we can reject the null hypothesis at the .01% significance level. This means that the data provides very strong evidence to coclude that there is a positive correlation between number of beers and blood alcohol content.
predict(y, data.frame(Beers=5), interval="confidence", level=.9)
## fit lwr upr
## 1 0.07711821 0.06808261 0.0861538
The 90% confidence interval for the mean blood alcohol content corresponding to 5 beers is \((.068, .086)\) which means that we are 90% confident that the average blood alcohol content after 5 beers is between .068 and .086.
predict(y, data.frame(Beers=5), interval="prediction", level=.9)
## fit lwr upr
## 1 0.07711821 0.03999884 0.1142376
The 90% prediction interval for blood alcohol content corresponding to 5 beers is \((.04, .114)\). This means that there is a 90% chance that a person who drinks 5 beers will have a blood alcohol content between .04 and .114.
xyplot(BAC~Beers, data=bac, panel=panel.lmbands)
No, the student cannot be confident that he won’t be arrested if he drives after 5 beers and is stopped because there is a good chance that his blood alcohol content will be over .08.
We can find a 90% confident interval for the slope of the regresison line using the formula estimate \(\pm t^*\) SE. Find \(t^*\):
xqt(.95, df=14)
## [1] 1.76131
Find the interval:
.017964-1.761*.0024
## [1] 0.0137376
.017964+1.761*.0024
## [1] 0.0221904
The 90% confidence interval for the slope is \((.0137, .0222)\) which means that there is a 90% chance that each beer increases blood alcohol content by between .0137 and .0222 on average.
Load the data:
tuition<-read.file("/home/emesekennedy/Data/Ch10/tuition.txt", header=T, sep="\t")
## Reading data with read.table()
xyplot(Year.2008~Year.2000, data=tuition)
The relationship between the tuition for the two years is strong, positive, and linear.
y<-lm(Year.2008~Year.2000, data=tuition)
y
##
## Call:
## lm(formula = Year.2008 ~ Year.2000, data = tuition)
##
## Coefficients:
## (Intercept) Year.2000
## 1132.750 1.692
The equation of the least-squares regression line is \(\hat{y}=1.692x+1132.75\) where \(x\) is the tuition for 2000.
Create a function from the line:
f<-makeFun(y)
Add the line to the scatterplot:
plotFun(f(Year.2000)~Year.2000, data=tuition, add=T)
cor(Year.2008~Year.2000, data=tuition)
## [1] 0.8844314
cor(Year.2008~Year.2000, data=tuition)^2
## [1] 0.7822189
The correlation is \(r=.7822\), and \(r^2=.8844\). This means that 88.44% of the variation in the 2008 tuition is explained by the least-squares regression line.
mplot(y, which=1)
## [[1]]
The residuals look fairly random, but there are a couple large values.
mplot(y, which=2)
## [[1]]
The residuals look very close to Normal.
\(H_0: \rho=0\) (no correlation between the tutuions for 2000 and 2008)
\(H_a: \rho>0\)
summary(y)
##
## Call:
## lm(formula = Year.2008 ~ Year.2000, data = tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2727.22 -691.07 64.44 750.01 2521.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1132.7501 701.4152 1.615 0.116
## Year.2000 1.6924 0.1604 10.552 8.75e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1134 on 31 degrees of freedom
## Multiple R-squared: 0.7822, Adjusted R-squared: 0.7752
## F-statistic: 111.3 on 1 and 31 DF, p-value: 8.746e-12
8.746/2
## [1] 4.373
The test statistic is \(t=10.552\) and the \(P\)-value is \(4.373\times 10^{-12}\). We can conclude that the data provides very strong evidence that there is a positive correlation between the tuition for 2000 and 2008.
The formula for the 95% confidence interval for the slope is estimte\(\pm t^*\) SE. Find \(t^*\):
xqt(.975, df=31)
## [1] 2.039513
1.6924-2.04*.1604
## [1] 1.365184
1.6924+2.04*.1604
## [1] 2.019616
The 95% confidence interval for the slope is \((1.3652, 2.0196)\). This means that we are 95% confident that every $1 increase in 2000 tuition will result in an average incerase of 2008 tuition between $1.37 and $2.02, or every $1000 increase in 2000 tuition will result in an average increase of 2008 tuition between $1365 and $2020.
predict(y, data.frame(Year.2000=5100), interval="prediction", level=.95)
## fit lwr upr
## 1 9763.775 7397.608 12129.94
The fitted value 9764, and the 95% prediction interval is \((7398, 12130)\). This means that there is a 95% chance that the 2008 tuition for Stat U will be between $7398 and $12130.
predict(y, data.frame(Year.2000=8700), interval="prediction", level=.95)
## fit lwr upr
## 1 15856.26 13084.74 18627.79
The fitted value 15856, and the 95% prediction interval is \((13085, 18628)\). This means that there is a 95% chance that the 2008 tuition for Moneypit U will be between $13085 and $18628.
The prediction for part (c) might not be as accurate because $8700 is outside of the range of 2000 tuition values for the data. This is an example of extrapolation.