I want to explore the relationship between fluid ounces of beer consumed and number of hot wings consumed using a simple linear regression. I posit that the variables will show a positive relationship because drinking beer would cool down your mouth after eating hot wings.

library(resampledata)
## 
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
## 
##     Titanic
data(Beerwings)
names(Beerwings)
## [1] "ID"       "Hotwings" "Beer"     "Gender"
attach(Beerwings)
mymod<-lm(Beer~Hotwings)
mymod
## 
## Call:
## lm(formula = Beer ~ Hotwings)
## 
## Coefficients:
## (Intercept)     Hotwings  
##       3.040        1.941
plot(Hotwings,Beer)
abline(3.040,1.941)

Now that I have a decent picture of what my regression model looks like, I want to check out how closely my two variables are correlated.

cor(Beer,Hotwings)
## [1] 0.7841224

Here I can predict the beer consumption given a specific value of hot wings consumed with a prediction interval. A confidence interval will give me an estimate of the mean amount of beer consumed when 10 hot wings are consumed.

newdata<-data.frame(Hotwings=10)
beerpredict<-predict(mymod, newdata, interval="predict")
beerpredict
##        fit      lwr      upr
## 1 22.44788 6.831517 38.06425
beerconfid<-predict(mymod, newdata, interval="confidence")
beerconfid
##        fit      lwr      upr
## 1 22.44788 19.42369 25.47207

These results show that the prediction interval is much larger than the confidence interval which fits the pattern we proved for every case in class. It is linked to the comparison of the variance of a single point estimate and the variance of an estimate of the mean. We have R code to confirm this conclusion:

beerpredict %*% c(0,-1,1)
##       [,1]
## 1 31.23273
beerconfid %*% c(0,-1,1)
##       [,1]
## 1 6.048387
beerpredict[1]==beerconfid[1]
## [1] TRUE

The prediction interval is indeed wider and the two intervals are centered around the same value.

summary(mymod)
## 
## Call:
## lm(formula = Beer ~ Hotwings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.566  -4.537  -0.122   3.671  17.789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0404     3.7235   0.817    0.421    
## Hotwings      1.9408     0.2903   6.686 2.95e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.479 on 28 degrees of freedom
## Multiple R-squared:  0.6148, Adjusted R-squared:  0.6011 
## F-statistic:  44.7 on 1 and 28 DF,  p-value: 2.953e-07

This R summary of the model yields some key information about the relationship between our variables. In class, we discussed the F-test which will tell us whether or not a linear relationship exists between our explanatory variable and response variable. The null hypothesis is that there is not a linear relationship. Thus, our tiny p-value (2.93x10^(-7)) leads us to reject the null hypothesis in favor the existance of a linear relationship.