library(resampledata)
##
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
##
## Titanic
data(Beerwings)
attach(Beerwings)
In section 3.5, we learned how to test the significance of the slope (and the intercept, but that is less important).
I will use the Beerwings data set as an example. This data set includes 30 different observations of how many wings a person ate, as well as how much beer they drank (in ounces), and the gender of the person. For this example, Hotwings will be my explanatory variable, and Beer will be the response.
I will test the significance of the slope. The null hypothesis is \({\beta_1} = 0\), with an alternative hypothesis of \({\beta_1} \neq 0\).
Beerwings.mod <- lm(Beer ~ Hotwings, data = Beerwings)
Beerwings.mod
##
## Call:
## lm(formula = Beer ~ Hotwings, data = Beerwings)
##
## Coefficients:
## (Intercept) Hotwings
## 3.040 1.941
summary(Beerwings.mod)
##
## Call:
## lm(formula = Beer ~ Hotwings, data = Beerwings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.566 -4.537 -0.122 3.671 17.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0404 3.7235 0.817 0.421
## Hotwings 1.9408 0.2903 6.686 2.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.479 on 28 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6011
## F-statistic: 44.7 on 1 and 28 DF, p-value: 2.953e-07
Looking at the p-value, which is 2.95e-07, I have enough evidence to say that I would reject the null in favor of the alternative hypothesis, that \({\beta_1} \neq 0\). So it seems that there is a linear relationship between the number of wings eaten and the amount of beer that has been drunk.
confint(Beerwings.mod)
## 2.5 % 97.5 %
## (Intercept) -4.586851 10.667590
## Hotwings 1.346131 2.535371
I am 95% confident that the true slope \({\beta_1}\) is between [1.346131, 2.535371].
If I tested the significance of the y-intercept, with a null hypothesis of \({\beta_0} = 0\), with an alternative hypothesis of \({\beta_0} \neq 0\), it can been seen from the data above that the p-value is 0.421. Any legitimate significance level would have us fail to reject the null, as the p-value is very large. Furthermore, I can be 95% confidence that the true intercept \({\beta_0}\) is between [-4.586851, 10.667590].
In this section we learned how to choose an x-value and calculate the “distance” value for that particular value. Once we have the distance value, we can calculate the confidence interval for the mean value of y when x is equal to the chosen value, as well as the prediction interval for the individual y based on the chosen x. Both the confidence and the prediction intervals are centered at the same place (\(\hat{y}\)), but the prediction interval will be bigger, as the variance for a single observation is bigger than the variance for the mean of all of the observations.
For the Beerwings example, I’ll choose my x as 5 hot wings, and easily calculate the confidence interval and prediction interval using R.
newBeerwings<-data.frame(Hotwings=5)
(predBeer <- predict(Beerwings.mod, newBeerwings, interval="predict"))
## fit lwr upr
## 1 12.74413 -3.366303 28.85455
(confBeer <- predict(Beerwings.mod, newBeerwings, interval="confidence"))
## fit lwr upr
## 1 12.74413 7.762078 17.72617
The prediction interval is [-3.366303,28.85455]. The confidence interval is [7.762078,17.72617].
Time to check that the prediction interval is bigger.
confBeer %*% c(0, -1, 1) #conf interval width
## [,1]
## 1 9.964095
predBeer %*% c(0, -1, 1) #pred interval width
## [,1]
## 1 32.22086
The prediction interval is over three times bigger than the confidence interval.
Now to check if they are centered at the same place.
confBeer[1] == predBeer [1]
## [1] TRUE
Woo hoo!
In this section, we learn about total variation, explained variation, unexplained variation, and the simple coefficient of determination (r^2). The simple coefficient of determination shows us the percent of the variability of the response that’s explained by its linear relationship with the predictor(s).
cor(Beer, Hotwings)
## [1] 0.7841224
This number is positive, so Beer and Hotwings have a positive association. It’s moderately close to one, so there’s a decent percent of variance that can be explained by the model.
This is just a different way to test if there’s a linear relationship between x and y - in our case, Beer and Hotwings. By looking above at the summary of our linear model, it can be seen that the F-statistic p-value is 2.95e-07, which is the same as the p-value we discussed previously. The F-statistic p-value is the same as that p-value that we got from the t-distribution, but only for simple linear regression.