The following is an example of how to make a linear regression model using the Beerwings dataset from the the resampledata package:
library(resampledata)
##
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
##
## Titanic
data("Beerwings")
First, you would want to look at the names of the variables and the first few entries in the dataset.
head(Beerwings)
## ID Hotwings Beer Gender
## 1 1 4 24 F
## 2 2 5 0 F
## 3 3 5 12 F
## 4 4 6 12 F
## 5 5 7 12 F
## 6 6 7 12 F
names(Beerwings)
## [1] "ID" "Hotwings" "Beer" "Gender"
Next, we would want to make a scatterplot of the data. We want to look at whether the number of hot wings eaten influences the ounces of beer drank.
plot(Beerwings$Hotwings, Beerwings$Beer, xlab = "Number of Hot Wings Eaten", ylab = "Ounces of Beer Drank")
In the scatterplot, we can see an upward trend, or a positive association between the two variables. The points seem to follow the trend fairly well, and they look to follow a linear trend. Thus, using a linear model for the dataset is reasonable.
We now create the linear model for the data:
BeerWings <- lm(Beerwings$Beer ~ Beerwings$Hotwings)
BeerWings
##
## Call:
## lm(formula = Beerwings$Beer ~ Beerwings$Hotwings)
##
## Coefficients:
## (Intercept) Beerwings$Hotwings
## 3.040 1.941
Our regression line is as follows: Predicted Beer = 3.040 + 1.941*Hotwings
We interpret the slope by saying that eating one more hot wing will increase beer consumption by 1.941 fluid ounces on average. In this case, it also makes some sense to interpret the intercept, but no one in our dataset consumed zero hot wings and we do not want to extrapolate.
Then, we plot the data in our scatterplot with the regression line.
plot(Beerwings$Hotwings, Beerwings$Beer, xlab = "Number of Hot Wings Eaten", ylab = "Ounces of Beer Drank")
abline(3.040, 1.941)
We can then predict the number of ounces of beer someone would drink, given that they ate 9 hot wings and calculate the residual.
nineWings <- 3.040+1.941*9
nineWings
## [1] 20.509
residNine <- Beerwings[11,3]-nineWings
residNine
## [1] 3.491
Though using a linear model for this data seems reasonable, we should check our model assumptions to be sure. First, we check to make sure the errors are normally distributed and have a mean near zero.
resids <- BeerWings$residuals
qqnorm(resids)
qqline(resids)
mean(resids)
## [1] -1.777875e-16
The residuals appear to follow the line pretty well so we can assume that the errors follow a normal distribution and the mean of the residuals is very small, almost zero.
Now, we check for homoscedasticity.
plot(BeerWings$residuals ~ Beerwings$Hotwings)
abline(0,0)
From this plot, there is no obvious evidence for heteroscedasticity. We also know that each of the residuals are independent because each comes from a different individual.
Lastly, we can look at a summary of our model.
summary(BeerWings)
##
## Call:
## lm(formula = Beerwings$Beer ~ Beerwings$Hotwings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.566 -4.537 -0.122 3.671 17.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0404 3.7235 0.817 0.421
## Beerwings$Hotwings 1.9408 0.2903 6.686 2.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.479 on 28 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6011
## F-statistic: 44.7 on 1 and 28 DF, p-value: 2.953e-07
The summary tells us many things. One thing it shows us is the variance of the residuals; in this case, it is 7.479. We also see that the model uses 28 degrees of freedom because there are two parameters estimated and 30 individuals in the dataset.