The dataset I will be working with is Beerwings. The response variable is beer consumption, measured in fluid ounces. The predictor variable is the number of hot wings consumed.
First, we need to call up the dataset.
library (resampledata)
data ("Beerwings")
attach(Beerwings)
Now that the dataset has been called and attached, it’s time to move on to making a scatterplot.
plot (Hotwings, Beer)
By default, R uses the names of the variables in the dataset to label the axes. However, the default labels can be changed by simply adding a few extra arguments, as shown below:
plot (Hotwings, Beer, ylab = "Beer Consumed (fl-oz)",
xlab = "Hot Wings Consumed", main = "Beer Consumption and Hot Wing Consumption")
Labels for the x and y axes are defined by xlab and ylab, respectively. The title of the plot is defined by main. Relabelling the axes makes plots easier for others to understand.
From the scatterplot we can see an upward trend that appears to be linear. The data points are consistently packed. For the most part, data points follow the trend.
Linear regression can be used to create a model for this data. The “lm” command creates the model.
BeerMDL <- lm(Beer ~ Hotwings)
BeerMDL
##
## Call:
## lm(formula = Beer ~ Hotwings)
##
## Coefficients:
## (Intercept) Hotwings
## 3.040 1.941
From the output we can see the intercept is at 3.040, and the slope is 1.941.
The slope means that for each additional hot wing consumed, the amount of beer consumed increases by 1.941 fluid-ounces. The positive slope is consistent with the upward trend observed in the scatterplot.
In this case, the intercept cannot be interpreted. There are no data points where the explanatory variable, the number of hot wings consumed, is close to or equal to zero.
The equation for the regression line is: Beer = 1.941 * Hotwings + 3.040
Now that we have created the regression model, we can graph the line on the scatterplot. To do this, we first need to create the plot. Then, we use the abline command. Note that the intercept is the first argument, followed by the slope.
plot (Hotwings, Beer, ylab = "Beer Consumed (fl-oz)",
xlab = "Hot Wings Consumed", main = "Beer Consumption and Hot Wing Consumption")
abline (3.040, 1.941)
Now let’s check for the normality of errors. To do this, we need to look at residuals for all of our data. A residual is the observed value minus the expected value. Do not calculate the residuals by hand; R can do this very quickly, using the method shown below:
BeerResid <- BeerMDL$residuals
BeerResid stores the residual values for each of the 30 data points. We can create a histogram of the residuals to see if they are normally distributed.
hist(BeerResid)
From the histogram, we can see the residuals appear to follow an approximately normal distribution.
Another useful tool we can use is a Q-Q plot, which plots the quantiles of the residuals against the quantiles of a normal distribution. Residuals will follow a straight line if they are normally distributed. Creating this plot is very simple, using the commands shown below:
qqnorm(BeerResid)
qqline(BeerResid)
The residuals are fairly close to the line, which is a good indicator of normalilty.
Our original scatterplot can also be useful here to check for heteroscedasticity. The vertical spread of the data points should be similar for different amounts of hot wings consumed.
plot (Hotwings, Beer, ylab = "Beer Consumed (fl-oz)",
xlab = "Hot Wings Consumed", main = "Beer Consumption and Hot Wing Consumption")
From this plot, there aren’t obvious signs of heteroscedasticity.
Another way to check for heteroscedasticity is to plot the residuals against values for hot wings consumed.
plot (BeerMDL$residuals ~ Hotwings)
abline(0,0)
The vertical spread is also fairly consistent here, thus showing no signs of heteroscedasticity.
The model summary can give us a lot of useful information.
summary(BeerMDL)
##
## Call:
## lm(formula = Beer ~ Hotwings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.566 -4.537 -0.122 3.671 17.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0404 3.7235 0.817 0.421
## Hotwings 1.9408 0.2903 6.686 2.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.479 on 28 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6011
## F-statistic: 44.7 on 1 and 28 DF, p-value: 2.953e-07
We can see the intercept and slope of our regression model listed under “Estimate Std.” The Residual Standard error is the Mean Square Error.
Just to be sure, we can calculate the MSE by hand.
sqrt(sum(BeerResid^2)/28)
## [1] 7.479343
The answer is the same as the value given in the model summary.