This R guide will cover material from chapter 3 of “Forecasting, Time Series, and Regression” by Bowerman, O’Connell, and Koehler called Simple Linear Regression.
In chapter 3, we began by talking about simple linear regression models. Simple linear regression observes the relationship between an independent variable, x, and a dependent variable, y, by looking at a scatterplot of data and approximating a straight line through the points. The equation for the model is: \[{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i+ {\epsilon}\]
In this equation, \[\hat{\beta_0}+\hat{\beta_1}\] is the mean value of the dependent variable y with independent variable x. \[\hat{\beta_0}\] is the y-intercept, or the mean value of y when x equals 0. \[\hat{\beta_1}\] is the slope. It is the change in the mean value of y with every one-unit increase in the mean of x. \[{\epsilon}\] is the error term that accounts for things other than x that affect y.
To create this model in R, we must first select a data set. We are going to select a package that lists the speed of cars (mph) and the distances (ft) taken to stop. We call up this data and tell R we want to use it by using the data and attach commands as follows:
data(cars)
attach(cars)
If we want to see our variable names and a snippet of the data in the set, we will use the head() command with the name of our data as the argument. We can use the summary() function to see a short analysis of our data, including the min, max, mean, median, etc. of our variables. We can use this to get an idea of the range of our data points.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
In this example, both of our variables are quantitative variables since they are described using numbers rather than words.
We are now going to create a scatterplot of the speed of cars vs. their stopping distances to view the relationship between these two variables. In this example, we will say that the stopping distance depends on the speed, so the stopping distance is the dependent variable (y) and the speed is the independent variable (x). To create a scatterplot, we use the plot function followed by the (predictor variable, response variable). We can also label our axes using the ylab and xlab specifiers and put a title on the graph using the main specifier, as follows:
plot(speed, dist, ylab = "Car Stopping Distance (ft)", xlab = "Speed (mph)", main = "Car Speed vs. Stopping Distance")
We now want to find the line that describes the relationship between our two variables, the line described by the simple linear regression model. We create this regression line using the lm function and two arguments, the response ~ the predictor, as follows:
mymod <- lm(dist ~ speed)
mymod
##
## Call:
## lm(formula = dist ~ speed)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
After we run our regression model, we can interpret our results. Our regression equation from these results is dist = -17.579 + 3.932(speed) + \(\epsilon\). With a slope of 3.932, we say that with every one-unit (mph) increase in speed, our stopping distance increases by 3.932 feet on average.
In this case, we will not interpret our y-intercept because it would not make sense that there would be a stopping distance with a speed of 0. You would not have a stopping distance if you were not moving to begin with.
We will now add our regression line to the plot by first plotting the data and then using the abline function. The abline function requires two arguments, the values of the intercept and the slope that we got from from our regression model. Within the function, they must be placed in the order (intercept, slope).
plot(speed, dist, ylab = "Car Stopping Distance", xlab = "Speed", main = "Car Speed vs. Stopping Distance")
abline(-17.579,3.932)
So, the relationship between speed and stopping distance is best described by the line seen here. The equation of this line is the same as our simple linear regression model: dist = -17.579 + 3.932(speed) + \(\epsilon\).
Another way we can examine the relationship between speed and stopping distance is to look at the correlation between them. We use the variable r to represent correlation. In our case, this will measure the strength and direction of the linear relationship between the speed and stopping distance. We use the cor() function followed by our two variables:
cor(speed, dist)
## [1] 0.8068949
Because this is positive and pretty close to 1, there is a strong, positive, linear relationship between the two variables. The closer to 1, the more strongly correlated (linearly related) two variables are. And because they are positively correlated, this means that they move in the same direction: as the speed increases, the stopping distance increases.
When we create regression models, we make multiple assumptions about the error term.
The first assumption we make is that the error term is normally distributed. We can check this assumption by creating a histogram of the residuals and checking its shape:
myresids <- mymod$residuals
hist(myresids)
This histogram looks decently bell-shaped, but we do have a bit of a right skew. In context of our example, this skew means that for higher speeds, we are underpredicting our stopping distances. We would expect there to be more variability in stopping distances for higher speeds just due to variability in the way people stop (slamming on brakes, coming slowly to a stop, etc.) Differences in cars could also account for some of this variability just based on how well the cars stop.
If we have trouble seeing whether our error terms follow a normal distribution based on a histogram, we could also plot our residuals against a straight line. We use the qqnorm and qqline functions with our residual model as the argument to see this comparison:
qqnorm(myresids)
qqline(myresids)
The data points follow the straight line pretty closely, but again, we see a bit of variability when the data points are further away from the line.
Another assumption we make when we create regression models is homoscedasticity, meaning that the variance of the residuals will be equal for all predictor values. We can check this assumption by plotting our residuals against our independent variable (speed, in this case):
plot(mymod$residuals ~ speed)
abline(0,0)
We have issues here because the vertical spread varies for different speeds. For slower speeds, there is very little variability, while for higher speeds, there is more variability. This makes sense in the context of the problem because faster speeds can result in different stopping distances based on whether a person slams on his or her brakes or comes to a stop more slowly, whereas slower speeds result in more constant stopping distances.
Another assumption we have about the error term is that all error terms are independent of each other. Intuitively, this makes sense because the observations are taken randomly, which would suggest independence.
Our mean square error is the point estimate of the variance of the error terms. Although we can calculate it by hand, we will just use the summary function.
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Our MSE is located next to the label “residual standard error”: 15.38. So the estimate of our variance is 15.38.
We can also find our R2 value by looking at the summary of our model:
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In the bottom of our summary, we find our R2 value of .6511. If we take the square root of this, we will find our correlation, r, as we previously saw. The variable R2 represents the simple coefficient of determination, which is the proportion of the total variation in our response, stopping distance, that is explained by its linear relationship with speed. This value can only be between 0 and 1, and if it is close to 1, then the values of speed in our model do not provide accurate predictions of our stopping distance values. Because our R2 value is only .6511, we will not conclude that our speed is a bad predictor of stopping distance.
In simple linear regression, we use hypothesis tests to see if there is a significant relationship between our two variables. Our null hypothesis is that there is no linear relationship between them, and our alternative hypothesis is that there is one.
We can also perform hypothesis tests based on the summary of our model. Let’s look at it one more time:
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In an F-test, we test the significance of the relationship between our x and y variables, so between speed and stopping distance in this case. In this test, our null hypothesis would be that our slope is 0, meaning that variables x and y (speed and stopping distance) have no relationship. Our alternative hypothesis is that our slope is something other than 0, meaning they do have a relationship. Our summary function has already ran the test for us, and we can find our test statistic in the bottom row: 89.57. We can also see that with a p-value of 1.49*10-12, we can safely reject the null hypothesis and conclude that there is a linear relationship between speed and stopping distances. Note that our degrees of freedom in this scenario is 48 because we had 50 observations on 2 variables.
We can also perform a t-test with this information, but for simple linear regression, our t-test and F-test are the same. We have the same null hypothesis that our slope is zero and the two variables have no relationship, and our alternative hypothesis is still that the slope will be something other than zero and the two variables do have a relationship. We find our test statistic at the intersection of the row labeled “speed”" and the column labeled “t value”" and get a value of 9.464. If we pay close attention, we will notice that this is the square root of the test statistic in our F-test; this is because squaring a t-distribution gives an F-distribution in the case of simple linear regression. So we get the same p-value of 1.49*10-12 and can reject our null hypothesis in favor of the alternative hypothesis, concluding that there is a significant linear relationship between speed and stopping distances.
We can also create a confidence interval for our slope using the confint() function. We can also specify what level we want our confidence interval to be, as seen below:
confint(mymod, level = .95)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
The confidence intervals for both the y-intercept and the slope are given above. We see the confidence interval for our y-intercept is (-31.16785, -3.990340), but because it does not make sense to interpret the y-intercept in this example, we will not interpret the confidence interval. The confidence interval for the slope is (3.096964, 4.767853). This means that we are 95% confident that as the speed increases by one mile per hour, the stopping distance will increase by an amount between 3.096964 and 4.767853. Or, in other words, we are 95% confident that the true slope is between 3.096964 and 4.767853.
We can create prediction and confidence intervals for the stopping distance when the speed is 15 mph:
newdata <- data.frame(speed = 15)
predy <- predict(mymod, newdata, interval="predict")
confy <- predict(mymod, newdata, interval = "confidence")
confy
## fit lwr upr
## 1 41.40704 37.02115 45.79292
predy
## fit lwr upr
## 1 41.40704 10.17482 72.63925
Our confidence interval says that when the speed is fixed at 15 mph, we can be 95% confident that the mean stopping distance lies between 37.02115 and 45.79292 ft. The prediction interval says that when the speed is fixed at 15 mph, we can be 95% confident that a point estimate of the stopping distance lies between 10.17482 and 72.63925 ft.
Intuitively, our prediction interval should be wider because there is more variability in predicting a point estimate than an average. We can check this with the following:
confy %*% c(0, -1, 1)
## [,1]
## 1 8.771769
predy %*% c(0, -1, 1)
## [,1]
## 1 62.46443
Our intuition is correct; the length of our confidence interval is 25.91548 and the length of our prediction interval is 67.05575.
That’s all; you’re done!