This R guide will be focusing on simple linear regression. This means that we will have two quantitative variables, using one as a predictor and the other as the response, in our model. The predictor is the variable used to predict the value of the response variable and the response variable is the value being predicted.
First, we’ll need to learn how to visually determine if simple linear regression can be applied to the data by using a scatterplot. Then we will need to determine the direction and strength of the linear relationship of the data by looking at correlation. Finally, we’ll fit a regression line to the data, learn how to interpret it, calculate residuals, use it to make predictions, and create and interpret both confidence and prediction intervals.
The data we will use for this is cars from the R package datasets. This data is cross-sectional, meaning it was taken at one point in time rather than over the span of a set of time so time is not one of the variables. This means it is ok to use a regression method of analysis on it. This data set contains two variables, the speed a car was going at the time of stopping and the distance the car needed to stop completely. This data was recorded in the 1920’s so some of the stopping distances might seem large for the relative speed of the car at stopping. From this data set we want to determine whether the speed of a car at the time of stopping will affect the distance it takes the car to fully stop.
The first thing we’ll want to do is investigate what our data looks like. The first command, names(), will display the names of all of the variables in the data set. The second command, head(), will show us the first few rows of the data set. Both commands take the name of the data set as their argument.
library(datasets)
data(cars)
names(cars)
## [1] "speed" "dist"
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
We see that our data set is made up of two variables, speed and distance, and that each car corresponds to one row of data.
It’s always a good thing to investigate the data that you are working with before you start any analysis so that you can be certain you are analyzing the data in the right way.
Now, we’ll look at visualizing the data through a scatterplot. To do this, we use the command plot(). The arguments of this function are both quantitative variables and usually take the form of plot(predictor, response). In our case, since we want to determine the effect of car speed on stopping distance we would type:
plot(cars$speed, cars$dist)
We can also add axis labels to make the graph more informative. To do this, we can use the arguments xlab and ylab for the x and y axis labels, respectively.
plot(cars$speed, cars$dist, xlab = "Car Speed at Stopping (in mph)",
ylab = "Stopping Distance (in feet)")
You can also add a title to the graph using the argument main, if you choose. The command would then look like this:
plot(cars$speed, cars$dist, xlab = "Car Speed at Stopping (in mph)",
ylab = "Stopping Distance (in feet)", main = "The Effect of Car Speed on Stopping Distance")
We can use this scatterplot to examine the trend of the data in order to gather a reasonable expectation for the correlation. The data seem to follow an upward trend here, meaning an increase in car speed at stopping time usually means an increase in stopping distance. Additionally, the trend seems to be relatively linear with the points situated fairly close together.
Then, to support our examination of the scatterplot we can check the correlation of the two variables by using the function cor(). Here, the arguments follow the same pattern as the plot() command, cor(predictor, response):
cor(cars$speed, cars$dist)
## [1] 0.8068949
We receive a pretty high, positive correlation of .807 meaning the data do seem to exhibit a strong, positive, linear relationship which supports our findings in examining the scatterplot.
So, now that we’ve examined our data we can begin creating the regression model. We can do this using the lm() command which uses the response and predictor variables in this way: lm(response ~ predictor). Here, lm stands for linear model.
Knowing this, we can create a model using our response and predictor variables:
carmod <- lm(dist ~ speed, data = cars)
We can also print out a summary of our model using the summary() command which takes our model as its argument:
summary(carmod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary tells us a lot and interpreting its output is possibly one of the most important aspects of building any model, so, how do we do it?
The summary begins by reminding us of the model that we created, then, gives us a summary of the residuals, which we will discuss later. The next part of the summary is the information on the regression coefficients. Here, we find the least squares point estimates of the coefficients and their standard errors. Least squares is a method of creating a model that has the minimum distance possible from the regression line to each of the data points, while still maintaining the linear relationship between predictor and response. Using this information given by the summary, we can also write our regression equation: \(\widehat{Distance} = -17.6 +3.9 \cdot CarSpeed\)
So, now that we have the regression equation, how can we interpret the regression coefficients?
First, you always need to check whether the intercept coefficient makes sense contextually before you decide whether or not to interpret it. We see here that it does not because a car going 0 mph is not going to move a negative distance when it stops. However, we can interpret the car speed coefficient, or the slope. This coefficient tells us that for every increase by 1 mph of car speed, the predicted stopping distance of the car will increase by 3.9 feet.
In many cases, we might not want to give just a point estimate for a regression coefficient, just as we usually do not only give a point estimate for a prediction we make because it creates a false sense of certainty. Thus, like creating a confidence or prediction interval for the point estimate of a prediction, we can also create confidence intervals for regression coefficients.
To do this, we can use the command confint(model, level). The level argument specifies the kind of confidence interval you want created, i.e. 90%, 95%, etc. So, we’ll create a 95% confidence interval for the speed regression coefficient and the intercept like this:
confint(carmod, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
We can interpret the output by saying that we are 95% confident that, as the speed of a car at stopping increases by one mph, the stopping distance required by this car to come to a complete stop increases by between 3.097 and 4.768 feet on average.
Like the interpretation of the point estimate of the intercept, the interpretation of the confidence interval of the intercept does not make sense contextually so we will not interpret it.
Our summary also gives the results of a t-test for the predictor variable and the intercepts of our model. The results of the t-tests indicate whether the variable in question makes our model significantly better than it would be without that variable. The t-test tests the null hypothesis that there is no linear relationship between the predictor and the response variable, or, for example, \(\beta_1 = 0\), against the alternative hypothesis that there is a linear relationship between the predictor and response variable, or, continuing with our example, \(\beta_1 \neq 0\). The summary output gives us both the test statistic and the p-value for each test. We can see from the p-value relating to the hypothesis test for car speed, because it is less than even a very small alpha value of .001, that speed significantly impacts the stopping distance of a car. However, if the p-value had been large, we might have concluded that the speed of a car has no linear relationship with the stopping distance of that car.
You might also notice that there is a hypothesis test done for the intercept as well, but this test is generally ignored because intercepts are usually kept in models regardless of their significance.
Next, the summary gives us residual standard error and two different values of R-squared. The value of R-squared that we focus on is the Multiple R-squared and it tells us how much of the variability in our response variable is explained by our model. From this example, we can see that R-squared is .6511 meaning that car speed explains 65.11% of the variability in car stopping distance.
Finally, the summary output gives us the results of the F-test for the model. The results of the F-test indicate whether the relationship between the predictor and response variable is significantly linear, similar to the t-test. To be explicit, the F-test tests the null hypothesis that \(\beta_1 = 0\) against the alternative hypothesis that \(\beta_1 \neq 0\). For our model, the F-statistic is 89.57 and we would compare that to an F-table with 1 and 48, or number of observations minus 2, degrees of freedom. After making that comparison, we see that the F-statistic is significant and has a p-value of less than even .001. This means that the regression relationship modeled by our model is significant; thus, there is a significant relationship between the speed of a car and its stopping distance.
So, now that we’ve created our regression model, we might want to see the regression line with our data to help visualize the relationship between car speed and stopping distance. We can do that by using the command abline(). This command takes the arguments of abline(intercept, slope). We’ll add it to our scatterplot like this:
plot(cars$speed, cars$dist, xlab = "Car Speed at Stopping (in mph)",
ylab = "Stopping Distance (in feet)")
abline(-17.5791, 3.9324)
It is important to note that the plot() command must come immediately before the abline() command since you need to have an already existing plot to add the line to.
From our model, we can obtain residuals. You can find a residual by calculating the observed value minus the predicted value at a certain x value. The residuals are important because they can help describe how well our model is fitting our data, because they represent how far our regression line is from each data point, and they can help us determine whether using linear regression to model our data is acceptable. If we wanted to find the residual for the thirteenth car in our data set we could use the command residuals(model), which extracts the residual from the model then specify that we want the thirteenth residual like this:
residuals(carmod)[13]
## 13
## -9.60981
Alternatively, we can extract the residuals by using model$residuals.
carmod$residuals[13]
## 13
## -9.60981
As you can see, we have extracted the residual of the thirteenth car in two different ways, but achieved the same answer in either case.
The residual we find for the thirteenth car is -9.61. Since it is negative, we have actually overpredicted the speed of the thirteenth car by 9.61 mph.
One thing we need to do before we go any further is to check the assumptions for regression models because the violation of these assumptions could indicate that there are problems with our model and its output. We already know that all of the errors will be independent since each of the measurements were taken from a different car so our model passes one condition.
We can check that the errors come from a normal distribution by creating a qqplot. We need two commands to do this, qqnorm(), which plots the theoretical and observed quantiles of our residuals, and qqline(), which adds a line that dictates where we hope the residuals lie on our qqplot; both commands take the residuals of our model as the argument.
qqnorm(carmod$residuals)
qqline(carmod$residuals)
Since most of the residuals seem to stick pretty close to the line, we can say this assumption is mostly met. However, take note of the points in the upper right corner of the plot. Those points do not seem to come from a normal distribution, but since there are only a few, we won’t worry about them too much.
We can check for equal variance by plotting the residuals against car speed. We add a horizontal line at 0 to better visualize the spread of our residuals around 0.
plot(cars$speed, carmod$residuals, xlab = "Car Speed at Stopping (in mph)",
ylab = "Model Residuals")
abline(0, 0)
It seems that there could be a slight issue with this condition because the variance of the residuals for cars at lower speeds appears to be less than the variance of the residuals for cars at higher speeds. We should watch out for any problems that this might cause, such as inaccurate confidence and prediction intervals due to incorrectly estimated variances. However, at all speeds, the residuals look to be roughly centered around 0.
All of these assumption checks are convincing enough to allow us to trust our regression model, while remembering that problems could arise due to our lack of equal variance.
One of the great things about making models is that they can often help us make predictions. We might want to predict the stopping distance of a car going 22 mph at the time of stopping. We can do this by plugging 22 into our regression equation:
-17.5791 + 3.9324 * 22
## [1] 68.9337
We’ve now predicted that a car going 22 mph at the time of stopping will take 68.93 feet to stop completely. This is a point estimate of the stopping distance for a car going 22 mph.
We have just used our model to make a prediction for stopping distance, given that the speed of the car is within our range of speed values, or our experimental region. However, we can also make a prediction interval and a confidence interval for this estimate.
First, let’s make a prediction interval for the stopping distance of an individual car going 22 mph. We start by creating a new data frame with our one value of interest in the speed column, since speed is our predictor variable. We then use this in our predict() command which takes the form predict(model, newDataFrame, interval, level). The interval argument specifies the type of interval you are creating and can be either predict or confidence for a prediction or confidence interval, respectively. Then, the level argument specifies your confidence level. The default is a 95% confidence level so, if level is not specified, you will receive a 95% prediction or confidence interval depending on what was specified for the interval argument.
newCar <- data.frame(speed = 22)
predInt <- predict(carmod, newCar, interval = "predict")
predInt
## fit lwr upr
## 1 68.9339 37.22044 100.6474
We’ve created the prediction interval which gave us a point estimate of 68.93 (the same as our point estimate above) and the prediction interval of [37.22, 100.65]. Our prediction interval is a 95% prediction interval because this is the default and we did not specify a confidence level.
From this prediction interval, we can say that we are 95% confident that the stopping distance for an individual car going 22 mph at stopping will take between 37.22 and 100.65 feet to completely stop.
Now, we can create a confidence interval for the mean stopping distance of a car going 22 mph. We use the same predict command here, but specify the interval as confidence.
confInt <- predict(carmod, newCar, interval = "confidence")
confInt
## fit lwr upr
## 1 68.9339 61.8963 75.97149
We can see that this interval again gives us the same point estimate. However, our confidence interval is [61.9, 75.97], smaller than our prediction interval. This is a characteristic difference between a prediction and confidence interval using the same significance level. Again, our significance level is .05 since that is the default for the function and we did not specify one.
We can interpret this confidence interval by saying we are 95% confident that the mean stopping distance for a car going 22 mph at stopping is between 61.9 and 75.97 feet.