Introduction

This R Guide will focus on the content from Chapter 3 of Forecasting, Time Series, and Regression. The topic of this chapter is Simple Linear Regression. We will be using R Studio to build a simple linear regression model step by step, as well as focusing on the interpretation and significance of various components of our model. Several key points throughout this R Guide will be:

Examining Data

In this R Guide I will be using the built-in cars data set which contains data about the speed of cars in miles per hour and the corresponding distance that it takes for that car to stop in feet. Since this is a built-in data set, it is extremely easy to access and download, and we will attach it right away to make any future commands easier.

data(cars)
attach(cars)

Before jumping right into linear regression, it is important to look at the data and see if there is a potential linear relationship. The simplest and most efficient way to do this is through a scatterplot. This graph will allow us to see the relationship between two quantitative variables. In this case our two variables are speed and distance, and for this model we will use speed as the explanatory variable while distance will be the response.

plot(speed,dist, ylab = "Stopping Distance (Feet)",
     xlab = "Car Speed (MPH)",
     main = "Stopping Distance of Cars Based on Speed")

From this plot we can notice several different things about our data. First of all, there appears to be a clear relationship between the speed of the car and the distance it takes for that car to stop. Based on our plot we can see that as speed increases, the distance to stop also increases, so we have a positive association between the two variables. Another component of this plot to look at is the variance of the points. It is important to look at whether these points are closely packed together or spread apart. For our data, the points seem to be less packed together, telling us that there is a higher degree of variance which is important to take into account when considering our linear regression model.

Regression Model

Seeing as how we are doing all of our calculations in R, creating the linear regression model is a simple one-step command. This linear regression model will be used to explain the correlation between the predictor variable (speed) and the response variable (distance).

car.mod <- lm(dist ~ speed)
car.mod
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

Building and Interpreting the Regression Equation

We can then use these coefficients in the linear regression equation. The general linear regression equation is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] The intercept coefficient will be the beta(zero) value, and the speed coefficient will be the beta(one) value which is the slope of the regression equation. Inputting the values will thus yield the equation: \[\hat{y_i}= -17.579+3.932 x_i\]

Another way of looking at it would be: \[\hat{dist_i}= -17.579+3.932(speed_i)\] While it is great to have a linear regression equation, it is more important to understand the different components of it and what they tell us about the relationship between the data. This is done by interpreting the coefficients of the linear regression model. Just like in a typical y=mx+b equation, m is the slope of this line, and b is the y intercept. Typically, with linear regression, it is simplest to interpret the slope value. In this case the slope of the equation is 3.932, meaning that for every one unit increase in speed, we estimate that it will take the car 3.932 additional feet to come to a stop. Interpreting the intercept can be a bit more difficult, as it is important to think about what the y-intercept means in the context of the problem. The y-intercept can only be interpreted if the predictor variable can realistically be equal to zero and if the intercept then makes sense logically. In this equation, it is possible for the x-value to be equal to 0, as that would simply mean the car is going 0 miles per hour, or in other words is already stopped. However, the intercept of -17.579 does not make sense logically if x is equal to zero. Hypothetically, this would mean that a car going 0 mph takes -17.579 feet to stop. Having a negative stopping distance is impossible, so the y-intercept cannot be interpreted in this instance.

Regression Visualization

Now that we have our linear regression equation, it can be helpful to plot it with our original scatterplot to see how the two are related.

plot(speed, dist, xlab = "Car Speed (MPH)", 
     ylab = "Stopping Distance (Feet)",
     main = "Stopping Distance of Cars Based on Speed")
abline(-17.579, 3.932, col = "blue")

This line does a good job of letting us visualize the linear relationship our data has. One thing that may be clearer now that we have graphed the regression line is the variance in our data. While there is a clear linear relationship, our points are not packed as closely as we may like.

Prediction and Residuals

We can now use the linear regression equation to make predictions about what the stopping distance will be for a car going a particular speed. For example, we can predict how far it takes for a car going 8 mph to stop.

-17.579+3.932*8
## [1] 13.877

Looking at our data, the actual experienced stopping distance for a car going 8 mph was 16 feet, as shown in row 5.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
cars[5,2]
## [1] 16

Thus we can calculate the residual value by doing observed value minus predicted value.

(-17.579+3.932*8)-cars[5,2]
## [1] -2.123

The residual value of -2.123 indicates that our predicted value was 2.123 feet too small. We could do this with all possible values of x, and the residual values will be necessary when looking at our model assumptions.

Model Assumptions

When modeling with simple linear regression, we make several assumptions that need to be true in order to justify this as an accurate model.

Homoscedasticity

One of the assumptions we need to make is that the data displays homoscedasticity. What this means is that for all of our possible x-values, our data maintains equal variance. One quick way to do a simple check is by referring to the scatterplot of our data.

plot(speed, dist, xlab = "Speed (MPH)", 
     ylab = "Distance (Feet)",
     main = "Stopping Distance of Cars Based on Speed")
abline(0,4, col = "green")
abline(-40,4, col = "green")

I plotted two green parallel lines along with our scatter plot to give us a better idea of how the variance changes throughout our data. (Please note that the green lines hold no statistical significance. They are only included as a reference point.) Though there are a few outliers on the upper end of things, it doesn’t look to be enough to cause a major concern to our homoscedasticity. We can check this assumption in another way by plotting the residual values against speed.

plot(car.mod$residuals~speed)
abline(0,0)
abline(20,0, col = "red")
abline (-20,0, col = "red")

Looking at this, I feel like it confirms what we saw in our scatterplot. There are a few outliers on the upper end, but nothing that would enough concern for us to say this violates our assumption. (Similar to the green lines above, I simply included the red lines as a reference point and they hold no statistical significance.)

Normal Distribution

Another assumption made is that our residual errors have a normal distribution. Our quick check for this assumption will be to plot the residuals as a histogram and check to make sure the shape of it approximately matches that of a normal distribution.

car.resids <- car.mod$residuals
hist(car.resids)

Looking at the histogram it appears that it matches the bell shape curve that we know a normal distribution has. Just like with our homoscedasticity, we can go a little more in-depth and use a qqnorm plot to confirm what we see in our histogram.

qqnorm(car.resids)
qqline(car.resids)

I like the way this qqnorm plot looks. In general, most of our points fall either on or near the line which is what we look for in a qqnorm plot. I feel that it confirms our assumption of the errors following a normal distribution.

Model Summary

Knowing that our model assumptions check out, let’s take a look at a summary of what we know about our linear regression model so far.

summary(car.mod)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

A simple summary command gives us a ton of information about our model. There is a lot to look at so it is best to take it one step as a time so as not to miss anything, as well as so we can understand what each piece is telling us.

Estimate

Under the coefficients headline, the first thing to look at is the estimate column. This is giving us the same information as before when we first created the model. It gives us both the B0 and B1 values which we use to write our equation. Since I covered this when we first calculated our linear regression equation, I will move past it for now.

T value and Pr(>|t|)

Found two and three columns over from our point estimates, t value and Pr(>|t|) are important for testing the significance of each beta value. The t value is our test statistic. We use this test statistic to test the H0:B1=0. If our null hypothesis is confirmed, it means that our B1 should equal 0, and therefore holds no significance in modeling our data. If this were to be the case in simple linear regression, we would then have to try to find a new predictor variable as the current one does not hold any statistical significance. To test the significance, we would compare the |t| to t[alpha/2] with (n-2) degrees of freedom. These calculations can all be done by hand, but the result is also included for us in this summary under the Pr(>|t|) column. This displays the output of testing our t value against a two-sided alternative hypothesis of Ha: B1=/=0, with an alpha value of .95. The more significance our predictor variable holds, the smaller we would expect our p-value to be. In this case our p-value is 1.49*10-12, which is an extremely small value and indicates that speed is a significant predictor of stopping distance. We can carry out this same procedure with the B0 value. Our p-value for the intercept was much higher at .0123. Depending on how accurate we are looking for our model to be, it may make sense to get rid of the y-intercept, or to leave it in but understand that it does not hold a major significance. This also relates back to what was discussed in the Regression Model section, when we looked at interpreting the intercept value and determined that we could not in this model. In this case we will leave it in, understanding that it is not essential to our model.

R-squared Value

Towards the bottom of the summary we can find the Multiple R-squared value. The R-squared value is calculated by dividing explained variation by total variation. This tells us what portion of the relationship is explained by the linear regression model. The R-squared ranges between 0 and 1, and a higher value indicates a stronger linear relationship between the predictor and response variables. When looking at speed as a predictor of stopping distance, we get an R-squared value of .6511. This means that 65.11% of stopping distance is explained by the speed the car is traveling. Another related value is the Adjusted R-squared value. This value is very similar to the first R-squared value, but this one makes adjustments and lowers the value based on how many predictor variables we choose to use. This is done because it would theoretically be possible to had hundreds of variables to a model where the resulting R-squared would be 1. The adjusted R-squared simply takes this into account so we can get a better idea of exactly how much of a relationship there is between our variables, even when we had more predictors in multiple linear regression.

Correlation Coefficient

Related to the R-squared value above, we can also calculate a simple correlation coefficient that measures the strength of the linear relationship between x and y. The correlation coefficient, denoted r, can be calculated by taking the square root of the r2 value, and we will also use the same sign as the B1 value. That is, if B1 is positive r will be positive, and if B1 is negative then r will be negative. Even simpler than taking the square root of r2, we can let R do the work for us by using the cor command.

cor(speed, dist)
## [1] 0.8068949

A value closer to one indicates a stronger positive correlation, thus a value of .8069 points towards a strong correlation. However, strong correlation does not necessarily indicate a cause and effect relationship. It simply means that x and y tend to move in the same direction. (i.e as speed increases, stopping distance also increases)

F-Test

One final thing to look at in the linear regression summary is the F-Test statistic at the very bottom of the summary. The purpose of the F-test is to test the significance of the relationship between x and y. Our null hypothesis for the F-test is H0: B1=0, meaning that if our null hypothesis holds then the relationship between x and y is not significant. We are given the F-test stat at the bottom of the model, and we can either compare it to the F-value with 1 and (n-2) degrees of freedom, or use the p-value given to us to accept or reject the null hypothesis. The summary gives us the p-value right next to the F-test statistic, and we can reject the null hypothesis if the p-value is less than our chosen alpha value. In this summary our p-value is 1.49x10-12, which is small enough that we can reject the null hypothesis and conclude the there is a significant relationship between speed and distance.

Intervals

Confidence Interval for Slope

Something we may be interested in is creating a confidence interval for our slope coefficient. Though we have our B1 value, it is based only on the data that we have currently. An interval can be used to give us a better idea of what the true value of B1 is over all possible values.

confint(car.mod, level = .95)
##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

In this example we use a level of .95, signaling that we are 95% confident that the true slope of B1 will fall between 3.0969 and 4.76778. This interval is just another way of learning more about the correlation between our two variables and how accurate our B1 is.

Confidence and Prediction Intervals

One reason to do linear regression is that we can then take the results and use it to make forecasts about other values of future data points. One way we can do this is through confidence and prediction intervals. While before we just plugged a speed value into our equation to get a point estimate, a better practice is using confidence and prediction intervals to provide a larger range that accounts for some of the variability that comes with our data. When looking at these intervals, you first have to create a new data frame and choose a value you want your intervals to be based around.

newdata.car <- data.frame(speed = 18)

For this guide I chose a speed value of 18, since it falls roughly in the middle of our preexisting speed data. We can then use this new data to create our confidence and prediction intervals.

Prediction Interval

First lets look at the prediction interval.

(pred <- predict(car.mod, newdata.car, interval="predict") )
##        fit      lwr      upr
## 1 53.20426 21.89839 84.51014

It is important to understand the difference between a confidence interval and a prediction interval as they tell us two different things, but can be easily confused. A prediction interval is a 100(1-alpha)% interval for an individual value of y, when given independent value xi. Another way to think about it would be to think, if the speed of our car is ___, let’s make a prediction for the distance it takes to stop. Then based on the alpha value chosen, it tells us that we are (1-alpha)% sure that the predicted value will fall in this interval. So, for a speed value of 18 and at an alpha value of .05 (which is the assumed level in R), we got a prediction interval of [21.89839,84.51014], meaning that we predict with 95% accuracy that our individual value will be between 21.898 and 84.51.

Confidence Interval

Next we will look at a confidence interval for the same speed value of 18.

(conf <- predict(car.mod, newdata.car, interval="confidence") )
##        fit      lwr      upr
## 1 53.20426 48.32138 58.08715

While similar to a prediction interval, the confidence interval provides an interval for what we can expect the mean value of y to be when we plug in xi. So rather than looking at a single individual prediction, a confidence interval focuses on the mean value over all predicted values. In this example where we again use a value of 18, we get a confidence interval of [48.32138,58.08715], meaning we are 95% confident that the average stopping distance for all cars going 18mph is between 48.321 and 58.087.

One thing to notice when comparing our confidence and prediction intervals is that our prediction interval is much wider than our confidence interval.

conf %*% c(0, -1, 1)
##      [,1]
## 1 9.76577
pred %*% c(0, -1, 1)
##       [,1]
## 1 62.61175

Our confidence interval has a width of 9.766 while our prediction interval width is over 60. The reason for this is that there is much more variability when looking at a single prediction versus the mean value of all y values. Thus, we would always expect the prediction interval to be wider than the confidence interval, when calculated at the same alpha level.