Simple Linear Regression in R

Introducing the Data
Visual Analysis and Trends
Model Basics
- Constructing the Model
- Point Prediction
Testing Model Assumptions: Residuals and Error
Correlation and Significance
Confidence and Prediction Intervals
Conclusion

Introducing the Data

Our goal today is to study the fundamentals of a simple linear regression using a motivating example. The data set we will use includes two quantitative variables: speeds and stopping distances of cars in the 1920’s. Our first objective is to describe the linear relationship between our two variables. We would like to explore how the speed of a car could influence its stopping distance so we will define our response variable and predictor as stopping distance and speed, respectively.

Let’s call our data set from the R library and attach the headers of the data so that we don’t have to call the whole set every time we want to use a variable name. The library() and data() function specify which R package and set to pull the data from. Our data comes from the “cars” set in the “datasets” library so we input those words into their respective functions. The “cars” input is also used for the attach() function so that R will store the names of our variables. To figure out what those names are, we use the head() function which reveals the first 6 lines of data in the set.

library(datasets)
data("cars")
attach(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Now we know that the observed speeds are under the header “speed” and the observed stopping distances are under the name “dist.” We will use these from now on when referring to our predictor and response variable, respectively. As an example of how to read this data, we would say that row 4 represents an observation of a car going 7 mph that has a stopping distance of 22 feet.

Visual Analysis and Trends

To get a better look at our complete data set we should plot the whole thing. This will give us some general insight on the trends present in the observed data. To create ascatter plot, we use the plot() function with inputs in this order: (predictor , response , main = “title”, xlab = “x-axis label”, ylab = “y-axis label”)

plot(speed , dist , main = "Stopping Distances by Speed" , xlab = "Speed (mph)" , ylab = "Distance (ft)")

This chart shows us that our variables follow a positive trend. We can see that higher speeds tend to be associated with longer stopping distances and lower speeds tend to be associated with shorter stopping distances. It is reasonable to assume that some linear relationship between the two factors exists based on this graph.

Model Basics

Constructing the Model

We will name the model that uses speed as the predictor and stopping distance as the response “model1.” Using the general formula for a simple linear regression and substituting in our model response and predictor, we should get regression equation that looks like this: \[\widehat{StoppingDistance}= \hat{\beta_0}+\hat{\beta_1} Speed\] The lm() function with inputs (response ~ predictor) will give us the point estimates of our regression parameters $\hat{\beta_0}$ and $\hat{\beta_1}$ . These parameters represent the y-intercept and the slope of the line of best fit, respectively, given our observed data. Since we will want to call up the model later without typing all of this out again, we can save the model as model1 using the “model name <-” command. We ask R to display the results of that function by restating the name of the stored object on its own like this:

model1 <- lm(dist ~ speed)
model1

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The regression equation above is the key to interpreting these outputs as coefficients. Let’s start with $\hat{\beta_0}$ : intuitively, this number would be our predicted stopping distance given that a car’s speed was 0. However there are a number of reasons why this line of reasoning is not applicable to our model.

First of all, our response variable can never take on a negative value becuase it represents a distance. Additionally, it is obvious that if a car is already stationary (that is, the speed is 0) then its stopping distance is 0 by definition. Lastly, we do not observe any instances where the explanatory variable is zero or close enough to zero to accurately predict a response. Predicting outside the experimental region is called extrapolating and we want to avoid doing that. For these reasons, we cannot assign a meaningful interpretation to our intercept point estimate.

We mentioned above that the value 3.932 represents the point estimate of the slope of the prediction line. Thus, we predict that a car’s stopping distance will increase by 3.932 feet for every additional mile per hour at which the car is travelling.

Thus, our final model looks like this: \[\widehat{StoppingDistance} = -17.579 + 3.932 * Speed\]

Point Prediction

With all of this information, we are now capable of producing and interpreting a prediction using our linear regression equation. All we have to do is plug in a value for the predictor “speed.” If we want to predict the stopping distance for a car that is going 8mph, we can just type out the equation with 8 as the value for “Speed”:

-17.579 + 3.932 * 8

## [1] 13.877

To summarize in words, we predict that it will take 13.877 feet for a car to stop when it’s travelling at a speed of 8mph. We can compare this prediction to our observed data, too. Notice that when we used the head() function above, we saw that row 5 of our data set contained the observed stopping distance of a car going 8mph. So we know that the stopping distance was 16 feet for a car that was travelling at 8mph, but our prediction from the linear regression model did not yield that value. Let’s explore that difference in the next section.

Testing Model Assumptions: Residuals and Error

To achieve a visual comparison of our prediction line to the observed data, we could plot it onto the scatter plot from before using the the parameters of our regression equation. We copy our plot() function from earlier and run the abline() function with inputs (intercept , slope) simultaneously to superimpose the line onto the scatter plot.

plot(speed , dist , main = "Stopping Distances by Speed" , xlab = "Speed (mph)" , ylab = "Distance (ft)")
abline(-17.579 , 3.932)

Now it is abundantly clear that our observed data points do not fall on our prediction line. The differences between observations and our own predictions are called residuals and are denoted $e_i=y_i-\hat{y_i}$ . Our prediction above yielded a stopping distance of 13.877 ft when the car was moving at 8 mph. The stopping distance that was actually observed for a car going 8mph was 16 ft. Then our residual value is:

resid8 <- 16 - 13.877
resid8

## [1] 2.123

Under the assumptions of the model, error terms $\epsilon_i$ are statistically independent samples of a normal distribution with mean ${\mu}=0$ and variance ${\sigma}^2$. ${\sigma}^2$ represents the variance of the population of stopping distance values at speed $x$ .

Let’s test this assumption by creating a qq-normal plot. R will calculate all of our residuals with “model1$residuals” and we’ll store that value in the name “distresid”. The result will reveal to us how closely our residuals fit the curve of a normal distribution. The qqnorm() function with input (distresid) will create the plot for us and if they fit closely, the points will all lie nearly on the the line plotted using the qqline() function with the same input.

distresid <- model1$residuals
qqnorm(distresid)
qqline(distresid)

Here we see that they form a straight line for the most part, so we can take the normality assumption to be true.

Another assumption that is easy to test in R is the constant variance assumption. Our error terms should have a constant variance for all speeds. To test this, we can compare the residuals to our observed speeds by creating a scatter plot and comparing the spread of the residuals around 0 at each speed. Here we’re looking for any type of pattern in variance forms over the spectrum of speeds observed. If no pattern forms, we would say our data is homoscedastic which means that the constant variance assumption would be met.

plot(distresid ~ speed, ylab = "Residuals", xlab = "Speed (mph)", main = "Residual Plot")
abline(0,0)

This plot shows a slight trend of increased variance as the the speed increases. We would call this phenomenon heteroscedasticity and say that the constant variance assumption is not met. There are a few reasons why this data in particular might not meet the constant variance assumption. Depending on the quality of the tread on the tire or the road conditions during observation, the variance in stopping distance could differ significantly at higher speeds compared to slower ones. For the sake of continuity, we’ll move forward with the example and keep this information in the back of our mind as we explore other parts of our model.

Correlation and Significance

Mean Square Error and Correlation

We can make a point estimate for the error term variance ${\sigma}^2$ and denote it $s$ where $s$ = Mean Square Error. Calculating $s$ takes a lot of time and effort so it would be very convenient to have a shortcut. Instead of calculating this value by taking the sum of all the squared residuals and dividing by 2 less than our number of observations (accounting for the two parameters that we estimated), it’s much easier to just pull the value out of the R summary of our model.

summary(model1)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As you can see, we ended up with a lot more information than we asked for, but the Mean Square Error pops up below the “signif. codes” section as “Residual standard error”. While all of these fresh statistics are here, let’s take a moment to disect each one of them on this summary.

A common indicator of the utility of the model is $R^2$ which is called the coefficient of determination. We can find it listed in the summary tab as “Multiple R-squared”. It measures the ratio of variation in stopping distances that is explained by their linear relationship with the speeds of the cars. It follows that we desire an $R^2$ value as close to 1 as possible. Our $R^2$ value of 0.6511 tells us that speed is not a bad predictor of the stopping distance since much of its variance is explained by its linear relationship with speed.

The square root of $R^2$ is a much simpler indicator of utility and it goes by the name “correlation” (denoted $r$). It can be calculated in R using the cor() function and it takes on values between -1 and 1. The inputs of the cor() function are (predictor , response). In our case, it represents how strong the positive correlation is between speed and stopping distance.

cor(speed , dist)

## [1] 0.8068949

This correlation is quite high and tells us that the speed and stopping distance have a relatively strong tendancy to move together linearly with a positive slope.

The t-test

The “Coefficients” section explores the significance of each of the estimated parameters in our linear regression model. If the parameters are not significant, the model does not provide us with a useful description of the linear relationship between the predictors and the response. For example, a hypothesis test can determine whether the speed of the car is a significant predictor of the stopping distance or not. In a t-test, we construct a test statistic or t-value $t$ . If the area under the T-distribution curve with 48 degrees of freedom to the right of $t$ is less than a chosen significance level ${\alpha}$ , we can reject the null hypothesis ${\beta_1} = 0$ in favor of the alternative that ${\beta_1}\neq0$ with (1-${\alpha}$)% confidence.

The t-value for each parameter is listed under the “t-value” column of the summary and immediately to the right of them are the areas under the T-distribution curve as mentioned before. We call these areas p-values. The important thing to take away from these values is that the p-value of our predictor variable, speed, is very low. This tells us that we can reject the null hypothesis with a very high degree of confidence. Our conclusion is that there is a large chance that the speed has a significant linear relationship with stopping distance.

Take note that the p-value for the y-intercept is quite a bit larger. Why might this be? We discussed earlier the logical reasons why our y-intercept parameter has no meaningful interpretation since the stopping distance should be 0 if the speed of the car is 0. Depending on the circumstances surrounding our analysis, we might consider dropping the y-intercept since we fail to reject the null hypothesis at an ${\alpha}$ = 0.01 significance level.

The F-test

Furthermore, an F-test will perform a very similar analysis to the t-test, except it considers the significance of the whole model in relationship to the response variation. In a simple linear regression like we’ve presented, the p-values and null hypothesis rejection points are equivalent. Thus, our conclusion for the F-test is the same as our conclusion from the t-test in this case: the speed of the car has a significant linear relationship with the car’s stopping distance.

Confidence and Prediction Intervals

Confidence Interval for a Model Parameter

Now that we’re sure that our model makes useful predictions, we can create meaningful confidence intervals for the regression parameters. The confint() function will yield the limits of a 95% confidence interval by default for the model that we put into it.

confint(model1)

##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

Our main interval of concern is with the slope parameter or ${\beta_1}$ corresponding to the speed variable. This output tells us that we can be 95% sure that the coefficient describing the relationship between speed and stopping distance will take on a value between 3.096964 and 4.767853. We will ignore the y-intercept interval since the value itself has no meaningful interpretation.

We could change the confidence level using the inputs (model1, level = “desired confidence %”):

confint(model1 , level = 0.9)

##                    5 %      95 %
## (Intercept) -28.914514 -6.243676
## speed         3.235501  4.629317

Notice how an interval of a higher degree of confidence is wider? This is because we can be more certain that the true ${\beta_1}$ lies within a larger interval. Conversely, we cannot be as confident that ${\beta_1}$ falls within a very small range of values without more information.

Confidence Interval for the Mean Response

Our next objective is to get an idea for the mean stopping distance given a specified speed. Let’s say that we want to know the mean stopping distance given a speed of 10mph. The predict() function will produce the limits of a (1-${\alpha}$)% confidence interval, but we first need to specify at which speed we’re looking to estimate the mean stopping distance. To do this, we use the data.frame() function and input the “name of the variable=value we specify”.

con10 <- data.frame(speed = 10)
predict(model1 , con10 , interval = "confidence")

##        fit      lwr      upr
## 1 21.74499 15.46192 28.02807

Notice that the inputs of the predict() function are in the order (model, specified point, type of interval). This set of code produced the bounds of the 95% confidence interval for the mean stopping distance for a car going 10mph under the titles “lwr” and “upr”, respectively. The main takeaway is that we are 95% confident that the mean stopping distance for a car going 10mph is between 15.46192 ft and 28.02807 ft.

Prediction Interval

If we want to make an interval that describes a point estimate of an individual response variable value at a specified explanatory value, we can use the same two functions, but adjust the input of the predict() function to read “(interval=‘prediction’).”

con10 <- data.frame(speed = 10)
predict(model1 , con10 , interval = "prediction")

##        fit       lwr      upr
## 1 21.74499 -9.809601 53.29959

This means that we can predict with 95% confidence that the stopping distance of a car that is going 12mph will be between -9.809601 ft and 53.29959 ft. We don’t need a calculator or computer to see that this prediction interval is quite a bit wider than the confidence interval. This variation is due to the fact that we have refined our goal value from an average to an point prediction. Instead of looking for the mean of the population of responses, we’re estimating the individual value of the response variable at the specified predictor value (i.e. an individual stopping distance at the specified speed).

We need to make note that the lower limit of this prediction interval is negative. This calls into question whether or not this prediction interval offers us any useful information. We might have to loosen up our confidence level to allow our interval to condense to values that make sense.

Conclusion

This concludes our overview of the simple linear regression model! I think we covered all of the most important topics and their applications. Thanks for following along!