Simple linear regression is a way to estimate a linear relationship between two variables. The independent variable is called the predictor while the dependent variable is called the response. The equation for a simple linear regression model is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] With \(\hat{\beta_0}\) as our intercept, and \(\hat{\beta_1}\) as the slope of the approximation line. We will look at the data from the “cars” data set to conduct our test. First, Let’s look at the data set as a whole and see what it is telling us. Also attach the data set for easy accessibility later on.
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
attach(cars)
This provides data on the relation between the speed of a car (mph) and the distance it takes to stop in feet. One detail that should be noted is that this data was recorded in the 1920’s.
For our model we will use the speed variable as our predictor variable and the stopping distance as the response.
One simple way to test whether two variables are related is through a correlation test in R. When testing the correlation, we hope to see a value close to 1.
cor(dist,speed)
## [1] 0.8068949
Although this number isn’t that close to one, we will move forward to see how accurate of a linear regression model we can create.
Now, examine the scatter plot of all data recorded to visualize whether or not a linear approximation model is a viable option. Notice when writing the code, the response is first and predictor is second. Also, it is always better to create your own axis titles.
plot(dist~speed, ylab="Stopping Distance (ft)", xlab = "Car Speed (mph)", main= "Relationship Between Car Speed and Stopping Distance")
From this scatter plot we can see that there seems to be a positive trend line, Which intuitively makes sense because a car that is going faster will need a greater distance to stop. So we expect our linear model to have a positive slops, and would assume that the intercept will be near zero.
Next we create our model. All you need to do is tell R to make a linear model and that you want the speed of the car and the predictor, and the stopping distance as the response.
mymod<- lm(dist~speed)
Now call up the name of the linear model. In this case it is named mymod.
mymod
##
## Call:
## lm(formula = dist ~ speed)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
From this we can see that the slope of our linear model is 3.932, meaning that for every additional mph of speed, it adds nearly 4 more feet of stopping distance. According to this, our intercept is -17.579, we cannot read into this too much because it would be interpreted as when the car is moving at 0 mph, it takes it -17.579 ft to stop. The reason this intercept does not make sense is because we have no data points at or around the speed of 0 mph because that would not have been helpful for the overall research. So although it is important, the intercept does not need to necessarily make numerical sense. The linear model is now set as: \[\hat{y_i}= -17.579+3.932 x_i\]
We will now add our newly found regression line to the scatter plot we created earlier to visualize whether or not the model is an accurate estimate. We will use the same scatter plot, but with the addition of the “abline” function that adds our regression line.
plot(dist~speed, ylab="Stopping Distance (ft)", xlab = "Car Speed (mph)", main= "Relationship Between Car Speed and Stopping Distance")
abline(-17.579,3.932)
As you look at the regression line compared to the data points, it looks to accurately estimate the stopping distance of the car across all of the recorded speeds.
Now let’s use our new regression model to see how accurate it is. We will compare the output of our model compared to the actual value of the data point. At random, let’s choose the 15th observation a look up the values of that observation.
cars[15,"speed"]
## [1] 12
cars[15,"dist"]
## [1] 28
Here we see that the car on this specific observation was moving at 12 mph and it took 28 feet to stop. Using our linear regression model and the speed value of the observation (the predictor variable \(x_i\)), we will see what sort of response we get. Use the coefficient function and then we will multiply the first coefficient by one because it is our intercept value, and multiply the second coefficient value \(\hat{\beta_1}\),by our given predictor value 12.
coef(mymod)%*%c(1, 12)
## [,1]
## [1,] 29.60981
We see that our model estimated the stopping distance to be about 29.6 feet compared to the actual 28 feet. Although the estimate is off by over a foot, that is still a pretty small window of error.
Next we will look at the residuals of our model. Which is the difference between the actual observed value and and our predicted value. This is an example of the residual of the 15th observation.
cars[15,"dist"]-coef(mymod)%*%c(1,cars[15,"speed"])
## [,1]
## [1,] -1.60981
This shows that the model is a 1.5 ft overestimate of the stopping distance.
Now we look at all of the residuals over the whole data set. One of the assumptions of a simple linear regression is that the residuals are normally distributed. To test whether this model achieves that we will compare the quantiles of the residuals against the quantiles of a normal distribution. The residuals should follow the line.
resids<- mymod$residuals
qqnorm(resids)
qqline(resids)
For the most part, the residuals seem to follow the line throughout the quantiles. But it should be noted that they do veer from the line in the highest quantile.
Another assumption is that variance is constant throughout the residuals. We can test this by comparing the residuals against the predictor variable values.
plot(resids~speed,xlab="Speed (mph)",ylab="Residuals",main="Variance of Residuals Compared to Predictor")
abline(0,0)
The variance seems to be constant throughout the lower speeds, but does begin to differ at higher speeds. The model may not be a viable option as the speeds increase. There is no obvious signs of heteroscedasticity though.
We will use the summary function within R gain a large amount of desirable values, we will then discuss each section separately.
We will test the accuracy of our model by looking at the mean square error. Since this squares each residual, estimates that are far off will hurt the accuracy of the overall model much more than just adding the residuals together and dividing by the number of observations (50). We would solve the MSE by calculating it, but R does that for us in the summary function of the model.
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Here we can see that the MSE or as R calls it, the residual standard error is 15.38. This is a discouraging sign because an accurate model will have a much smaller MSE, ideally <1. One hypothesis to this large MSE is due to the inaccuracy of the model where there are greater speeds. Several large inaccuracies alone can skew this value. We also notice that there are 48 degrees of freedom because there are 50 observations of two variables. 50-2=48
We can also use the information within the summary functions to complete our hypothesis testing. First we state that our null hypothesis is that there is no linear relationship between the speed of cars, and the distance it takes to stop.The alternative hypothesis is that there is a linear relationship between the two variables. in the speed row we can see that our test stat is 9.464 and our degrees of freedom is 48 because there is 50 observations of 2 variables. The next column gives us our p-val of 1.49e-12. With such a small p-val we can safely reject our null hypothesis in favor of the alternative. We do believe that there is a linear relationship between the two variables.
Another way to test the linear relationship between the two variables is by using the F-Test. This test utilizes the F distribution to test the variation between the outputs of our model, and the actual observed values. Like the original hypothesis test, our null hypothesis is that there is no linear relationship between the two variables, and the alternative hypothesis is that there is a linear relationship. We need a small p-value to be able to reject our null in favor of the alternative. At the bottom of the summary function we can see that your F-statistic is 89.57 with degrees of freedom of 1 and 48, and looking to the next column there is our p-value of 1.49e-12. This value is small enough for us to reject our null hypothesis and conclude that there is a linear relationship between a car’s speed and its stopping distance. As you notice the F-test p-value is the same as our original hypothesis test. This is to be expected under simple linear regression.
The simple coefficient of determination r2, is the proportion of the total variation in the 50 observed values of the dependent variable (stopping distance) that is explained by our simple linear regression model. The values show how much of the total variability is explained by the model. Values range between 0%-100%. We desire to be as close to 100 as possible because that would mean that our model explains almost all of the total variance. We see that our R-squared value is 0.6511 and the adjusted R-squared is 0.6438. The different between the two is that the second is adjusted for the number of predictor variables. Since we only have one predictor we don’t expect the difference between the two to be significant. At around 65%, we have explained a majority of the variance through the model, but there is room for improvement.
Now we will look at how to create a confidence interval using our linear regression model of the cars data set. Let’s create a 90% confidence interval of the linear relationship between a car’s speed (the predictor) and the car’s stopping distance (the response).
confint(mymod,level=0.9)
## 5 % 95 %
## (Intercept) -28.914514 -6.243676
## speed 3.235501 4.629317
Here instead of getting just one value for the intercept and slope, we get a window (the confidence interval). With this information, we can say that with 90% confidence that we can predict the stopping distance of a car given its speed, utilizing a slope and intercept inside the window provided by the confidence interval calculation.
To compare the width of a PI and CI, we first have to just a set data point within the cars data set. We will set the predictor value at 9mph because we know it is within our data set. Then look at the predictor interval.
newdata1<-data.frame(speed=9)
(predi <- predict(mymod, newdata1, interval="predict") )
## fit lwr upr
## 1 17.81258 -13.87225 49.49741
The predictor interval provides a very large window. We will now compare it to the confidence interval.
(confi<- predict(mymod,newdata1, interval = "confidence"))
## fit lwr upr
## 1 17.81258 10.90512 24.72005
Here we have a much smaller interval. This is to be expected because a PI bases its interval on the one observation of the speed 9 mph, while the confidence interval utilizes all of the 9 mph observations to create its interval. We can also find the exact widths of these intervals for easier comparison.
confi %*% c(0, -1, 1)
## [,1]
## 1 13.81493
predi %*% c(0, -1, 1)
## [,1]
## 1 63.36966
Now you can easily see the vastly differing widths of the two intervals. Because the confidence interval has more to base its interval off of, it will always be smaller.
One last step is to verify that the two intervals are centered at the same value. This can be done with simple test.
predi[1]==confi[1]
## [1] TRUE
With the true output we know the intervals are centered at the same point, and we are not breaking any of the linear regression model rules.
After analyzing the linear relationship between a car’s speed and its stopping distance, it is safe to say that there does appear to be a linear relationship present. Although our model does not necessarily predict stopping distance with the most accuracy, we are able to show through our hypothesis test that there is definitely a linear relationship between the two variables.