seed <- 1234
library(tidyverse)
library(lmtest)
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
The cell below imports the data and prints a summary of the data it contains.
data(cars)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
As we can see in the above output, the cars dataset
contains two variables: speed and dist. More
specifically, the dataset describes the distance it takes a car to stop
at different speeds.
Using these two fields, we can build a linear model that fits the data to a line using the following equation:
\[ \begin{equation} \hat{y} = a_0 + a_1x \end{equation} \]
In this case, \(\hat{y}\) (the
dependent variable, or the values predicted by the model) will be
distance, \(x\) (the
independent variable of the model) will be speed, and \(a_0\) and \(a_1\) will be the fitted values of the
\(y\)-intercept and slope,
respectively. The model is produced in the code cell below and defined
as lin_reg:
lin_reg <- lm(dist ~ speed, data=cars)
lin_reg
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
Before we evaluate the performance of the model, we must check that the assumptions of linear regression are true. If not, our model could be unreliable or even misleading. The four assumptions of linear regression are the following:
To check whether or not speed and distance
have a linear relationship, we can create a scatter plot overlayed with
the line produced by the fitted values of \(a_0\) and \(a_1\) from the model:
ggplot(data=cars, aes(x=speed, y=dist)) +
geom_point() +
geom_abline(slope=lin_reg$coefficients[2],
intercept=lin_reg$coefficients[1],
color='red')
Based off visual analysis, it is clear that the line provides a more than reasonable fit to the data, and that these two variables do appear to exhibit a linear relationship.
To check if the residuals are normally distributed, we can make a histogram:
ggplot(lin_reg, aes(x = .resid)) +
geom_histogram(binwidth = 5)
While visual analysis reveals that the distribution of reisdual lengths do indeed look normal, we can confirm statistically this is the case via a shapiro-wilk test. The alternative hypothesis of this test is that the data does not follow a normal distribution, and thus the desired result is for a \(p\)-value > 0.05.
shapiro.test(lin_reg$residuals)
##
## Shapiro-Wilk normality test
##
## data: lin_reg$residuals
## W = 0.94509, p-value = 0.02152
Given the high \(p\)-value, we can confirm that the normality assumption is met.
To check if the homoscedasticity condition is met, we can create a plot of the residuals as a function of the fitted values:
ggplot(lin_reg, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
While the distribution of residuals looks somewhat equal accross all fitted values, there does appear to be a wider spread towards the right side of the plot. Since the visual analysis does not provide full confidence that this assumption is met, we can confirm statistically via a Breusch-Pagan test. The alternative hypothesis of this test is that the distribution of the variance of the residuals differs for different values of \(y\) (i.e. the data is not homoskedastic), and thus, once again, the desired result of the test is to obtain a \(p\)-value > 0.05:
bptest(lin_reg)
##
## studentized Breusch-Pagan test
##
## data: lin_reg
## BP = 3.2149, df = 1, p-value = 0.07297
We see that this is indeed the case, and confirm the homoscedasticity assumption.
Based off the above residual plot above, we can also confirm that there is not a strong correlation between the residuals and fitted values. This is further evidenced by the fact that the correlation coefficient between the two is very close to 0:
cor(lin_reg$residuals, lin_reg$fitted.values)
## [1] 4.628907e-17
As such, we confirm that the final linear regression assumption model is met.
Now that we have confirmed that our model meets all the assumptions of linear regression, we can evaluate the performance of our model:
summary(lin_reg)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
From the above output we see that speed is indeed a
significantly significant predictor of distance, and
results in a model with a significantly high \(R^2\) value. More specifically,
speed in this linear regression model accounts for 64.38%
of the variance observed in distance. For using only a
single predictor variable, this is quite impressive!
Given that we have now confirmed our model can be used reliably, we
can create a function that uses the outputs of the model to predict the
stopping distance of a car given any speed. The cell below creates a
function predict_stopping_speed and uses it to predict the
stopping distance of a car travelling 100 mph:
predict_stopping_distance <- function(car_speed){
return(unname(lin_reg$coefficients[2] * car_speed + lin_reg$coefficients[1]))
}
predict_stopping_distance(100)
## [1] 375.6618