Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(ggplot2)
cars_data <- cars
cars_model <- lm(dist ~ speed, data = cars_data)
cars_model
##
## Call:
## lm(formula = dist ~ speed, data = cars_data)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
If \(x\) represents the speed of the car and \(y\) represents the stopping distance, then based on the regression analysis above we can represent stopping distance as a function of speed with the equation \(y = 3.932x - 17.579\) or \(y = -17.579 + 3.932x\)
Since \(17.579/3.932 \approx 4.470753\), we can interpret our regression equation to mean that we would expect a car traveling around 4.5 miles per hour to be able to stop instantaneously (\(y = 0\)), and that each one mile per hour increase in speed would result in an average increase in stopping distance of around 4 feet.
The scatterplot for the cars data is shown below.
ggplot(cars_data, aes(x = speed, y = dist)) + geom_point() + labs(title = "Car Speed vs. Stopping Distance", x = "Speed (Miles/Hour)", y = "Stopping Distance (Feet)")
The same scatterplot with the linear model overlaid is shown below.
ggplot(cars_data, aes(x = speed, y = dist)) + geom_point() +
labs(title = "Car Speed vs. Stopping Distance", x = "Speed (Miles/Hour)", y = "Stopping Distance (Feet)") +
geom_abline(aes(slope = coef(cars_model)[2], intercept = coef(cars_model)[1]))
summary(cars_model)
##
## Call:
## lm(formula = dist ~ speed, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We see that the median residual is close to 0 and the first and third quartile residuals are of very similar magnitude, both of which are desirable for a good linear model. The minimum and maximum residuals are not as close as the first and third quartile residuals, but are still not very far apart.
The t value for the slope is close to 10, which is a good sign for the quality of our model. The t value for the y-intercept is small, indicating uncertainty in our y-intercept.
The p-value for the slope is extremely small, approximately 0.0000000000015. This minuscule p-value is strong evidence of a linear relationship between car speed and stopping distance. The p-value for the y-intercept is also quite small, which is evidence that the true y-intercept is not 0.
The multiple R-squared value of 0.6511 means that 65.11% of the variability in stopping distance is explained by the variation in car speed.
The residual plot for this linear model is shown below.
ggplot (data = cars_model, aes(x = .fitted, y = .resid)) +
geom_point () +
geom_hline (yintercept = 0, linetype = "dashed") +
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")
We see from the plot that the magnitude of the residuals increases as the speed increases, which is not ideal for our model. Ideally, we would see constant variance in the residuals.
We can further investigate the residuals with a Q-Q plot, shown below.
ggplot (data = cars_model, aes(sample = .resid)) + stat_qq() + stat_qq_line() + labs(title = "Q-Q Residual Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")
We see from our Q-Q plot that the ends of the our model diverge from the straight line. Therefore, the residuals are not normally distributed. The right end of the distribution is heavier than expected, and the left end is lighter than expected, implying a right-skewed distribution.
Although it does not provide any additional information, all four default diagnostic plots are shown below.
par(mfrow = c(2,2))
plot(cars_model)
In conclusion, the linear model we developed is a decent fit for the data, as indicated by the moderate R-squared of 0.6511 and the very low slope p-value of \(1.49 \times 10^{-12}\). However, it is not an ideal fit for the data as the residuals do not display constant variance and the Q-Q plot of the residuals does not form a straight line.