Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
To begin, I loaded the cars data set and plotted the predictor and response for visual analysis. The analysis appears to show a fairly strong, positive correlation between speed and stopping distance. This would indicate that a linear regression may be an appropriate selection for modelling and will show that as speed increases, stopping distance also increases.
data("cars")
ggplot(cars, aes(speed, dist)) +
geom_point() +
geom_smooth(method='lm')
I performed a single linear regression on the data and displayed the summary and resulting equation below. The summary analysis indicates the following:
cars.lm <- lm(dist ~ speed, data=cars)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
\[\widehat{distance}=speed \times 3.93 - 17.58\]
Next, i plotted the relationship between the fitted values and residual values. Ideally, we would see a randomly distributed set of points with no apparent pattern, roughly equally distributed above and below 0. While this appears to be the case, we also find that the amount of variation appears to increase as the fitted values increase. This is not ideal. Although the increase in variation is minimal, we should note that this indicates the model may have difficulty explaining some of the variation in stopping distance.
ggplot() +
geom_point(aes(fitted(cars.lm), resid(cars.lm)))
The qqplot tells a similar story. There are a number of upper elements that the model is having difficulty capturing. The qqplot also indicates that perhaps a different model may be more appropriate. A thorough investigation would compare this model to an exponential model and a quadratic model.
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
Overall, this appears to a valid single regression with notes that the model has an increase in the variability of residuals. We could perhaps address this issue by transforming either the predictor or response variable or adding more predictors.