Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis).
# Creating the model
data(cars)
cars_model <- lm(dist ~ speed, data = cars)
# Visualizing the model
cars %>%
ggplot(mapping = aes(x = speed, y = dist)) +
geom_point(color = 'darkgreen') +
geom_smooth(method = 'lm', se = FALSE, color = 'red') +
labs(title = 'Linear Model',
subtitle = 'stopping distance as function of speed',
x = 'speed',
y = 'stopping distance') +
theme(
plot.title=element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5))
# Model information
summary(cars_model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As we can see from the summary() call, the y intercept is -17.5790949, and the slope is 3.9324088. So, the final regression model is:
\[\hat{dist}=3.9324{\times}speed-17.5791\]
Residuals: The median is -2.272, which is fairly close to 0. The magnitudes of the first and third quartiles are very close (9.525 vs. 9.215). These are indications that the model is a good one. There is however some discrepancy in the magnitudes of the minimum and maximum residual values.
T and p values of the slope: The t value of the slope is 9.464, which is sufficient (5 to 10 is good) to say that the model is good. Meanwhile, the p value of the slope is really small (1.49e-12). That means, the probability of getting a t value of 9.464 or higher is less than 1.49e-12 if there is no linear relationship between the variables. Thus, there is strong evidence that the model works well in suggesting a linear relationship between distance to stop and speed.
Residual standard error and the quartiles: The first and third quartile magnitudes should be close to the residual standard error times 1.5. This is clearly not the case since the quartile magnitudes are smaller than the standard error. So, this is perhaps an indication that the model could be better.
\(R^2\) values: Both \(R^2\) values are around 65%. So, around 65% of the variation in distance traveled is explained by the variation in speed.
If the residuals are normally distributed around a mean of 0, we can say that the linear model is a good fit for the data. Let’s see if they really are normally distributed using 2 plots: the residuals plot and the quantile-quantile plot.
# The residuals plot
qplot(fitted.values(cars_model), residuals(cars_model)) + geom_smooth(method = 'lm', se = FALSE)
# The Q-Q plot
qqnorm(resid(cars_model))
qqline(resid(cars_model))
The residuals plot looks to be free of any obvious pattern, although the width of the band of points around the line is not totally consistent. It looks pretty good. However, when we look at the Q-Q plot, we become more sure that the residuals are normally distributed. The points are mostly on the line, and the shape and density of the 2 tails are similar.