Question

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.

Answer

library(tidyverse)

Model

cars_data <- cars

cars_data_lm <- lm(dist ~ speed, data = cars_data)
summary(cars_data_lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Using the coefficients estimate.
Let \(x\) represents the speed of the car and \(y\) the stopping distance. Then \[ y = -17.5791 + 3.9324x \] Therefore, for the car to come to an complete stop (\(y = 0\)), \[ \begin{aligned} 0 &= -17.5791 + 3.9324x \\ x &= \frac{17.5791}{3.9324} \\ x &= 4.4703 \end{aligned} \] The speed of the car must be approximately 4.5 mph.

Also, since the slope of the model is positive (3.9324), It means that as the speed increases so does the stopping distance. That is, for every 1 mph increase in speed. The stopping distance of the car increases by about 4 miles.

The Pr of the speed is very low (1.49e-12). Which indicates that the speed is very relevant to the stopping distance of the car.

The multiple R-squared of this model is 0.6511, which tells us that this model accounts for only 65% of the data set.

Visualization

Plot of car speed vs car stopping distance with linear regression

cars_data %>% ggplot(aes(x = speed, y = dist)) +
  geom_point() +
  geom_abline(aes(slope = coef(cars_data_lm)[2], 
              intercept = coef(cars_data_lm)[1])) +
  labs(title = 'Stopping Distance vs Speed',
       x = 'Speed (mph)',
       y = 'Stopping Distance (miles)')

The relationship of speed to stopping distance spreads close to the predictive linear model.

Residual Analysis:

cars_data_lm %>% ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(title = 'Residual vs Fitted',
       x = 'Fitted',
       y = 'Residual')

The plot shows that as the speed increases the residual also increased. Which means that as the speed increased, the that difference between the value the linear model predicts and the actual value increased as well.

Examining the Q-Q plot will show if the residuals are normally distributed

cars_data_lm %>% ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = 'Q-Q Plot',
       x = 'Predicted Value',
       y = 'Measured Value')

As we go to the right, the residual plot diverges from the line. This indicates that the residual is not normally distributed and is actually right skewed.

Note; the behavior of the residuals could also have been visualized by

par(mfrow = c(2, 2))
plot(cars_data_lm)

In conclusion, with a multiple R-square of 0.6511 and speed p-value of 1.49e-12. The linear model is a good representative of the stopping distance to speed relationship. However, due to the high variance
of the residuals, a linear model is not the most ideal model to use.