Question

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Work and Answer

Linear Model

First, I loaded the cars data set. Then, I built the linear model by setting stopping distance as the dependent and speed as the independent variables.

#load cars data
data(cars)

#linear model
cars_model= lm(dist ~ speed, data=cars)

summary(cars_model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Model Quality and Visualization

My model’s intercept is a bit weird because it’s negative (-17.5..), suggesting that cars would have a negative stopping distance at zero speed, which doesn’t make sense. But the numbers still tell us it’s a legit part of the model. As for the speed, every extra mile per hour means a car will take almost 4 more feet to stop (3.93..). Since the p-value is really close to 0, we can trust that speed really does affect stopping distance quite a bit since it is statistically significant.

The R^2 value is 0.6511, which means that ~65.11% of the variance in the stopping distance is explained by the model. The Adjusted R^2 is slightly lower at 0.6438. It’s close to the R^2, which means that we don’t have a problem having too many predictors.

Scatter plot of Distance vs Speed.

library(ggplot2)

ggplot(cars, aes(x = speed, y = dist)) + 
  geom_point() +  
  geom_smooth(method = "lm", col = "orange") +  
  ggtitle("Stopping Distance (ft) vs Speed (mph)") +  
  xlab("Speed (mph)") +  
  ylab("Stopping Distance (ft)") 
## `geom_smooth()` using formula = 'y ~ x'

Residual Analysis

The residuals range from -29.069 to 43.201. Most of the residuals are relatively small, with the middle 50% (the IQR) between -9.525 and 9.215, which suggests that for most observations, the model predicts the stopping distance pretty well.The median of the residuals is close to zero (-2.272), which suggests that the model is not over- or under-predicting the stopping distances.

The Residual Standard Error is 15.38. In the context of the cars dataset, it means that the typical prediction by the model is off by about 15.38 feet.

QQ-Plot for Residuals

par(mfrow=c(2,2))  
plot(cars_model)

#residuals v fitted 
ggplot(cars, aes(x = fitted(cars_model), y = residuals(cars_model))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype="dashed", color = "orange") +
  ggtitle("Residuals vs Fitted Values") +
  xlab("Fitted Values") +
  ylab("Residuals")

#normal Q-Q plot
ggplot() +
  stat_qq(aes(sample = residuals(cars_model))) +
  ggtitle("Normal Q-Q Plot") +
  xlab("Theoretical Quantiles") +
  ylab("Standardized residuals")

The residuals are well-aligned along the 1-to-1 line on the QQ-Plot, suggesting that they’re normally distributed. This normal distribution of residuals would support the validity of the regression model, indicating that the model’s assumptions are met.