Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Load cars

data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Plot cars to see all observations, with the best-fit line

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
ggplot(cars, aes(x= speed, y= dist))+
  geom_point() +
  ggtitle("Cars") +
  labs(x = "Speed", y = "Distance") +
  geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

Linear model for speed vs distance

# Make linear model of distance by speed
cars.lm <- lm(dist ~ speed, data=cars)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
# Get summary of our model
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residuals

  • When we use speed as a predictor value for stopping distance, the residuals are close to being 0, given that the median value is -2.72. As Chapter 3, Pg. 20 states - “If the line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero… a good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.” In other words, the quartiles should also be somewhat equally distant from the mean value of the model, which is somewhat close, albeit the max is further out than the min, so the model is somewhat skewed to the right (as we can see in the scatterplot above, there are a few outliers with higher residuals)

Standard Error & R-squared values

  • The textbook states that “For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.” This ratio is the test statistic/ t-value, which we can see is 9.464, with a tiny p-value of 1.49e-12 (confirmed by the 3 asterisks near it; ***). Since these values are so small, we can say there’s strong evidence of a linear relationship between a car’s speed and its stopping distance. Additionally, the R-squared and Adjusted R-squared values show that our model predicts nearly 65$ of our data.

Residuals

plot(fitted(cars.lm),resid(cars.lm))
abline(h=0, col="red")

  • Looking at this plot, it’s evident that the residuals skew more towards the negative, which we can confirm by looking at our initial scatterplot. If the model fits the data well, we’d expect the residuals to be normally distribution around a mean of 0, which doesn’t seem to be the case.

Quantile-vs-Quantile

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

  • Our points seem to be very close to the line at first glance, but they’re actually curving downwards within the -1 to 1 quantile, and then heavily curve upwards from the 1 to 2 quantile.

Analysis

par(mfrow=c(2,2))
plot(cars.lm)

  • Overall, the model does seem to be decent for predicting a cars stopping distance based on its speed, but I’d be curious to know what other fields affect the output, but that’d require multiple conditions analysis.