Using the “cars” dataset in R, build a linear model for stopping
distance as a function of speed and replicate the analysis of your
textbook chapter 3 (visualization, quality evaluation of the model, and
residual analysis.)
Load cars
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Plot cars to see all observations, with the best-fit line
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
ggplot(cars, aes(x= speed, y= dist))+
geom_point() +
ggtitle("Cars") +
labs(x = "Speed", y = "Distance") +
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

Linear model for speed vs distance
# Make linear model of distance by speed
cars.lm <- lm(dist ~ speed, data=cars)
cars.lm
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
# Get summary of our model
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Residuals
- When we use speed as a predictor value for stopping distance, the
residuals are close to being 0, given that the median value is -2.72. As
Chapter 3, Pg. 20 states - “If the line is a good fit with the data, we
would expect residual values that are normally distributed around a mean
of zero… a good model would tend to have a median value near zero,
minimum and maximum values of roughly the same magnitude, and first and
third quartile values of roughly the same magnitude.” In other words,
the quartiles should also be somewhat equally distant from the mean
value of the model, which is somewhat close, albeit the max is further
out than the min, so the model is somewhat skewed to the right (as we
can see in the scatterplot above, there are a few outliers with higher
residuals)
Standard Error & R-squared values
- The textbook states that “For a good model, we typically would like
to see a standard error that is at least five to ten times smaller than
the corresponding coefficient.” This ratio is the test statistic/
t-value, which we can see is 9.464, with a tiny p-value of 1.49e-12
(confirmed by the 3 asterisks near it; ***). Since these values are so
small, we can say there’s strong evidence of a linear relationship
between a car’s speed and its stopping distance. Additionally, the
R-squared and Adjusted R-squared values show that our model predicts
nearly 65$ of our data.
Residuals
plot(fitted(cars.lm),resid(cars.lm))
abline(h=0, col="red")

- Looking at this plot, it’s evident that the residuals skew more
towards the negative, which we can confirm by looking at our initial
scatterplot. If the model fits the data well, we’d expect the residuals
to be normally distribution around a mean of 0, which doesn’t seem to be
the case.
Quantile-vs-Quantile
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

- Our points seem to be very close to the line at first glance, but
they’re actually curving downwards within the -1 to 1 quantile, and then
heavily curve upwards from the 1 to 2 quantile.
Analysis
par(mfrow=c(2,2))
plot(cars.lm)

- Overall, the model does seem to be decent for predicting a cars
stopping distance based on its speed, but I’d be curious to know what
other fields affect the output, but that’d require multiple conditions
analysis.