Introduction

library(dplyr)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

Step 1: Visualize the Data

plot(dist ~ speed, cars)

We can see that distance tends to increase as speed increases.

Step 2: Building a Linear Model

cars.lm = lm(dist ~ speed, cars)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

Here, the y-intercept is \(a_0 = -17.579\) and the slope is \(a_1 = 3.932\), thus the final regression model is:

\(\hat{dist} = 3.932*speed - 17.579\)

Plotting the original data along with the fitted line:

plot(dist ~ speed, cars)
abline(cars.lm, col="blue")

Step 3: Evaluating the Quality of the Model

Let’s take a closer look at the statistics of the linear model we just fit:

summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The residuals appear relatively normally distributed, although the maximum value is farther from the median than the minimum value. The probability of observing a value greater than the test statistic of the speed is extremely high. The R-squared value means 65.11% of the variability of distance is explained by the variation in speed.

Step 4: Residual Analysis

plot(density(resid(cars.lm)))

Close, but slightly right-skewed, as we’d expect from the maximum and minimum values.

Residuals Plot

plot(fitted(cars.lm), resid(cars.lm), xlab="Fitted", ylab="Residuals")
abline(0,0, col="blue")

We can see that the residuals do not appear to be uniformly distributed above and below 0. This tells us that using distance as the sole predictor does not sufficiently explain the data.

Quantile-versus-Quantile (Q-Q) Plot

qqnorm(resid(cars.lm))
qqline(resid(cars.lm), col="blue")

If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line – that is not the case as we can see from the points on the top right. This further indicates that the residuals are not normally distributed.

Finally, we can look at the above two plots alongside a Scale-Location plot and Residuals vs. Leverage plot by calling the plot function on the linear model and setting the output parameters:

par(mfrow=c(2,2))
plot(cars.lm)

Conclusion

Given that the residuals are not normally distributed, we can determine that stopping distance alone is not sufficient to explain speed.