library(dplyr)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
plot(dist ~ speed, cars)
We can see that distance tends to increase as speed increases.
cars.lm = lm(dist ~ speed, cars)
cars.lm
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
Here, the y-intercept is \(a_0 = -17.579\) and the slope is \(a_1 = 3.932\), thus the final regression model is:
\(\hat{dist} = 3.932*speed - 17.579\)
Plotting the original data along with the fitted line:
plot(dist ~ speed, cars)
abline(cars.lm, col="blue")
Let’s take a closer look at the statistics of the linear model we just fit:
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The residuals appear relatively normally distributed, although the maximum value is farther from the median than the minimum value. The probability of observing a value greater than the test statistic of the speed is extremely high. The R-squared value means 65.11% of the variability of distance is explained by the variation in speed.
plot(density(resid(cars.lm)))
Close, but slightly right-skewed, as we’d expect from the maximum and minimum values.
plot(fitted(cars.lm), resid(cars.lm), xlab="Fitted", ylab="Residuals")
abline(0,0, col="blue")
We can see that the residuals do not appear to be uniformly distributed above and below 0. This tells us that using distance as the sole predictor does not sufficiently explain the data.
qqnorm(resid(cars.lm))
qqline(resid(cars.lm), col="blue")
If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line – that is not the case as we can see from the points on the top right. This further indicates that the residuals are not normally distributed.
Finally, we can look at the above two plots alongside a Scale-Location plot and Residuals vs. Leverage plot by calling the plot function on the linear model and setting the output parameters:
par(mfrow=c(2,2))
plot(cars.lm)
Given that the residuals are not normally distributed, we can determine that stopping distance alone is not sufficient to explain speed.