We are using the “cars” dataset to make a linear model for stopping distance as a function of speed.
library(datasets)
library(ggplot2)
library(tidyverse)
library(lmtest)
library(nortest)
cars <- as.data.frame(cars)
First, we will make sure the data follows a linear pattern:
ggplot(cars) + geom_point(aes(x = speed, y = dist))
The data seems to follow a linear pattern, though it does seem to fan out a bit. We will need to keep an eye out for violations of the constant variance assumption.
lm_cars <- lm(dist ~ speed, data = cars)
summary(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
ggplot(cars) + geom_point(aes(x = speed, y = dist)) + geom_smooth(aes(x = speed, y = dist),method = 'lm')
The summary of the model checks all the boxes we’re looking for. The residuals are fairly symmetric, with the 1st quartile and 3rd quartile being about equal, and the median being near 0. The coefficient for speed is easily different from 0 with a t value near 10. Lastly, the R squared value indicates a strong fit.
plot(lm_cars)
There are two issues that are not concerning, but should be noted, and one that is concerning. First, the small deviations from normality are not really a concern with only 50 observations. There is one point (49) that is somewhat influential, so it would be nice to know some other features of this car if we had access to them. Finally, the residuals exhibit non constant variance and some curvature. This will need to be addressed, espcially because if we use some domain knowledge, we know that momentum varies quadratically with velocity.
diagnostics <- data.frame(fitted = lm_cars$fitted.values, residual_abs = abs(lm_cars$residuals),
residuals = lm_cars$residuals)
ggplot(diagnostics) + geom_point(aes(x = fitted, y = residual_abs)) +
labs(x= 'Fitted value', y = 'Absolute Value of Residuals')
ggplot(diagnostics) + geom_histogram(aes( x = residuals), color = 'black', fill = 'darkblue', binwidth = 4)
bptest(lm_cars)
##
## studentized Breusch-Pagan test
##
## data: lm_cars
## BP = 3.2149, df = 1, p-value = 0.07297
ad.test(lm_cars$residuals)
##
## Anderson-Darling normality test
##
## data: lm_cars$residuals
## A = 0.79406, p-value = 0.0369
If normality were the only issue, we would not change or reject the model because of the skew. We would merely note that the standard error of the beta coefficient is should be higher. However, the heteroskedacity is more of a concern, even given the higher p value from its test. This is because of our domain knowledge, and our likely desire to use our model to exrapolate stopping distance.
We will transform the response variable as a remedial measure.
cars <- cars %>%
mutate(
dist_sqrt = sqrt(dist)
)
ggplot(cars) + geom_point(aes(x = speed, y = dist_sqrt)) + geom_smooth(aes(x = speed, y = dist_sqrt),method = 'lm')
lm_cars_sqrt <- lm(dist_sqrt ~ speed, data = cars)
summary(lm_cars_sqrt)
##
## Call:
## lm(formula = dist_sqrt ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0684 -0.6983 -0.1799 0.5909 3.1534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27705 0.48444 2.636 0.0113 *
## speed 0.32241 0.02978 10.825 1.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 48 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
## F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
plot(lm_cars_sqrt)
bptest(lm_cars_sqrt)
##
## studentized Breusch-Pagan test
##
## data: lm_cars_sqrt
## BP = 0.011192, df = 1, p-value = 0.9157
ad.test(lm_cars_sqrt$residuals)
##
## Anderson-Darling normality test
##
## data: lm_cars_sqrt$residuals
## A = 0.39752, p-value = 0.3551
The transformation fixes both issues, and improves the fit. This remedial measure appears to have been a success.