Introduction
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and create an analysis of the regression model (visualization, quality evaluation of the model, and residual analysis.)
Libraries
library(tidyverse)
Data
data(cars)
head(cars, 6)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
ggplot(data = cars, aes(x = speed, y = dist)) +
geom_point(color = "steelblue")+
theme_minimal()+
labs(x = "Speed", y = "Stopping Distance")
Linear Regression Model - One-Factor
cars.lm <- lm(dist~speed, cars)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
cars.res <- resid(cars.lm)
##Residual Analysis Plot
ggplot(data=cars,aes(x=dist, y=cars.res))+
geom_hline(yintercept = 0)+
geom_point(color="steelblue")+
theme_minimal()+
labs(x = "Stopping Distance", y = "Residuals")
##QQ Plot
ggplot(cars,aes(sample=cars.res))+
stat_qq(color="steelblue")+
stat_qq_line()+
theme_minimal()+
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
Analysis
Summary Statistics: Based on the summary statistics of the linear model it can be seen that the residual mean is close to zero. In addition the the 1st and 3rd quantile have the same magnitude but the min and the max do not. This seems to implicate that the residuals are not normal for the tail end of the data.
The ratio of the intercept estimate to the standard error is less than five indicating this parameter may vary significantly. The ratio of the slope estimate to the standard error is approximately 10 indicating little variability for this parameter.
The significance value is approximately zero therefore indicating that speed is relevant in the model and statistically significant.
Lastly, we review the residual data visualizations. In our first Residual Analysis plot we see that we do not have constant variability. After 65 mph we see the residuals begin to trend above zero. In the second plot (Q-Q Plot) we see that the data is normal until Theoretical Quantile 1. After this Quantile the residuals are not normal and tail off from our QQ Line.
Conclusion
Based on the analysis above we can state that speed is not a sufficient predictor of stopping distance using a one-factor linear model.