Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(tidyverse)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
In the scatter plot below, we see a moderate positive linear trend between speed and distance from the cars dataset.
cars %>%
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = lm, se = F) +
labs(title = "Cars",
x = "Speed", y = "Distance") +
theme_minimal()
cars_lm <- lm(speed ~ dist, data = cars)
cars_lm
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Coefficients:
## (Intercept) dist
## 8.2839 0.1656
In our linear regression model below, we see the min-max and 1Q-3Q has roughly similar magnitudes and the median is close to zero. This means this model is good but lets do some more evaluation. The standard error is 49 times smaller than the corresponding coefficient. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.65, which means this model explains 65% of the data’s variation. Overall, I would say this is a good model.
summary(cars_lm)
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In the residual plot below, we see that the variance of residuals are not uniform which indicates our explanatory variable is probably does not fully explain the data. But if we look at the quartile-quartile plot, we that the residuals are normally distributed. Therefore, I would say overall this is a good model.
cars_lm %>%
ggplot(aes(fitted(cars_lm), resid(cars_lm))) +
geom_point() +
geom_smooth(method = lm, se =F) +
labs(title = "Residual Analysis",
x = "Fitted Line", y = "Residuals") +
theme_minimal()
cars_lm %>%
ggplot(aes(sample = resid(cars_lm))) +
stat_qq() +
stat_qq_line() +
labs(title = "Q-Q Plot") +
theme_minimal()