Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
data("cars")
cars %>% head() %>% kable()
| speed | dist |
|---|---|
| 4 | 2 |
| 4 | 10 |
| 7 | 4 |
| 7 | 22 |
| 8 | 16 |
| 9 | 10 |
p <- ggplot(cars, aes(speed, dist)) +
geom_point() +
ggtitle('Stopping Distance VS Speed') +
xlab("Speed") +
ylab("Stopping Distance") +
scale_x_continuous(limits = c(0, 26)) +
scale_y_continuous(limits = c(0, 125)) +
theme_tufte()
p
From the plotted relationship we see that the relationship looks roughly linear. There could be a bit of a quadratic relationship because as speed increases there looks to be a supra-linear increase in stopping distance. This would be consitant with what we know from physics where the breaking distance is a quadratic function of speed. Because observed speeds in this data set are low \(>30mph\) this effect is deminished. Also, we would expec a car goign 0 mph to stop at zero distance. A quick eyeball of the data shows that a trend line would not do this.
Taking the square root of stoppind distance should fix this.
mod <- lm(dist~speed, cars)
summary(mod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We can see that the model does a good job of fitting the data, with an \(R^2\) of 65. Also both points coefficents are significant. Given this significance, it is also not surprising the F statistic is also significant.
So even with the suspected non-linear relationship, the model still does well.
The thing that makes the least sense is the intercept value, which should be zero. The fact that it is so negative suggests that the data is non-linear.
p + geom_smooth(method='lm', se = FALSE, color = 'grey')
## Warning: Removed 2 rows containing missing values (geom_smooth).
This plot further confirms there may be slight non-linearity.
df <- augment(mod) # https://stackoverflow.com/questions/36731027/how-can-i-plot-the-residuals-of-lm-with-ggplot
ggplot(df, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
theme_tufte()
The residual plot shows heterosckedasticity as the error seems to be related to the fitted values.
# Thanks to this site for the function:
# https://stackoverflow.com/questions/4357031/qqnorm-and-qqline-in-ggplot2
ggQQ <- function(LM) # argument: a linear model
{
y <- quantile(LM$resid[!is.na(LM$resid)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
p <- ggplot(LM, aes(sample=.resid)) +
stat_qq() +
geom_abline(slope = slope, intercept = int, color="light grey") +
theme_tufte()
return(p)
}
ggQQ(mod)
The qqplot also shows that our errors are not normally distributed because of the deviance from the line at both the start and end.
transCars <- cars
transCars$dist = sqrt(transCars$dist)
p <- ggplot(transCars, aes(speed, dist)) +
geom_point() +
ggtitle('Stopping Distance VS Speed') +
xlab("Speed") +
ylab("Stopping Distance") +
theme_tufte()
p
p + geom_smooth(method='lm', se = FALSE, color = 'grey')
mod <- lm(dist~speed, transCars)
summary(mod)
##
## Call:
## lm(formula = dist ~ speed, data = transCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0684 -0.6983 -0.1799 0.5909 3.1534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27705 0.48444 2.636 0.0113 *
## speed 0.32241 0.02978 10.825 1.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 48 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
## F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
df <- augment(mod) # https://stackoverflow.com/questions/36731027/how-can-i-plot-the-residuals-of-lm-with-ggplot
ggplot(df, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
theme_tufte()
ggQQ(mod)
The scatter plot and residual plot both look better. The \(R^2\) is higher, howver the QQPlot still indicates there may be some issues. Over all I think it’s a better model.