Problem
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
attach(cars)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
We can see that the stopping distance tends to increase as the speed increases.
plot(speed, dist, xlab='Speed (mph)', ylab='Distance (ft)',
main='Distance vs. Speed')
We create a one factor regression to model distance as a function of speed.
lm <- lm(dist ~ speed, data=cars)
summary(lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Y intercept= -17.5791, Slope = 3.9324
The equation for the regression model is
\[ \hat{dist}=−17.5791+3.9324∗speed\]
Residuals:
The 1q and 3q values are roughly close, but the min residual is -29.069 whereas the max residual is 43.201. The residuals seem to increasing as the value of speed increases, and are therefore not exactly normally distributed.
Coefficients
For a good model, we would like to see standard errors that is at least 5 to 10 times smaller than the coefficients.
standard error (intercept) is roughly 3 times smaller whereas standard error(speed) is roughly 9.5 times smaller. These values suggest that the slope estimate shows little variability but the intercept estimate can vary significantly.
The p-values of both the coefficiants are very small, so there is minimal probability that the corresponding coefficients are not relevant to the model.
Rsquare
Multiple Rsquare of 65.11% means that the model explains 65.11% of the data’s variation.
plot(speed, dist, xlab='Speed (mph)', ylab='Distance (ft)',
main='Distance vs. Speed')
abline(lm)
plot(fitted(lm), resid(lm))
abline(a=0, b=0, col='red')
We see that the residuals tend to increase as we move towards the right. The residuals are not uniformly scattered above and below zero.This model will tend to overpredict as often as it underpredicts.
Q-Q plot
qqnorm(resid(lm), col='blue')
qqline(resid(lm), col='red')
Even though the points in the middle are somewhat closer to the line, the points at the end diverge significantly. The distribution’s tails are heavier than a normal distribution. The residuals are not normally distributed.
We conclude that using only the speed as a predictor of stopping distance in the model is insufficient to explain the data. Therefore, we can say that there may be other factors that may be considered to accurately predict the stopping distance.