Question:
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Answer:
Now, let’s start replicating the analysis in the reading (I like using ggplot so the actual output will vary in aesthetics) by taking a look at the data
So, there does appear to be a relationship between the variables. Now, let’s create a linear model to see what we can determin.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
We can generate some summary statistics to determine how well the data fits.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The residuals appeaar to be roughly balanced around zero; however, the maximum has a slightly larger magnitude than the minimum. The standard error for the intercept is 2.6 times smaller than the estimate and the standard error for the speed coefficient is 9.46 times smaller than the estimate. Also, the estimates are indicated to be significant predictors of at least \(.0.01 \le p \le 0.05\) with an \({R}^{2}\) value of 0.651 which means that the model explains 65.1 percent of the data’s variation
Now, we can plot the resduals against the fitted values.
The data in this plot appears to have a constant variability; however, it seems like there may be a slight increase in variability as fitted values increase. We should carefully monitor that as it would mean that there may be a more complex relationship between these variables. Additionally, we also should also view the distribution of the residuals.
From this plot we can see that the curve the data is making may be slightly convex (upturned U), so there may be some right skew to the data, but this appears to be minor. Additionally, here is a histogram plot of the residuals to confirm our findings.
Overall, given the steps provided in the reading, I would say the model does an okay job predicting the response variable. It’s obviously not perfect, but there does not appear to be any extreme issues with the given predictor variable. As previously mentioned, we should keep an eye on the fitted values vs. residuals to ensure there is constant variability.