Homework 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

To begin, I loaded the cars data set and plotted the predictor and response for visual analysis. The analysis appears to show a fairly strong, positive correlation between speed and stopping distance. This would indicate that a linear regression may be an appropriate selection for modelling and will show that as speed increases, stopping distance also increases.

data("cars")

ggplot(cars, aes(speed, dist)) +
  geom_point() +
  geom_smooth(method='lm')

I performed a single linear regression on the data and displayed the summary and resulting equation below. The summary analysis indicates the following:

The residual distribution is appoximately normal, which is a positive indication that the model is valid.
The coefficients indicate a relationship between speed and distance. With a p-value less than 0.01 this relationship is considered very strong.
The intercept is a non-logical value (it is impossible for stopping distance to be less than 0), but this is OK. It simply means that we cannot extend the predictive ability of the regression far past the upper and lower independent variable bounds
The \(R^2\) of \(\approx 65\%\) indicates a strong, positive relationship modeled by this regression

cars.lm <- lm(dist ~ speed, data=cars)
summary(cars.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

\[\widehat{distance}=speed \times 3.93 - 17.58\]

Next, i plotted the relationship between the fitted values and residual values. Ideally, we would see a randomly distributed set of points with no apparent pattern, roughly equally distributed above and below 0. While this appears to be the case, we also find that the amount of variation appears to increase as the fitted values increase. This is not ideal. Although the increase in variation is minimal, we should note that this indicates the model may have difficulty explaining some of the variation in stopping distance.

ggplot() +
  geom_point(aes(fitted(cars.lm), resid(cars.lm)))

The qqplot tells a similar story. There are a number of upper elements that the model is having difficulty capturing. The qqplot also indicates that perhaps a different model may be more appropriate. A thorough investigation would compare this model to an exponential model and a quadratic model.

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

Overall, this appears to a valid single regression with notes that the model has an increase in the variability of residuals. We could perhaps address this issue by transforming either the predictor or response variable or adding more predictors.

Homework 11

Brian Weinfeld

November 5, 2018