DATA 605, HW 11: Linear Modeling

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

1. Visualization

The cars dataset has two variables: speed in mph and stopping distance (dist) in feet.

The cars seemed to have been going pretty slowly when the data was collected. Speeds range from 4 to 25 mph, with an average of 15 mph.

summary(cars$speed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    15.0    15.4    19.0    25.0

boxplot(cars$speed)

Stopping distance ranges from 2 to 120 ft, with an average of 43 ft. The big difference between this variable’s mean and median suggests that it is influenced by outliers, which we also see in the boxplot.

summary(cars$dist)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   26.00   36.00   42.98   56.00  120.00

boxplot(cars$dist)

When we visualize speed against stopping distance, we see a positive, fairly correlated relationship. As speed increases, stopping distance also seems to increase.

x <- cars$speed  # car speed
y <- cars$dist   # stopping distance
cars_lm <- lm(y ~ x)   # linear model

qplot(x, y, 
      ylab="Stopping Distance (ft)", xlab="Speed (mph)", 
      main="Cars Speed vs. Stopping Distance", ymin=-10) +
  geom_abline(intercept = cars_lm$coefficients[1], 
              slope = cars_lm$coefficients[2])

2. Quality Analysis

We can evaluate the quality of the linear model by examining the summary output.

Intercept: \(-17.5791\). According to the model, a car going 0 mph would have a stopping distance of about \(-18\) ft. This is impossible, so the intercept is not meaningful for this model.
Slope: \(3.9324\). For every 1 mph increase in a car’s speed, the model suggests that its stopping distance increases by about 4 feet.
Standard error: \(6.7584\) (intercept) and \(0.4155\) (slope). The ratio between the coefficients and standard error is fairly large, meaning there is relatively little variability in the estimates of the slope and intercept.
P-value: \(0.0123\). This small p-value is significant at the 99% level (one significance star), which means that there is a high likelihood that speed is relevant in the model, and the model more accurately predicts it.
R-squared: \(0.6511\); adjusted \(R^2\): \(0.6438\). This means that speed explains about 64% of the variation in stopping distance, but more variables may be needed to explain closer to 100% of the variation.
Degrees of freedom: \(48\). There were \(46\) observations used to generate the model. This is a small dataset.

summary(cars_lm)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## x             3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

3. Residuals Analysis

When we plot the residuals of this model, we see that more points fall below zero than above zero. This suggests that the model tends to overestimate a car’s real stopping distance.

plot4 <- qplot(cars_lm$fitted.values, cars_lm$residuals, ylab="Fitted Values", xlab="Residuals")

plot5 <- ggplot() + geom_qq(aes(sample = cars_lm$residuals))

grid.arrange(plot4, plot5, ncol=1, nrow=2)

4. Summary

Speed appears to be a fairly good predictor of stopping distance, which makes intuitive sense. However, we could improve our model by collecting more observations and accounting for other important factors, like the friction of the driving surface.