Load Packages:

Below, we load the packages required for data analysis and visualization.

library(tidyverse)

Question:

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis).

Answer:

Load Data:

Below, we load the “cars” dataset.

data(cars)

Plot Data:

Next we plot the data.

ggplot(cars, aes(x=speed, y=dist)) + 
    geom_point(color = "darkblue") +
    labs(title = "Stopping Distance as a Function of Speed",
         x = "Speed", y = "Stopping Distance") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"))

The relationship appears linear. As speed increases, stopping distance tends to increase as well. So we proceed to fit the linear model.

Fit Linear Model:

cars_lm <- lm(dist ~ speed, data = cars)
cars_lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The y-intercept is -17.579, and the slope is 3.932. So we write the regression model as:

\(y = dist = f(x)\), where \(x = speed\)

\(\hat{y} = -17.579 + 3.932x\)

Replot Data with Linear Model:

We then replot the original data with the fitted line.

ggplot(cars, aes(x=speed, y=dist)) + 
    geom_point(color="darkblue") +
    geom_smooth(linewidth = 1, method = lm, se = FALSE, color="lightblue") + 
    labs(title = "Stopping Distance as a Function of Speed",
         x = "Speed", y = "Stopping Distance") + 
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"))
## `geom_smooth()` using formula = 'y ~ x'

Evaluate Model Quality:

Now we can start to evaluate the model quality.

summary(cars_lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The median value is near zero, the first and third quartile values are of the same magnitude, and the min/max residual values are of roughly the same magnitude as well.

The standard error for speed is 9.5 times smaller than its coefficient value, which is good. And the p-value of the coefficient is very small, so observing it if there were no linear relationship at all would be unlikely. The \(R^2\) value is high, indicating 65.11% of the variability in stopping distance is explained by speed.

This is all so far indicative of a good model.

Residual Analysis, Part 1:

ggplot(cars_lm, aes(x = .fitted, y = .resid)) +
    geom_point(color="darkblue") +
    geom_hline(yintercept = 0, linewidth = 1, color = "lightblue") +
    labs(title = "Residual vs. Fitted Values",
         x = "Fitted Values", y= "Residuals") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"))

Plotting the fitted values against the residuals shows us that the residuals are uniformly scattered above and below zero. There are no obvious patterns here that would lead us to reconsider the model.

Residual Analysis, Part 2:

qqnorm(resid(cars_lm))
qqline(resid(cars_lm))

Lastly, we visualize whether the residuals from our model are normally distributed with a quantile-versus-quantile plot. We see that both tails of our data deviate from the straight line we would expect them to follow if the residuals were normally distributed. So speed doesn’t fully explain stopping distance, and we could still benefit from a better model despite all indicators prior to this being good.