Regression Analysis in R

Introduction

The cars dataset contains 50 observations of the speeds (mph) of various cars along with their stopping distances (ft). Because the data were gathered in the 1920s, the regression parameters for these data will not extrapolate to cars currently on the road. However, if the relationship between stopping distance and speed is linear for these century-old cars, then there’s reason to believe that the relationship between speed and stopping distance for cars today may also be linear. That’s because the basic design of brakes has not changed fundamentally: car brakes still act by generating friction against a rotating wheel.

Preliminary analysis

Before proceeding with the regression, let’s perform a “sanity check” on the data. Are there any values that don’t make sense? How many values are missing?

filter(cars, is.na(speed) || is.na(dist))

## [1] speed dist 
## <0 rows> (or 0-length row.names)

There are no missing values in the data set.

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The values in the column speed are reasonable for cars of the 1920s. The maximum value in dist of 120 feet is more than twice the value of Q3 and is somewhat suspect, both for statistical reasons and based on my knowledge of cars. Perhaps this row represents a car with malfunctioning breaks. What speed is associated with the maximum stopping distance?

filter(cars, dist == 120)

##   speed dist
## 1    24  120

It’s somewhat reassuring that this maximum stopping distance is associated with a speed near the maximum. There are no other suspicious values.

We can visually examine a scatterplot of the data to determine whether it exhibits a roughly linear shape.

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  stat_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

The data appear roughly linear. Other conditions for linear regression seem plausible: the observations are independent, since we can assume they come from different cars. The variability of the data seems roughly constant across the graph, although there may be a slight increase in variability as speed increases. We’ll examine later whether the residuals are approximately normally distributed.

The linear model

The linear model for this data appears below.

cars.lm <- lm(cars$dist ~ cars$speed)
cars.lm

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

The linear model fit to this data is $\widehat{dist} = 3.932 \times speed - 17.579$.

This data can help answer the question, “Given that a car is traveling at a certain speed, what distance will it need to stop?” Here speed is the response variable, and dist is the explanatory variable. So the linear model appears as lm(cars$dist ~ cars$speed). Based on the model for this data, a 1 mph difference in speed is associated with a 3.932 ft difference in stopping distance. Since this slope parameter is positive, these values increase or decrease together: faster cars require greater distance to stop. The value for the intercept is the stopping distance predicted by the model for a car traveling at 0 mph. Since a car moving at 0 mph is already stopped, the intercept in this model is not meaningful.

Evaluating the linear model

Summary statistics for the linear model appear below.

summary(cars.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

If the model is successful then it should account for all the variation in dist that can be explained by variation in speed. That means residuals should be random noise centered at zero, and they should not reflect some other pattern not accounted for by the model. Examining the summary statistics for residuals, we can see they are centered near zero (Median = -2.272). The residuals are roughly symmetric (Min and Max are approximately equal in magnitude, and so are 1Q and 3Q). Both these are a good indication that the model is a good representation of the data.

The summary statistics for coefficients show that the estimates for both the intercept, $b_{0}$, and the slope, $b_{1}$, are statistically significant, and are not likely to vary greatly given another sample of measurements from the same population. This statistical significance is indicated by the last column, Pr(>|t|), which contains the p-values for both parameters. The multiple R-squared indicates that about 65% of the variation in stopping distance is explained by speed.

What is the effect of the outlier discussed in the preliminary analysis? Let’s examine the summary statistics for the linear model with the outlier removed.

cars2 <- cars %>%
  filter(dist < 120)

cars2.lm <- lm(cars2$dist ~ cars2$speed)

summary(cars2.lm)

## 
## Call:
## lm(formula = cars2$dist ~ cars2$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.789  -9.149  -1.672   8.013  43.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.0021     6.2951  -2.224    0.031 *  
## cars2$speed   3.6396     0.3918   9.290 3.26e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.1 on 47 degrees of freedom
## Multiple R-squared:  0.6474, Adjusted R-squared:  0.6399 
## F-statistic: 86.31 on 1 and 47 DF,  p-value: 3.262e-12

The p-values for both parameters of the linear model for cars2 are greater than those for cars1, and the R-squared values are lower. These results are surprising to me. I had reasoned that since the outlier resulted in a large residual, removing it would mean the model could explain more of the variability in dist. Instead, it seems to explain less. Why is this?

Residual analysis

We can plot the residuals to examine them further.

ggplot(data = cars.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed', color = 'blue') +
  xlab("Fitted values") +
  ylab("Residuals")

Based on the scatterplot, the residuals do appear to have a mean near 0. The residuals show greater variability with greater Fitted values. There is no evident pattern in the residuals, which suggests that the model has been successful in accounting for the variability in dist by using the measurements in speed.

A Q-Q plot can help us assess the normality of the residuals.

ggplot(data = cars.lm, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()

The Q-Q plot shows that, while the residuals are very nearly normal near the middle of the data set, they exhibit some deviation from normality near the tails. This suggests that the quality of our model is higher near the middle of the data set, and that we should apply it with some reservations when making predictions near the tails.

Conclusion

The linear model fit to this data set does a good job of accounting for changes in stopping distance based on changes in speed. While there is some deviation from normality in the residuals, this deviation occurs at the tails of the data set. The model may be less useful for guiding predictions of stopping distance for 1920s cars traveling at or above 25 mph.