Overview

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

We will perform our analysis using the lm function from base R.

Data Exploration

The cars data frame has two columns: speed and distance and we’ll use these values as parameters in the linear model function.

data (cars)

model <- lm(dist ~ speed, data=cars)

plot(cars)
abline(model)

The scatter plot shows a general positive correlation between a car’s speed and stopping distance. The line represents a predicted value for the stopping distance and for low speeds (5 to 20), the measured values hover close to the model’s prediction. However, at 25, the stopping distances vary more.

Judge the model’s usefulness by examining its statistics

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residuals

Min      1Q  Median      3Q     Max 
 -29.069  -9.525  -2.272   9.215  43.201 

Fortunately, the median value is not too far from 0 suggesting the model has some viability and the min and max values have the same magnitude indicating the values are normally distributed. There are no signs this model can’t be used to correlate speed and distance so we can continue.

Coefficients:

               Estimate    Std. Error    t value    Pr(>|t|)    
 (Intercept)   -17.5791     6.7584       -2.601     0.0123 *  
   cars$speed    3.9324     0.4155        9.464     1.49e-12 ***
   ---
   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

According to the model, the y-intercept is -17.58 which doesn’t make sense since a car at speed = 0 should have a stopping distance of 0, also. The slope of this model is 3.93 indicating that for every increase of of one speed unit, the stopping distance increases by a factor of 3.93. The standard error for the intercept is 6.76 which is relatively close to the y-intercept coefficient (a standard error that is 5 to 10 times smaller is a sign of a better fitting model). The speed’s standard error is closer to ideal as it’s 9.47 times smaller than the 3.93 coefficient. So, this model’s variability for the stopping distance is large.

The Pr column shows the y-intercept has a probability of 0.0123 of being not relevant in this model. In contrast, the speed has an extremely low probability (1.49e-12) of being not relevant. The asterisks give a quick visual summary of each factor’s probabilities.

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

For the last summary statistics, we look at the data fitted against the model’s regression line and we compare the 1Q & 3Q and see if they are 1.5 times 15.38. Since they are around 9.2 and -9.2, they don’t suggest a normal distribution of residuals.

The multiple R-squared statistic shows this model explains 0.65 of the data set.

Residual Analysis

To further assist in evaluating this model, we create a QQ plot showing the residue values around a zero line (predicted values).

plot(fitted(model), resid(model))
abline(0,0)

We notice the residue values increasing going from left to right. Again, this is another confirmation this model can only account for some but not all of the measured values from the data set.

A QQ Plot is another perspective show how the data follows along model until a certain point where they diverge at the upper right.

qqnorm(resid(model))
qqline(resid(model))