Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

# get dataset
data(cars)

# plot relationship
plot(cars[,"speed"],cars[,"dist"], main="Stopping Distance vs. Speed",
xlab="Speed", ylab="Stopping Distance")

This shows that dist tends to increase as speed increases. If we superimpose a straight line on it, we can see the relationship between the predictor (speed) and the dependent variable (diff) is roughly linear. However, it is not perfectly linear, once we get into the 15 - 20 speed we see a larger spread in values.

Let’s create the model!

# define model
model <- lm(dist ~ speed, data = cars)

# plot model
plot(dist ~ speed, data=cars, 
     main="Simple Linear Model: stopping distance as a function of Speed",
     xlab="Speed", ylab="Stopping Distance")
abline(model) # add line

Plotting the SLR and including the line, kind of accentuates spread in values, and even more so when it get’s to 20 - 25 speed.

Model statistics

Let’s use the summary() function to generate model statistics.

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Here the summary() function outputs the call to lm() so we can see what exactly these statistics are pertaining too.

Residuals

summary(summary(model)$residuals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.069  -9.525  -2.272   0.000   9.215  43.201

It shows the Residuals which are the differences between the actual values, and the fitted values on the regression line. The min is the minimum residual value, whilst the max is the maximum residual value, which gives us a nice idea of the spread of the data. We also see the quartiles of the residuals and the median. A good model should have a median of around 0, min and max values of roughly the same magnitude.

Here the median value is not too far from zero, which suggests a decent model.

Coefficents

summary(model)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
## speed         3.932409  0.4155128  9.463990 1.489836e-12

This shows the estimated coefficients values. The Std. Error shows the statistical standard error for the speed coefficient. A good model will typically have standard error that is at least five to ten times smaller then the coefficient. For example the standard error for speed is \[3.932409/0.4155128 = 9.46399004\] which suggests the variability for the stopping distance is quite large.

The \(P(>|t|)\) column indicates the likelihood of observing a test statistic (\(t\) value) as extreme or more extreme as the one observed (otherwise known as the p-value). This means the probability of observing a \(t\) value of 9.46 or more (assuming there is NO linear relationship between speed and stopping distance) 1.489836e-12. Since this value is so small, we can say there is a strong evidence that a linear relationship exists between speed and stopping distance. However, to know if the relationship is actually linear, we need to do some more tests to check relationship validity.

Residual Analysis

Let’s plot the residuals.

plot(fitted(model), resid(model), main = "Residuals", 
     xlab = "Fitted Model", ylab = "Residuals")
abline(0,0)

We notice the residue values increasing going from left to right. Again, this is another confirmation this model can only account for some but not all of the measured values from the data set.

Let’s plot a Q-Q plot!

qqnorm(resid(model))
qqline(resid(model))

If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line. We can see that they tail off a bit in the upper portion. The right side indicates the right tail is heavier then a normal distribution, whereas the left tail lighter. This pattern is indicative of a right-skewed distribution.