CUNY SPS DATA 605 - Assignment 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Load Data

# Load cars
data(cars)
# Print the first 6 rows
head(cars, 6)

Sanity Check

attach(cars)
# Check min and max and mean
min(speed); max(speed); mean(speed)

## [1] 4

## [1] 25

## [1] 15

min(dist); max(dist); mean(dist)

## [1] 2

## [1] 120

## [1] 43

Nothing obviously wacky here so let’s move on…

Visualize the Data

Check if a linear model seems appropriate.

# Create scatter plot
plot(speed, dist, main = "Stopping Distance as a Function of Speed", ylab = "Distance", xlab = "Speed")

Relationship looks linear. As speed increases so does the stopping distance. So a linear model seems appropriate.

Build linear Model

# Distance as a function of speed
cars.lm <- lm(dist ~ speed)
cars.lm

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93

Our linear regression model is: \[ stopping\_distance = -17.58 + 3.93 \times speed \]

Plot with fitted line

plot(speed, dist, main = "Stopping Distance as a Function of Speed", ylab = "Distance", xlab = "Speed")
abline(cars.lm)

Evaluate the Quality of the Model

summary(cars.lm)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60           0.012 *  
## speed          3.932      0.416    9.46 0.0000000000015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 48 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 0.00000000000149

hist(cars.lm$residuals)

mean(cars.lm$residuals)

## [1] 0.000000000000000087

The residuals are close to normally distributed around a mean of 0.000000000000000087 (almost exactly zero) indicating a good fit. Looks like we have just a few outliers in the 40+ bin of the histogram which are causing a large difference in the magnitude of our min and max residual values but our 1st and 3rd quartile values are of almost equal magnitude.

According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for speed, 0.42, is 9.46 times smaller than the coefficient, 3.93. So this also indicates a good fit. The standard error for the intercept, 6.76, is -2.6 times smaller than the coefficient, -17.58. So not as good a fit as the speed, indicating that this estmate may vary.

The p-values for the coefficients, 0.0000000000015 for speed and 0.01 for intercept, indicate that it is highly likely that both the speed and this specific intercept value are relevant to the model.

The \(R^2\) value of 0.6511 indicates that the model explains 65.11% of the variation in stopping distance.

Plot the Residuals

plot(fitted(cars.lm), resid(cars.lm))

There are no apparent patterns to the plotted residuals indicating that the linear model is a good fit.

Q-Q Plot

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

We can see that overall the sample quantiles follow a linear pattern similar to the theoretical quantiles.

Conclusion

Overall the linear model is a good fit for this data, except for a few outliers at the upper end.