Problem

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Data Analysis

Load the cars dataset into R and take a quick look at the data.

# Load the Cars dataset into R.
cars_dataset <- datasets::cars
# Summarize of the data.
str(cars_dataset)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

As we can see above, the Cars dataset comprises of 50 observations of 2 variables - speed (the independent variable), and stopping distance (dist) (the dependent variable). Each row is an observation of the relation between car speed and the distance it takes for a car to stop.

summary(cars_dataset)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Is the relationship between the indepentent and dependant variable linear?

plot(dist ~ speed, data = cars_dataset, col = 'blue',
     pch = 16, xlab = 'Speed (Independent Variable)', ylab = 'Stopping Distance (Dependent Variable)',
     main = 'Relationship Between Speed and Stopping Distance')
abline(lm(dist ~ speed, data = cars_dataset), col = 'red')

From the above scatter plot we can see that Distance and Speed are positively correlated and whilst not perfectly linear, The relationship appears to be linear.

Linear Regression Analysis

cars_dataset_lm <- lm(speed ~ dist, data = cars_dataset)
summary(cars_dataset_lm)
## 
## Call:
## lm(formula = speed ~ dist, data = cars_dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

 

Linear Model Summary Analysis

  1. The Multiple R-squared value above is a measurement of how well the model describes the data. For this model, the R-Squared value is 0.6511 which means that the model explains 65.11% of the data’s variation.

  2. The probability value (p-value) that the speed coefficient is not relevant in the model is 1.49e-12 which means that speed is highly relevant in modeling stopping distance.

  3. For well fitted linear regression models, the residuals should be normally distributed with a mean as close to 0 as possible. The residuals median for this model is 0.3615 which is good.

 

Residual Analysis

par(mfrow = c(2, 2))
plot(cars_dataset_lm)

par(mfrow = c(1, 1))

From the Residuals vs Fitted plot above we can see that the residuals are not uniformly scattered above and below zero. It appears that as the fitted values increase, so do the residuals.

The Normal Q-Q plot above tells us that the residuals from the model are not normally distributed. The 2 ends diverge from the diagonal line indicating that the residuals are not normally distributed.

Conclusion

From the above analysis, we can conclude that speed alone is an insufficent predictor in the model to explain the data and to predict stopping distance. To get a more accurate picture, we would most likely need to introduce additional predictors to our model such as weather, car weight, effectiveness of the car’s brake pads, etc.