Introduction

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Dataset

The cars dataset has two columns - “speed”" and “dist” that relate the car speed and the distance it takes for a car to stop

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Visualization

Speed is the predictor variable and stopping distance is the system response.

plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance', 
     main='Stopping Distance vs. Speed')

# A linear model
cars_linear <- lm(cars$dist ~ cars$speed)
cars_linear
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932
# Line of best fit.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance', 
     main='Stopping Distance vs. Speed')
abline(cars_linear)

Linear Model

# A linear model
cars_linear <- lm(cars$dist ~ cars$speed)
cars_linear
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932
# Line of best fit.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance', 
     main='Stopping Distance vs. Speed')
abline(cars_linear)

# Evaluation of the Linear Model
summary(cars_linear)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Quality Evaluation of the Model

  1. The standard error for the speed coefficient is ~ 9.4 (3.93/.42) times the coefficient value, which is good as explained in the book.

  2. Speed is very relevant in modeling stopping distance because the probability that the speed coefficient is not relevant in the model is p-value = 1.49e-12

  3. The intercept pretty relevant in the model: p-value of the intercept is 0.0123.

  4. The model explains 65.11% of the data’s variation: multiple R-squared = 0.6511

  5. From the residuals distribution, the distribution is normal.

Residual Analysis

plot(cars_linear$fitted.values, cars_linear$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)

The linear model of the cars is normally distributed around zero; it does seem to overpredict more than it underpredicts. Due to the small dataset, this might be smoothed out in a larger model

Normal Q-Q Plot

qqnorm(cars_linear$residuals)
qqline(cars_linear$residuals)

From the Q-Q plot, there is some divergent at the very end of the upper tail, but most of the residuals are tightly packed and well-distributed across and about the line. This implies a largely normal distribution.

Conclusion

From the overall analysis speed is a good predictor of distance and our model is a well fitted model that satisfies the assumptions of a linear regression model.