Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
The cars dataset has two columns - “speed”" and “dist” that relate the car speed and the distance it takes for a car to stop
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Speed is the predictor variable and stopping distance is the system response.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance',
main='Stopping Distance vs. Speed')
# A linear model
cars_linear <- lm(cars$dist ~ cars$speed)
cars_linear
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
# Line of best fit.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance',
main='Stopping Distance vs. Speed')
abline(cars_linear)
# A linear model
cars_linear <- lm(cars$dist ~ cars$speed)
cars_linear
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
# Line of best fit.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Stopping Distance',
main='Stopping Distance vs. Speed')
abline(cars_linear)
# Evaluation of the Linear Model
summary(cars_linear)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The standard error for the speed coefficient is ~ 9.4 (3.93/.42) times the coefficient value, which is good as explained in the book.
Speed is very relevant in modeling stopping distance because the probability that the speed coefficient is not relevant in the model is p-value = 1.49e-12
The intercept pretty relevant in the model: p-value of the intercept is 0.0123.
The model explains 65.11% of the data’s variation: multiple R-squared = 0.6511
From the residuals distribution, the distribution is normal.
plot(cars_linear$fitted.values, cars_linear$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)
The linear model of the cars is normally distributed around zero; it does seem to overpredict more than it underpredicts. Due to the small dataset, this might be smoothed out in a larger model
qqnorm(cars_linear$residuals)
qqline(cars_linear$residuals)
From the Q-Q plot, there is some divergent at the very end of the upper tail, but most of the residuals are tightly packed and well-distributed across and about the line. This implies a largely normal distribution.
From the overall analysis speed is a good predictor of distance and our model is a well fitted model that satisfies the assumptions of a linear regression model.