Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(tidyverse)
library(moderndive)
library(skimr)
Let’s consider a simple example of how the speed of a car affects its stopping distance, that is, how far it travels before it comes to a stop. To examine this relationship, we will use the ‘cars’ dataset.
cars_regression <- cars %>%
select(speed, dist)
cars_regression %>% skim()
Name | Piped data |
Number of rows | 50 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
speed | 0 | 1 | 15.40 | 5.29 | 4 | 12 | 15 | 19 | 25 | ▂▅▇▇▃ |
dist | 0 | 1 | 42.98 | 25.77 | 2 | 26 | 36 | 56 | 120 | ▅▇▅▂▁ |
How the speed of a car affects its stopping distance, that is, how far it travels before it comes to a stop?
Exploratory analysis of variables
\(y:\) Dependent Variable - \(dist\)
\(\vec{x}\) Independent Variable - \(speed\)
Investigate correlation between this variables
Initial scatterplot of the stopping distance as a function of speed indicates the stopping distance tends to increase as the speed increases, as is expected.
The plot does show the relationship is likely linear.
## `geom_smooth()` using formula 'y ~ x'
The output of the model indicates a linear function as:
\[ stopping\ distance = -17.579 + (3.932 * speed) \]
A y-intercept of -17.579 does seem peculiar.
Based on the linear model, this would indicate that a car at speeds 0 or closer to 0 (negative) it would stop in less than 0 feet, which is accurate since a car that is not in movement thus it does not need to stop.
The slope of 3.932 based on the speed.
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | -17.579 | 6.758 | -2.601 | 0.012 | -35.707 | 0.548 |
speed | 3.932 | 0.416 | 9.464 | 0.000 | 2.818 | 5.047 |
Coefficient
also seen under estimate - This portion of the output shows the estimated coefficient values
Std. Error
For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient
For Example:
the SD error for \(speed\) is 9.45 times smaller then the coefficient value.(3.932/0.416)
P-value
shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient
The probability that the intercept is not relevant is 0.012.
Residuals
The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line.
Residual values are normally distributed around a mean of zero in this case we see that even though the values are not exactly zero we can still expect a normal distribution.
That is, a good model’s residuals should be roughly balanced around and not too far away from the mean of zero.
RSquared & Residual Standard Error RSE
These final few lines in the output provide some statistical information about the quality of the regression model’s fit to the data
summary(speed_model)$r.squared
## [1] 0.6510794
summary(speed_model)$sigma
## [1] 15.37959
Minimum Maximum
minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.
summary(speed_model_points$residual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.06900 -9.52550 -2.27200 -0.00004 9.21450 43.20100
Residual Visual Relationship
Distribution of Residuals Investigate potential relationships between the residuals and all explanatory/predictor variables
Residual Vs Fitter
we may be able to construct a model that produces tighter residual values and better predictions.
Residual values greater than zero mean that the regression model predicted a value that was too small compared to the actual measured value, and negative values indicate that the regression model predicted a value that was too large
QQ Plot
If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line. Which in this case we do see a straight line forming.
This test could confirm that the speed as a predictor in the model may be sufficient to explain the data.
plot(speed_model)
plot(speed_model_points)
We do this so that we can specify that 8 is a value of speed, so that predict knows how to use it with the model stored in speed_model
predict(speed_model, newdata = data.frame(speed = c(8, 21, 50)))
## 1 2 3
## 13.88018 65.00149 179.04134
\(stopping\ distance = -17.579 + (3.932 * speed)\) \(stopping\ distance = -17.579 + (3.932 * 8)\) \(stopping\ distance = -17.579 + (3.932 * 21)\) \(stopping\ distance = -17.579 + (3.932 * 50)\) ** note 50 is out of range
Overall, the car speed would appear to be a good predictor of stopping distance. The linear regression model does contain some flaws, particularly in the intercept value and the predictions at higher speeds