library(ggplot2)
library(dplyr)Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
SOLUTION:
Below we can see that the cars dataset contains 50 observations of 2 variables: speed and dist.
Each observation represents the speed of a given car and the distance that it takes for the car to stop.
Show some of data in cars dataset
head(cars, 10)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
#> 7 10 18
#> 8 10 26
#> 9 10 34
#> 10 11 17Structure of the cars dataset
str(cars)
#> 'data.frame': 50 obs. of 2 variables:
#> $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
#> $ dist : num 2 10 4 22 16 10 18 26 34 17 ...Show some summary statistics about the dataset
summary(cars)
#> speed dist
#> Min. : 4.0 Min. : 2.00
#> 1st Qu.:12.0 1st Qu.: 26.00
#> Median :15.0 Median : 36.00
#> Mean :15.4 Mean : 42.98
#> 3rd Qu.:19.0 3rd Qu.: 56.00
#> Max. :25.0 Max. :120.00Determine whether or not it looks as though a linear relationship exists between the predictor “speed” and the output value “stopping distance” (dist).
ggplot(cars, aes(x = speed, y = dist, color = speed)) +
geom_point(shape = 1, alpha = 0.6) +
geom_smooth(method="lm", se=FALSE) +
labs(title = "Car Speed vs Stopping Distance") +
xlab("Speed") +
ylab("Stopping Distance")
This figure shows that the “stopping distance” tends to increase as the speed increases, as we expected. After superimposing a straight line on this scatter plot, we see that the relationship between the predictor (the speed) and the output (the stopping distance) is roughly linear and positive. It is not perfectly linear, however.
Our next step is to develop a regression model that will help us quantify the degree of linearity in the relationship between the output and the predictor.
Generate a linear model using the lm function
model = lm (cars$dist ~ cars$speed, data = cars)Show the statistics summary of the model
mod_summary <- summary(model)
mod_summary
#>
#> Call:
#> lm(formula = cars$dist ~ cars$speed, data = cars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -29.069 -9.525 -2.272 9.215 43.201
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -17.5791 6.7584 -2.601 0.0123 *
#> cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 15.38 on 48 degrees of freedom
#> Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
#> F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12Show the resulting linear equation for the model
intercept <- round(model$coefficients[1], 4)
slope <- round(model$coefficients[2], 4)\(\widehat{dist} = {-17.5791} + {3.9324} * {speed}\)
Evaluate the Residuals generated by the model
summary(mod_summary$residuals)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -29.069 -9.525 -2.272 0.000 9.215 43.201When we look at the residual values reported by summary(), we can see that we have a relatively good model because the median value is near zero, the minimum and maximum values are of roughly 10 units apart in magnitude, and first and third quartile values are of roughly the same magnitude. Therefore, for this model, the residual values are not too far off what we would expect for Gaussian-distributed numbers. But we will need to run additional tests to verify this.
Evaluate the Coefficients generated by the model
mod_summary$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -17.579095 6.7584402 -2.601058 1.231882e-02
#> cars$speed 3.932409 0.4155128 9.463990 1.489836e-12For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. In this case, our standard error is about 2.6 time smaller in magnitude than the Intercept coefficient. The standard error for the slope is roughly 9.46 times smaller.
The value of Pr(>|t|) for speed is very small, so the probability that speed is not relevant to the model is very tiny. The value of Pr(>|t|) for distance is small (0.0123), which represent a small probability that the distance is not relevant to the model.
Evaluate the Residual standard error generated by the model
mod_summary$sigma
#> [1] 15.37959If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error. For this model, the residuals are roughly 1.5 time the standard error.
The Multiple R-squared value is a statistical measure of how well the model describes the data. The reported R-Squared of 0.6511 from the summary for this model means that the model explains 65.11 percent of the data’s variation.
Plot the Fitted Values versus the Residual Values
ggplot(model, aes(.fitted, .resid)) +
geom_point(size=2) +
labs(title = "Fitted Values vs Residuals") +
labs(x = "Fitted Values") +
labs(y = "Residuals") +
geom_abline(slope = 0) From the plot above, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly scattered above and below zero. Overall, this plot tells us that using the speed as the sole predictor in the regression model does not sufficiently or fully explain the data. There might be other additional factors that affect the distance needed to stop a car, such as terrain conditions, tire thread, weather conditions, etc.
qqnorm(resid(model))
qqline(resid(model))The Q-Q Plot above, shows how the data, for the most part, follows along model until a it starts diverging for higher values of speed, which can be seen near the upper right of the plot.