Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Dependent variable: Stopping distance Independent variable: Speed
Load the cars data
# Load the necessary package
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the cars dataset
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
First we will determine whether or not a linear relationship exists between the stopping distance of the car and the speed by plotting a scatter diagram
The figure below shows that stopping distance tends to increase as the speed of the car increases as expected. So there is a relationship to model.
library(ggplot2)
library(dplyr)
# Scatter plot
plot(cars$speed, cars$dist, main = "Stopping Distance vs Speed", xlab = "Speed", ylab = "Stopping Distance")
Below is the Simple Linear model of the data with speed as the independent variable and dist as the response variable. The equation of the model is shown below
# Create a linear model predicting stopping distance based on speed
cars.lm <- lm(dist ~ speed, data = cars)
cars.lm
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
dist=−17.579+3.932×speed
# Summary of the model
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Interpretation of the model output
# Scatter plot with regression line
plot(cars$speed, cars$dist, main = "Stopping Distance vs Speed", xlab = "Speed", ylab = "Stopping Distance")
abline(cars.lm, col = "red")
Statistical Significance: The p-value for the speed coefficient is very small (1.49e-12), indicating that the relationship between speed and stopping distance is statistically significant.
Practical Significance The practical significance of the model lies in understanding that while the statistical interpretation of the slope suggests a positive relationship between speed and stopping distance, the negative intercept is not meaningful in a real-world context. This negative value for the intercept indicates that the model should not be extrapolated beyond the range of data from which it was derived, especially towards a speed of zero where it predicts a negative stopping distance, which is impossible. The model is useful for predicting stopping distances within the range of observed speeds, but caution is needed when applying it outside this range.
Fit of the Model: The Multiple R-squared value is 0.6511, indicating that approximately 65.11% of the variance in stopping distance is explained by the model. The Adjusted R-squared value, which adjusts for the number of predictors in the model, is 0.6438, reinforcing that the model explains a good portion of the variance in stopping distance.
Residuals: The range of residuals is quite large, from about -29.069 to 43.201, which may suggest some large prediction errors in the model.
par(mfrow=c(2,2))
plot(cars.lm)
Below is the analysis of the residual plots above:
Residuals vs Fitted: There’s no clear pattern to the residuals, which is good as it suggests that the relationship is linear and the model fits well across all levels of the fitted values. However, there are a few outliers present.
Normal Q-Q Plot: The residuals largely follow the theoretical line, indicating that the residuals are normally distributed. There are a few deviations at the high end (the right side of the plot), which suggests the presence of some outliers.
Scale-Location: The spread of residuals seems constant across the range of fitted values, indicating homoscedasticity. Again, a few outliers are visible.
Residuals vs Leverage: Most data points have low leverage, but a few points have higher leverage, which could potentially be influential. The Cook’s distance lines don’t indicate any points with unduly large influence on the model.
Overall, the model seems to fit the data reasonably well, though you may want to investigate the outliers further to determine if they should be included in the model.