Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
pkges <- c("ggplot2", "dplyr")
# Loop through the packages
for (p in pkges) {
# Check if package is installed
if (!requireNamespace(p, quietly = TRUE)) {
install.packages(p) #If the package is not installed, install the package
library(p, character.only = TRUE) #Load the package
} else {
library(p, character.only = TRUE) #If the package is already installed, load the package
}
}
There is no need to load the dataset because the dataset is built in R.
There are 2 columns and 50 rows in the dataset. The explanatory variable that will be used in this model is ‘speed’ and the response variable will be ‘distance’.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
# Rename the column names
Cars_DF <- cars %>%
rename("SPEED" = "speed", "DISTANCE" = "dist")
Cars_DF
## SPEED DISTANCE
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
Using a Plot to visualize the data to determine whether there exists a linear relationship between the predictor and the output value. As is indicated, there is a liner relationship between speed and distance. When speed is increased the stopping distance also increases.
par(bg="gray")
plot(Cars_DF, xlab = "SPEED (MPH)", ylab = "DISTANCE (FT)",
col="purple", las = 1, main = "STOPPING DISTANCE vs SPEED")
grid()
Linear_model <- lm(DISTANCE ~ SPEED, data = Cars_DF)
Linear_model
##
## Call:
## lm(formula = DISTANCE ~ SPEED, data = Cars_DF)
##
## Coefficients:
## (Intercept) SPEED
## -17.579 3.932
par(bg="gray")
plot(Cars_DF, xlab = "SPEED (MPH)", ylab = "DISTANCE (FT)",
col="purple", las = 1, main = "STOPPING DISTANCE vs SPEED")
abline(Linear_model, col="green")
grid()
Outlined below is a summary of the Linear Model.
summary(Linear_model)
##
## Call:
## lm(formula = DISTANCE ~ SPEED, data = Cars_DF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## SPEED 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
par(bg="gray")
plot(fitted(Linear_model), resid(Linear_model), xlab='FITTED VALUES', ylab='RESIDUALS', main = "LINEAR MODEL (RESIDUALS vs FITTED",col = "purple")
abline(0,0, col="green")
par(bg="gray")
qqnorm(resid(Linear_model), col = "purple")
qqline(resid(Linear_model), col = "green")
par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0),
mar = c(4.1, 4.1, 2.1, 1.1))
plot(Linear_model, col = "purple")
My analysis of the Linear Model indicates that there is a positive correlation between the explanatory (speed) and response variable (stopping distance) and the relationship is linear. The residuals distribution also suggests that the distribution is normal.