Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(tidyverse)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(cars[,"speed"],cars[,"dist"], main="Relationship Between Speed and Distance",
xlab="Speed", ylab="Distance")
ggplot(cars,aes(x = `speed`, y = `dist`)) + geom_point() + geom_smooth(method = "lm")
There appears to be a linear relationship between speed and distance. Distance increases as speed increases.
set.seed(123)
model1 <- lm(dist ~ speed, data=cars)
summary(model1)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
With a p-value below 0.05, it can be determined that speed is a statistically significant variable when determining distance. The Adjusted R-squared of 64.38% represents the percentage of variance in distance that can be explained by speed. This percentage is an indicator that the linear model is a good fit. Additionally, the median value of -2.272 is close to zero, which is a good indicator of the model fit. The t-value is 9.464, which is also a good indicator of the model fit since the standard error for speed is 9.464 times smaller than the coefficient value, and typically a good model fit has a standard error that is five to ten times smaller than the corresponding coefficient.
plot(model1)
ggplot(model1,aes(x = `speed`, y = `dist`)) + geom_point() +geom_smooth(method = "lm")
ggplot(model1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
plot(fitted(model1),resid(model1))
plot(dist ~ speed, data=cars)
abline(model1)
ggplot(data = model1, aes(x = .resid)) +
geom_histogram(binwidth = 10) +
xlab("Residuals")
hist(model1$residuals, breaks = 10)
qqnorm(resid(model1))
qqline(resid(model1))
par(mfrow=c(2,2))
plot(model1)
When analyzing each of the residual plots, it appears that the model
is a good fit to the data. In the Residual vs. Fitted Values plot, the
residual points are scattered around the zero threshold with no
discernible pattern. The histogram plot appears to show a positive
skewness to the right, with the center to the left of the zero
threshold. This indicates that the residuals are not normally
distributed. The Q-Q plot shows its right tail diverging from the line,
which is another indication that the residuals are not normally
distributed. Taken altogether, while the initial measures of the
regression model shows promising fit, the plots indicate that the
residuals of the model are not normally distributed, meaning that
statistical inferences made from the model may be biased or invalid. The
regression model can possibly be improved by log-transforming the
dependent variable dist, or removing outliers from the
dataset.