Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
#plot(cars$speed, cars$dist)
p1 <- ggplot(cars, aes(speed,dist)) +
geom_point() +
labs(title = "Speed~Stopping distance",
x = "Speed",
y = "distance")
p1
attach(cars)
cars.lm <- lm(dist~speed)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The Least Squares Regression line for the lm is as follows: \[ \begin{aligned} \widehat{stoppingDistance} &= -17.5791 + 3.9324 \times speed \end{aligned} \]
The median value of the residuals is somewhat close to zero and quartiles and min/max values are roughly the same magnitude. The standard Error for speed of 3.9324 is 9.46 (t-value) times smaller than our corresponding coefficient, 0.4155. The p-value that represents the probability that speed is not relevant in this model is 1.49-12 (\(1.29 \times 10^{-16}\)), which is very small. The prob. that the intercept is irrelavant to the model is 0.0123 (\(\approx 1\)%). The \(R^2 = 65.11\)% means that with speed
, 65.11% of the variability in distance
is explained by speed
variable.
#plot(cars.lm)
#plot(speed,dist)
#abline(cars.lm)
p1 + geom_smooth(method='lm', se = FALSE, color = 'blue')
#plot(fitted(cars.lm), resid(cars.lm))
#
#plot(cars.lm$fitted.values, cars.lm$residuals, xlab='Fitted Values', ylab='Residuals')
#abline(0,0)
#
#plot_ss(cars.lm$fitted.values, cars.lm$residuals, data = cars)
plot_ss(cars.lm$fitted.values, cars.lm$residuals, data = cars, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -3.672e-15 8.543e-17
##
## Sum of Squares: 11353.52
hist(cars.lm$residuals)
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
We can spot some outliers, however, the residuals shows somewhat reasonable normal distribution. It is not a perfect example but a good place to start to predict the values. I would say that linear model seems appropreate.