library(tidyverse)
library(psych)
library(car)
library(olsrr)
Introduction
This study is to determine the relationship between a cars speed vs its stopping distance. The speed variable is in mph and the dist variable is in feet. The purpose of this section is to build a model for stopping distance as a function of speed. The cars dataset consists of 50 observations of two variables - speed and dist.
Cars Dataset
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
describe(cars)
## vars n mean sd median trimmed mad min max range skew
## speed 1 50 15.40 5.29 15 15.47 5.93 4 25 21 -0.11
## dist 2 50 42.98 25.77 36 40.88 23.72 2 120 118 0.76
## kurtosis se
## speed -0.67 0.75
## dist 0.12 3.64
Visualize the Data
ggplot(cars, aes(speed, dist)) +
geom_point(aes(color=speed, alpha = speed)) +
xlab("Speed") +
ylab("Dist") +
ggtitle("Speed vs Stopping Distance") +
theme_light()
The figure shows that distance tends to increase as the speed increases.
The linear Model
\[y = b_0 + b_1x_1\]
where \(x_1\) is the input to the system, \(b_0\) is the y-intercept of the line, \(b_1\) is the slope, and \(y\) is the output the model predicts.
Using the \(lm\) function from R, we can create a linear model from our data. R will compute the values of \(b_0\) and \(b_1\) using the least squares method.
cars.lm <- lm(dist ~ speed, cars)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In this model, the \(y-intercept\) is \(b_0 = -17.5791\) and the slope is \(b_1 = 3.9324\)
intercept <- coef(cars.lm)[1]
slope <- coef(cars.lm)[2]
ggplot(cars.lm, aes(cars$speed, cars$dist)) +
geom_point() +
geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)
The Linear regression model is: \[dist = -17.5791 + 3.9324*x\]
Model Evaulation
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The model shows a 3.9324 increase in stopping distance per speed increase.
The least squares line accounts for \(R^2\) of \(0.6438\) of the data.
The \(p-values\) seem to indicate that the variables are significant
Residuals
Looking at the residuals, we would expect that if our linear model is a good fit with the data, we would expect residuals that are normally distributed around a mean of zero. When looking at the output of summary(), we would epext residual values would tend to have a median value near zero.
ols_plot_resid_fit(cars.lm)
Looking at the residual plot, we see that the residuals tend to decrease as we move right.
crPlots(cars.lm)
The CR plot also shows that the residuals tend to deviate from the linear regression line.
ols_plot_resid_qq(cars.lm)
The Q-Q plot also shows that there are some outlier values.
ols_plot_resid_hist(cars.lm)
Ref: Linear Regression Using R, David J. Lilja https://cran.r-project.org/web/packages/olsrr/vignettes/residual_diagnostics.html