Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
#Load the cars dataset into a variable as a data frame
cars <- datasets::cars
#Let's investigate the data
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
describe(cars)
## vars n mean sd median trimmed mad min max range skew kurtosis
## speed 1 50 15.40 5.29 15 15.47 5.93 4 25 21 -0.11 -0.67
## dist 2 50 42.98 25.77 36 40.88 23.72 2 120 118 0.76 0.12
## se
## speed 0.75
## dist 3.64
#Using the lm formula to build the linear regression model
cars.lm <- lm(dist ~ speed, data=cars)
#The summary will give us coefficients as well the slope and intercept that can be used to build the linear regression model evaluating the relationship between distance and speed.
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
What’s important to take note is the p value which is less than 0.05, making it statistically significant. The intercept is -17.5791 and the slope is 3.9324, meaning the distance changes by 3.9324 per changes in speed.
The adjusted R^2 is 0.6438.
#ggplot without getting the slope and intercept manually
ggplot(data = cars, aes(x = speed, y = dist)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
#ggplot with slope and intercept gotten from cars.lm
mx <- coef(cars.lm)[1]
slope <- coef(cars.lm)[2]
ggplot(cars.lm, aes(cars$speed, cars$dist)) +
geom_point() +
geom_abline(slope = slope, intercept = mx, show.legend = TRUE)
In order to evaluate whether the linear model is reliable, 3 things need to be accounted for a)linearity b)normal residuals 3)variability
Let’s check for variability:
ggplot(data = cars.lm, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
There is no pattern in the plot above which is a sign of linearity.
Check for Normal Residuals:
ggplot(data = cars.lm, aes(x = .resid)) +
geom_histogram(binwidth = 10) +
xlab("Residuals")
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
The residuals seem close to normally distributed even though the histogram and Q-Q plot show outliers on the far right end. The residuals don’t have to be perfectly normal but close enough. I would conclude that it is a viable model.