Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
As the str() function shows, the data is structured with 2 variables(speed & dist) and 50 observations.
Scatter plot
ggplot(cars, aes(x = dist, y = speed))+geom_point()+ggtitle("Association between distance and speed")
cor(cars)
## speed dist
## speed 1.0000000 0.8068949
## dist 0.8068949 1.0000000
corrplot(cor(cars), type = 'upper')
smod<-lm(dist ~ speed, data = cars)
summary(smod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As the summary of the linear regression model shows,
slope = 3.9324
intercept = -17.5791
therefore, the linear regression model formula is y = -17.5791 + 3.9324x
plot(cars, xlab = "Speed", ylab = "distance")
abline(smod) # plot the model
ols_plot_resid_hist(smod) # residual distribution
ols_plot_resid_fit(smod) # Heteroscedasticity
ols_plot_cooksd_chart(smod) # cook's distance for indentifying outliers
The residuals distribution suggests that the distribution is slightly right skewed.
The P-value is 1.49e-12, it means that distance is statistically significantly associated with speed.
The adjusted R-squared is 0.6438, which means that this model explains 64.38% of the data’s variation.
The distance and speed is strongly correlated, an increase of 1 unit of speed results increase the distance of 0.8 unit.
The ols_plot_resid_fit() function tells the model has Heteroscedasticity problem.
The Cook’s distance plot shows there are two outliers in the data, located at the 23rd and 49th rows.