Data605 - HW11 Regression
Assignment
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Initial Setup
I will use Tidymodels to do this homework because I would like to practice the mechanics of the framework and also because I can expand on it if necessary.
#libraries to use
library(tidymodels)## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.0 ✔ recipes 1.0.1
## ✔ dials 1.0.0 ✔ rsample 1.1.0
## ✔ dplyr 1.0.9 ✔ tibble 3.1.8
## ✔ ggplot2 3.3.6 ✔ tidyr 1.2.0
## ✔ infer 1.0.3 ✔ tune 1.0.0
## ✔ modeldata 1.0.0 ✔ workflows 1.0.0
## ✔ parsnip 1.0.1 ✔ workflowsets 1.0.0
## ✔ purrr 0.3.4 ✔ yardstick 1.0.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(tidyverse)## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ stringr 1.4.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
rm(list = ls())Load data
# We will convert to a tibble to use it under tidymodels
df <- as_tibble(cars)
head(df)## # A tibble: 6 × 2
## speed dist
## <dbl> <dbl>
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Visualize data
plot(df$speed, df$dist, main="Stopping Distance vs Speed",
xlab="Car Speed", ylab="Stoping Distance", pch=19)We can see a fairly linear relationship between Car speed and its stopping distance. This tells us a linear regression model may be a good choice to model the relationship.
Linear Regression
Fit the regression
lm_fit <- linear_reg() %>%
set_engine(engine="lm") %>%
fit(dist ~ speed, data = df)Review regression fitting results
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = dist ~ speed, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We observe and adjusted R-squared of 0.6438. Is not a terrible fit, so it does tells us there is a strong relationship between speed and stopping distance.
If we review the stats of the predictor, in this case speed, we see our regression has a t-value of 9.464 with a p-value of 1.49e-12 with a significance level almost 0. This tells us that speed is a statistically significant predictor of stopping distance
ggplot(data = df,
mapping = aes(x = lm_fit$fit$fitted.values, y = dist)) +
geom_point(color = '#006EA1', alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Speed vs Stopping Distance',
x = 'Car Speed',
y = 'Stopping Distance')A plot of fitted results vs actual values shows us the value a very close to the 45 degree line, which tells us the difference between fitted predicted values is close to the actual values.
Residual Analysis
res <- resid(lm_fit$fit)
#produce residual vs. fitted plot
plot(fitted(lm_fit$fit), res)
#add a horizontal line at 0
abline(0,0)#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res)#Create density plot of residuals
plot(density(res))par(mfrow=c(2,2))
plot(lm_fit$fit)Residual plot does show a larger variance at the high value end of the dataset.
This is evidenced again in the Normal Q-Q Plot with the high value end deviating from the 45 degree line.
Finally the density plot of the residuals, show data is fairly normal with a small right skew.