Data605 - HW11 Regression

Assignment

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Initial Setup

I will use Tidymodels to do this homework because I would like to practice the mechanics of the framework and also because I can expand on it if necessary.

#libraries to use

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

## ✔ broom        1.0.0     ✔ recipes      1.0.1
## ✔ dials        1.0.0     ✔ rsample      1.1.0
## ✔ dplyr        1.0.9     ✔ tibble       3.1.8
## ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
## ✔ infer        1.0.3     ✔ tune         1.0.0
## ✔ modeldata    1.0.0     ✔ workflows    1.0.0
## ✔ parsnip      1.0.1     ✔ workflowsets 1.0.0
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ✔ stringr 1.4.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()

rm(list = ls())

Load data

# We will convert to a tibble to use it under tidymodels
df <- as_tibble(cars)
head(df)

## # A tibble: 6 × 2
##   speed  dist
##   <dbl> <dbl>
## 1     4     2
## 2     4    10
## 3     7     4
## 4     7    22
## 5     8    16
## 6     9    10

Visualize data

plot(df$speed, df$dist, main="Stopping Distance vs Speed",
     xlab="Car Speed", ylab="Stoping Distance", pch=19)

We can see a fairly linear relationship between Car speed and its stopping distance. This tells us a linear regression model may be a good choice to model the relationship.

Linear Regression

Fit the regression

lm_fit <- linear_reg() %>%
  set_engine(engine="lm") %>%
  fit(dist ~ speed, data = df)

Review regression fitting results

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = dist ~ speed, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We observe and adjusted R-squared of 0.6438. Is not a terrible fit, so it does tells us there is a strong relationship between speed and stopping distance.

If we review the stats of the predictor, in this case speed, we see our regression has a t-value of 9.464 with a p-value of 1.49e-12 with a significance level almost 0. This tells us that speed is a statistically significant predictor of stopping distance

ggplot(data = df,
       mapping = aes(x = lm_fit$fit$fitted.values, y = dist)) +
  geom_point(color = '#006EA1', alpha = 0.25) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Speed vs Stopping Distance',
       x = 'Car Speed',
       y = 'Stopping Distance')

A plot of fitted results vs actual values shows us the value a very close to the 45 degree line, which tells us the difference between fitted predicted values is close to the actual values.

Residual Analysis

res <- resid(lm_fit$fit)

#produce residual vs. fitted plot
plot(fitted(lm_fit$fit), res)
#add a horizontal line at 0 
abline(0,0)

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res)

#Create density plot of residuals
plot(density(res))

par(mfrow=c(2,2))
plot(lm_fit$fit)

Residual plot does show a larger variance at the high value end of the dataset.

This is evidenced again in the Normal Q-Q Plot with the high value end deviating from the 45 degree line.

Finally the density plot of the residuals, show data is fairly normal with a small right skew.