602 HW 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

The cars data set compares notes the relationship between one independent variable (Speed), and one dependent variable (stopping distance).

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

cars <- datasets::cars
head(cars)

cars %>% 
  ggplot(aes(x = speed, y = dist)) + geom_jitter()

At first glance, we do see that there is a positive relationship between these two variables. Lets use linear modeling to investigate the strength of this relationship.

cars.model <- lm(dist ~ speed, cars)

summary(cars.model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As our R Squared is above .5, and our p-value is well below .05, we can say that speed is relatively predictive of stopping distance.

Below, the added line shows what the predicted value for a given speed:

cars %>% 
  ggplot(aes(x = speed, y = dist)) + 
  geom_jitter() +
  geom_abline(slope = cars.model$coefficients[2], intercept = cars.model$coefficients[1], show.legend = TRUE)

Before saying for sure that our model is valid, we need to demonstrate 3 things:

Variability

The below plot compares the residuals to the fitted value. In order to be valid, we would not expect to see any pattern or trend in this plot. As it stands, we see for extreme values, the model is less accurate, however, it is very valid for the bulk of our fitted values.

ggplot(data = cars.model, aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Normal residuals

we will also test to ensure that our residuals are fairly normally distributed.

In the first plot below, we see that the bulk of the residuals are in fact normally distributed, with some exceptions. We have slightly more residuals, where the model predicted a stopping distance 40 feet longer than expected. these are likely due to the extreme values we saw above. Overall this passes this test.

ggplot(data = cars.model, aes(x = .resid)) +
  geom_histogram(binwidth = 10)

We can also plot these on a QQ plot. Again, this passes our test.

qqnorm(resid(cars.model))
qqline(resid(cars.model))