The cars data set compares notes the relationship between one independent variable (Speed), and one dependent variable (stopping distance).
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cars <- datasets::cars
head(cars)
cars %>%
ggplot(aes(x = speed, y = dist)) + geom_jitter()
At first glance, we do see that there is a positive relationship between these two variables. Lets use linear modeling to investigate the strength of this relationship.
cars.model <- lm(dist ~ speed, cars)
summary(cars.model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As our R Squared is above .5, and our p-value is well below .05, we can say that speed is relatively predictive of stopping distance.
Below, the added line shows what the predicted value for a given speed:
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_jitter() +
geom_abline(slope = cars.model$coefficients[2], intercept = cars.model$coefficients[1], show.legend = TRUE)
Before saying for sure that our model is valid, we need to demonstrate 3 things:
The below plot compares the residuals to the fitted value. In order to be valid, we would not expect to see any pattern or trend in this plot. As it stands, we see for extreme values, the model is less accurate, however, it is very valid for the bulk of our fitted values.
ggplot(data = cars.model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
we will also test to ensure that our residuals are fairly normally distributed.
In the first plot below, we see that the bulk of the residuals are in fact normally distributed, with some exceptions. We have slightly more residuals, where the model predicted a stopping distance 40 feet longer than expected. these are likely due to the extreme values we saw above. Overall this passes this test.
ggplot(data = cars.model, aes(x = .resid)) +
geom_histogram(binwidth = 10)
We can also plot these on a QQ plot. Again, this passes our test.
qqnorm(resid(cars.model))
qqline(resid(cars.model))