library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
The cars data is pulled in using the following function, which creates a data.frame object called cars and includes the columns of interest.
data(cars)
Let’s take a quick look at the data:
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
As we can see, the data contains two columns: the speed of a car and the distance required for it to stop. The excercise here will be to see to what degree there is a relationship between the car’s speed and stopping distance–and, whether we can effectively predict a car’s stopping distance given its speed.
First, let’s simply plot the variables against each other to get a visual sense of the relationship between these variables, if any:
ggplot(data=cars, aes(x=speed, y=dist)) +
geom_point()
With some amount of variability, at a glance it does appear that as car
speed increases, so does the distance required for it to stop.
In order to see the strength of that relationship, we can create a linear model using R’s built-in functionality for such analysis.
model <- lm(dist~speed, data=cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
A few notes at a glance:
3.9324 / 0.4155
## [1] 9.46426
…nearly 10 times smaller. Per section 3.3 in our textbook, this means that “there is relatively little variability in the slope estimate,” a signifier of good fit. - We know our residuals should be roughly normally distributed around 0. We see a median quite close to 0, symmetrical quartiles, and a somewhat symmmetrical minimum and maximum. We can explore this in greater detail later. - The miniscule p-value leaves little doubt that the effect of speed on stopping distance is statistically significant.
Let’s revisualize our plot with our model super-imposed:
#the model generated using this method should be exactly the same as the one built above using lm()
ggplot(data=cars, aes(x=speed, y=dist)) +
geom_point() +
geom_smooth(method='lm', formula=y~x)
Again, a visual gut-check tells us this model does a pretty good job of predicting a car’s stopping distance based on its speed.
Now, as a last step, we need to analyze our residuals more closely. The residuals describe the difference between the predicted values in our model and the actual values in our data, and they should follow some general principles in order for our model results to be trustworthy.
Using the code from the book, we can generate a plot of our residuals against our fitted model:
plot(fitted(model),resid(model))
This chart produces no visible pattern, which is a good
sign for a model fit. For a good model, we actually should expect
residuals to look random against our model–otherwise, there may be
other, more important indicators we’ve missed.
Next, we can look at a Q-Q plot to get a better sense of how our residuals are distributed:
qqnorm(resid(model))
qqline(resid(model))
Because the points on this chart hew closely to a straight line, it
tells us our residuals are close to being normally distributed–another
strong indicator for the good fit of our model.