Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
data(cars)
print(sum(is.na(cars$speed)))
## [1] 0
print(sum(is.na(cars$dist)))
## [1] 0
There are no missing values.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
sum(abs((cars$speed - mean(cars$speed)) / sd(cars$speed)) > 2.5)
## [1] 0
sum(abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5)
## [1] 1
There aren’t any outliers in speed. There is one potential outlier in stopping distance.
cars[abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5,]
## speed dist
## 49 24 120
cars %>%
ggplot(aes(x=speed, y=dist)) +
geom_point(position="jitter") +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
Based on the scatterplot above, we can see that there is a strong positive correlation between speed and stopping distance of a vehicle.
cor(cars$speed, cars$dist)
## [1] 0.8068949
In fact, there is a near perfect relationship, with a correlation of 80%.
model <- lm(dist ~ speed, data=cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We can see from the above summary that:
ggplot(data = model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
From the residuals plot above we see that for the most part there is a tendancy towards zero. However there is slight heteroscedasticity.
qqnorm(resid(model))
qqline(resid(model))
Using the qqplot, we can see that our residuals are not normally distributed. In this case, the residuals distribution is heavier on the right tail.
ggplot(data = model, aes(x = .resid)) +
geom_histogram(binwidth=8)