# Load cars
data(cars)
# Print the first 6 rows
head(cars, 6)
attach(cars)
# Check min and max and mean
min(speed); max(speed); mean(speed)
## [1] 4
## [1] 25
## [1] 15
min(dist); max(dist); mean(dist)
## [1] 2
## [1] 120
## [1] 43
Nothing obviously wacky here so let’s move on…
Check if a linear model seems appropriate.
# Create scatter plot
plot(speed, dist, main = "Stopping Distance as a Function of Speed", ylab = "Distance", xlab = "Speed")
Relationship looks linear. As speed increases so does the stopping distance. So a linear model seems appropriate.
# Distance as a function of speed
cars.lm <- lm(dist ~ speed)
cars.lm
##
## Call:
## lm(formula = dist ~ speed)
##
## Coefficients:
## (Intercept) speed
## -17.58 3.93
Our linear regression model is: \[ stopping\_distance = -17.58 + 3.93 \times speed \]
plot(speed, dist, main = "Stopping Distance as a Function of Speed", ylab = "Distance", xlab = "Speed")
abline(cars.lm)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.07 -9.53 -2.27 9.21 43.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579 6.758 -2.60 0.012 *
## speed 3.932 0.416 9.46 0.0000000000015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 48 degrees of freedom
## Multiple R-squared: 0.651, Adjusted R-squared: 0.644
## F-statistic: 89.6 on 1 and 48 DF, p-value: 0.00000000000149
hist(cars.lm$residuals)
mean(cars.lm$residuals)
## [1] 0.000000000000000087
The residuals are close to normally distributed around a mean of 0.000000000000000087 (almost exactly zero) indicating a good fit. Looks like we have just a few outliers in the 40+ bin of the histogram which are causing a large difference in the magnitude of our min and max residual values but our 1st and 3rd quartile values are of almost equal magnitude.
According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for speed, 0.42, is 9.46 times smaller than the coefficient, 3.93. So this also indicates a good fit. The standard error for the intercept, 6.76, is -2.6 times smaller than the coefficient, -17.58. So not as good a fit as the speed, indicating that this estmate may vary.
The p-values for the coefficients, 0.0000000000015 for speed and 0.01 for intercept, indicate that it is highly likely that both the speed and this specific intercept value are relevant to the model.
The \(R^2\) value of 0.6511 indicates that the model explains 65.11% of the variation in stopping distance.
plot(fitted(cars.lm), resid(cars.lm))
There are no apparent patterns to the plotted residuals indicating that the linear model is a good fit.
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
We can see that overall the sample quantiles follow a linear pattern similar to the theoretical quantiles.
Overall the linear model is a good fit for this data, except for a few outliers at the upper end.