Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
There are 50 observations of speed and stopping distance.
# scatterplot
require(stats); require(graphics)
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1)
title(main = "Cars dataset: Stopping distance vs. Speed")## [1] 120
There is an outlier at Stopping Distance = 120.
LinearModel <- lm(dist ~ speed, data = cars)
intercept = round(LinearModel$coefficients[1], 3)
slope = round(LinearModel$coefficients[2], 3)
formula = paste ("dist = ", intercept, " + ", slope, "*", "speed + error")
plot(x = speed, y = dist,
main=paste("Cars dataset: ", formula))
abline(reg = LinearModel,col="blue")The intercept is negative, which means that at very low speeds, the predicted stopping distance would be negative.
(Of course, in reality, this is not possible…)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.06908 -9.52532 -2.27185 9.21472 43.20128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579095 6.758440 -2.60106 0.012319 *
## speed 3.932409 0.415513 9.46399 0.0000000000014898 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.3796 on 48 degrees of freedom
## Multiple R-squared: 0.651079, Adjusted R-squared: 0.64381
## F-statistic: 89.5671 on 1 and 48 DF, p-value: 0.00000000000148984
The model shows strong significance at 99% confidence on the slope of the speed parameter, but the intercept is significant only at 95% confidence.
The \(R^2\) indicates that the model explains about 65 percent of the variance, which is reasonably good.
This figure corresponds to correlation, \(R\), of about 80 percent between the variables.
Residual = resid(LinearModel)
Fitted = fitted(LinearModel)
plot(Fitted,Residual, main="Cars dataset: Fitted vs. Residuals", xlab="Fitted Stopping Distance")
abline(h=0, col="blue")The residuals do not show any recognizable pattern, though the variance at higher speeds does appear to be larger than that at lower speeds, which may indicate heteroscedasticity.
The QQ plot indicates some outliers at the upper end (actual stopping distance well above model prediction),
which may call normality into question.
Residual = resid(LinearModel)
hist(Residual, main = "Histogram of Residuals - 8 breaks", ylab = "Density",
ylim = c(0, 0.05),prob = TRUE,breaks=8, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)hist(Residual, main = "Histogram of Residuals - 15 breaks", ylab = "Density",
ylim = c(0, 0.05),prob = TRUE,breaks=15, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)While the mean of the residuals is, by definition, zero, the median is -2.27 .
The number of observations for which the residual is negative is 27,
while the number of cases for which the residual is positive is 23 .
These figures are consistent with the graphs shown above.
Because the sample size is so small, it is difficult to determine from the above whether or not Normality is achieved.
\(H_0\) : The residuals are normal \(H_A\) : The residuals are not normal
##
## Shapiro-Wilk normality test
##
## data: Residual
## W = 0.9450906, p-value = 0.0215246
Because the p-value (0.0215) is low, the null hypothesis is rejected at 95% confidence. This indicates that the residuals are not sufficiently close to the normal distribution to meet the conditions of linear regression.
Although the conditions for linear regression are questionable due to a few outliers resulting in failure of the normality test on the residuals, the model is significant and the results seem adequate in explaining that about \({\frac{2}{3}}\) of the variance in stopping distance is attributable to speed, while the other \({\frac{1}{3}}\) of the variance remains unexplained by this model and thus must be attributable to other factors which have not been modeled.