HW11 - Regression

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

There are 50 observations of speed and stopping distance.

3.2 The Linear Model Function

3.3 Evaluating the Quality of the Model

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -29.06908  -9.52532  -2.27185   9.21472  43.20128 
## 
## Coefficients:
##               Estimate Std. Error  t value           Pr(>|t|)    
## (Intercept) -17.579095   6.758440 -2.60106           0.012319 *  
## speed         3.932409   0.415513  9.46399 0.0000000000014898 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.3796 on 48 degrees of freedom
## Multiple R-squared:  0.651079,   Adjusted R-squared:  0.64381 
## F-statistic: 89.5671 on 1 and 48 DF,  p-value: 0.00000000000148984

The model shows strong significance at 99% confidence on the slope of the speed parameter, but the intercept is significant only at 95% confidence.

The \(R^2\) indicates that the model explains about 65 percent of the variance, which is reasonably good.
This figure corresponds to correlation, \(R\), of about 80 percent between the variables.

3.4 Residual Analysis

Plot residuals

The residuals do not show any recognizable pattern, though the variance at higher speeds does appear to be larger than that at lower speeds, which may indicate heteroscedasticity.

QQ plot

The QQ plot indicates some outliers at the upper end (actual stopping distance well above model prediction),
which may call normality into question.

Plot histograms of Residuals

While the mean of the residuals is, by definition, zero, the median is -2.27 .
The number of observations for which the residual is negative is 27,
while the number of cases for which the residual is positive is 23 .
These figures are consistent with the graphs shown above.
Because the sample size is so small, it is difficult to determine from the above whether or not Normality is achieved.

Shapiro-Wilks test

\(H_0\) : The residuals are normal \(H_A\) : The residuals are not normal

## 
##  Shapiro-Wilk normality test
## 
## data:  Residual
## W = 0.9450906, p-value = 0.0215246

Because the p-value (0.0215) is low, the null hypothesis is rejected at 95% confidence. This indicates that the residuals are not sufficiently close to the normal distribution to meet the conditions of linear regression.

Conclusion

Although the conditions for linear regression are questionable due to a few outliers resulting in failure of the normality test on the residuals, the model is significant and the results seem adequate in explaining that about \({\frac{2}{3}}\) of the variance in stopping distance is attributable to speed, while the other \({\frac{1}{3}}\) of the variance remains unexplained by this model and thus must be attributable to other factors which have not been modeled.