DATA605_HW11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Model Creation

We are creating a model where stopping distance is being predicted by car speed.

cars.lm <- lm(cars$dist ~ cars$speed)

Model Visualization

plot(cars$dist ~ cars$speed)
abline(cars.lm)

Model Evaluation

First step is examining the residuals. The residuals are the distance from the expected values to the actual values. A negative residual means that we over predicted the value and a positive residual means we under-predicted the actual value. A good model would have a median that is close to zero, a minimum and maximum with roughly the same magnitude and the same for Q1 and Q3. We want the residuals to be normally distributed and centered on zero.

Based on an inspection of summary statistics of the residuals, the residuals are close to normally distributed. The Median is -2.272 which is close to zero. The magnitude of Q1 and Q3 are almost identical. The Max is larger than the Min by 50% but I think is still close enough.

Next we look at the Standard Error, we are hoping to see values that are 5-10X smaller than our coefficients. We also want to examine the ‘test statistic’ which is the probability of observing a ‘t value’ as extreme as ours assuming that there is no liner relationship between the model predictor and response variable.

The Standard Error for our Intercept is ~3X smaller than our Estimate while the Standard Error for Speed is ~10X smaller which we want to see. We don’t have to calculate these values by hand because they are shown for us as the ‘t value’.

We can reject the null hypothesis that the true coefficients are zero for both our intercept and speed coefficient at the 95% confidence interval.

Then we want to examine the ‘Residual standard error’ which is a measure of the total variation of our model residuals. We can do another check on the normality of our residual distribution by comparing the Q1 and Q3 values and the RSE. The Q1 and Q3 values should be 1.5X the RSE.

Our Q1 and Q3 values are in fact smaller than our RSE which means our residual distribution may not be as normal as I earlier asserted.

We have 48 degrees fo freedom because we have 50 observations and 2 parameters.

The Multiple R-squared is the variation explained by our linear model divided by the total variation. We can see that 65% of the variation in stopping distance is explained by speed. The Adjusted R-squared is the R-squared with an adjustment for number of predictors, our model is small and so is our adjustment.

Lastly we look at the F-statistic which is the t value squared. It compares our model against a model with just the intercept in it.

summary(cars.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis

First we want to plot the fitted values against the residuals. We are hoping to see no evident pattern, points evenly distributed above and below zero and an even dispersion of points along the X axis.

We do not see an evident pattern and the points are evenly dispersed along the X axis but not above and below zero.

plot(fitted(cars.lm), resid(cars.lm))
abline()

### Q-Q Plot

The quantile versus quantile plot give another check on the normality of our residuals. It they are normally distributed we expect them to plot in a straight line.

These residuals look normal except at the right tail (not crazy bad). The right tail is a little heavier than we would expect.

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

Lastly we can plot multiple checks at the same time. Including two new ones, the Scale-Location plot which another way to examine the residuals (standardized and squared) and finally the Residuals-Leverage plot which allows you to examine outliers.

These two new plots give us no new information. There are no outliers and our residuals don’t appear to violate normality.

par(mfrow=c(2,2))
plot(cars.lm)

DATA605_HW11

William Aiken

4/10/2022

Model Creation

Model Visualization

Model Evaluation

Residual Analysis