Homework 11

Tyler Baker

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

The data

First I must load the cars data and get a feel for it.

data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
?cars
## starting httpd help server ... done

Visualization

We should begin by making a simple scatterplot.

ggplot(cars, aes(x= speed, y= dist))+
  geom_point()

It appears that there is a strong relationship between speed and stopping distance.

Visualizing the Linear Model

Here I will use ggplot’s built in functions to look at a linear regression model put on top of the previous scatterplot.

ggplot(cars, aes(x = speed, y= dist))+
  geom_point()+
  geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

The Linear Model

Here we will find the linear model exactly, and discuss how good or bad of a fit it is.

The Model itself

attach(cars)
cars.lm <- lm(dist ~ speed)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

Thus, the y-intercept it -17.579, and are slope is 3.932

The Justification

Now let’s see if the linear model is justified.

summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
Residuals

First we check the residuals. We want the residuals to look near normal where the mean is close to 0.

With a median of -2.272 we are close to 0. The quartiles should be near equidistant from the mean. The |Q1| is about equal to |Q3| so they’re good. The min and max should also follow these rules. The |max| is a little greater than |min| so we have a little pull to the right.

So for the residuals we have almost checked every box. The unchecked box wasn’t all too bad. We look like we’re in good shape, but we need to keep investigating.

Estimated Coeff.

For a good model we want to see the standard error that is at least five to ten times smaller than the corresponding coefficient. Here, our’s is 9.464 times smaller. So we have another boxed checked. Lastly, we have to note that speed has a significance code of *** which means that it has a vital role on stopping distance. Another good thing for our model.

The last statistics

Here we want our previous quartiles to be about 1.5 times this standard error. This is not what we are getting. Looking at the R^2. We see that our model describes about 65% of the data.

The residual plot
plot(fitted(cars.lm),resid(cars.lm))

Here we should see residuals that are normally distributed between being above and below 0. That is not our case. There are clearly more negative residuals.

The QQ plot
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

Ideally our points will be very close to the line. In our case the points are indeed very close to the line for most of the line.

Conclusion

In conclusion, most of the needed boxes were checked. This linear model is justified. Furthermore, the discrepencies come from other input factors we do not have. Some other input factors include weight, brake conditions, weather conditions, along with others. However, with all of that being said, I am happy to say that speed does impact stopping distance.