HW11

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

The cars data is pulled in using the following function, which creates a data.frame object called cars and includes the columns of interest.

data(cars)

Let’s take a quick look at the data:

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

As we can see, the data contains two columns: the speed of a car and the distance required for it to stop. The excercise here will be to see to what degree there is a relationship between the car’s speed and stopping distance–and, whether we can effectively predict a car’s stopping distance given its speed.

First, let’s simply plot the variables against each other to get a visual sense of the relationship between these variables, if any:

ggplot(data=cars, aes(x=speed, y=dist)) +
  geom_point()

With some amount of variability, at a glance it does appear that as car speed increases, so does the distance required for it to stop.

In order to see the strength of that relationship, we can create a linear model using R’s built-in functionality for such analysis.

model <- lm(dist~speed, data=cars)

summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

A few notes at a glance:

Looking first at the R-squared, we see that the model is relatively highly explanatory of the relationship between speed and distance (an R-squared over 0.6 is generally thought to describe a strong model fit). Given the low number of variables, R-squared is a fine metric to consult here, and does not differ much from the adjusted R-squared.
The standard error for speed is, as a ratio against its coefficient:

3.9324 / 0.4155

## [1] 9.46426

…nearly 10 times smaller. Per section 3.3 in our textbook, this means that “there is relatively little variability in the slope estimate,” a signifier of good fit. - We know our residuals should be roughly normally distributed around 0. We see a median quite close to 0, symmetrical quartiles, and a somewhat symmmetrical minimum and maximum. We can explore this in greater detail later. - The miniscule p-value leaves little doubt that the effect of speed on stopping distance is statistically significant.

Let’s revisualize our plot with our model super-imposed:

#the model generated using this method should be exactly the same as the one built above using lm()

ggplot(data=cars, aes(x=speed, y=dist)) +
  geom_point() + 
  geom_smooth(method='lm', formula=y~x)

Again, a visual gut-check tells us this model does a pretty good job of predicting a car’s stopping distance based on its speed.

Now, as a last step, we need to analyze our residuals more closely. The residuals describe the difference between the predicted values in our model and the actual values in our data, and they should follow some general principles in order for our model results to be trustworthy.

Using the code from the book, we can generate a plot of our residuals against our fitted model:

plot(fitted(model),resid(model))

This chart produces no visible pattern, which is a good sign for a model fit. For a good model, we actually should expect residuals to look random against our model–otherwise, there may be other, more important indicators we’ve missed.

Next, we can look at a Q-Q plot to get a better sense of how our residuals are distributed:

qqnorm(resid(model))
qqline(resid(model))

Because the points on this chart hew closely to a straight line, it tells us our residuals are close to being normally distributed–another strong indicator for the good fit of our model.

HW11

2024-04-07