Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Loading data

data(cars)
print(sum(is.na(cars$speed)))
## [1] 0
print(sum(is.na(cars$dist)))
## [1] 0

There are no missing values.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
sum(abs((cars$speed - mean(cars$speed)) / sd(cars$speed)) > 2.5)
## [1] 0
sum(abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5)
## [1] 1

There aren’t any outliers in speed. There is one potential outlier in stopping distance.

cars[abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5,]
##    speed dist
## 49    24  120
cars %>%
  ggplot(aes(x=speed, y=dist)) +
  geom_point(position="jitter") +
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

Based on the scatterplot above, we can see that there is a strong positive correlation between speed and stopping distance of a vehicle.

cor(cars$speed, cars$dist)
## [1] 0.8068949

In fact, there is a near perfect relationship, with a correlation of 80%.

Building a model

model <- lm(dist ~ speed, data=cars)
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We can see from the above summary that:

Analyzing Residuals

ggplot(data = model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

From the residuals plot above we see that for the most part there is a tendancy towards zero. However there is slight heteroscedasticity.

qqnorm(resid(model))
qqline(resid(model))

Using the qqplot, we can see that our residuals are not normally distributed. In this case, the residuals distribution is heavier on the right tail.

ggplot(data = model, aes(x = .resid)) +
  geom_histogram(binwidth=8)