HW 11: Simple Linear Regression

Cars Dataset

library(tidyverse)

df <- datasets::cars

summary(df)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Visualize

plot(cars$speed, cars$dist, xlab="speed (mph)", ylab="stopping distance (ft)", main="Cars")

Build model

cars.lm <- lm(dist ~ speed, data=cars)

cars.lm

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

\[ \widehat{dist} = - 17.579 + 3.932 * speed \] For every 1 mph increase in speed, the stopping distance increases by 3.9 feet. The intercept means that when the speed is 0 mph, the stopping distance is -17 feet, which does not make sense. The intercept does not make sense here.

plot(dist ~ speed, data=cars)
abline(cars.lm)

Model Quality Evaluation

summary(cars.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The median residual is not close to zero.
The first and third quartiles are of similar magnitudes.
The minimum and maximum residuals are not of similar magnitudes.
The ratio of the speed coefficient and its standard error is 3.9324/0.4155=9.46, which means that the standard error for speed is 9.5 times smaller than the speed coefficient. This shows that there is large variability in the slope estimate.
The p-value of the speed coefficient is very small, so we can say there is strong evidence of there being a linear relationship between speed and stopping distance.
If the residuals are distributed normally, the first and third quartiles should be about 1.5 times the standard error. This is not true here:

\[ -9.525 \neq 1.5 * 15.38 \]

\[ 9.215 \neq 1.5 * 15.38 \]

The multiple R-squared is 0.6511, which means that 65.011% of the variability in stopping distance is explained by the variation in the speed.
The F-statistic isn’t too important here since the model only has one predictor.

Diagnostic plots

par(mfrow=c(2,2))
plot(cars.lm)

In the Q-Q plot, the right tail is “heavier” than what would be expected for residuals that are normally distributed. The distribution is right-skewed.