DATA605 Homework11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

data(cars)
cars_df <- cars

head(cars_df, 10)

##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17

str(cars_df)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

This data was gotten from measuring the speed and stopping distances of cars in 1920s. It contains only two variables (Speed and Stopping distance) and 50 observations. The numeric Stopping distance is measured in foot (ft)

ggplot(cars_df, aes(speed, dist)) + 
  geom_point(size = 2, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  labs(title = "Speed vs Stopping Distance", 
       x = "Speed (mph)", 
       y = "Stopping distance (ft)")

lm_cars <- lm(speed~dist, data = cars_df)

summary(lm_cars)

## 
## Call:
## lm(formula = speed ~ dist, data = cars_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpreting the model

From the model, the stopping distance can be expressed as:

\(distance = 8.28391 + 0.16557 * speed\)

This implies that:

1. Every increase in speed, will cause a 0.16557S increase in stopping distance.

2. The speed is probably relevant in this model because its p-value is very near to zero while the Y-intercept’s p-value is approximately 1 percent.

3. The model produced a Multiple R-squared of 0.6511, implying that about 65% variation in the stopping distance is accounteed for by the least-squares line.

ggplot(data = cars_df, aes(x=speed, y=lm_cars$residuals)) + 
  geom_point(size = 2, alpha = .3) + 
  geom_abline(intercept = 0, slope = 0, color = "blue") +
  theme(panel.grid.major = element_line(color = "green")) +
  labs(title = "Car speed vs Model Residuals", 
       x = "Car Speed (mph)", 
       y = "Model Residuals")

Using qqnorm to to check if the residuals are nearly normal (exhibit normal distribution).

qqnorm(lm_cars$residuals)
qqline(lm_cars$residuals)

We can observe the residuals have near normal distribution though some tails can be observed.

Reviewing further using a histogram:

hist(lm_cars$residuals, main="Histogram of Linear model Residuals", xlab="Residuals")

There appears to be a modest normal distribution as depicted by the above histogram.

Testing further using inference.

Inference:

H0: There is no relationship between speed and stopping distance

HA: There is a positive relationship (correlation) between Speed and stopping distance

Rounding up: A look at the model using the summary() function again.

summary(lm_cars)

## 
## Call:
## lm(formula = speed ~ dist, data = cars_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As already noted: