Question

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Answer

#load libraries
library(ggplot2)

Visualize Data

# load the r dataset and review briefly
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
# create scatter plot and regression line for visualization

theme_set(theme_bw())
ggplot(cars, aes(speed, dist))+
  geom_point()+
  geom_smooth(method = "lm", se=F)+
  labs(title = "Cars Dataset",
       x= "Speed",
       y= "Distance",
       subtitle="Linear Model")

Linear Model Function

#find a model where predictor distance explains output of speed.
cars_lm <- lm(speed ~ dist, data=cars)
cars_lm
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Coefficients:
## (Intercept)         dist  
##      8.2839       0.1656

Evaluation of the Model

# summary of cars linear model
summary(cars_lm)
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

When we look at the formula, we can confirm that we called the correct function when we created the formula. The median value of the residuals is 0.3615, which indicates that the model is good as it is close to 0. The p value also indicates that there is strong high correlation between speed and distance.

Redidual Analysis

theme_set(theme_bw())
ggplot(cars, aes(fitted(cars_lm), resid(cars_lm)))+
  geom_point()+
  geom_smooth(method = "lm", se=F)+
  labs(title = "Cars Dataset",
       x= "Fitted",
       y= "Residual",
       subtitle="Residual Analysis")

qqnorm(resid(cars_lm))
qqline(resid(cars_lm))

We see residuals being distributed normally. There is no significant diverge (maybe slightly at the upper end) Based on the quality evaluation and residual analysis , i would say, this is a good model.