Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Let us build a linear model and find the best fitting line.

## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

Let us look at the plot of two variables. Speed is the independent variable and stopping distance is the dependent variable.

There appears to be some correlation between two variables. Let us evaluate the linear model we have.

## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The residuals distribution suggests that the distribution is normal.

The Book says “For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient”.

The standard error for the speed coefficient is ~ 9.4 times the coefficient value.

The probability that the speed coefficient is not relevant in the model is 1.49e-12 (p-value), this shows the speed is very relevant in modeling stopping distiance.

The p-value of the intercept is 0.0123, which means the intercept is pretty relevant in the model.

The multiple R-squared is 0.6511, which means that this model explains 65.11% of the data’s variation.

Plot the Residuals

It is possible to say that the outlier values do not show the same variance of the residuals.

There are some problems at the outlier levels, the normal Q-Q plot of the residuals appears to follow the theoretical line. Residuals are reasonably normally distributed.