FZahir_Assign11

Problem

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Load the data

attach(cars)
summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Visualize the data

We can see that the stopping distance tends to increase as the speed increases.

plot(speed, dist, xlab='Speed (mph)', ylab='Distance (ft)', 
     main='Distance vs. Speed')

Linear Regression model

We create a one factor regression to model distance as a function of speed.

lm <- lm(dist ~ speed, data=cars)
summary(lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Y intercept= -17.5791, Slope = 3.9324

The equation for the regression model is

\[ \hat{dist}=−17.5791+3.9324∗speed\]

Evaluating the quality of the model

Residuals:

The 1q and 3q values are roughly close, but the min residual is -29.069 whereas the max residual is 43.201. The residuals seem to increasing as the value of speed increases, and are therefore not exactly normally distributed.

Coefficients

For a good model, we would like to see standard errors that is at least 5 to 10 times smaller than the coefficients.

standard error (intercept) is roughly 3 times smaller whereas standard error(speed) is roughly 9.5 times smaller. These values suggest that the slope estimate shows little variability but the intercept estimate can vary significantly.

The p-values of both the coefficiants are very small, so there is minimal probability that the corresponding coefficients are not relevant to the model.

Rsquare

Multiple Rsquare of 65.11% means that the model explains 65.11% of the data’s variation.

Plotting the linear model

plot(speed, dist, xlab='Speed (mph)', ylab='Distance (ft)', 
     main='Distance vs. Speed')
abline(lm)

Residual analysis

plot(fitted(lm), resid(lm))
abline(a=0, b=0, col='red')

We see that the residuals tend to increase as we move towards the right. The residuals are not uniformly scattered above and below zero.This model will tend to overpredict as often as it underpredicts.

Q-Q plot

qqnorm(resid(lm), col='blue')
qqline(resid(lm), col='red')

Even though the points in the middle are somewhat closer to the line, the points at the end diverge significantly. The distribution’s tails are heavier than a normal distribution. The residuals are not normally distributed.

Conclusion

We conclude that using only the speed as a predictor of stopping distance in the model is insufficient to explain the data. Therefore, we can say that there may be other factors that may be considered to accurately predict the stopping distance.