##           installed_and_loaded.packages.
## prettydoc                           TRUE
## dplyr                               TRUE
## ggplot2                             TRUE
## datasets                            TRUE
## gvlma                               TRUE
## MASS                                TRUE
## lmtest                              TRUE
## car                                 TRUE

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Cars Data Set

cars {datasets} R Documentation Speed and Stopping Distances of Cars

Description: The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

Format: A data frame with 50 observations on 2 variables.

[,1] speed numeric Speed (mph) [,2] dist numeric Stopping distance (ft)

Source: Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.

cars_df <- cars
glimpse(cars_df)

## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13...
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28...

Question

Is the measure of color intensity predictive of the cars’s speed content?

Model Summary

cars_model <- lm(dist ~ speed, cars_df)
summary(cars_model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

intercept <- coef(cars_model)[1]
slope <- coef(cars_model)[2]

a <- ggplot(cars_model, aes(speed, dist))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

Model Interpretation

This linear model is expressed as \(\widehat{StoppingDistance} = -17.5791 + 3.9324*{speed}\)
For each additional increase in the miles per hour, the model expects an increase of 3.9 feet in stopping distance.
In this model, multiple \(R^2\) is 0.6511, which means that the model’s least-squares line accounts for approximately \(65\%\) of the variation in the stopping distance.
The speed’s p-value is near zero and Y-intercept’s p-value is ~1%, which means that there is very little chance that they are not relevant to the model.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

The Component+Residual plot shows some deviation from a linear relationship.

crPlots(cars_model)

Normality: Are the model’s residuals distributed nearly normally?

No, per the histogram and Q-Q plot, the residuals are not normally distributed. There are some outliers on the left-side of the distribution.

sresid <- studres(cars_model)
hist(sresid, freq = FALSE, breaks = 10, main = "Distribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)

plot(cars_model, which = 2)

Homoscedasticity: Is there constant variability among the residuals?

Based on the scatter plot, the residuals do show some small deviation in variability.
The Non-constant Variance Score Test has a p-value of <.05, which means that we reject the null hypothesis of homoscedasticity.

plot(cars_model, which = 1)

ncvTest(cars_model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.650233    Df = 1     p = 0.03104933

Independence: Are the data from a random sample and not from a time series?

The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence (no autocorrelation).

durbinWatsonTest(cars_model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1604322      1.676225   0.168
##  Alternative hypothesis: rho != 0

Conclusion

Based upon my diagnostics and the gvlma function, the conditions for linear regression have not been met.

gvlma(cars_model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars_df)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = cars_model) 
## 
##                     Value  p-value                   Decision
## Global Stat        15.801 0.003298 Assumptions NOT satisfied!
## Skewness            6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis            1.661 0.197449    Assumptions acceptable.
## Link Function       2.329 0.126998    Assumptions acceptable.
## Heteroscedasticity  5.283 0.021530 Assumptions NOT satisfied!