## installed_and_loaded.packages.
## prettydoc TRUE
## dplyr TRUE
## ggplot2 TRUE
## datasets TRUE
## gvlma TRUE
## MASS TRUE
## lmtest TRUE
## car TRUE
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Cars Data Set
cars {datasets} R Documentation Speed and Stopping Distances of Cars
Description: The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
Format: A data frame with 50 observations on 2 variables.
[,1] speed numeric Speed (mph) [,2] dist numeric Stopping distance (ft)
Source: Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
cars_df <- cars
glimpse(cars_df)## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13...
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28...
Question
Is the measure of color intensity predictive of the cars’s speed content?
Model Summary
cars_model <- lm(dist ~ speed, cars_df)
summary(cars_model)##
## Call:
## lm(formula = dist ~ speed, data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
intercept <- coef(cars_model)[1]
slope <- coef(cars_model)[2]a <- ggplot(cars_model, aes(speed, dist))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)Model Interpretation
This linear model is expressed as \(\widehat{StoppingDistance} = -17.5791 + 3.9324*{speed}\)
For each additional increase in the miles per hour, the model expects an increase of 3.9 feet in stopping distance.
In this model, multiple \(R^2\) is 0.6511, which means that the model’s least-squares line accounts for approximately \(65\%\) of the variation in the stopping distance.
The speed’s p-value is near zero and Y-intercept’s p-value is ~1%, which means that there is very little chance that they are not relevant to the model.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
The Component+Residual plot shows some deviation from a linear relationship.
crPlots(cars_model)Normality: Are the model’s residuals distributed nearly normally?
No, per the histogram and Q-Q plot, the residuals are not normally distributed. There are some outliers on the left-side of the distribution.
sresid <- studres(cars_model)
hist(sresid, freq = FALSE, breaks = 10, main = "Distribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)plot(cars_model, which = 2)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot, the residuals do show some small deviation in variability.
The Non-constant Variance Score Test has a p-value of <.05, which means that we reject the null hypothesis of homoscedasticity.
plot(cars_model, which = 1)ncvTest(cars_model)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.650233 Df = 1 p = 0.03104933
Independence: Are the data from a random sample and not from a time series?
The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence (no autocorrelation).
durbinWatsonTest(cars_model)## lag Autocorrelation D-W Statistic p-value
## 1 0.1604322 1.676225 0.168
## Alternative hypothesis: rho != 0
Conclusion
Based upon my diagnostics and the gvlma function, the conditions for linear regression have not been met.
gvlma(cars_model)##
## Call:
## lm(formula = dist ~ speed, data = cars_df)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = cars_model)
##
## Value p-value Decision
## Global Stat 15.801 0.003298 Assumptions NOT satisfied!
## Skewness 6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis 1.661 0.197449 Assumptions acceptable.
## Link Function 2.329 0.126998 Assumptions acceptable.
## Heteroscedasticity 5.283 0.021530 Assumptions NOT satisfied!
plot(gvlma(cars_model))References
Global Validation of Linear Models Assumptions
http://r-statistics.co/Assumptions-of-Linear-Regression.html
https://www.statmethods.net/stats/regression.html
http://ademos.people.uic.edu/Chapter12.html#121_example_2:_a_thanksgiving_not_to_remember,_part_2
http://www.statisticshowto.com/durbin-watson-test-coefficient/
http://www.ianruginski.com/regressionassumptionswithR_tutorial.html