library(tidyverse)
library(MASS)
library(fueleconomy)
library(lmtest)
library(car)
library(gvlma)
Question
Can hp in a vehicle be predictive of MPG?
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Model Summary
model <- lm(mtcars$mpg ~ mtcars$hp, data = mtcars)
summary(model)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## mtcars$hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
intercept <- coef(model)[1]
slope <- coef(model)[2]
ggplot(model, aes(mtcars$hp, mtcars$mpg)) +
geom_point() +
geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)
The simple linear model is expressed as:
\[\widehat{mpg} = 30.09886 + -0.06823{*hp}\] The p-values seem to indicate that the variables are siginificant for the model. The model shows a -0.06823 decrease in mpg per HP increase. The least-squares line accounts for \(R^2\) of \(0.6024\).
Model Validation
res <- residuals(model)
res <- as.data.frame(res)
ggplot(res, aes(res)) +
geom_histogram(fill = 'blue', alpha=0.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(fitted(model),resid(model))
qqnorm(resid(model))
qqline(resid(model))
crPlots(model)
The Q-Q plot, histogram, and CR plots seem to show that the residuals are not normally distributed.
Outlier Analysis:
outlierTest(model)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## Maserati Bora 2.568387 0.015632 0.50021
cutoff <- 4/((nrow(mtcars)-length(model$coefficients)-2))
plot(model, which=4, cook.levels=cutoff)
gvlma(model)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp, data = mtcars)
##
## Coefficients:
## (Intercept) mtcars$hp
## 30.09886 -0.06823
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model)
##
## Value p-value Decision
## Global Stat 18.845445 0.0008428 Assumptions NOT satisfied!
## Skewness 2.977924 0.0844075 Assumptions acceptable.
## Kurtosis 0.005681 0.9399169 Assumptions acceptable.
## Link Function 12.369182 0.0004365 Assumptions NOT satisfied!
## Heteroscedasticity 3.492659 0.0616415 Assumptions acceptable.
According to gvlma - conditions for this linear model have not been met.
Ref:
https://stackoverflow.com/questions/43252474/using-and-interpreting-output-from-gvlma https://conservancy.umn.edu/bitstream/handle/11299/189222/LinearRegression_fulltext.pdf https://ademos.people.uic.edu/Chapter12.html#42_testing_for_outliers:_outliertest()