We are going to use a simple regression model to go through the ten assumptions of linear regression modelling. We will find the best model using the metric \(\mathcal{R_{adj}}^2\) and the mtcars build in dataset to predict “mpg”.
leaps::leaps(x = mtcars[-1],y = mtcars$mpg,method = "adjr2") -> models_adj2
index <- which.max(models_adj2$adjr2)
regressors <- models_adj2$which[index,]
names(mtcars[-1])[c(as.vector(regressors))] -> regressors
model <- lm( as.formula(paste("mpg","~",paste(regressors,collapse = "+"))),mtcars)
The formula as state is “mpg ~ disp + hp + wt + qsec + am”.
round(mean(model$residuals))
## [1] 0
plot(model,1)
The points look random so we accept that this assumption is met.
In the next plot we observe that beyond lag 0 which is by default 1, all other bars fall below the threshold.
acf(model$residuals)
We can also verify this with the Durbin-Watson test.
lmtest::dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.674, p-value = 0.08441
## alternative hypothesis: true autocorrelation is greater than 0
We will perform some tests with the non factor regressors,
cor.test(mtcars$disp,resid(model))$p.value
## [1] 1
cor.test(mtcars$hp,resid(model))$p.value
## [1] 1
cor.test(mtcars$wt,resid(model))$p.value
## [1] 1
cor.test(mtcars$qsec,resid(model))$p.value
## [1] 1
all assumptions are also met here.
This is TRUE.
dim(mtcars)
## [1] 32 11
The lower the variance importance metric \(\displaystyle \frac{1}{1-\mathcal{R}^2}\) computed by subtracting from the model one regressor each time and computing \(R^2\) of mpg against all others, the lower the vif figure is the better.
car::vif(model)
## disp hp wt qsec am
## 9.071960 5.195144 7.170802 3.791426 2.887334
Some fall bellow 5 (my threshold) and am is a factor, but nonetheless this dataset is small so we accept these values
We already discussed this in previous sections.
Let’s use a Shapiro test for normality.
shapiro.test(resid(model))
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.95389, p-value = 0.1858
The normality assumption is met although the test set is small.
var(mtcars$disp)
## [1] 15360.8
var(mtcars$hp)
## [1] 4700.867
var(mtcars$wt)
## [1] 0.957379
var(mtcars$qsec)
## [1] 3.193166
We see that weight has low variance but we will keep it since the dataset is small, we also excluded from the test the factor variable am.