10 Commandments Of Regression

We are going to use a simple regression model to go through the ten assumptions of linear regression modelling. We will find the best model using the metric \(\mathcal{R_{adj}}^2\) and the mtcars build in dataset to predict “mpg”.

leaps::leaps(x = mtcars[-1],y = mtcars$mpg,method = "adjr2") -> models_adj2
index <- which.max(models_adj2$adjr2)
regressors <- models_adj2$which[index,]
names(mtcars[-1])[c(as.vector(regressors))] -> regressors
model <- lm( as.formula(paste("mpg","~",paste(regressors,collapse = "+"))),mtcars)

The regression model is linear in parameters.

The formula as state is “mpg ~ disp + hp + wt + qsec + am”.

The mean of the residuals is zero

round(mean(model$residuals))

## [1] 0

Equal variance of the residuals, no heteroscedacity present.

plot(model,1)

The points look random so we accept that this assumption is met.

No autocorrellation present in the residuals. (lag dependence)

In the next plot we observe that beyond lag 0 which is by default 1, all other bars fall below the threshold.

acf(model$residuals)

We can also verify this with the Durbin-Watson test.

lmtest::dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.674, p-value = 0.08441
## alternative hypothesis: true autocorrelation is greater than 0

The prediction variables and the residuals must be uncorrelated.

We will perform some tests with the non factor regressors,

cor.test(mtcars$disp,resid(model))$p.value

## [1] 1

cor.test(mtcars$hp,resid(model))$p.value

## [1] 1

cor.test(mtcars$wt,resid(model))$p.value

## [1] 1

cor.test(mtcars$qsec,resid(model))$p.value

## [1] 1

all assumptions are also met here.

The number of the regressors (predictos) must be greater than the observations in the data.

This is TRUE.

dim(mtcars)

## [1] 32 11

No multicollinearity.

The lower the variance importance metric \(\displaystyle \frac{1}{1-\mathcal{R}^2}\) computed by subtracting from the model one regressor each time and computing \(R^2\) of mpg against all others, the lower the vif figure is the better.

car::vif(model)

##     disp       hp       wt     qsec       am 
## 9.071960 5.195144 7.170802 3.791426 2.887334

Some fall bellow 5 (my threshold) and am is a factor, but nonetheless this dataset is small so we accept these values

The model must be correctly stated i.e. if Y and X have an inverse relationship 1/X must be stated clearly.

We already discussed this in previous sections.

The residuals must be normally distributed around zero.

Let’s use a Shapiro test for normality.

shapiro.test(resid(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 0.95389, p-value = 0.1858

The normality assumption is met although the test set is small.

The variability of the regressors must be positive (greater than zero)

var(mtcars$disp)

## [1] 15360.8

var(mtcars$hp)

## [1] 4700.867

var(mtcars$wt)

## [1] 0.957379

var(mtcars$qsec)

## [1] 3.193166

We see that weight has low variance but we will keep it since the dataset is small, we also excluded from the test the factor variable am.

10 Commandments Of Regression

George Papadopoulos

3/8/2022