Multicollinearity and AIC

Daniele Polidori

2020-05-28

Introduction

In the real worls, as a data scientist we often use regressions in order to derive conclusions regarding a particular dataset. As a result, regression is a very powerfull tool utilized by all data scientists in the world. Unfurtunately, the vast majority of datasets in the world are multivariate, and this could brings problem in using the regression. In other words, in an integrated system, the variation of a part of it implies a variation in the variables connected with that part.

As a result, in this introduction to multicollinearity and AIC I am going to explain how to use the stepwise regression in order to select predictors, and how to use AIC index to compare models.

For this analysis I will use the airquality index dataset, that is already uploaded on Rstudio. In particular, this dataset is carachterized by 153 observations and 6 variables.

Model building

First of all we have to recall the necessary libraries:

    library(tidyverse) 
    library(gridExtra)

After that we have to recall the dataset:

    data(airquality)
    attach(airquality)
    dim(airquality)
#> [1] 153   6
    head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

At this point, we are going to create 2 models: Model1 who consider the variables temp, wind and solar.r: Model2 who consider the variables temp, wind and solar.r and the interaction among variables.

model1 <- lm(Ozone ~ Temp + Wind + Solar.R)
summary(model1) ## no interaction
#> 
#> Call:
#> lm(formula = Ozone ~ Temp + Wind + Solar.R)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -40.485 -14.219  -3.551  10.097  95.619 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
#> Temp          1.65209    0.25353   6.516 2.42e-09 ***
#> Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
#> Solar.R       0.05982    0.02319   2.580  0.01124 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 21.18 on 107 degrees of freedom
#>   (42 observations deleted due to missingness)
#> Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
#> F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
model2 <- lm(Ozone ~ Temp * Wind * Solar.R)
summary(model2) 
#> 
#> Call:
#> lm(formula = Ozone ~ Temp * Wind * Solar.R)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -38.080 -11.192  -1.656   6.613  91.357 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)
#> (Intercept)       -7.139e+01  1.079e+02  -0.661    0.510
#> Temp               1.348e+00  1.476e+00   0.913    0.363
#> Wind               4.329e+00  8.888e+00   0.487    0.627
#> Solar.R           -6.647e-01  5.876e-01  -1.131    0.261
#> Temp:Wind         -7.262e-02  1.255e-01  -0.578    0.564
#> Temp:Solar.R       1.098e-02  7.790e-03   1.409    0.162
#> Wind:Solar.R       3.389e-02  5.184e-02   0.654    0.515
#> Temp:Wind:Solar.R -5.604e-04  7.005e-04  -0.800    0.426
#> 
#> Residual standard error: 19.2 on 103 degrees of freedom
#>   (42 observations deleted due to missingness)
#> Multiple R-squared:  0.6883, Adjusted R-squared:  0.6671 
#> F-statistic: 32.49 on 7 and 103 DF,  p-value: < 2.2e-16

As we can see from the p-values, in the first model the predictors are significant. However, in Model2, considering the interaction among variables all the predictors are not significants. There is a high probability that this phenomena is caused by multicollinearity. However, we see that if we count for the interactions among variables(Model2),the R^2 of the model increase from 0.6059 to 0.6883. This increment is positive cause it means that if we count for intra interaction among variables the variation explained by the model increases.

Stepwise Model

At this point we can create a Stepwise model to select the indepent variables in order to select the predictors with the best relationship with the dependent variable.

model3 <- step(model2)
#> Start:  AIC=663.69
#> Ozone ~ Temp * Wind * Solar.R
#> 
#>                     Df Sum of Sq   RSS    AIC
#> - Temp:Wind:Solar.R  1    235.97 38205 662.37
#> <none>                           37969 663.69
#> 
#> Step:  AIC=662.37
#> Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R + Wind:Solar.R
#> 
#>                Df Sum of Sq   RSS    AIC
#> - Wind:Solar.R  1    429.42 38635 661.61
#> <none>                      38205 662.37
#> - Temp:Solar.R  1   1574.75 39780 664.86
#> - Temp:Wind     1   2748.20 40954 668.08
#> 
#> Step:  AIC=661.61
#> Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R
#> 
#>                Df Sum of Sq   RSS    AIC
#> <none>                      38635 661.61
#> - Temp:Solar.R  1    2141.1 40776 665.60
#> - Temp:Wind     1    4339.8 42975 671.43
summary(model3)
#> 
#> Call:
#> lm(formula = Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -38.398 -10.889  -2.445   7.132  93.485 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -1.368e+02  6.414e+01  -2.133 0.035252 *  
#> Temp          2.451e+00  8.250e-01   2.971 0.003678 ** 
#> Wind          1.115e+01  4.259e+00   2.617 0.010182 *  
#> Solar.R      -3.531e-01  1.750e-01  -2.018 0.046184 *  
#> Temp:Wind    -1.863e-01  5.425e-02  -3.434 0.000852 ***
#> Temp:Solar.R  5.717e-03  2.370e-03   2.412 0.017589 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 19.18 on 105 degrees of freedom
#>   (42 observations deleted due to missingness)
#> Multiple R-squared:  0.6828, Adjusted R-squared:  0.6677 
#> F-statistic: 45.21 on 5 and 105 DF,  p-value: < 2.2e-16

As we can see, using the stepwise model we find significant interactions and significant predictors.

AIC

By definition, the Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. As a result, we can use the AIC in order to compare models: the one with the loweer AIC is the best model.

AIC <- c(AIC(model1), AIC(model2), AIC(model3))
difference <- AIC - min(AIC)
difference
#> [1] 20.098924  2.071651  0.000000

As we can see, the third model is the best one.

Conclusion

At the end, we can say that multicollinearity is a very common problem in data science but there are several tools to deal with it. The Stepwise regression is a technique that allow us to find the best predictors with the best relationship with the dependent variable. The use of AIC is helpful in determining the best model and so, the best model in dealing with multicollinearity.

References

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/AIC” “https://www.r-bloggers.com/how-do-i-interpret-the-aic/” “https://leanpub.com/regmods/read#leanpub-auto-adjustment