In the real worls, as a data scientist we often use regressions in order to derive conclusions regarding a particular dataset. As a result, regression is a very powerfull tool utilized by all data scientists in the world. Unfurtunately, the vast majority of datasets in the world are multivariate, and this could brings problem in using the regression. In other words, in an integrated system, the variation of a part of it implies a variation in the variables connected with that part.
As a result, in this introduction to multicollinearity and AIC I am going to explain how to use the stepwise regression in order to select predictors, and how to use AIC index to compare models.
For this analysis I will use the airquality index dataset, that is already uploaded on Rstudio. In particular, this dataset is carachterized by 153 observations and 6 variables.
First of all we have to recall the necessary libraries:
After that we have to recall the dataset:
data(airquality)
attach(airquality)
dim(airquality)
#> [1] 153 6
head(airquality)
#> Ozone Solar.R Wind Temp Month Day
#> 1 41 190 7.4 67 5 1
#> 2 36 118 8.0 72 5 2
#> 3 12 149 12.6 74 5 3
#> 4 18 313 11.5 62 5 4
#> 5 NA NA 14.3 56 5 5
#> 6 28 NA 14.9 66 5 6At this point, we are going to create 2 models: Model1 who consider the variables temp, wind and solar.r: Model2 who consider the variables temp, wind and solar.r and the interaction among variables.
model1 <- lm(Ozone ~ Temp + Wind + Solar.R)
summary(model1) ## no interaction
#>
#> Call:
#> lm(formula = Ozone ~ Temp + Wind + Solar.R)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -40.485 -14.219 -3.551 10.097 95.619
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -64.34208 23.05472 -2.791 0.00623 **
#> Temp 1.65209 0.25353 6.516 2.42e-09 ***
#> Wind -3.33359 0.65441 -5.094 1.52e-06 ***
#> Solar.R 0.05982 0.02319 2.580 0.01124 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 21.18 on 107 degrees of freedom
#> (42 observations deleted due to missingness)
#> Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
#> F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
model2 <- lm(Ozone ~ Temp * Wind * Solar.R)
summary(model2)
#>
#> Call:
#> lm(formula = Ozone ~ Temp * Wind * Solar.R)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -38.080 -11.192 -1.656 6.613 91.357
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.139e+01 1.079e+02 -0.661 0.510
#> Temp 1.348e+00 1.476e+00 0.913 0.363
#> Wind 4.329e+00 8.888e+00 0.487 0.627
#> Solar.R -6.647e-01 5.876e-01 -1.131 0.261
#> Temp:Wind -7.262e-02 1.255e-01 -0.578 0.564
#> Temp:Solar.R 1.098e-02 7.790e-03 1.409 0.162
#> Wind:Solar.R 3.389e-02 5.184e-02 0.654 0.515
#> Temp:Wind:Solar.R -5.604e-04 7.005e-04 -0.800 0.426
#>
#> Residual standard error: 19.2 on 103 degrees of freedom
#> (42 observations deleted due to missingness)
#> Multiple R-squared: 0.6883, Adjusted R-squared: 0.6671
#> F-statistic: 32.49 on 7 and 103 DF, p-value: < 2.2e-16As we can see from the p-values, in the first model the predictors are significant. However, in Model2, considering the interaction among variables all the predictors are not significants. There is a high probability that this phenomena is caused by multicollinearity. However, we see that if we count for the interactions among variables(Model2),the R^2 of the model increase from 0.6059 to 0.6883. This increment is positive cause it means that if we count for intra interaction among variables the variation explained by the model increases.
At this point we can create a Stepwise model to select the indepent variables in order to select the predictors with the best relationship with the dependent variable.
model3 <- step(model2)
#> Start: AIC=663.69
#> Ozone ~ Temp * Wind * Solar.R
#>
#> Df Sum of Sq RSS AIC
#> - Temp:Wind:Solar.R 1 235.97 38205 662.37
#> <none> 37969 663.69
#>
#> Step: AIC=662.37
#> Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R + Wind:Solar.R
#>
#> Df Sum of Sq RSS AIC
#> - Wind:Solar.R 1 429.42 38635 661.61
#> <none> 38205 662.37
#> - Temp:Solar.R 1 1574.75 39780 664.86
#> - Temp:Wind 1 2748.20 40954 668.08
#>
#> Step: AIC=661.61
#> Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R
#>
#> Df Sum of Sq RSS AIC
#> <none> 38635 661.61
#> - Temp:Solar.R 1 2141.1 40776 665.60
#> - Temp:Wind 1 4339.8 42975 671.43
summary(model3)
#>
#> Call:
#> lm(formula = Ozone ~ Temp + Wind + Solar.R + Temp:Wind + Temp:Solar.R)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -38.398 -10.889 -2.445 7.132 93.485
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.368e+02 6.414e+01 -2.133 0.035252 *
#> Temp 2.451e+00 8.250e-01 2.971 0.003678 **
#> Wind 1.115e+01 4.259e+00 2.617 0.010182 *
#> Solar.R -3.531e-01 1.750e-01 -2.018 0.046184 *
#> Temp:Wind -1.863e-01 5.425e-02 -3.434 0.000852 ***
#> Temp:Solar.R 5.717e-03 2.370e-03 2.412 0.017589 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 19.18 on 105 degrees of freedom
#> (42 observations deleted due to missingness)
#> Multiple R-squared: 0.6828, Adjusted R-squared: 0.6677
#> F-statistic: 45.21 on 5 and 105 DF, p-value: < 2.2e-16As we can see, using the stepwise model we find significant interactions and significant predictors.
By definition, the Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. As a result, we can use the AIC in order to compare models: the one with the loweer AIC is the best model.
AIC <- c(AIC(model1), AIC(model2), AIC(model3))
difference <- AIC - min(AIC)
difference
#> [1] 20.098924 2.071651 0.000000As we can see, the third model is the best one.
At the end, we can say that multicollinearity is a very common problem in data science but there are several tools to deal with it. The Stepwise regression is a technique that allow us to find the best predictors with the best relationship with the dependent variable. The use of AIC is helpful in determining the best model and so, the best model in dealing with multicollinearity.