Swiss data
cor(airquality[, 1:4], use = "complete")
## Ozone Solar.R Wind Temp
## Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
## Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
## Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
## Temp 0.6985414 0.2940876 -0.4971897 1.0000000
cor(swiss, use = "complete")
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
anova or how to evaluate different models nested models analysis of Variance Table
fit1<-lm(Fertility ~ Agriculture, data=swiss)
fit3<-lm(Fertility ~ Agriculture + Examination + Education, data=swiss)
fit5<-lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data=swiss)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 6283.1
## 2 43 3180.9 2 3102.2 30.211 8.638e-09 ***
## 3 41 2105.0 2 1075.9 10.477 0.0002111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RSS: Residuals Sums of Squares Df: Degrees of freedom the number of regresion coeficients added from the previous model
Result: It is necessary the inclusion of the 5 variables
“All models are wrong, some models are useful”
The higher the Adjusted R-Squeared of a model is, the better the model is.
The Adjusted-R2 uses the variances instead of the variations. That means that it takes into consideration the sample size and the number of predictor variables. The value of the adjusted-R2 can actually increase with fewer variables or smaller sample sizes. You should always look at the adjusted-R2 when comparing models with different sample sizes or number of predictor variables, not the R2. If you have a tie for two models that have the same adjusted-R2, then take the one with the fewer variables as it’s a simpler model.
VIF is the increase of the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regresors (The squrare root of the VIF is the increse in the sd..)
library(car)
fit<-lm(Fertility ~ ., data=swiss)
#vif(fit)
sqrt(vif(fit)) # sd
## Agriculture Examination Education Catholic
## 1.511334 1.917138 1.665816 1.391819
## Infant.Mortality
## 1.052398
Education and Examination have high VIF and they are correlated Infant.Mortality has a low VIF is likely un-related to the other variables