En muchas situaciones se dispone de un conjunto grande de posibles variables explicativas, una posible pregunta sería saber que variables deben entrar y cuáles no. Los métodos de selección de variables se encargan de abordar el problema de construcción o selección del modelo. En general, si se incluyen cada vez más variables en un modelo de regresión, el ajuste a los datos mejora, aumenta la cantidad de parámetros a estimar, pero disminuye su precisión individual (mayor varianza) y por tanto la de la función de regresión estimada, se produce un sobreajuste. Por el contrario, si se incluyen menos variables de las necesarias en el modelo, las varianzas se reducen, pero los sesgos aumentaran obteniéndose una mala descripción de los datos. Por otra parte, algunas variables predictoras pueden perjudicar la confiabilidad del modelo, especialmente si están correlacionadas con otras. De esta manera, el objetivo de los métodos de selección de variables es buscar un modelo que se ajuste bien a los datos y que a la vez sea posible buscar un equilibrio entre bondad de ajuste y sencillez
rm(list=ls(all=TRUE))
setwd("G:/UAAAN/MATERIAS/2019/REGRESION APLICADA/LECCION 10")
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
Datos <- read_excel("ESTACIONES-1.xlsx")
attach(Datos)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
head( Datos )
## # A tibble: 6 x 12
## Bio1 Bio12 Pprimavera Pverano Potoño Pinvierno Tprimavera Tverano Totoño
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 183 634 135 226 216 68 196 248. 188.
## 2 231 260 26 131 79 24 253. 283. 225
## 3 220 258 45 122 61 30 237. 273 214
## 4 213 341 32 189 95 25 233. 264. 207
## 5 231 187 30 87 57 23 248 280 224
## 6 222 263 37 115 75 36 242. 273. 217
## # ... with 3 more variables: Tinvierno <dbl>, Altitud <dbl>, Rend <dbl>
Corr<-cor(Datos, method = "pearson") # Hacer la correlacion
round(Corr, digits = 3) # Redondear a tres digitos
## Bio1 Bio12 Pprimavera Pverano Potoño Pinvierno Tprimavera
## Bio1 1.000 -0.793 -0.495 -0.753 -0.769 -0.388 0.987
## Bio12 -0.793 1.000 0.587 0.958 0.957 0.491 -0.777
## Pprimavera -0.495 0.587 1.000 0.353 0.736 0.389 -0.487
## Pverano -0.753 0.958 0.353 1.000 0.847 0.350 -0.723
## Potoño -0.769 0.957 0.736 0.847 1.000 0.491 -0.752
## Pinvierno -0.388 0.491 0.389 0.350 0.491 1.000 -0.503
## Tprimavera 0.987 -0.777 -0.487 -0.723 -0.752 -0.503 1.000
## Tverano 0.928 -0.817 -0.287 -0.867 -0.711 -0.280 0.899
## Totoño 0.982 -0.775 -0.451 -0.768 -0.731 -0.234 0.945
## Tinvierno 0.784 -0.442 -0.626 -0.290 -0.550 -0.313 0.785
## Altitud -0.832 0.624 0.073 0.731 0.513 -0.024 -0.769
## Rend -0.394 0.535 0.320 0.519 0.469 0.332 -0.383
## Tverano Totoño Tinvierno Altitud Rend
## Bio1 0.928 0.982 0.784 -0.832 -0.394
## Bio12 -0.817 -0.775 -0.442 0.624 0.535
## Pprimavera -0.287 -0.451 -0.626 0.073 0.320
## Pverano -0.867 -0.768 -0.290 0.731 0.519
## Potoño -0.711 -0.731 -0.550 0.513 0.469
## Pinvierno -0.280 -0.234 -0.313 -0.024 0.332
## Tprimavera 0.899 0.945 0.785 -0.769 -0.383
## Tverano 1.000 0.946 0.507 -0.913 -0.378
## Totoño 0.946 1.000 0.741 -0.897 -0.404
## Tinvierno 0.507 0.741 1.000 -0.488 -0.316
## Altitud -0.913 -0.897 -0.488 1.000 0.313
## Rend -0.378 -0.404 -0.316 0.313 1.000
write.csv(Corr, file = "cuadro_de_correlacion.csv") # Genera un excel
corrplot(Corr,tl.col = "black")
corrplot(Corr,method = "ellipse", tl.col = "black")
corrplot.mixed(Corr, tl.col = "black")
corrplot.mixed(Corr, lower.col = "black", number.cex = 1.1, tl.col = "black")
corrplot(Corr, order = "FPC")
corrplot(Corr, order = "hclust", addrect = 2, tl.col = "black")
Los datos se colectaron de diferentes fuentes como artículos científicos y tesis del todo el país, y se pretende generar un modelo para estimar Rendimiento (Rend) de aceite de orégano (Lippia graveolens), a partir de datos climáticos (precipitación y temperatura) y altitud, los cuales se obtuvieron de cada sitio
regvacia<-lm(formula = Rend~1,Datos); summary(regvacia)
##
## Call:
## lm(formula = Rend ~ 1, data = Datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.980 -0.620 -0.380 0.695 1.320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9800 0.1808 16.48 2.65e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7883 on 18 degrees of freedom
regcompleta<-lm(formula = Rend~.,Datos); summary(regcompleta)
##
## Call:
## lm(formula = Rend ~ ., data = Datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79148 -0.27745 -0.02194 0.27453 0.87607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.740130 22.708407 0.737 0.485
## Bio1 0.339717 0.555228 0.612 0.560
## Bio12 0.067321 0.152001 0.443 0.671
## Pprimavera -0.050680 0.137724 -0.368 0.724
## Pverano -0.051371 0.154881 -0.332 0.750
## Potoño -0.091134 0.148324 -0.614 0.558
## Pinvierno -0.079826 0.215226 -0.371 0.722
## Tprimavera 0.179072 0.151477 1.182 0.276
## Tverano -0.469772 0.509104 -0.923 0.387
## Totoño 0.439659 0.595052 0.739 0.484
## Tinvierno -0.600166 0.553185 -1.085 0.314
## Altitud -0.004398 0.004727 -0.931 0.383
##
## Residual standard error: 0.725 on 7 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.1541
## F-statistic: 1.298 on 11 and 7 DF, p-value: 0.3762
regforw<-step(regvacia,
scope = list(lower=regvacia, upper=regcompleta),direction = "forward")
## Start: AIC=-8.07
## Rend ~ 1
##
## Df Sum of Sq RSS AIC
## + Bio12 1 3.1979 7.9867 -12.4666
## + Pverano 1 3.0114 8.1732 -12.0280
## + Potoño 1 2.4594 8.7252 -10.7863
## + Totoño 1 1.8239 9.3607 -9.4504
## + Bio1 1 1.7369 9.4477 -9.2748
## + Tprimavera 1 1.6438 9.5408 -9.0885
## + Tverano 1 1.5949 9.5897 -8.9911
## + Pinvierno 1 1.2336 9.9510 -8.2886
## + Pprimavera 1 1.1450 10.0396 -8.1202
## <none> 11.1846 -8.0681
## + Tinvierno 1 1.1155 10.0691 -8.0644
## + Altitud 1 1.0936 10.0910 -8.0231
##
## Step: AIC=-12.47
## Rend ~ Bio12
##
## Df Sum of Sq RSS AIC
## <none> 7.9867 -12.467
## + Potoño 1 0.242742 7.7439 -11.053
## + Tverano 1 0.118836 7.8678 -10.752
## + Tinvierno 1 0.088185 7.8985 -10.678
## + Pinvierno 1 0.071127 7.9155 -10.637
## + Tprimavera 1 0.029123 7.9575 -10.536
## + Bio1 1 0.027406 7.9593 -10.532
## + Altitud 1 0.008109 7.9786 -10.486
## + Pverano 1 0.006372 7.9803 -10.482
## + Totoño 1 0.003244 7.9834 -10.474
## + Pprimavera 1 0.000633 7.9860 -10.468
summary(regforw)
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8256 -0.3766 -0.1594 0.4509 1.3556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1154455 0.3667890 5.767 2.28e-05 ***
## Bio12 0.0021987 0.0008427 2.609 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared: 0.2859, Adjusted R-squared: 0.2439
## F-statistic: 6.807 on 1 and 17 DF, p-value: 0.01833
Los resutados indican que de todas las variables la unica seleccionada por el procedimiento stepwise es Bio12, en este caso es precipitacion anual (mm); es decir el rendimiento de aceite de oregano puede ser estimado a patir de Bio12
regstep<-step(regvacia,
scope = list(lower=regvacia, upper=regcompleta),direction = "both")
## Start: AIC=-8.07
## Rend ~ 1
##
## Df Sum of Sq RSS AIC
## + Bio12 1 3.1979 7.9867 -12.4666
## + Pverano 1 3.0114 8.1732 -12.0280
## + Potoño 1 2.4594 8.7252 -10.7863
## + Totoño 1 1.8239 9.3607 -9.4504
## + Bio1 1 1.7369 9.4477 -9.2748
## + Tprimavera 1 1.6438 9.5408 -9.0885
## + Tverano 1 1.5949 9.5897 -8.9911
## + Pinvierno 1 1.2336 9.9510 -8.2886
## + Pprimavera 1 1.1450 10.0396 -8.1202
## <none> 11.1846 -8.0681
## + Tinvierno 1 1.1155 10.0691 -8.0644
## + Altitud 1 1.0936 10.0910 -8.0231
##
## Step: AIC=-12.47
## Rend ~ Bio12
##
## Df Sum of Sq RSS AIC
## <none> 7.9867 -12.4666
## + Potoño 1 0.2427 7.7439 -11.0531
## + Tverano 1 0.1188 7.8678 -10.7515
## + Tinvierno 1 0.0882 7.8985 -10.6776
## + Pinvierno 1 0.0711 7.9155 -10.6366
## + Tprimavera 1 0.0291 7.9575 -10.5360
## + Bio1 1 0.0274 7.9593 -10.5319
## + Altitud 1 0.0081 7.9786 -10.4859
## + Pverano 1 0.0064 7.9803 -10.4818
## + Totoño 1 0.0032 7.9834 -10.4744
## + Pprimavera 1 0.0006 7.9860 -10.4681
## - Bio12 1 3.1979 11.1846 -8.0681
summary(regstep)
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8256 -0.3766 -0.1594 0.4509 1.3556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1154455 0.3667890 5.767 2.28e-05 ***
## Bio12 0.0021987 0.0008427 2.609 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared: 0.2859, Adjusted R-squared: 0.2439
## F-statistic: 6.807 on 1 and 17 DF, p-value: 0.01833
Como se se puede constatar, ambos procedimientos generan el mismo modelo. No obstante, a este modelo es necesario comprobar, la normalidad, homosedasticidad y e independencia. Estos apartados ya se trataron anteriormente.
plot(Bio12, Rend, pch=1, lwd = 5)
fit<-lm(Rend~Bio12, data = Datos)
abline(fit, col="red")
summary(fit)
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8256 -0.3766 -0.1594 0.4509 1.3556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1154455 0.3667890 5.767 2.28e-05 ***
## Bio12 0.0021987 0.0008427 2.609 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared: 0.2859, Adjusted R-squared: 0.2439
## F-statistic: 6.807 on 1 and 17 DF, p-value: 0.01833
Ahora gneraremos un modelo mediante Boostrap, usando 100 iteraciones. Puede cambiar el valor (B = 100) para generarlo con 1000 iteraciones
library(bootStepAIC)
## Warning: package 'bootStepAIC' was built under R version 3.5.3
## Loading required package: MASS
library(MASS)
lmFit <-lm(formula = Rend~.,Datos)
boot.stepAIC(regstep, Datos)
##
## Summary of Bootstrapping the 'stepAIC()' procedure for
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Bootstrap samples: 100
## Direction: backward
## Penalty: 2 * df
##
## Covariates selected
## (%)
## Bio12 98
## Null 2
##
## Coefficients Sign
## + (%) - (%)
## Bio12 100 0
##
## Stat Significance
## (%)
## Bio12 75.51
##
##
## The stepAIC() for the original data-set gave
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Coefficients:
## (Intercept) Bio12
## 2.115446 0.002199
##
##
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Rend ~ Bio12
##
## Final Model:
## Rend ~ Bio12
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 17 7.98667 -12.46664
boot.stepAIC(regstep, Datos, B = 100, alpha = 0.05, direction = "forward",
k = 2, verbose = T)
##
## 1 replicate finished
## 2 replicate finished
## 3 replicate finished
## 4 replicate finished
## 5 replicate finished
## 6 replicate finished
## 7 replicate finished
## 8 replicate finished
## 9 replicate finished
## 10 replicate finished
## 11 replicate finished
## 12 replicate finished
## 13 replicate finished
## 14 replicate finished
## 15 replicate finished
## 16 replicate finished
## 17 replicate finished
## 18 replicate finished
## 19 replicate finished
## 20 replicate finished
## 21 replicate finished
## 22 replicate finished
## 23 replicate finished
## 24 replicate finished
## 25 replicate finished
## 26 replicate finished
## 27 replicate finished
## 28 replicate finished
## 29 replicate finished
## 30 replicate finished
## 31 replicate finished
## 32 replicate finished
## 33 replicate finished
## 34 replicate finished
## 35 replicate finished
## 36 replicate finished
## 37 replicate finished
## 38 replicate finished
## 39 replicate finished
## 40 replicate finished
## 41 replicate finished
## 42 replicate finished
## 43 replicate finished
## 44 replicate finished
## 45 replicate finished
## 46 replicate finished
## 47 replicate finished
## 48 replicate finished
## 49 replicate finished
## 50 replicate finished
## 51 replicate finished
## 52 replicate finished
## 53 replicate finished
## 54 replicate finished
## 55 replicate finished
## 56 replicate finished
## 57 replicate finished
## 58 replicate finished
## 59 replicate finished
## 60 replicate finished
## 61 replicate finished
## 62 replicate finished
## 63 replicate finished
## 64 replicate finished
## 65 replicate finished
## 66 replicate finished
## 67 replicate finished
## 68 replicate finished
## 69 replicate finished
## 70 replicate finished
## 71 replicate finished
## 72 replicate finished
## 73 replicate finished
## 74 replicate finished
## 75 replicate finished
## 76 replicate finished
## 77 replicate finished
## 78 replicate finished
## 79 replicate finished
## 80 replicate finished
## 81 replicate finished
## 82 replicate finished
## 83 replicate finished
## 84 replicate finished
## 85 replicate finished
## 86 replicate finished
## 87 replicate finished
## 88 replicate finished
## 89 replicate finished
## 90 replicate finished
## 91 replicate finished
## 92 replicate finished
## 93 replicate finished
## 94 replicate finished
## 95 replicate finished
## 96 replicate finished
## 97 replicate finished
## 98 replicate finished
## 99 replicate finished
## 100 replicate finished
##
## Summary of Bootstrapping the 'stepAIC()' procedure for
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Bootstrap samples: 100
## Direction: forward
## Penalty: 2 * df
##
## Covariates selected
## (%)
## Bio12 100
##
## Coefficients Sign
## + (%) - (%)
## Bio12 100 0
##
## Stat Significance
## (%)
## Bio12 75
##
##
## The stepAIC() for the original data-set gave
##
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
##
## Coefficients:
## (Intercept) Bio12
## 2.115446 0.002199
##
##
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Rend ~ Bio12
##
## Final Model:
## Rend ~ Bio12
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 17 7.98667 -12.46664
Como se puede ver, la variable seleccionada con este método es la misma que la seleccionada con los procedimientos anteriores