Introducción

En muchas situaciones se dispone de un conjunto grande de posibles variables explicativas, una posible pregunta sería saber que variables deben entrar y cuáles no. Los métodos de selección de variables se encargan de abordar el problema de construcción o selección del modelo. En general, si se incluyen cada vez más variables en un modelo de regresión, el ajuste a los datos mejora, aumenta la cantidad de parámetros a estimar, pero disminuye su precisión individual (mayor varianza) y por tanto la de la función de regresión estimada, se produce un sobreajuste. Por el contrario, si se incluyen menos variables de las necesarias en el modelo, las varianzas se reducen, pero los sesgos aumentaran obteniéndose una mala descripción de los datos. Por otra parte, algunas variables predictoras pueden perjudicar la confiabilidad del modelo, especialmente si están correlacionadas con otras. De esta manera, el objetivo de los métodos de selección de variables es buscar un modelo que se ajuste bien a los datos y que a la vez sea posible buscar un equilibrio entre bondad de ajuste y sencillez

Cargando las librerías y los datos

rm(list=ls(all=TRUE))

setwd("G:/UAAAN/MATERIAS/2019/REGRESION APLICADA/LECCION 10")
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
Datos <- read_excel("ESTACIONES-1.xlsx")
attach(Datos)

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
head( Datos )
## # A tibble: 6 x 12
##    Bio1 Bio12 Pprimavera Pverano Potoño Pinvierno Tprimavera Tverano Totoño
##   <dbl> <dbl>      <dbl>   <dbl>  <dbl>     <dbl>      <dbl>   <dbl>  <dbl>
## 1   183   634        135     226    216        68       196     248.   188.
## 2   231   260         26     131     79        24       253.    283.   225 
## 3   220   258         45     122     61        30       237.    273    214 
## 4   213   341         32     189     95        25       233.    264.   207 
## 5   231   187         30      87     57        23       248     280    224 
## 6   222   263         37     115     75        36       242.    273.   217 
## # ... with 3 more variables: Tinvierno <dbl>, Altitud <dbl>, Rend <dbl>

Correlación en forma de tablas

Corr<-cor(Datos, method = "pearson")                  # Hacer la correlacion
round(Corr, digits = 3)                               # Redondear a tres digitos
##              Bio1  Bio12 Pprimavera Pverano Potoño Pinvierno Tprimavera
## Bio1        1.000 -0.793     -0.495  -0.753 -0.769    -0.388      0.987
## Bio12      -0.793  1.000      0.587   0.958  0.957     0.491     -0.777
## Pprimavera -0.495  0.587      1.000   0.353  0.736     0.389     -0.487
## Pverano    -0.753  0.958      0.353   1.000  0.847     0.350     -0.723
## Potoño     -0.769  0.957      0.736   0.847  1.000     0.491     -0.752
## Pinvierno  -0.388  0.491      0.389   0.350  0.491     1.000     -0.503
## Tprimavera  0.987 -0.777     -0.487  -0.723 -0.752    -0.503      1.000
## Tverano     0.928 -0.817     -0.287  -0.867 -0.711    -0.280      0.899
## Totoño      0.982 -0.775     -0.451  -0.768 -0.731    -0.234      0.945
## Tinvierno   0.784 -0.442     -0.626  -0.290 -0.550    -0.313      0.785
## Altitud    -0.832  0.624      0.073   0.731  0.513    -0.024     -0.769
## Rend       -0.394  0.535      0.320   0.519  0.469     0.332     -0.383
##            Tverano Totoño Tinvierno Altitud   Rend
## Bio1         0.928  0.982     0.784  -0.832 -0.394
## Bio12       -0.817 -0.775    -0.442   0.624  0.535
## Pprimavera  -0.287 -0.451    -0.626   0.073  0.320
## Pverano     -0.867 -0.768    -0.290   0.731  0.519
## Potoño      -0.711 -0.731    -0.550   0.513  0.469
## Pinvierno   -0.280 -0.234    -0.313  -0.024  0.332
## Tprimavera   0.899  0.945     0.785  -0.769 -0.383
## Tverano      1.000  0.946     0.507  -0.913 -0.378
## Totoño       0.946  1.000     0.741  -0.897 -0.404
## Tinvierno    0.507  0.741     1.000  -0.488 -0.316
## Altitud     -0.913 -0.897    -0.488   1.000  0.313
## Rend        -0.378 -0.404    -0.316   0.313  1.000
write.csv(Corr, file = "cuadro_de_correlacion.csv")   # Genera un excel 

Correlación en forma gráfica

corrplot(Corr,tl.col = "black")

corrplot(Corr,method =  "ellipse", tl.col = "black")

corrplot.mixed(Corr, tl.col = "black")

corrplot.mixed(Corr, lower.col = "black", number.cex = 1.1, tl.col = "black")

corrplot(Corr, order = "FPC")

corrplot(Corr, order = "hclust", addrect = 2, tl.col = "black")

Los datos se colectaron de diferentes fuentes como artículos científicos y tesis del todo el país, y se pretende generar un modelo para estimar Rendimiento (Rend) de aceite de orégano (Lippia graveolens), a partir de datos climáticos (precipitación y temperatura) y altitud, los cuales se obtuvieron de cada sitio

Stepwise: Paso 1

Hacer un modelo sin variables independientes (vacía)

regvacia<-lm(formula = Rend~1,Datos); summary(regvacia)
## 
## Call:
## lm(formula = Rend ~ 1, data = Datos)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.980 -0.620 -0.380  0.695  1.320 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9800     0.1808   16.48 2.65e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7883 on 18 degrees of freedom

Stepwise: Paso 2

Hacer un modelo con todas variables independientes (completa)

regcompleta<-lm(formula = Rend~.,Datos); summary(regcompleta)
## 
## Call:
## lm(formula = Rend ~ ., data = Datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79148 -0.27745 -0.02194  0.27453  0.87607 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.740130  22.708407   0.737    0.485
## Bio1         0.339717   0.555228   0.612    0.560
## Bio12        0.067321   0.152001   0.443    0.671
## Pprimavera  -0.050680   0.137724  -0.368    0.724
## Pverano     -0.051371   0.154881  -0.332    0.750
## Potoño      -0.091134   0.148324  -0.614    0.558
## Pinvierno   -0.079826   0.215226  -0.371    0.722
## Tprimavera   0.179072   0.151477   1.182    0.276
## Tverano     -0.469772   0.509104  -0.923    0.387
## Totoño       0.439659   0.595052   0.739    0.484
## Tinvierno   -0.600166   0.553185  -1.085    0.314
## Altitud     -0.004398   0.004727  -0.931    0.383
## 
## Residual standard error: 0.725 on 7 degrees of freedom
## Multiple R-squared:  0.671,  Adjusted R-squared:  0.1541 
## F-statistic: 1.298 on 11 and 7 DF,  p-value: 0.3762

Stepwise: Paso 3

Hacer un modelo con el procedimiento Forward

regforw<-step(regvacia, 
              scope = list(lower=regvacia, upper=regcompleta),direction = "forward")
## Start:  AIC=-8.07
## Rend ~ 1
## 
##              Df Sum of Sq     RSS      AIC
## + Bio12       1    3.1979  7.9867 -12.4666
## + Pverano     1    3.0114  8.1732 -12.0280
## + Potoño      1    2.4594  8.7252 -10.7863
## + Totoño      1    1.8239  9.3607  -9.4504
## + Bio1        1    1.7369  9.4477  -9.2748
## + Tprimavera  1    1.6438  9.5408  -9.0885
## + Tverano     1    1.5949  9.5897  -8.9911
## + Pinvierno   1    1.2336  9.9510  -8.2886
## + Pprimavera  1    1.1450 10.0396  -8.1202
## <none>                    11.1846  -8.0681
## + Tinvierno   1    1.1155 10.0691  -8.0644
## + Altitud     1    1.0936 10.0910  -8.0231
## 
## Step:  AIC=-12.47
## Rend ~ Bio12
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    7.9867 -12.467
## + Potoño      1  0.242742 7.7439 -11.053
## + Tverano     1  0.118836 7.8678 -10.752
## + Tinvierno   1  0.088185 7.8985 -10.678
## + Pinvierno   1  0.071127 7.9155 -10.637
## + Tprimavera  1  0.029123 7.9575 -10.536
## + Bio1        1  0.027406 7.9593 -10.532
## + Altitud     1  0.008109 7.9786 -10.486
## + Pverano     1  0.006372 7.9803 -10.482
## + Totoño      1  0.003244 7.9834 -10.474
## + Pprimavera  1  0.000633 7.9860 -10.468
summary(regforw)
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8256 -0.3766 -0.1594  0.4509  1.3556 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.1154455  0.3667890   5.767 2.28e-05 ***
## Bio12       0.0021987  0.0008427   2.609   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.2439 
## F-statistic: 6.807 on 1 and 17 DF,  p-value: 0.01833

Los resutados indican que de todas las variables la unica seleccionada por el procedimiento stepwise es Bio12, en este caso es precipitacion anual (mm); es decir el rendimiento de aceite de oregano puede ser estimado a patir de Bio12

Hacer un modelo con el procedimiento Stepwise

regstep<-step(regvacia, 
              scope = list(lower=regvacia, upper=regcompleta),direction = "both")
## Start:  AIC=-8.07
## Rend ~ 1
## 
##              Df Sum of Sq     RSS      AIC
## + Bio12       1    3.1979  7.9867 -12.4666
## + Pverano     1    3.0114  8.1732 -12.0280
## + Potoño      1    2.4594  8.7252 -10.7863
## + Totoño      1    1.8239  9.3607  -9.4504
## + Bio1        1    1.7369  9.4477  -9.2748
## + Tprimavera  1    1.6438  9.5408  -9.0885
## + Tverano     1    1.5949  9.5897  -8.9911
## + Pinvierno   1    1.2336  9.9510  -8.2886
## + Pprimavera  1    1.1450 10.0396  -8.1202
## <none>                    11.1846  -8.0681
## + Tinvierno   1    1.1155 10.0691  -8.0644
## + Altitud     1    1.0936 10.0910  -8.0231
## 
## Step:  AIC=-12.47
## Rend ~ Bio12
## 
##              Df Sum of Sq     RSS      AIC
## <none>                     7.9867 -12.4666
## + Potoño      1    0.2427  7.7439 -11.0531
## + Tverano     1    0.1188  7.8678 -10.7515
## + Tinvierno   1    0.0882  7.8985 -10.6776
## + Pinvierno   1    0.0711  7.9155 -10.6366
## + Tprimavera  1    0.0291  7.9575 -10.5360
## + Bio1        1    0.0274  7.9593 -10.5319
## + Altitud     1    0.0081  7.9786 -10.4859
## + Pverano     1    0.0064  7.9803 -10.4818
## + Totoño      1    0.0032  7.9834 -10.4744
## + Pprimavera  1    0.0006  7.9860 -10.4681
## - Bio12       1    3.1979 11.1846  -8.0681
summary(regstep)
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8256 -0.3766 -0.1594  0.4509  1.3556 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.1154455  0.3667890   5.767 2.28e-05 ***
## Bio12       0.0021987  0.0008427   2.609   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.2439 
## F-statistic: 6.807 on 1 and 17 DF,  p-value: 0.01833

Como se se puede constatar, ambos procedimientos generan el mismo modelo. No obstante, a este modelo es necesario comprobar, la normalidad, homosedasticidad y e independencia. Estos apartados ya se trataron anteriormente.

El modelo final

plot(Bio12, Rend, pch=1, lwd = 5)
fit<-lm(Rend~Bio12, data = Datos)
abline(fit, col="red")

summary(fit)
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8256 -0.3766 -0.1594  0.4509  1.3556 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.1154455  0.3667890   5.767 2.28e-05 ***
## Bio12       0.0021987  0.0008427   2.609   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6854 on 17 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.2439 
## F-statistic: 6.807 on 1 and 17 DF,  p-value: 0.01833

Ahora gneraremos un modelo mediante Boostrap, usando 100 iteraciones. Puede cambiar el valor (B = 100) para generarlo con 1000 iteraciones

Generar un modelo con Boostrap

library(bootStepAIC)
## Warning: package 'bootStepAIC' was built under R version 3.5.3
## Loading required package: MASS
library(MASS)

lmFit <-lm(formula = Rend~.,Datos)
boot.stepAIC(regstep, Datos)
## 
## Summary of Bootstrapping the 'stepAIC()' procedure for
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Bootstrap samples: 100 
## Direction: backward 
## Penalty: 2 * df
## 
## Covariates selected
##       (%)
## Bio12  98
## Null    2
## 
## Coefficients Sign
##       + (%) - (%)
## Bio12   100     0
## 
## Stat Significance
##         (%)
## Bio12 75.51
## 
## 
## The stepAIC() for the original data-set gave
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Coefficients:
## (Intercept)        Bio12  
##    2.115446     0.002199  
## 
## 
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Rend ~ Bio12
## 
## Final Model:
## Rend ~ Bio12
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                         17    7.98667 -12.46664
boot.stepAIC(regstep, Datos, B = 100, alpha = 0.05, direction = "forward",
             k = 2, verbose = T)
## 
##  1 replicate finished
##  2 replicate finished
##  3 replicate finished
##  4 replicate finished
##  5 replicate finished
##  6 replicate finished
##  7 replicate finished
##  8 replicate finished
##  9 replicate finished
##  10 replicate finished
##  11 replicate finished
##  12 replicate finished
##  13 replicate finished
##  14 replicate finished
##  15 replicate finished
##  16 replicate finished
##  17 replicate finished
##  18 replicate finished
##  19 replicate finished
##  20 replicate finished
##  21 replicate finished
##  22 replicate finished
##  23 replicate finished
##  24 replicate finished
##  25 replicate finished
##  26 replicate finished
##  27 replicate finished
##  28 replicate finished
##  29 replicate finished
##  30 replicate finished
##  31 replicate finished
##  32 replicate finished
##  33 replicate finished
##  34 replicate finished
##  35 replicate finished
##  36 replicate finished
##  37 replicate finished
##  38 replicate finished
##  39 replicate finished
##  40 replicate finished
##  41 replicate finished
##  42 replicate finished
##  43 replicate finished
##  44 replicate finished
##  45 replicate finished
##  46 replicate finished
##  47 replicate finished
##  48 replicate finished
##  49 replicate finished
##  50 replicate finished
##  51 replicate finished
##  52 replicate finished
##  53 replicate finished
##  54 replicate finished
##  55 replicate finished
##  56 replicate finished
##  57 replicate finished
##  58 replicate finished
##  59 replicate finished
##  60 replicate finished
##  61 replicate finished
##  62 replicate finished
##  63 replicate finished
##  64 replicate finished
##  65 replicate finished
##  66 replicate finished
##  67 replicate finished
##  68 replicate finished
##  69 replicate finished
##  70 replicate finished
##  71 replicate finished
##  72 replicate finished
##  73 replicate finished
##  74 replicate finished
##  75 replicate finished
##  76 replicate finished
##  77 replicate finished
##  78 replicate finished
##  79 replicate finished
##  80 replicate finished
##  81 replicate finished
##  82 replicate finished
##  83 replicate finished
##  84 replicate finished
##  85 replicate finished
##  86 replicate finished
##  87 replicate finished
##  88 replicate finished
##  89 replicate finished
##  90 replicate finished
##  91 replicate finished
##  92 replicate finished
##  93 replicate finished
##  94 replicate finished
##  95 replicate finished
##  96 replicate finished
##  97 replicate finished
##  98 replicate finished
##  99 replicate finished
##  100 replicate finished
## 
## Summary of Bootstrapping the 'stepAIC()' procedure for
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Bootstrap samples: 100 
## Direction: forward 
## Penalty: 2 * df
## 
## Covariates selected
##       (%)
## Bio12 100
## 
## Coefficients Sign
##       + (%) - (%)
## Bio12   100     0
## 
## Stat Significance
##       (%)
## Bio12  75
## 
## 
## The stepAIC() for the original data-set gave
## 
## Call:
## lm(formula = Rend ~ Bio12, data = Datos)
## 
## Coefficients:
## (Intercept)        Bio12  
##    2.115446     0.002199  
## 
## 
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Rend ~ Bio12
## 
## Final Model:
## Rend ~ Bio12
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                         17    7.98667 -12.46664

Como se puede ver, la variable seleccionada con este método es la misma que la seleccionada con los procedimientos anteriores