10/10, 2018

A veces no se pueden usar criterios de información

  • No se pueden calcular IC
  • No se cumplen supuestos de criterios de información
  • Comparacion entre distintos tipos de modelos (GLM, vs GAM vs RPART, etc)
  • Se aplican distintas transformaciones a la variable respuesta (GLM)
  • Varios terminos polinomiales (no se debe promediar modelos)

Alternativas a los métodos de criterios de información

  • Simular que presentamos datos nuevos
    • Crossvalidation
    • Bootstrapping
    • Leave one out

k-fold Crossvalidation

  • Divido aleatoreamente mi base de datos en \(k\) grupos
  • Entreno mis modelos con \(k-1\) grupos
  • Testeo con el grupo \(k_i\)
  • Promedio medida de desempeño por ejemplo \(R^2\)

Volvamos al ejemplo de hp

  • Veamos como evaluamos 1 modelo \(mpg = \beta_1 hp +c\)
  • \(R^2\) = 0.6

Paso 1 K-fold

Divido my base en K

  • En este caso K = 5
    • Dividiremos nuestra base de datos en 5 partes iguales

Paso 2 Entreno y testeo para cada K

Fold 1

  • Rsq = c(0.61)

Fold 2

  • Rsq = c(0.61, 0.65)

Fold 3

  • Rsq = c(0.61, 0.65, 0.89)

Fold 4

  • Rsq = c(0.61, 0.65, 0.89, 0.6)

Fold 5

  • Rsq = c(0.61, 0.65, 0.89, 0.6, 0.67), media = 0.68

k-fold repeated Crossvalidation

  • Repito esto n veces
  • 10-repeated-5-fold-crossvalidation = 50 \(R^2\)

k-fold repeated Crossvalidation (cont)

Rsquared Resample
0.825 Fold1.Rep01
0.732 Fold2.Rep01
0.769 Fold3.Rep01
0.882 Fold4.Rep01
0.678 Fold5.Rep01
0.709 Fold1.Rep02
0.599 Fold2.Rep02
0.915 Fold3.Rep02
0.779 Fold4.Rep02
0.683 Fold5.Rep02
0.440 Fold1.Rep03
0.714 Fold2.Rep03
0.707 Fold3.Rep03
0.860 Fold4.Rep03
0.648 Fold5.Rep03
0.900 Fold1.Rep04
0.643 Fold2.Rep04
0.609 Fold3.Rep04
0.993 Fold4.Rep04
0.638 Fold5.Rep04
0.789 Fold1.Rep05
0.930 Fold2.Rep05
0.812 Fold3.Rep05
0.711 Fold4.Rep05
0.623 Fold5.Rep05
0.699 Fold1.Rep06
0.766 Fold2.Rep06
0.788 Fold3.Rep06
0.678 Fold4.Rep06
0.818 Fold5.Rep06
0.755 Fold1.Rep07
0.474 Fold2.Rep07
0.893 Fold3.Rep07
0.657 Fold4.Rep07
0.751 Fold5.Rep07
0.611 Fold1.Rep08
0.642 Fold2.Rep08
0.857 Fold3.Rep08
0.768 Fold4.Rep08
0.852 Fold5.Rep08
0.952 Fold1.Rep09
0.828 Fold2.Rep09
0.799 Fold3.Rep09
0.709 Fold4.Rep09
0.422 Fold5.Rep09
0.561 Fold1.Rep10
0.712 Fold2.Rep10
0.568 Fold3.Rep10
0.883 Fold4.Rep10
0.745 Fold5.Rep10
  • \(R^2\) = 0.735468

k-fold repeated Crossvalidation (cont)

Seleccionando modelos usando k-fold repeated Crossvalidation

Modelos candidatos:

  • \(mpg = \beta_1hp + c\)
  • \(mpg = \beta_1hp + \beta_2hp^2 + c\)
  • \(mpg = \beta_1hp + \beta_2hp^2 + \beta_3hp^3 + c\)
  • \(mpg = \beta_1hp + \beta_2hp^2 + \beta_3hp^3 + \beta_4hp^4 + c\)
  • \(mpg = \beta_1hp + \beta_2hp^2 + \beta_3hp^3 + \beta_4hp^4 + \beta_5hp^5 + c\)
  • \(mpg = \beta_1hp + \beta_2hp^2 + \beta_3hp^3 + \beta_4hp^4 + \beta_5hp^5 + \beta_6hp^6 + c\)

Seleccionando por AICc

data("mtcars")

fit1 <- lm(mpg ~ hp, data = mtcars)
fit2 <- lm(mpg ~ hp + I(hp^2), data = mtcars)
fit3 <- lm(mpg ~ hp + I(hp^2) + I(hp^3), data = mtcars)
fit4 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4), data = mtcars)
fit5 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5), data = mtcars)
fit6 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) + I(hp^6), data = mtcars)

models <- list(fit1, fit2, fit3, fit4, fit5, fit6)
SelectedMods <- model.sel(models)

Seleccionando por AICc (cont)

(Intercept) hp I(hp^2) I(hp^3) I(hp^4) I(hp^5) I(hp^6) AICc delta weight
40.41 -0.21 0.00 NA NA NA NA 169.08 0.00 0.70
44.22 -0.29 0.00 0 NA NA NA 171.32 2.24 0.23
45.36 -0.33 0.00 0 0 NA NA 174.36 5.28 0.05
61.80 -0.96 0.01 0 0 0 NA 177.45 8.37 0.01
-62.95 4.81 -0.09 0 0 0 0 178.28 9.20 0.01
30.10 -0.07 NA NA NA NA NA 182.10 13.01 0.00

Seleccionando por n-repeated-K-fold-crossvalidation

  • para 1 modelo
set.seed(2018)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)

DF <- train(mpg ~ hp, data = mtcars, method = "lm", trControl = ctrl)$resample

DF <- DF %>% select(Rsquared, Resample)
  • Ahora seleccionen ustedes el mejor modelo

Ejercicio resuelto

form1 <- "mpg ~ hp"
form2 <- "mpg ~ hp + I(hp^2)"
form3 <- "mpg ~ hp + I(hp^2) + I(hp^3)"
form4 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4)"
form5 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5)"
form6 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) + I(hp^6)"

forms <- list(form1, form2, form3, form4, form5, form6)
K = (2:7)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)

Continuado

set.seed(2018)
Tests <- forms %>% map(~train(as.formula(.x), 
    data = mtcars, method = "lm", 
    trControl = ctrl)) %>% map(~as.data.frame(.x$resample)) %>% 
    map(~select(.x, Rsquared)) %>% 
    map(~summarise_all(.x, funs(mean, 
        sd), na.rm = T)) %>% map2(.y = forms, 
    ~mutate(.x, model = .y)) %>% 
    reduce(bind_rows) %>% mutate(K = K) %>% 
    arrange(desc(mean))

Continuado

mean sd model K
0.781 0.175 mpg ~ hp + I(hp^2) 3
0.772 0.176 mpg ~ hp + I(hp^2) + I(hp^3) 4
0.718 0.135 mpg ~ hp 2
0.649 0.329 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) 5
0.639 0.299 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) 6
0.635 0.296 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) + I(hp^6) 7

Discusión de artículo

Variables fijas vs aleatorias

  • Fijas (continuas o categóricas)
    • Se espera que tengan una influencia predecible y sistemática en sobre lo que queremos explicar. Además usan todos los niveles de un factor (Ejemplo genero)
  • Aleatorias

    • Se espera que su influencia sea impredecible e idiosincratica. Además no se usan todos los niveles de un factor (todos los individuos) A + Error(B)

Ejemplo CO2

Ejemplo CO2

Plant Type Treatment conc uptake
Qn1 Quebec nonchilled 95 16.0
Qn1 Quebec nonchilled 175 30.4
Qn1 Quebec nonchilled 250 34.8
Qn1 Quebec nonchilled 350 37.2
Qn1 Quebec nonchilled 500 35.3
Qn1 Quebec nonchilled 675 39.2
Qn1 Quebec nonchilled 1000 39.7
Qn2 Quebec nonchilled 95 13.6
Qn2 Quebec nonchilled 175 27.3
Qn2 Quebec nonchilled 250 37.1
Qn2 Quebec nonchilled 350 41.8
Qn2 Quebec nonchilled 500 40.6
Qn2 Quebec nonchilled 675 41.4
Qn2 Quebec nonchilled 1000 44.3
Qn3 Quebec nonchilled 95 16.2
Qn3 Quebec nonchilled 175 32.4
Qn3 Quebec nonchilled 250 40.3
Qn3 Quebec nonchilled 350 42.1
Qn3 Quebec nonchilled 500 42.9
Qn3 Quebec nonchilled 675 43.9
Qn3 Quebec nonchilled 1000 45.5
Qc1 Quebec chilled 95 14.2
Qc1 Quebec chilled 175 24.1
Qc1 Quebec chilled 250 30.3
Qc1 Quebec chilled 350 34.6
Qc1 Quebec chilled 500 32.5
Qc1 Quebec chilled 675 35.4
Qc1 Quebec chilled 1000 38.7
Qc2 Quebec chilled 95 9.3
Qc2 Quebec chilled 175 27.3
Qc2 Quebec chilled 250 35.0
Qc2 Quebec chilled 350 38.8
Qc2 Quebec chilled 500 38.6
Qc2 Quebec chilled 675 37.5
Qc2 Quebec chilled 1000 42.4
Qc3 Quebec chilled 95 15.1
Qc3 Quebec chilled 175 21.0
Qc3 Quebec chilled 250 38.1
Qc3 Quebec chilled 350 34.0
Qc3 Quebec chilled 500 38.9
Qc3 Quebec chilled 675 39.6
Qc3 Quebec chilled 1000 41.4
Mn1 Mississippi nonchilled 95 10.6
Mn1 Mississippi nonchilled 175 19.2
Mn1 Mississippi nonchilled 250 26.2
Mn1 Mississippi nonchilled 350 30.0
Mn1 Mississippi nonchilled 500 30.9
Mn1 Mississippi nonchilled 675 32.4
Mn1 Mississippi nonchilled 1000 35.5
Mn2 Mississippi nonchilled 95 12.0
Mn2 Mississippi nonchilled 175 22.0
Mn2 Mississippi nonchilled 250 30.6
Mn2 Mississippi nonchilled 350 31.8
Mn2 Mississippi nonchilled 500 32.4
Mn2 Mississippi nonchilled 675 31.1
Mn2 Mississippi nonchilled 1000 31.5
Mn3 Mississippi nonchilled 95 11.3
Mn3 Mississippi nonchilled 175 19.4
Mn3 Mississippi nonchilled 250 25.8
Mn3 Mississippi nonchilled 350 27.9
Mn3 Mississippi nonchilled 500 28.5
Mn3 Mississippi nonchilled 675 28.1
Mn3 Mississippi nonchilled 1000 27.8
Mc1 Mississippi chilled 95 10.5
Mc1 Mississippi chilled 175 14.9
Mc1 Mississippi chilled 250 18.1
Mc1 Mississippi chilled 350 18.9
Mc1 Mississippi chilled 500 19.5
Mc1 Mississippi chilled 675 22.2
Mc1 Mississippi chilled 1000 21.9
Mc2 Mississippi chilled 95 7.7
Mc2 Mississippi chilled 175 11.4
Mc2 Mississippi chilled 250 12.3
Mc2 Mississippi chilled 350 13.0
Mc2 Mississippi chilled 500 12.5
Mc2 Mississippi chilled 675 13.7
Mc2 Mississippi chilled 1000 14.4
Mc3 Mississippi chilled 95 10.6
Mc3 Mississippi chilled 175 18.0
Mc3 Mississippi chilled 250 17.9
Mc3 Mississippi chilled 350 17.9
Mc3 Mississippi chilled 500 17.9
Mc3 Mississippi chilled 675 18.9
Mc3 Mississippi chilled 1000 19.9

Modelos

library(lme4)
mod1 <- lm(uptake ~ Type * Treatment + I(log(conc)) + conc, data = CO2)
mod2 <- lmer(uptake ~ Type * Treatment + I(log(conc)) + conc + (1 | Plant), 
    data = CO2)
options(na.action = "na.fail")
Seleccion <- dredge(mod1, m.lim = c(0, round(nrow(CO2)/10)))

Selección

(Intercept) conc I(log(conc)) Treatment Type Treatment:Type df logLik AICc delta weight
-57.247 -0.025 17.789
7 -231.721 478.915 0.000 0.997
-55.608 -0.025 17.789
NA 6 -238.828 490.747 11.832 0.003
-14.037 NA 8.484
6 -246.017 505.124 26.209 0.000
-12.398 NA 8.484
NA 5 -251.194 513.157 34.242 0.000
-59.037 -0.025 17.789 NA
NA 5 -260.654 532.077 53.161 0.000
-15.827 NA 8.484 NA
NA 4 -268.437 545.381 66.466 0.000
27.621 0.018 NA
6 -267.101 547.292 68.377 0.000
29.260 0.018 NA
NA 5 -270.310 551.389 72.474 0.000
25.830 0.018 NA NA
NA 4 -282.035 572.577 93.662 0.000
-61.937 -0.025 17.789
NA NA 5 -289.240 589.249 110.334 0.000
35.333 NA NA
5 -291.877 594.523 115.608 0.000
-18.727 NA 8.484
NA NA 4 -293.361 595.227 116.312 0.000
36.973 NA NA
NA 4 -293.686 595.879 116.964 0.000
-65.367 -0.025 17.789 NA NA NA 4 -297.079 602.664 123.749 0.000
-22.157 NA 8.484 NA NA NA 3 -300.526 607.352 128.436 0.000
33.543 NA NA NA
NA 3 -300.801 607.901 128.986 0.000
22.930 0.018 NA
NA NA 4 -301.408 611.323 132.408 0.000
19.500 0.018 NA NA NA NA 3 -307.409 621.118 142.203 0.000
30.643 NA NA
NA NA 3 -314.173 634.646 155.730 0.000
27.213 NA NA NA NA NA 2 -318.682 641.512 162.596 0.000