Exámen Parcial II

Contexto

Un grupo de investigadores recolecta información de 386 individuos; de estos, algunos son diagnosticados con la enfermedad (columna y = yes) y otros no (columna y = no). Para todos los individuos se registró la presencia o ausencia de 20 síntomas diferentes; cada síntoma \(x_j = 1\) si el síntoma está presente y \(x_j = 0\) si no lo está \((j=1,2,\ldots,20)\).

El Problema

Se requiere determinar, con una alta precisión, el diagnóstico de la próxima persona a partir de los síntomas reportados. Por la naturaleza de la variable respuesta, pueden utilizarse los modelos de Regresión Logística, CART, Random Forests y SVM.

El otro requerimiento es que, en la medida de lo posible, el número de variables predictoras utilizadas para determinar si el diagnóstico de la enfermedad es yes o no debe ser el menor.

Contexto Analítico

El conjunto de datos pueden leerse en R utilizando los siguientes comandos:

Las variables relevantes en este caso son:

id: identificador del individuo
y: diagnóstico (yes, no)
xj: síntoma \(j\) (1: presente; 0: ausente)

Informe comparativo de los modelos de Regresión logística, CART, Random Forests y SVM con la intencionalidad de escoger aquel que proporcione el mejor accuracy para el testing con el menor número de variables predictoras.

Análisis exploratorio

Se realizó un primer ejercicio donde se analizó la distribución de la variable respuesta, y se determinaron las proporciones de las categorías SI/NO. De esto se tiene que para un modelo predictivo útil se debe tener un determinado nivel basal, es decir el porcentaje más alto de la categoría de la variable respuesta, para el caso del estudio en cuestión nos encontramos con un 58% de respuestas sí.

datos <- read.table('https://www.dropbox.com/s/bxea58hj951vi7k/datos-p1-caso2.txt?dl=1', header = TRUE)

###### Análisis exploratorio de los datos #####

# Resumen del set de datos

datos$y <- if_else(datos$y == "yes", "Si", "No")

attach(datos)

datos$id<-NULL

datos %<>% mutate_if(is.numeric, as.factor )


# Distribución de variables respuesta

ggplot(data = datos, aes(x = y, y = ..count.., fill = y)) +
  geom_bar() +
  scale_fill_manual(values = c("gray50", "orangered2")) +
  labs(title = "Distribución de la variable respuesta") +ylim(0,250)+
  theme_bw() +
  theme(legend.position = "bottom")

Contraste de proporción

## Contraste de proporciones

datos_cualitativos <-  datos %>%
  gather(key = "variable", value = "grupo",-y)

# Se añade un identificador formado por el nombre de la variable y el grupo 
datos_cualitativos <- datos_cualitativos %>%
  mutate(variable_grupo = paste(variable, grupo, sep = "_"))

# Función que calcula el test de proporciones para la columna "Survived" de un df
test_proporcion <- function(df){
  n_si <- sum(df$y == "Si") 
  n_no <- sum(df$y == "No")
  n_total <- n_si + n_no
  test <- prop.test(x = n_si, n = n_total, p = 0.58)
  prop_si <- n_si/n_total
  return(data.frame(p_value = test$p.value, prop_si))
}

# Se agrupan los datos por "variable_grupo" y se aplica a cada grupo la función
# test_proporcion()

analisis_prop <- datos_cualitativos%>%
  group_by(variable_grupo) %>%
  nest() %>%
  arrange(variable_grupo) %>%
  mutate(prop_test = map(.x = data, .f = test_proporcion)) %>%
  unnest(prop_test) %>%
  arrange(p_value) %>% 
  dplyr::select(variable_grupo,p_value, prop_si)

kable(analisis_prop, format = "simple", align = c("c"))

variable_grupo	p_value	prop_si
x6_1	0.0000000	0.8734177
x14_1	0.0000000	0.8848921
x2_0	0.0000000	0.3191489
x2_1	0.0000000	0.8282828
x3_1	0.0000000	0.8290155
x3_0	0.0000000	0.3316062
x15_1	0.0000000	0.8609272
x10_1	0.0000000	0.8490566
x8_1	0.0000000	0.8229167
x8_0	0.0000000	0.3402062
x1_1	0.0000000	0.8201058
x1_0	0.0000000	0.3502538
x13_1	0.0000000	0.8472222
x12_1	0.0000000	0.8529412
x7_1	0.0000000	0.8417266
x6_0	0.0000000	0.3771930
x19_1	0.0000000	0.8394161
x4_1	0.0000000	0.8000000
x10_0	0.0000000	0.3920705
x15_0	0.0000000	0.4000000
x5_1	0.0000000	0.7906977
x4_0	0.0000000	0.3883495
x14_0	0.0000001	0.4089069
x11_1	0.0000001	0.7891566
x18_1	0.0000005	0.7843137
x5_0	0.0000008	0.4112150
x13_0	0.0000008	0.4214876
x9_1	0.0000012	0.7643678
x12_0	0.0000029	0.4320000
x11_0	0.0000032	0.4227273
x7_0	0.0000040	0.4331984
x17_1	0.0000043	0.7643312
x16_1	0.0000054	0.7666667
x19_0	0.0000073	0.4377510
x9_0	0.0000120	0.4292453
x18_0	0.0000476	0.4463519
x17_0	0.0001496	0.4541485
x16_0	0.0003049	0.4618644
x20_1	0.0005360	0.7313433
x20_0	0.0120987	0.5000000

Se realiza un test de contraste de proporción con la intención de encontrar los p valor para identificar si las proporciones de la variable respuesta se alejan de lo esperado de acuerdo al nivel basal. Podemos concluir de acuerdo a la tabla que cuanto menor es el P valor mayor es la evidencia de que la proporción de si o no en la variable respuesta se alejan de lo esperando indicando que no existe relación entre las variables predictoras y la variable respuesta. Para el caso concreto del estudio se encuentra que las variables predictoras no están relacionadas entre sí.

seleccion de predictores

########## División de los datos para entrenamiento y test ################

set.seed(123)

# Se crean los índices de las observaciones de entrenamiento

train <- createDataPartition(y = datos$y, p = 0.8, list = FALSE, times = 1)
datos_train <- datos[train, ]
datos_test  <- datos[-train, ]

########### Preprocesado de los datos ##############

objeto_recipe <- recipe(formula = y ~ .,
                        data =  datos_train)

# los predictores incluidos en el modelo, no se detecta ninguno con varianza 
#cero o próxima a cero.

objeto_recipe <- objeto_recipe %>% step_nzv(all_predictors())

### Binarizacion de las variables

objeto_recipe <- objeto_recipe %>% step_dummy(all_nominal(), -all_outcomes())

# Se entrena el objeto recipe
trained_recipe <- prep(objeto_recipe, training = datos_train)

# Se aplican las transformaciones al conjunto de entrenamiento y de test
datos_train_prep <- bake(trained_recipe, new_data = datos_train)
datos_test_prep  <- bake(trained_recipe, new_data = datos_test)


######### Seleccion de predictores #########

#Método de Correlación de variables para seleccionar predictores
#Estimamos coeficiente de phi entre y & Xi; i=1,...,20
set.seed(1)
coef_phi <- sapply(1:20, function(j) DescTools::Assocs(with(datos, table(datos[,1+j], y)))[1])
names(coef_phi) <- paste0("x", 1:20)
sort(coef_phi)

      x20       x16       x17       x18        x9       x11        x5 
0.2231660 0.3010498 0.3087477 0.3349730 0.3378817 0.3675988 0.3821920 
      x19        x7       x12        x4       x13       x10       x15 
0.3894417 0.3973712 0.4074547 0.4161185 0.4172020 0.4557566 0.4557988 
      x14        x1        x8        x6        x3        x2 
0.4629860 0.4759306 0.4890539 0.4944164 0.5039526 0.5156583

#Visualizamos variables con coeficientes de phi mayores a 0.45
plot(sort(coef_phi), las = 1, pch = 16, ylab = "Coeficiente Phi", xlab = "")
abline(h = 0.45, col = 3, lwd = 2)

# las cuales son:
paste0("x", which(coef_phi >0.45))

[1] "x1"  "x2"  "x3"  "x6"  "x8"  "x10" "x14" "x15"

#Método de Regresion logística para seleccionar predictores

d <- read.table('https://www.dropbox.com/s/bxea58hj951vi7k/datos-p1-caso2.txt?dl=1', header = TRUE)

## Omitimos el idenditicados de cada paciente de los datos
d$id<-NULL
d$y<-ifelse(d$y=="yes",1,0)

set.seed(21)
#Ajustamos los datos a un modelo en el que se incluyen todas las variables
model1 <- glm(y~., data = d,family = "binomial")
#Ajustamos un modelo sin variables, solo el intercepto
model.ini <- glm(y ~ 1, data = d, family = binomial)
#Realizamos los modelos posibles paso a paso
library(MASS)
mod.final <- stepAIC(model.ini, 
                     scope = list(upper = model1), 
                     direction = "both")

Start:  AIC=527.11
y ~ 1

       Df Deviance    AIC
+ x2    1   417.07 421.07
+ x3    1   421.80 425.80
+ x6    1   422.21 426.21
+ x8    1   428.08 432.08
+ x1    1   433.28 437.28
+ x14   1   433.43 437.43
+ x15   1   438.11 442.11
+ x10   1   438.97 442.97
+ x13   1   452.61 456.61
+ x4    1   455.36 459.36
+ x12   1   455.52 459.52
+ x7    1   459.42 463.42
+ x19   1   462.05 466.05
+ x5    1   466.37 470.37
+ x11   1   470.71 474.71
+ x9    1   479.64 483.64
+ x18   1   479.87 483.87
+ x17   1   486.99 490.99
+ x16   1   488.77 492.77
+ x20   1   505.30 509.30
<none>      525.11 527.11

Step:  AIC=421.07
y ~ x2

       Df Deviance    AIC
+ x15   1   359.93 365.93
+ x14   1   373.04 379.04
+ x13   1   383.01 389.01
+ x12   1   383.30 389.30
+ x6    1   385.61 391.61
+ x10   1   390.73 396.73
+ x3    1   392.86 398.86
+ x8    1   393.35 399.35
+ x19   1   393.78 399.78
+ x7    1   395.10 401.10
+ x1    1   396.12 402.12
+ x17   1   396.42 402.42
+ x11   1   397.68 403.68
+ x16   1   398.71 404.71
+ x4    1   399.76 405.76
+ x5    1   400.35 406.35
+ x18   1   401.21 407.21
+ x20   1   409.60 415.60
+ x9    1   410.13 416.13
<none>      417.07 421.07
- x2    1   525.11 527.11

Step:  AIC=365.93
y ~ x2 + x15

       Df Deviance    AIC
+ x6    1   345.45 353.45
+ x10   1   347.43 355.43
+ x4    1   348.54 356.54
+ x7    1   349.09 357.09
+ x5    1   349.92 357.92
+ x14   1   350.47 358.47
+ x3    1   351.44 359.44
+ x8    1   352.12 360.12
+ x1    1   352.33 360.33
+ x19   1   352.90 360.90
+ x17   1   352.93 360.93
+ x9    1   355.54 363.54
+ x12   1   355.68 363.68
+ x18   1   356.66 364.66
+ x16   1   357.02 365.02
+ x11   1   357.52 365.52
+ x13   1   357.65 365.65
<none>      359.93 365.93
+ x20   1   359.89 367.89
- x15   1   417.07 421.07
- x2    1   438.11 442.11

Step:  AIC=353.45
y ~ x2 + x15 + x6

       Df Deviance    AIC
+ x4    1   337.76 347.76
+ x10   1   339.01 349.01
+ x17   1   339.02 349.02
+ x7    1   339.48 349.48
+ x14   1   340.73 350.73
+ x19   1   340.78 350.78
+ x5    1   341.49 351.49
+ x3    1   341.53 351.53
+ x16   1   342.80 352.80
+ x8    1   343.40 353.40
<none>      345.45 353.45
+ x1    1   343.56 353.56
+ x12   1   343.64 353.64
+ x9    1   343.85 353.85
+ x18   1   344.02 354.02
+ x11   1   344.12 354.12
+ x13   1   344.27 354.27
+ x20   1   345.31 355.31
- x6    1   359.93 365.93
- x2    1   375.19 381.19
- x15   1   385.61 391.61

Step:  AIC=347.76
y ~ x2 + x15 + x6 + x4

       Df Deviance    AIC
+ x17   1   332.04 344.04
+ x10   1   332.95 344.95
+ x14   1   333.26 345.26
+ x7    1   333.39 345.39
+ x19   1   333.81 345.81
+ x16   1   334.76 346.76
<none>      337.76 347.76
+ x5    1   335.76 347.76
+ x12   1   335.84 347.84
+ x3    1   336.33 348.33
+ x1    1   336.60 348.60
+ x8    1   336.74 348.74
+ x13   1   336.78 348.78
+ x11   1   336.85 348.85
+ x18   1   336.88 348.88
+ x20   1   337.54 349.54
+ x9    1   337.71 349.71
- x4    1   345.45 353.45
- x6    1   348.54 356.54
- x2    1   354.76 362.76
- x15   1   374.97 382.97

Step:  AIC=344.04
y ~ x2 + x15 + x6 + x4 + x17

       Df Deviance    AIC
+ x10   1   327.89 341.89
+ x7    1   327.97 341.97
+ x14   1   328.94 342.94
+ x19   1   329.76 343.76
+ x5    1   329.92 343.92
+ x3    1   330.03 344.03
<none>      332.04 344.04
+ x16   1   330.93 344.93
+ x12   1   331.11 345.11
+ x1    1   331.20 345.20
+ x8    1   331.30 345.30
+ x13   1   331.32 345.32
+ x11   1   331.69 345.69
+ x18   1   331.94 345.94
+ x20   1   331.94 345.94
+ x9    1   331.97 345.97
- x17   1   337.76 347.76
- x4    1   339.02 349.02
- x6    1   342.61 352.61
- x2    1   347.80 357.80
- x15   1   361.42 371.42

Step:  AIC=341.89
y ~ x2 + x15 + x6 + x4 + x17 + x10

       Df Deviance    AIC
+ x7    1   324.95 340.95
+ x14   1   325.60 341.60
<none>      327.89 341.89
+ x19   1   326.47 342.47
+ x5    1   326.54 342.54
+ x3    1   326.91 342.91
+ x16   1   327.07 343.07
+ x12   1   327.35 343.35
+ x13   1   327.46 343.46
+ x8    1   327.47 343.47
+ x1    1   327.51 343.51
+ x11   1   327.59 343.59
+ x20   1   327.84 343.84
+ x18   1   327.87 343.87
+ x9    1   327.89 343.89
- x10   1   332.04 344.04
- x17   1   332.95 344.95
- x4    1   333.29 345.29
- x6    1   334.54 346.54
- x2    1   339.29 351.29
- x15   1   353.98 365.98

Step:  AIC=340.95
y ~ x2 + x15 + x6 + x4 + x17 + x10 + x7

       Df Deviance    AIC
+ x14   1   322.77 340.77
<none>      324.95 340.95
- x7    1   327.89 341.89
- x10   1   327.97 341.97
+ x19   1   324.16 342.16
+ x16   1   324.17 342.17
+ x13   1   324.36 342.36
+ x3    1   324.41 342.41
+ x5    1   324.65 342.65
+ x12   1   324.69 342.69
+ x20   1   324.70 342.70
+ x1    1   324.76 342.76
+ x11   1   324.79 342.79
+ x8    1   324.89 342.89
+ x9    1   324.91 342.91
+ x18   1   324.95 342.95
- x4    1   329.63 343.63
- x6    1   329.81 343.81
- x17   1   329.82 343.82
- x2    1   335.88 349.88
- x15   1   348.05 362.05

Step:  AIC=340.77
y ~ x2 + x15 + x6 + x4 + x17 + x10 + x7 + x14

       Df Deviance    AIC
<none>      322.77 340.77
- x14   1   324.95 340.95
- x10   1   325.09 341.09
- x7    1   325.60 341.60
+ x16   1   322.18 342.18
- x6    1   326.21 342.21
+ x19   1   322.33 342.33
+ x5    1   322.46 342.46
+ x3    1   322.49 342.49
+ x13   1   322.63 342.63
+ x20   1   322.65 342.65
- x17   1   326.65 342.65
+ x18   1   322.66 342.66
+ x1    1   322.68 342.68
+ x9    1   322.72 342.72
+ x11   1   322.76 342.76
+ x8    1   322.76 342.76
+ x12   1   322.77 342.77
- x4    1   327.44 343.44
- x2    1   332.99 348.99
- x15   1   336.27 352.27

#El modelo final sugiere que 8 variables son suficientes para predecir 
#que una persona tenga la enfermedad, lo que permite discriminar entre grupos
#estas variables son:
  mod.final[1]

$coefficients
(Intercept)          x2         x15          x6          x4 
 -1.8740693   1.0942832   1.2845260   0.6934984   0.6806736 
        x17         x10          x7         x14 
  0.5967528   0.5343521   0.5826155   0.5809925

La función de la correlación de Phi es determinar si existe una relación lineal entre dos variables a nivel nominal con dos valores cada una (dicotómico) y que esta relación no sea debida al azar, buscando así, saber si existe relación entre el par de variables.

Por otro lado, la Regresión Logística es una técnica estadística multivariante que nos permite estimar la relación existente entre una variable dependiente no métrica, en particular dicotómica y un conjunto de variables independientes métricas o no métricas.

En esta sección se aplicaron dos métodos que dieran indicios de la seleccion de los mejores predictores posibles, en primera instancia se esimó en coeficiente de correlacion de phi entre la respuesta y cada una de las 20 variables, resultando que las variables “x1”, “x2”, “x3”, “x6”, “x8”, “x10”, “x14”, “x15” presentaron los ocho coeficientes más altos, sugiriendo que cada uno de estos muestra una mayor asociación entre los dos grupos de pacientes, en comapració con las variables restantes.

El segundo método aplicado, consistió en el ajuste de mosdelos de regresión logistica, iniciando desde un modelo con todas las variables hasta sin ellas, resultado un mejor modelo basado en el criterio AIC con ocho predictores, que sugieren una alta discriminación entre los dos grupos de pacientes, las variables correpondieron a “x2”, “x15”, “x6”, “x4”, “x17”, “x10”, “x7”, “x14”.

################# Modelos con predictores selectos con el coeficiente de phi #################################

########################## SVM lineal phi #################################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)

for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros)) 
}

seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================

#"x1"  "x2"  "x3"  "x6"  "x8"  "x10" "x14" "x15"

set.seed(346)
modelo_svmlineal_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                              x14_X1 + x15_X1, data = datos_train_prep,
                            method = "svmLinear",
                            tuneGrid = hiperparametros,
                            metric = "Accuracy",
                            trControl = control_train)

stopImplicitCluster()


##################### Regresión logística #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_logistic_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                             x14_X1 + x15_X1, data = datos_train_prep,
                           method = "glm",
                           tuneGrid = hiperparametros,
                           metric = "Accuracy",
                           trControl = control_train,
                           family = "binomial")
## stop parallel workers computing 
stopImplicitCluster()


####################### Árbol de clasificación simple ####################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_C50Tree_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                            x14_X1 + x15_X1,data = datos_train_prep,
                          method = "C5.0Tree",
                          tuneGrid = hiperparametros,
                          metric = "Accuracy",
                          trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()


###################### RandomForest ########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(mtry = c(3, 4, 5, 7),
                               min.node.size = c(2, 3, 4, 5, 10, 15, 20, 30),
                               splitrule = "gini")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_rf_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                       x14_X1 + x15_X1, data = datos_train_prep,
                     method = "ranger",
                     tuneGrid = hiperparametros,
                     metric = "Accuracy",
                     trControl = control_train,
                     # Número de árboles ajustados
                     num.trees = 500)

## stop parallel workers computing 
stopImplicitCluster()

######################## Gradient Boosting #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(interaction.depth = c(1, 2),
                               n.trees = c(500, 1000, 2000),
                               shrinkage = c(0.001, 0.01, 0.1),
                               n.minobsinnode = c(2, 5, 15))

set.seed(226)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(226)
modelo_boost_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                          x14_X1 + x15_X1, data = datos_train_prep,
                        method = "gbm",
                        tuneGrid = hiperparametros,
                        metric = "Accuracy",
                        trControl = control_train,
                        # Número de árboles ajustados
                        distribution = "adaboost",
                        verbose = FALSE)

## stop parallel workers computing 
stopImplicitCluster()

##################### SVM Radial  ##########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(sigma = c(0.001, 0.01, 0.1, 0.5, 1),
                               C = c(1 , 20, 50, 100, 200, 500, 700))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_svmrad_phi <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x6_X1 + x10_X1 +
                           x14_X1 + x15_X1, data = datos_train_prep,
                         method = "svmRadial",
                         tuneGrid = hiperparametros,
                         metric = "Accuracy",
                         trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()



# step
# #"x2"  "x15"  "x6"  "x4"  "x17"  "x10" "x7" "x14"

########################## Modelos con predictores selectos con step #################################

########################## SVM lineal step #################################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)

for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros)) 
}

seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================

#"x2"  "x15"  "x6"  "x4"  "x17"  "x10" "x7" "x14"

set.seed(346)
modelo_svmlineal_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                                x14_X1 + x15_X1 + x17_X1, data = datos_train_prep,
                              method = "svmLinear",
                              tuneGrid = hiperparametros,
                              metric = "Accuracy",
                              trControl = control_train)

stopImplicitCluster()


##################### Regresión logística #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_logistic_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                               x14_X1 + x15_X1 + x17_X1, data = datos_train_prep,
                             method = "glm",
                             tuneGrid = hiperparametros,
                             metric = "Accuracy",
                             trControl = control_train,
                             family = "binomial")
## stop parallel workers computing 
stopImplicitCluster()


####################### Árbol de clasificación simple ####################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_C50Tree_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                              x14_X1 + x15_X1 + x17_X1,data = datos_train_prep,
                            method = "C5.0Tree",
                            tuneGrid = hiperparametros,
                            metric = "Accuracy",
                            trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()


###################### RandomForest ########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(mtry = c(3, 4, 5, 7),
                               min.node.size = c(2, 3, 4, 5, 10, 15, 20, 30),
                               splitrule = "gini")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_rf_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                         x14_X1 + x15_X1 + x17_X1, data = datos_train_prep,
                       method = "ranger",
                       tuneGrid = hiperparametros,
                       metric = "Accuracy",
                       trControl = control_train,
                       # Número de árboles ajustados
                       num.trees = 500)

## stop parallel workers computing 
stopImplicitCluster()

######################## Gradient Boosting #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(interaction.depth = c(1, 2),
                               n.trees = c(500, 1000, 2000),
                               shrinkage = c(0.001, 0.01, 0.1),
                               n.minobsinnode = c(2, 5, 15))

set.seed(226)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(226)
modelo_boost_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                            x14_X1 + x15_X1 + x17_X1, data = datos_train_prep,
                          method = "gbm",
                          tuneGrid = hiperparametros,
                          metric = "Accuracy",
                          trControl = control_train,
                          # Número de árboles ajustados
                          distribution = "adaboost",
                          verbose = FALSE)

## stop parallel workers computing 
stopImplicitCluster()


##################### SVM Radial 2 ##########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(sigma = c(0.001, 0.01, 0.1, 0.5, 1),
                               C = c(1 , 20, 50, 100, 200, 500, 700))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_svmrad_step <- train(y ~ x2_X1 + x4_X1 + x6_X1 + x7_X1 + x10_X1 +
                             x14_X1 + x15_X1 + x17_X1, data = datos_train_prep,
                           method = "svmRadial",
                           tuneGrid = hiperparametros,
                           metric = "Accuracy",
                           trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()


### otro predictores con phi

# x2+x3+x6+x8+x1+x14+x15+x10+x13+x4


###### Modelos con predictores selectos con phi 10 variables #######

########################## SVM lineal otro #################################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)

for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros)) 
}

seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================

# x2+x3+x6+x8+x1+x14+x15+x10+x13+x4

set.seed(346)
modelo_svmlineal_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                                 x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1, data = datos_train_prep,
                               method = "svmLinear",
                               tuneGrid = hiperparametros,
                               metric = "Accuracy",
                               trControl = control_train)

stopImplicitCluster()


##################### Regresión logística #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_logistic_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                                x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1, data = datos_train_prep,
                              method = "glm",
                              tuneGrid = hiperparametros,
                              metric = "Accuracy",
                              trControl = control_train,
                              family = "binomial")
## stop parallel workers computing 
stopImplicitCluster()

####################### Árbol de clasificación simple ####################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- data.frame(parameter = "none")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_C50Tree_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                               x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1,data = datos_train_prep,
                             method = "C5.0Tree",
                             tuneGrid = hiperparametros,
                             metric = "Accuracy",
                             trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()

###################### RandomForest ########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================
getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(mtry = c(3, 4, 5, 7),
                               min.node.size = c(2, 3, 4, 5, 10, 15, 20, 30),
                               splitrule = "gini")

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_rf_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                          x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1, data = datos_train_prep,
                        method = "ranger",
                        tuneGrid = hiperparametros,
                        metric = "Accuracy",
                        trControl = control_train,
                        # Número de árboles ajustados
                        num.trees = 500)

## stop parallel workers computing 
stopImplicitCluster()

######################## Gradient Boosting #######################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(interaction.depth = c(1, 2),
                               n.trees = c(500, 1000, 2000),
                               shrinkage = c(0.001, 0.01, 0.1),
                               n.minobsinnode = c(2, 5, 15))

set.seed(226)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(226)
modelo_boost_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                             x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1, data = datos_train_prep,
                           method = "gbm",
                           tuneGrid = hiperparametros,
                           metric = "Accuracy",
                           trControl = control_train,
                           # Número de árboles ajustados
                           distribution = "adaboost",
                           verbose = FALSE)

## stop parallel workers computing 
stopImplicitCluster()

##################### SVM Radial 2 ##########################

# PARALELIZACIÓN DE PROCESO
#===============================================================================

getDoParWorkers()

[1] 1

registerDoParallel(1)

# HIPERPARÁMETROS, NÚMERO DE REPETICIONES Y SEMILLAS PARA CADA REPETICIÓN
#===============================================================================
particiones  <- 10
repeticiones <- 5

# Hiperparámetros
hiperparametros <- expand.grid(sigma = c(0.001, 0.01, 0.1, 0.5, 1),
                               C = c(1 , 20, 50, 100, 200, 500, 700))

set.seed(123)
seeds <- vector(mode = "list", length = (particiones * repeticiones) + 1)
for (i in 1:(particiones * repeticiones)) {
  seeds[[i]] <- sample.int(1000, nrow(hiperparametros))
}
seeds[[(particiones * repeticiones) + 1]] <- sample.int(1000, 1)

# DEFINICIÓN DEL ENTRENAMIENTO
#===============================================================================
control_train <- trainControl(method = "repeatedcv", number = particiones,
                              repeats = repeticiones, seeds = seeds,
                              returnResamp = "final", verboseIter = FALSE,
                              allowParallel = TRUE)

# AJUSTE DEL MODELO
# ==============================================================================
set.seed(342)
modelo_svmrad_otro <- train(y ~ x1_X1 + x2_X1 + x3_X1 + x4_X1 + x6_X1 +
                              x8_X1 + x10_X1 + x13_X1 + x14_X1 + x15_X1, data = datos_train_prep,
                            method = "svmRadial",
                            tuneGrid = hiperparametros,
                            metric = "Accuracy",
                            trControl = control_train)

## stop parallel workers computing 
stopImplicitCluster()

modelos_phi <- list(SVMlineal_phi = modelo_svmlineal_phi,
                logistic_phi = modelo_logistic_phi,
                arbol_phi = modelo_C50Tree_phi, 
                rf_phi = modelo_rf_phi, 
                boosting_phi = modelo_boost_phi, 
                SVMradial_phi = modelo_svmrad_phi)

resultados_resamples_phi <- resamples(modelos_phi)

# Se trasforma el dataframe devuelto por resamples() para separar el nombre del
# modelo y las métricas en columnas distintas.

metricas_resamples_phi <- resultados_resamples_phi$values %>%
  gather(key = "modelo", value = "valor", -Resample) %>%
  separate(col = "modelo", into = c("modelo", "metrica"),
           sep = "~", remove = TRUE)


metricas_resamples_phi %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor, na.rm = TRUE)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.1, alpha = 0.6) +
  scale_y_continuous(limits = c(0.4, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  theme_bw() +
  labs(title = "Validación: Accuracy medio repeated-CV",
       subtitle = "Modelos seleccionado por phi (8 variables) ordenados por media") +
  coord_flip() +
  theme(legend.position = "none")

# Error de test

predicciones_phi <- extractPrediction(
  models = modelos_phi,
  testX = datos_test_prep %>% dplyr::select(-y),
  testY = datos_test_prep$y
)


pred1 <- predict(modelo_svmlineal_phi, datos_test_prep)

pred2 <- predict(modelo_svmrad_phi, datos_test_prep)

predicciones_phi$pred[is.na(predicciones_phi$pred)] <- c(paste(pred1),paste(pred2))


metricas_predicciones_phi <- predicciones_phi %>%
  mutate(acierto = ifelse(obs == pred, TRUE, FALSE)) %>%
  group_by(object, dataType) %>%
  summarise(accuracy = mean(acierto))

kable(metricas_predicciones_phi %>%
  spread(key = dataType, value = accuracy) %>%
  arrange(desc(Test)), format = "simple", align = c("c"))

object	Test	Training
logistic_phi	0.8815789	0.8258065
rf_phi	0.8684211	0.8258065
arbol_phi	0.8552632	0.8258065
SVMlineal_phi	0.8552632	0.8258065
SVMradial_phi	0.8552632	0.8258065
boosting_phi	0.8026316	0.8290323

ggplot(data = metricas_predicciones_phi,
       aes(x = reorder(object, accuracy), y = accuracy,
           color = dataType, label = round(accuracy, 2))) +
  geom_point(size = 8) +
  scale_color_manual(values = c("orangered2", "gray50")) +
  geom_text(color = "white", size = 3) +
  scale_y_continuous(limits = c(0.5, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  coord_flip() +
  labs(title = "Accuracy de entrenamiento y test", 
       x = "modelo") +
  theme_bw() + 
  theme(legend.position = "bottom")

Para el proceso de la evaluación del modelo a partir de la correlación de phi como estrategia para la selección de variables, se ejecutó un modelo con 8 variables predictoras, para este caso particular se idéntifica que el mayor accuracy ocurre para el modelo aplicando el algoritmo logístico con un valor de 0.83 para el training y un 0.88 para el test. La gráfica y la tabla permiten identificar los otros valores para cada algoritmo evaluado.

modelo otro

modelos_otro <- list(SVMlineal_otro = modelo_svmlineal_otro,
                logistic_otro = modelo_logistic_otro,
                arbol_otro = modelo_C50Tree_otro,
                rf_otro = modelo_rf_otro,
                boosting_otro = modelo_boost_otro,
                SVMradial_otro = modelo_svmrad_otro)

resultados_resamples_otro <- resamples(modelos_otro)

metricas_resamples_otro <- resultados_resamples_otro$values %>%
  gather(key = "modelo", value = "valor", -Resample) %>%
  separate(col = "modelo", into = c("modelo", "metrica"),
           sep = "~", remove = TRUE)


metricas_resamples_otro %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor, na.rm = TRUE)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.1, alpha = 0.6) +
  scale_y_continuous(limits = c(0.4, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  theme_bw() +
  labs(title = "Validación: Accuracy medio repeated-CV",
       subtitle = "Modelos ordenados por media") +
  coord_flip() +
  theme(legend.position = "none")

predicciones_otro <- extractPrediction(
  models = modelos_otro,
  testX = datos_test_prep %>% dplyr::select(-y),
  testY = datos_test_prep$y
)

pred1 <- predict(modelo_svmlineal_otro, datos_test_prep)

pred2 <- predict(modelo_svmrad_otro, datos_test_prep)

predicciones_otro$pred[is.na(predicciones_otro$pred)] <- c(paste(pred1),paste(pred2))

metricas_predicciones_otro <- predicciones_otro %>%
  mutate(acierto = ifelse(obs == pred, TRUE, FALSE)) %>%
  group_by(object, dataType) %>%
  summarise(accuracy = mean(acierto))

kable(metricas_predicciones_otro %>%
  spread(key = dataType, value = accuracy) %>%
  arrange(desc(Test)), format = "simple", align = c("c"))

object	Test	Training
rf_otro	0.8815789	0.8741935
logistic_otro	0.8684211	0.8322581
SVMradial_otro	0.8684211	0.8483871
arbol_otro	0.8552632	0.8258065
SVMlineal_otro	0.8552632	0.8258065
boosting_otro	0.8289474	0.8322581

ggplot(data = metricas_predicciones_otro,
       aes(x = reorder(object, accuracy), y = accuracy,
           color = dataType, label = round(accuracy, 2))) +
  geom_point(size = 8) +
  scale_color_manual(values = c("orangered2", "gray50")) +
  geom_text(color = "white", size = 3) +
  scale_y_continuous(limits = c(0.5, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  coord_flip() +
  labs(title = "Accuracy de entrenamiento y test", 
       x = "modelo") +
  theme_bw() + 
  theme(legend.position = "bottom")

En un siguiente proceso de selección de variable con el método de correlación de phi, se ejecutó un modelo con 10 variables predictoras, para este segundo caso de estudio, se idéntica que el mayor accuracy ocurre para el modelo aplicando el algoritmo logístico con un valor de 0.87 para el training y un 0.88 para el test. Es evidente que mejora el proceso en el entrenamiento del modelo, pero al realizar la predicción arroja un accuracy igual al del modelo con 8 variables, podría pensarse que con esta estrategia de selección sigue funcionando el modelo con 8 variables dado que igualan en el accuracy del test y este presenta un menor número de predictoras. La gráfica y la tabla permiten identificar los otros valores para cada algoritmo evaluado.

Modelo step

modelos_step <- list(SVMlineal_step = modelo_svmlineal_step,
                logistic_step = modelo_logistic_step, 
                arbol_step = modelo_C50Tree_step, 
                rf_step = modelo_rf_step,
                boosting_step = modelo_boost_step, 
                SVMradial_step = modelo_svmrad_step)


resultados_resamples_step <- resamples(modelos_step)


metricas_resamples_step <- resultados_resamples_step$values %>%
  gather(key = "modelo", value = "valor", -Resample) %>%
  separate(col = "modelo", into = c("modelo", "metrica"),
           sep = "~", remove = TRUE)


metricas_resamples_step %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor, na.rm = TRUE)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.1, alpha = 0.6) +
  scale_y_continuous(limits = c(0.4, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  theme_bw() +
  labs(title = "Validación: Accuracy medio repeated-CV",
       subtitle = "Modelos ordenados por media") +
  coord_flip() +
  theme(legend.position = "none")

predicciones_step <- extractPrediction(
  models = modelos_step,
  testX = datos_test_prep %>% dplyr::select(-y),
  testY = datos_test_prep$y
)

pred1 <- predict(modelo_svmlineal_step, datos_test_prep)

pred2 <- predict(modelo_svmrad_step, datos_test_prep)

predicciones_step$pred[is.na(predicciones_step$pred)] <- c(paste(pred1),paste(pred2))


metricas_predicciones_step <- predicciones_step %>%
  mutate(acierto = ifelse(obs == pred, TRUE, FALSE)) %>%
  group_by(object, dataType) %>%
  summarise(accuracy = mean(acierto))

kable(metricas_predicciones_step %>%
  spread(key = dataType, value = accuracy) %>%
  arrange(desc(Test)), format = "simple", align = c("c"))

object	Test	Training
logistic_step	0.9078947	0.8387097
SVMradial_step	0.9078947	0.8225806
rf_step	0.8815789	0.8419355
SVMlineal_step	0.8815789	0.8419355
arbol_step	0.8552632	0.8258065
boosting_step	0.8552632	0.8354839

ggplot(data = metricas_predicciones_step,
       aes(x = reorder(object, accuracy), y = accuracy,
           color = dataType, label = round(accuracy, 2))) +
  geom_point(size = 8) +
  scale_color_manual(values = c("orangered2", "gray50")) +
  geom_text(color = "white", size = 3) +
  scale_y_continuous(limits = c(0.5, 1)) +
  # Accuracy basal
  geom_hline(yintercept = 0.58, linetype = "dashed") +
  annotate(geom = "text", y = 0.58 , x = 6.5, label = "Accuracy basal") +
  coord_flip() +
  labs(title = "Accuracy de entrenamiento y test", 
       x = "modelo") +
  theme_bw() + 
  theme(legend.position = "bottom")

Por último, se realizó un proceso de selección de variable con el método de stepAIC, dicho proceso generó un modelo con 8 variable predictoras con el AIC más eficiente, para este caso de estudio, se idéntifica que el mayor accuracy ocurre para el modelo aplicando el algoritmo logístico con un valor de 0.84 para el training y un 0.91 para el test. Es evidente que el proceso en el entrenamiento del modelo disminuye sin diferencias grandes con respecto a los otros dos modelos trabajados, pero al realizar la predicción arroja el mejor accuracy.

De este último proceso se determina y escoge el mejor modelo a trabajar y el que se sugiere como modelo final para la predicción solicitada, el modelo logistico con ocho variables y un accuracy de 0.91 al evaluar datos desconocidos.

La gráfica y la tabla permiten identificar los otros valores para cada algoritmo evaluado.