Step Wise Selection

Este es un algoritmo que nos permite encontrar el mejor modelo para predecir una variable Y contra todas las posibles combinaciones de X, este algoritmo es eficiente y es considerado de fuerza bruta, pero si existen muchas variables puede ser un poco costoso computacionalmente, para ello haremos las siguientes pruebas utilizando el dataset de Boston

summary(Boston)
      crim                zn             indus            chas              nox               rm       
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000   Min.   :0.3850   Min.   :3.561  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000   Median :0.5380   Median :6.208  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917   Mean   :0.5547   Mean   :6.285  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000   Max.   :0.8710   Max.   :8.780  
      age              dis              rad              tax           ptratio          black       
 Min.   :  2.90   Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 77.50   Median : 3.207   Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 68.57   Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :100.00   Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

Ahora veamos la lista de las variables:

features <- colnames(Boston)
features <- setdiff(features , "medv")
features
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"     "tax"    
[11] "ptratio" "black"   "lstat"  

Fordward Step Wise

El es un algoritmo basado en el Step Wise que consiste en crear un modelo utilizando una variable, cualquier de \(X_{0}..X_{p}\) donde p es la cantidad de variables que tenemos, y se toma el modelo com el menor RSS or mayor \(R^2\) como el ganador, luego haremos la misma prueba con todos los modelos posibles de dos variables, y asi sucesivamente hasta que tengamos los mejores modelos basados en 1, 2, 3, .., 10 variables, con ese listado de modelos haremos la ultima prueba utilizando kfold y veremos cual de los 10 modelos es el ganador.

Empesemos por definir una funcion que segun el dataset y las variables a probar genere un modelo y calcule el RSS:

generate_rss <- function(features,dataset,label){
  formula <- paste0(features, collapse="+")
  formula <- paste0(label,"~",formula,
                    collapse = "")
  fit <- lm(formula,data=dataset)
  test_pred <- predict(fit,dataset)
  se <- sum((test_pred-dataset[,label])^2)
  return(se)
}
generate_rss(c("crim"), Boston, "medv")
[1] 36275.51

Ahora creamos una funcion que nos permita generar modelos desde 1 hasta 10 variables y hacer la prueba del modelo para generar el RSS:

generate_models <- function(n,dataset,label){
  features <- colnames(dataset)
  features <- setdiff(features , label)
  features_subset<-
    combn(features,n,simplify = FALSE)
  metric <- lapply(features_subset,
                   get_metric,
                   dataset=dataset,
                   label=label)
  index<-
    which.min(metric %>% unlist() )
  return(features_subset[[index]])
  
}
n_features <- ncol(Boston)-1
list_models<-
  lapply(1:n_features,
         generate_models,
         dataset=Boston,
         label = 'medv')
list_models
[[1]]
[1] "lstat"

[[2]]
[1] "rm"    "lstat"

[[3]]
[1] "rm"      "ptratio" "lstat"  

[[4]]
[1] "rm"      "dis"     "ptratio" "lstat"  

[[5]]
[1] "nox"     "rm"      "dis"     "ptratio" "lstat"  

[[6]]
[1] "chas"    "nox"     "rm"      "dis"     "ptratio" "lstat"  

[[7]]
[1] "chas"    "nox"     "rm"      "dis"     "ptratio" "black"   "lstat"  

[[8]]
[1] "zn"      "chas"    "nox"     "rm"      "dis"     "ptratio" "black"   "lstat"  

[[9]]
[1] "crim"    "chas"    "nox"     "rm"      "dis"     "rad"     "ptratio" "black"   "lstat"  

[[10]]
 [1] "crim"    "zn"      "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"  

[[11]]
 [1] "crim"    "zn"      "chas"    "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio" "black"  
[11] "lstat"  

[[12]]
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio"
[11] "black"   "lstat"  

[[13]]
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"     "tax"    
[11] "ptratio" "black"   "lstat"  

Estos son los 13 mejores modelos desde 1 hasta 13 variables. Ahora nos toca probar cual es el mejor de los 13 modelos, para ello crearemos una funcion loocv para la prueba y una funcion kfolds para hacer la prueba, utilizaremos \(K=10\),

library(caret)
Loading required package: lattice

Attaching package: ‘lattice’

The following object is masked from ‘package:boot’:

    melanoma

Loading required package: ggplot2
loocv <- function(index, dataset, label, features){
  train <- dataset[-index,]
  test <- dataset[index,]
  formula <- paste0(label, "~", paste0(features, collapse="+"), collapse = "")
  fit <- lm(formula, data=train)
  
  test_pred <- predict(fit, test)
  se <- (test_pred-test[,label])^2
  
  return(se)
}
folds <- function(k, dataset, label, features){
  se_folds_all <- lapply(createFolds(dataset[,label], k = k, list = TRUE, returnTrain = FALSE), 
                         loocv, 
                         label=label, 
                         dataset=dataset,
                         features=features)
  return(sum(se_folds_all %>% unlist()))
}

Finalmente hacemos la prueba:

list_models_rss <- lapply(list_models,
               folds,
               k=10,
               dataset=Boston,
               label="medv"
               )
list_models[which.min(list_models_rss)]
[[1]]
 [1] "crim"    "zn"      "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"  

Finalmente ese es el mejor modelo.

Backward Step Wise

Al igual que el algoritmo anterior, solo que aca empesamos por definir un modelo con todas las variables, luego una con 12 variables y asi sucesivamente hasta llegar a modelos de una variable, donde ganar el que tenga el menor RSS o el mayor \(R^2\), para ello ya tenemos varias funcioens diferentes por lo que solamente invertiremos la funcion generadora de modelos:

backward_list_models<-
  lapply(n_features:1,
         generate_models,
         dataset=Boston,
         label = 'medv')
backward_list_models
[[1]]
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"     "tax"    
[11] "ptratio" "black"   "lstat"  

[[2]]
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio"
[11] "black"   "lstat"  

[[3]]
 [1] "crim"    "zn"      "chas"    "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio" "black"  
[11] "lstat"  

[[4]]
 [1] "crim"    "zn"      "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"  

[[5]]
[1] "crim"    "chas"    "nox"     "rm"      "dis"     "rad"     "ptratio" "black"   "lstat"  

[[6]]
[1] "zn"      "chas"    "nox"     "rm"      "dis"     "ptratio" "black"   "lstat"  

[[7]]
[1] "chas"    "nox"     "rm"      "dis"     "ptratio" "black"   "lstat"  

[[8]]
[1] "chas"    "nox"     "rm"      "dis"     "ptratio" "lstat"  

[[9]]
[1] "nox"     "rm"      "dis"     "ptratio" "lstat"  

[[10]]
[1] "rm"      "dis"     "ptratio" "lstat"  

[[11]]
[1] "rm"      "ptratio" "lstat"  

[[12]]
[1] "rm"    "lstat"

[[13]]
[1] "lstat"

Y con esto realizamos la misma prueba para ver cual es el mejor modelo:

backward_list_models_rss <- lapply(backward_list_models,
               folds,
               k=10,
               dataset=Boston,
               label="medv"
               )
backward_list_models[which.min(backward_list_models_rss)]
[[1]]
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "dis"     "rad"     "tax"     "ptratio"
[11] "black"   "lstat"  

Como vemos obtenemos el mismo resultado que con el algoritmo anterior.

LS0tCnRpdGxlOiAiRmlhYmlsaWRhZCwgSG9qYSBkZSBUcmFiYWpvIDYiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIyBTdGVwIFdpc2UgU2VsZWN0aW9uCgpFc3RlIGVzIHVuIGFsZ29yaXRtbyBxdWUgbm9zIHBlcm1pdGUgZW5jb250cmFyIGVsIG1lam9yIG1vZGVsbyBwYXJhIHByZWRlY2lyIHVuYSB2YXJpYWJsZSBZIGNvbnRyYSB0b2RhcyBsYXMgcG9zaWJsZXMgY29tYmluYWNpb25lcyBkZSBYLCBlc3RlIGFsZ29yaXRtbyBlcyBlZmljaWVudGUgeSBlcyBjb25zaWRlcmFkbyBkZSBmdWVyemEgYnJ1dGEsIHBlcm8gc2kgZXhpc3RlbiBtdWNoYXMgdmFyaWFibGVzIHB1ZWRlIHNlciB1biBwb2NvIGNvc3Rvc28gY29tcHV0YWNpb25hbG1lbnRlLCBwYXJhIGVsbG8gaGFyZW1vcyBsYXMgc2lndWllbnRlcyBwcnVlYmFzIHV0aWxpemFuZG8gZWwgZGF0YXNldCBkZSBgQm9zdG9uYAoKYGBge3J9CnN1bW1hcnkoQm9zdG9uKQpgYGAKCkFob3JhIHZlYW1vcyBsYSBsaXN0YSBkZSBsYXMgdmFyaWFibGVzOgoKYGBge3J9CmZlYXR1cmVzIDwtIGNvbG5hbWVzKEJvc3RvbikKZmVhdHVyZXMgPC0gc2V0ZGlmZihmZWF0dXJlcyAsICJtZWR2IikKZmVhdHVyZXMKYGBgCgojIyMgRm9yZHdhcmQgU3RlcCBXaXNlCgpFbCBlcyB1biBhbGdvcml0bW8gYmFzYWRvIGVuIGVsIFN0ZXAgV2lzZSBxdWUgY29uc2lzdGUgZW4gY3JlYXIgdW4gbW9kZWxvIHV0aWxpemFuZG8gdW5hIHZhcmlhYmxlLCBjdWFscXVpZXIgZGUgJFhfezB9Li5YX3twfSQgZG9uZGUgcCBlcyBsYSBjYW50aWRhZCBkZSB2YXJpYWJsZXMgcXVlIHRlbmVtb3MsIHkgc2UgdG9tYSBlbCBtb2RlbG8gY29tIGVsIG1lbm9yIFJTUyBvciBtYXlvciAkUl4yJCBjb21vIGVsIGdhbmFkb3IsIGx1ZWdvIGhhcmVtb3MgbGEgbWlzbWEgcHJ1ZWJhIGNvbiB0b2RvcyBsb3MgbW9kZWxvcyBwb3NpYmxlcyBkZSBkb3MgdmFyaWFibGVzLCB5IGFzaSBzdWNlc2l2YW1lbnRlIGhhc3RhIHF1ZSB0ZW5nYW1vcyBsb3MgbWVqb3JlcyBtb2RlbG9zIGJhc2Fkb3MgZW4gMSwgMiwgMywgLi4sIDEwIHZhcmlhYmxlcywgY29uIGVzZSBsaXN0YWRvIGRlIG1vZGVsb3MgaGFyZW1vcyBsYSB1bHRpbWEgcHJ1ZWJhIHV0aWxpemFuZG8ga2ZvbGQgeSB2ZXJlbW9zIGN1YWwgZGUgbG9zIDEwIG1vZGVsb3MgZXMgZWwgZ2FuYWRvci4KCkVtcGVzZW1vcyBwb3IgZGVmaW5pciB1bmEgZnVuY2lvbiBxdWUgc2VndW4gZWwgZGF0YXNldCB5IGxhcyB2YXJpYWJsZXMgYSBwcm9iYXIgZ2VuZXJlIHVuIG1vZGVsbyB5IGNhbGN1bGUgZWwgUlNTOgoKYGBge3J9CmdlbmVyYXRlX3JzcyA8LSBmdW5jdGlvbihmZWF0dXJlcyxkYXRhc2V0LGxhYmVsKXsKICBmb3JtdWxhIDwtIHBhc3RlMChmZWF0dXJlcywgY29sbGFwc2U9IisiKQogIGZvcm11bGEgPC0gcGFzdGUwKGxhYmVsLCJ+Iixmb3JtdWxhLAogICAgICAgICAgICAgICAgICAgIGNvbGxhcHNlID0gIiIpCiAgZml0IDwtIGxtKGZvcm11bGEsZGF0YT1kYXRhc2V0KQogIHRlc3RfcHJlZCA8LSBwcmVkaWN0KGZpdCxkYXRhc2V0KQogIHNlIDwtIHN1bSgodGVzdF9wcmVkLWRhdGFzZXRbLGxhYmVsXSleMikKICByZXR1cm4oc2UpCn0KCmdlbmVyYXRlX3JzcyhjKCJjcmltIiksIEJvc3RvbiwgIm1lZHYiKQpgYGAKQWhvcmEgY3JlYW1vcyB1bmEgZnVuY2lvbiBxdWUgbm9zIHBlcm1pdGEgZ2VuZXJhciBtb2RlbG9zIGRlc2RlIDEgaGFzdGEgMTAgdmFyaWFibGVzIHkgaGFjZXIgbGEgcHJ1ZWJhIGRlbCBtb2RlbG8gcGFyYSBnZW5lcmFyIGVsIFJTUzoKCmBgYHtyfQpnZW5lcmF0ZV9tb2RlbHMgPC0gZnVuY3Rpb24obixkYXRhc2V0LGxhYmVsKXsKICBmZWF0dXJlcyA8LSBjb2xuYW1lcyhkYXRhc2V0KQogIGZlYXR1cmVzIDwtIHNldGRpZmYoZmVhdHVyZXMgLCBsYWJlbCkKICBmZWF0dXJlc19zdWJzZXQ8LQogICAgY29tYm4oZmVhdHVyZXMsbixzaW1wbGlmeSA9IEZBTFNFKQogIG1ldHJpYyA8LSBsYXBwbHkoZmVhdHVyZXNfc3Vic2V0LAogICAgICAgICAgICAgICAgICAgZ2V0X21ldHJpYywKICAgICAgICAgICAgICAgICAgIGRhdGFzZXQ9ZGF0YXNldCwKICAgICAgICAgICAgICAgICAgIGxhYmVsPWxhYmVsKQogIGluZGV4PC0KICAgIHdoaWNoLm1pbihtZXRyaWMgJT4lIHVubGlzdCgpICkKICByZXR1cm4oZmVhdHVyZXNfc3Vic2V0W1tpbmRleF1dKQogIAp9CgpuX2ZlYXR1cmVzIDwtIG5jb2woQm9zdG9uKS0xCgpsaXN0X21vZGVsczwtCiAgbGFwcGx5KDE6bl9mZWF0dXJlcywKICAgICAgICAgZ2VuZXJhdGVfbW9kZWxzLAogICAgICAgICBkYXRhc2V0PUJvc3RvbiwKICAgICAgICAgbGFiZWwgPSAnbWVkdicpCgpsaXN0X21vZGVscwpgYGAKCkVzdG9zIHNvbiBsb3MgMTMgbWVqb3JlcyBtb2RlbG9zIGRlc2RlIDEgaGFzdGEgMTMgdmFyaWFibGVzLiBBaG9yYSBub3MgdG9jYSBwcm9iYXIgY3VhbCBlcyBlbCBtZWpvciBkZSBsb3MgMTMgbW9kZWxvcywgcGFyYSBlbGxvIGNyZWFyZW1vcyB1bmEgZnVuY2lvbiBsb29jdiBwYXJhIGxhIHBydWViYSB5IHVuYSBmdW5jaW9uIGtmb2xkcyBwYXJhIGhhY2VyIGxhIHBydWViYSwgdXRpbGl6YXJlbW9zICRLPTEwJCwgCgoKCmBgYHtyfQpsaWJyYXJ5KGNhcmV0KQpsb29jdiA8LSBmdW5jdGlvbihpbmRleCwgZGF0YXNldCwgbGFiZWwsIGZlYXR1cmVzKXsKICB0cmFpbiA8LSBkYXRhc2V0Wy1pbmRleCxdCiAgdGVzdCA8LSBkYXRhc2V0W2luZGV4LF0KICBmb3JtdWxhIDwtIHBhc3RlMChsYWJlbCwgIn4iLCBwYXN0ZTAoZmVhdHVyZXMsIGNvbGxhcHNlPSIrIiksIGNvbGxhcHNlID0gIiIpCiAgZml0IDwtIGxtKGZvcm11bGEsIGRhdGE9dHJhaW4pCiAgCiAgdGVzdF9wcmVkIDwtIHByZWRpY3QoZml0LCB0ZXN0KQogIHNlIDwtICh0ZXN0X3ByZWQtdGVzdFssbGFiZWxdKV4yCiAgCiAgcmV0dXJuKHNlKQp9Cgpmb2xkcyA8LSBmdW5jdGlvbihrLCBkYXRhc2V0LCBsYWJlbCwgZmVhdHVyZXMpewogIHNlX2ZvbGRzX2FsbCA8LSBsYXBwbHkoY3JlYXRlRm9sZHMoZGF0YXNldFssbGFiZWxdLCBrID0gaywgbGlzdCA9IFRSVUUsIHJldHVyblRyYWluID0gRkFMU0UpLCAKICAgICAgICAgICAgICAgICAgICAgICAgIGxvb2N2LCAKICAgICAgICAgICAgICAgICAgICAgICAgIGxhYmVsPWxhYmVsLCAKICAgICAgICAgICAgICAgICAgICAgICAgIGRhdGFzZXQ9ZGF0YXNldCwKICAgICAgICAgICAgICAgICAgICAgICAgIGZlYXR1cmVzPWZlYXR1cmVzKQogIHJldHVybihzdW0oc2VfZm9sZHNfYWxsICU+JSB1bmxpc3QoKSkpCn0KYGBgCgpGaW5hbG1lbnRlIGhhY2Vtb3MgbGEgcHJ1ZWJhOgoKYGBge3J9Cmxpc3RfbW9kZWxzX3JzcyA8LSBsYXBwbHkobGlzdF9tb2RlbHMsCiAgICAgICAgICAgICAgIGZvbGRzLAogICAgICAgICAgICAgICBrPTEwLAogICAgICAgICAgICAgICBkYXRhc2V0PUJvc3RvbiwKICAgICAgICAgICAgICAgbGFiZWw9Im1lZHYiCiAgICAgICAgICAgICAgICkKCmxpc3RfbW9kZWxzW3doaWNoLm1pbihsaXN0X21vZGVsc19yc3MpXQpgYGAKRmluYWxtZW50ZSBlc2UgZXMgZWwgbWVqb3IgbW9kZWxvLgoKIyMjIEJhY2t3YXJkIFN0ZXAgV2lzZQoKQWwgaWd1YWwgcXVlIGVsIGFsZ29yaXRtbyBhbnRlcmlvciwgc29sbyBxdWUgYWNhIGVtcGVzYW1vcyBwb3IgZGVmaW5pciB1biBtb2RlbG8gY29uIHRvZGFzIGxhcyB2YXJpYWJsZXMsIGx1ZWdvIHVuYSBjb24gMTIgdmFyaWFibGVzIHkgYXNpIHN1Y2VzaXZhbWVudGUgaGFzdGEgbGxlZ2FyIGEgbW9kZWxvcyBkZSB1bmEgdmFyaWFibGUsIGRvbmRlIGdhbmFyIGVsIHF1ZSB0ZW5nYSBlbCBtZW5vciBSU1MgbyBlbCBtYXlvciAkUl4yJCwgcGFyYSBlbGxvIHlhIHRlbmVtb3MgdmFyaWFzIGZ1bmNpb2VucyBkaWZlcmVudGVzIHBvciBsbyBxdWUgc29sYW1lbnRlIGludmVydGlyZW1vcyBsYSBmdW5jaW9uIGdlbmVyYWRvcmEgZGUgbW9kZWxvczoKCmBgYHtyfQpiYWNrd2FyZF9saXN0X21vZGVsczwtCiAgbGFwcGx5KG5fZmVhdHVyZXM6MSwKICAgICAgICAgZ2VuZXJhdGVfbW9kZWxzLAogICAgICAgICBkYXRhc2V0PUJvc3RvbiwKICAgICAgICAgbGFiZWwgPSAnbWVkdicpCgpiYWNrd2FyZF9saXN0X21vZGVscwpgYGAKWSBjb24gZXN0byByZWFsaXphbW9zIGxhIG1pc21hIHBydWViYSBwYXJhIHZlciBjdWFsIGVzIGVsIG1lam9yIG1vZGVsbzoKCmBgYHtyfQpiYWNrd2FyZF9saXN0X21vZGVsc19yc3MgPC0gbGFwcGx5KGJhY2t3YXJkX2xpc3RfbW9kZWxzLAogICAgICAgICAgICAgICBmb2xkcywKICAgICAgICAgICAgICAgaz0xMCwKICAgICAgICAgICAgICAgZGF0YXNldD1Cb3N0b24sCiAgICAgICAgICAgICAgIGxhYmVsPSJtZWR2IgogICAgICAgICAgICAgICApCgpiYWNrd2FyZF9saXN0X21vZGVsc1t3aGljaC5taW4oYmFja3dhcmRfbGlzdF9tb2RlbHNfcnNzKV0KYGBgCgpDb21vIHZlbW9zIG9idGVuZW1vcyBlbCBtaXNtbyByZXN1bHRhZG8gcXVlIGNvbiBlbCBhbGdvcml0bW8gYW50ZXJpb3Iu