Librerías necesarias

En primer lugar, cargamos las librerías necesarias.

library(data.table)
library(Boruta)
## Loading required package: ranger

Extracción de datos

Extracción de datos desde archivo CSV, considerando las cadenas como variables categóricas (tipo factor de R) usando el parámetro stringsAsFactors.

datos <- fread("bank.csv", stringsAsFactors = T)

Además, mostramos las características del conjunto de datos.

summary(datos)
##       age                 job          marital         education   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306  
##  Median :39.00   technician :768   single  :1196   tertiary :1350  
##  Mean   :41.17   admin.     :478                   unknown  : 187  
##  3rd Qu.:49.00   services   :417                                   
##  Max.   :87.00   retired    :230                                   
##                  (Other)    :713                                   
##  default       balance      housing     loan           contact    
##  no :4445   Min.   :-3313   no :1962   no :3830   cellular :2896  
##  yes:  76   1st Qu.:   69   yes:2559   yes: 691   telephone: 301  
##             Median :  444                         unknown  :1324  
##             Mean   : 1423                                         
##             3rd Qu.: 1480                                         
##             Max.   :71188                                         
##                                                                   
##       day            month         duration       campaign     
##  Min.   : 1.00   may    :1398   Min.   :   4   Min.   : 1.000  
##  1st Qu.: 9.00   jul    : 706   1st Qu.: 104   1st Qu.: 1.000  
##  Median :16.00   aug    : 633   Median : 185   Median : 2.000  
##  Mean   :15.92   jun    : 531   Mean   : 264   Mean   : 2.794  
##  3rd Qu.:21.00   nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 293   Max.   :3025   Max.   :50.000  
##                  (Other): 571                                  
##      pdays           previous          poutcome      y       
##  Min.   : -1.00   Min.   : 0.0000   failure: 490   no :4000  
##  1st Qu.: -1.00   1st Qu.: 0.0000   other  : 197   yes: 521  
##  Median : -1.00   Median : 0.0000   success: 129             
##  Mean   : 39.77   Mean   : 0.5426   unknown:3705             
##  3rd Qu.: -1.00   3rd Qu.: 0.0000                            
##  Max.   :871.00   Max.   :25.0000                            
## 

Cálculo de la importancia de los atributos

Determinar la importancia de los atributos.

Determinamos la importancia mediante la función Boruta.

boruta.model <- Boruta(y~., data = datos, doTrace = 2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +34 secs:
##  confirmed 9 attributes: age, contact, day, duration, housing and 4 more;
##  still have 7 attributes left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
## After 19 iterations, +59 secs:
##  confirmed 2 attributes: education, marital;
##  still have 5 attributes left.
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
## After 22 iterations, +1.1 mins:
##  rejected 1 attribute: job;
##  still have 4 attributes left.
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
## After 28 iterations, +1.5 mins:
##  confirmed 1 attribute: campaign;
##  still have 3 attributes left.
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
## After 31 iterations, +1.6 mins:
##  confirmed 1 attribute: loan;
##  still have 2 attributes left.
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
print(boruta.model)
## Boruta performed 99 iterations in 5.259073 mins.
##  13 attributes confirmed important: age, campaign, contact, day,
## duration and 8 more;
##  1 attributes confirmed unimportant: job;
##  2 tentative attributes left: balance, default;
plot(boruta.model)

Refinar modelo para resolver posibles atributos tentativos.

Dado que pueden existir atributos no resueltos (tentativos), refinamos el modelo.

boruta.model2 <- TentativeRoughFix(boruta.model)
print(boruta.model2)
## Boruta performed 99 iterations in 5.259073 mins.
## Tentatives roughfixed over the last 99 iterations.
##  14 attributes confirmed important: age, campaign, contact, day,
## default and 9 more;
##  2 attributes confirmed unimportant: balance, job;
plot(boruta.model2)

Visualización de los resultados

Obtener un listado de nombres de atributos seleccionados.

Obtenemos un listado de nombres de atributos seleccionados.

getSelectedAttributes(boruta.model2, withTentative = F)
##  [1] "age"       "marital"   "education" "default"   "housing"  
##  [6] "loan"      "contact"   "day"       "month"     "duration" 
## [11] "campaign"  "pdays"     "previous"  "poutcome"

Creación del ranking de atributos

Finalmente, creamos el ranking de atributos.

x <- attStats(boruta.model2)
ranking <- data.table(attribute=rownames(x), x)[order(-meanImp)]
print(ranking)
##     attribute    meanImp  medianImp     minImp    maxImp   normHits
##  1:  duration 74.0363054 74.3359891 67.6250154 79.837327 1.00000000
##  2:  poutcome 29.5822569 29.5615634 27.1130699 32.731257 1.00000000
##  3:     month 26.9336883 27.0123499 23.3668577 30.564676 1.00000000
##  4:     pdays 21.8499979 21.7154333 19.8187257 24.093880 1.00000000
##  5:   contact 19.0998746 19.0543470 16.4497486 21.721025 1.00000000
##  6:       age 14.1080634 14.1203027 11.2456871 16.837757 1.00000000
##  7:  previous 14.0033089 13.9808159 12.3333329 15.845053 1.00000000
##  8:       day 13.2493060 13.2311758 10.4594950 15.977183 1.00000000
##  9:   housing 11.6987730 11.6496796  9.3591637 14.896684 1.00000000
## 10:  campaign  3.7145156  3.7017747  0.8651600  5.636992 0.81818182
## 11:      loan  3.4438427  3.3984611  0.2185968  6.398775 0.74747475
## 12:   marital  3.2825237  3.2870398  0.6530748  5.812836 0.78787879
## 13:   default  2.6426690  2.6285467  0.7769855  4.775365 0.62626263
## 14: education  2.5592933  2.6704746 -0.5060725  5.517054 0.56565657
## 15:   balance  2.4911078  2.4279182  0.1524011  5.071215 0.50505051
## 16:       job  0.4412679  0.4009689 -1.7843487  2.319369 0.03030303
##      decision
##  1: Confirmed
##  2: Confirmed
##  3: Confirmed
##  4: Confirmed
##  5: Confirmed
##  6: Confirmed
##  7: Confirmed
##  8: Confirmed
##  9: Confirmed
## 10: Confirmed
## 11: Confirmed
## 12: Confirmed
## 13: Confirmed
## 14: Confirmed
## 15:  Rejected
## 16:  Rejected

Más información

Puede encontrar más información para ampliar en https://www.datacamp.com/community/tutorials/feature-selection-R-boruta.