En primer lugar, cargamos las librerías necesarias.
library(data.table)
library(Boruta)
## Loading required package: ranger
Extracción de datos desde archivo CSV, considerando las cadenas como variables categóricas (tipo factor de R) usando el parámetro stringsAsFactors.
datos <- fread("bank.csv", stringsAsFactors = T)
Además, mostramos las características del conjunto de datos.
summary(datos)
## age job marital education
## Min. :19.00 management :969 divorced: 528 primary : 678
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## default balance housing loan contact
## no :4445 Min. :-3313 no :1962 no :3830 cellular :2896
## yes: 76 1st Qu.: 69 yes:2559 yes: 691 telephone: 301
## Median : 444 unknown :1324
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
##
## day month duration campaign
## Min. : 1.00 may :1398 Min. : 4 Min. : 1.000
## 1st Qu.: 9.00 jul : 706 1st Qu.: 104 1st Qu.: 1.000
## Median :16.00 aug : 633 Median : 185 Median : 2.000
## Mean :15.92 jun : 531 Mean : 264 Mean : 2.794
## 3rd Qu.:21.00 nov : 389 3rd Qu.: 329 3rd Qu.: 3.000
## Max. :31.00 apr : 293 Max. :3025 Max. :50.000
## (Other): 571
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 490 no :4000
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 197 yes: 521
## Median : -1.00 Median : 0.0000 success: 129
## Mean : 39.77 Mean : 0.5426 unknown:3705
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :25.0000
##
Determinamos la importancia mediante la función Boruta.
boruta.model <- Boruta(y~., data = datos, doTrace = 2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## After 11 iterations, +34 secs:
## confirmed 9 attributes: age, contact, day, duration, housing and 4 more;
## still have 7 attributes left.
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## After 19 iterations, +59 secs:
## confirmed 2 attributes: education, marital;
## still have 5 attributes left.
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## After 22 iterations, +1.1 mins:
## rejected 1 attribute: job;
## still have 4 attributes left.
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## After 28 iterations, +1.5 mins:
## confirmed 1 attribute: campaign;
## still have 3 attributes left.
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## After 31 iterations, +1.6 mins:
## confirmed 1 attribute: loan;
## still have 2 attributes left.
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
print(boruta.model)
## Boruta performed 99 iterations in 5.259073 mins.
## 13 attributes confirmed important: age, campaign, contact, day,
## duration and 8 more;
## 1 attributes confirmed unimportant: job;
## 2 tentative attributes left: balance, default;
plot(boruta.model)
Dado que pueden existir atributos no resueltos (tentativos), refinamos el modelo.
boruta.model2 <- TentativeRoughFix(boruta.model)
print(boruta.model2)
## Boruta performed 99 iterations in 5.259073 mins.
## Tentatives roughfixed over the last 99 iterations.
## 14 attributes confirmed important: age, campaign, contact, day,
## default and 9 more;
## 2 attributes confirmed unimportant: balance, job;
plot(boruta.model2)
Obtenemos un listado de nombres de atributos seleccionados.
getSelectedAttributes(boruta.model2, withTentative = F)
## [1] "age" "marital" "education" "default" "housing"
## [6] "loan" "contact" "day" "month" "duration"
## [11] "campaign" "pdays" "previous" "poutcome"
Finalmente, creamos el ranking de atributos.
x <- attStats(boruta.model2)
ranking <- data.table(attribute=rownames(x), x)[order(-meanImp)]
print(ranking)
## attribute meanImp medianImp minImp maxImp normHits
## 1: duration 74.0363054 74.3359891 67.6250154 79.837327 1.00000000
## 2: poutcome 29.5822569 29.5615634 27.1130699 32.731257 1.00000000
## 3: month 26.9336883 27.0123499 23.3668577 30.564676 1.00000000
## 4: pdays 21.8499979 21.7154333 19.8187257 24.093880 1.00000000
## 5: contact 19.0998746 19.0543470 16.4497486 21.721025 1.00000000
## 6: age 14.1080634 14.1203027 11.2456871 16.837757 1.00000000
## 7: previous 14.0033089 13.9808159 12.3333329 15.845053 1.00000000
## 8: day 13.2493060 13.2311758 10.4594950 15.977183 1.00000000
## 9: housing 11.6987730 11.6496796 9.3591637 14.896684 1.00000000
## 10: campaign 3.7145156 3.7017747 0.8651600 5.636992 0.81818182
## 11: loan 3.4438427 3.3984611 0.2185968 6.398775 0.74747475
## 12: marital 3.2825237 3.2870398 0.6530748 5.812836 0.78787879
## 13: default 2.6426690 2.6285467 0.7769855 4.775365 0.62626263
## 14: education 2.5592933 2.6704746 -0.5060725 5.517054 0.56565657
## 15: balance 2.4911078 2.4279182 0.1524011 5.071215 0.50505051
## 16: job 0.4412679 0.4009689 -1.7843487 2.319369 0.03030303
## decision
## 1: Confirmed
## 2: Confirmed
## 3: Confirmed
## 4: Confirmed
## 5: Confirmed
## 6: Confirmed
## 7: Confirmed
## 8: Confirmed
## 9: Confirmed
## 10: Confirmed
## 11: Confirmed
## 12: Confirmed
## 13: Confirmed
## 14: Confirmed
## 15: Rejected
## 16: Rejected
Puede encontrar más información para ampliar en https://www.datacamp.com/community/tutorials/feature-selection-R-boruta.