Professor: Leandro Maciel Almeida
Boosting é uma das técnicas de aprendizado de conjunto em aprendizado de máquina e é amplamente utilizada em problemas de regressão e classificação. O conceito principal deste método é melhorar (impulsionar) os learners sequencialmente e aumentar a precisão do modelo com um modelo combinado.
Existem vários algoritmos de boosting, como aumento de gradiente, AdaBoost (Adaptive Boost), XGBoost e outros. Neste trabalho, iremos classificar dados GLASS com o método gbm (Modelo de Gradient Boosting) do pacote gbm (Generalized Boosted Model). Este pacote aplica as máquinas de aumento de gradiente de J. Friedman e os algoritmos Adaboot.
library(gbm)
## Loaded gbm 2.1.8
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
glass_csv <- read.csv(file.choose(), header = FALSE, sep = ",")
colnames(glass_csv) <- c("id","ri","na","mg","al","si","k","ca","ba","fe","tipo")
head(glass_csv)
## id ri na mg al si k ca ba fe tipo
## 1 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
index1 <- createDataPartition(glass_csv$tipo, p = 0.7, list = FALSE)
treinogbm <- glass_csv[index1,]
testegbm <- glass_csv[-index1,]
gbmglass <- gbm(tipo ~. ,
data = treinogbm,
distribution = "multinomial",
cv.folds = 5,
shrinkage = 0.1,
n.minobsinnode = 0.01,
n.trees = 200)
## Warning: Setting `distribution = "multinomial"` is ill-advised as it is
## currently broken. It exists only for backwards compatibility. Use at your own
## risk.
print(gbmglass)
## gbm(formula = tipo ~ ., distribution = "multinomial", data = treinogbm,
## n.trees = 200, n.minobsinnode = 0.01, shrinkage = 0.1, cv.folds = 5)
## A gradient boosted model with multinomial loss function.
## 200 iterations were performed.
## The best cross-validation iteration was 158.
## There were 10 predictors of which 10 had non-zero influence.
predi <- predict.gbm(object = gbmglass,
newdata = testegbm,
n.trees = 200,
type = "response")
labels <- colnames(predi)[apply(predi, 1, which.max)]
result <- data.frame(testegbm$tipo, labels)
print(result)
## testegbm.tipo labels
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## 13 1 1
## 14 1 1
## 15 1 1
## 16 1 1
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## 21 1 1
## 22 1 2
## 23 2 2
## 24 2 2
## 25 2 2
## 26 2 2
## 27 2 2
## 28 2 2
## 29 2 2
## 30 2 2
## 31 2 2
## 32 2 2
## 33 2 2
## 34 2 2
## 35 2 2
## 36 2 2
## 37 2 2
## 38 2 2
## 39 2 2
## 40 2 2
## 41 2 2
## 42 2 2
## 43 2 3
## 44 3 3
## 45 3 3
## 46 3 3
## 47 3 3
## 48 3 3
## 49 5 5
## 50 5 5
## 51 5 5
## 52 5 5
## 53 6 6
## 54 6 6
## 55 6 6
## 56 6 6
## 57 7 7
## 58 7 7
## 59 7 7
## 60 7 7
## 61 7 7
## 62 7 7
## 63 7 7
testegbm$tipo <- as.factor(testegbm$tipo)
cm <- confusionMatrix(testegbm$tipo, as.factor(labels))
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 21 1 0 0 0 0
## 2 0 20 1 0 0 0
## 3 0 0 5 0 0 0
## 5 0 0 0 4 0 0
## 6 0 0 0 0 4 0
## 7 0 0 0 0 0 7
##
## Overall Statistics
##
## Accuracy : 0.9683
## 95% CI : (0.89, 0.9961)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9574
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 1.0000 0.9524 0.83333 1.00000 1.00000 1.0000
## Specificity 0.9762 0.9762 1.00000 1.00000 1.00000 1.0000
## Pos Pred Value 0.9545 0.9524 1.00000 1.00000 1.00000 1.0000
## Neg Pred Value 1.0000 0.9762 0.98276 1.00000 1.00000 1.0000
## Prevalence 0.3333 0.3333 0.09524 0.06349 0.06349 0.1111
## Detection Rate 0.3333 0.3175 0.07937 0.06349 0.06349 0.1111
## Detection Prevalence 0.3492 0.3333 0.07937 0.06349 0.06349 0.1111
## Balanced Accuracy 0.9881 0.9643 0.91667 1.00000 1.00000 1.0000
Aplicando Gradient Boosting Machines, nós conseguimos uma acurácia de 98,41% na classificação dos dados glass.