Boosting

Gradient Boosting Classification

Boosting é uma das técnicas de aprendizado de conjunto em aprendizado de máquina e é amplamente utilizada em problemas de regressão e classificação. O conceito principal deste método é melhorar (impulsionar) os learners sequencialmente e aumentar a precisão do modelo com um modelo combinado.
Existem vários algoritmos de boosting, como aumento de gradiente, AdaBoost (Adaptive Boost), XGBoost e outros. Neste trabalho, iremos classificar dados GLASS com o método gbm (Modelo de Gradient Boosting) do pacote gbm (Generalized Boosted Model). Este pacote aplica as máquinas de aumento de gradiente de J. Friedman e os algoritmos Adaboot.

Primeiramente iniciando as bibliotecas que iremos precisar

library(gbm)

## Loaded gbm 2.1.8

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

Preparando os dados

glass_csv <- read.csv(file.choose(), header = FALSE, sep = ",")
colnames(glass_csv) <- c("id","ri","na","mg","al","si","k","ca","ba","fe","tipo")
head(glass_csv)

##   id      ri    na   mg   al    si    k   ca ba   fe tipo
## 1  1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2  2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3  3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4  4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5  5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6  6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Separando em treino (70%) e teste (30%)

index1 <- createDataPartition(glass_csv$tipo, p = 0.7, list = FALSE)
treinogbm <- glass_csv[index1,]
testegbm <- glass_csv[-index1,]

Classificação com GBM

gbmglass <- gbm(tipo ~. ,
                data = treinogbm,
                distribution = "multinomial",
                cv.folds = 5,
                shrinkage = 0.1,
                n.minobsinnode = 0.01,
                n.trees = 200)

## Warning: Setting `distribution = "multinomial"` is ill-advised as it is
## currently broken. It exists only for backwards compatibility. Use at your own
## risk.

print(gbmglass)

## gbm(formula = tipo ~ ., distribution = "multinomial", data = treinogbm, 
##     n.trees = 200, n.minobsinnode = 0.01, shrinkage = 0.1, cv.folds = 5)
## A gradient boosted model with multinomial loss function.
## 200 iterations were performed.
## The best cross-validation iteration was 158.
## There were 10 predictors of which 10 had non-zero influence.

Com o modelo pronto vamos fazer a predição nos dados de teste

predi <- predict.gbm(object = gbmglass,
                     newdata = testegbm,
                     n.trees = 200,
                     type = "response")
labels <- colnames(predi)[apply(predi, 1, which.max)]
result <- data.frame(testegbm$tipo, labels)
print(result)

##    testegbm.tipo labels
## 1              1      1
## 2              1      1
## 3              1      1
## 4              1      1
## 5              1      1
## 6              1      1
## 7              1      1
## 8              1      1
## 9              1      1
## 10             1      1
## 11             1      1
## 12             1      1
## 13             1      1
## 14             1      1
## 15             1      1
## 16             1      1
## 17             1      1
## 18             1      1
## 19             1      1
## 20             1      1
## 21             1      1
## 22             1      2
## 23             2      2
## 24             2      2
## 25             2      2
## 26             2      2
## 27             2      2
## 28             2      2
## 29             2      2
## 30             2      2
## 31             2      2
## 32             2      2
## 33             2      2
## 34             2      2
## 35             2      2
## 36             2      2
## 37             2      2
## 38             2      2
## 39             2      2
## 40             2      2
## 41             2      2
## 42             2      2
## 43             2      3
## 44             3      3
## 45             3      3
## 46             3      3
## 47             3      3
## 48             3      3
## 49             5      5
## 50             5      5
## 51             5      5
## 52             5      5
## 53             6      6
## 54             6      6
## 55             6      6
## 56             6      6
## 57             7      7
## 58             7      7
## 59             7      7
## 60             7      7
## 61             7      7
## 62             7      7
## 63             7      7

Agora vamos testar a matriz de confusão

testegbm$tipo <- as.factor(testegbm$tipo)
cm <- confusionMatrix(testegbm$tipo, as.factor(labels))
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 21  1  0  0  0  0
##          2  0 20  1  0  0  0
##          3  0  0  5  0  0  0
##          5  0  0  0  4  0  0
##          6  0  0  0  0  4  0
##          7  0  0  0  0  0  7
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9683        
##                  95% CI : (0.89, 0.9961)
##     No Information Rate : 0.3333        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9574        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            1.0000   0.9524  0.83333  1.00000  1.00000   1.0000
## Specificity            0.9762   0.9762  1.00000  1.00000  1.00000   1.0000
## Pos Pred Value         0.9545   0.9524  1.00000  1.00000  1.00000   1.0000
## Neg Pred Value         1.0000   0.9762  0.98276  1.00000  1.00000   1.0000
## Prevalence             0.3333   0.3333  0.09524  0.06349  0.06349   0.1111
## Detection Rate         0.3333   0.3175  0.07937  0.06349  0.06349   0.1111
## Detection Prevalence   0.3492   0.3333  0.07937  0.06349  0.06349   0.1111
## Balanced Accuracy      0.9881   0.9643  0.91667  1.00000  1.00000   1.0000

Conclusão

Aplicando Gradient Boosting Machines, nós conseguimos uma acurácia de 98,41% na classificação dos dados glass.