Gradiente Boosting

class: center, middle, title-slide

.title[
# Gradiente Boosting
]
.subtitle[
## Em classificação
]
.author[
### Jaime Utria
]
.institute[
### Departamento de Estatística - UFF
]

---

<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: {
    Macros: {
      vet: ["{\\mathbf #1}",1],
      prodint: ["\\langle #1, #2 \\rangle",2],
      posto: ["{\\mathrm{posto} (#1)}",1],
      tr: ["{\\mathrm{tr} (#1)}",1]
    }
  }
});
</script>

## Gradiente boosting: Algoritmo

1. Iniciar `$\hat{f}_0(\vet x) = \text{argmin}_\gamma\sum_{i=1}^n L(y_i,\gamma)$`.

2. Para `$b = 1, \ldots, B$`: 
  
  (a) Calcular os pseudo-resíduos
  `$r_{ib} = - \left[\frac{\partial L(y_i,\hat{f}(\mathbf{x}_i))}{\partial \hat f(\mathbf{x}_i)}\right]_{\hat f = \hat f_{b-1}}, i =1,\;  \ldots, n$`.
  
  (b) Ajustar uma árvore de regressão aos pseudo-resíduos `$r_{ib}$`, `$i=1,\ldots,n$` com regiões terminais `$R_{jb}$`, `$j=1,\ldots,J_b$`.
  
  (c) Calcular `$\gamma_{jb} =\text{argmin}_\gamma \sum_{\vet x_i \in R_{jb}} L(y_i,\hat f_{b-1}(\mathbf x_i)+\gamma)$`
  
  (d) Atualizar `$\hat f_b(\mathbf{x}) = \hat f_{b-1}(\mathbf{x}) + \eta \sum_{j=1}^{J_b} \gamma_{jb}I(\mathbf{x} \in R_{jb})$`
  
3. Preditor final: `$\hat f_{\text{GB}}(\mathbf{x}) = \hat f_B(\mathbf{x})$`.

---

- Seja `$\mathcal{L} = \{(\mathbf x_i,y_i): \mathbf{x}_i \in \mathbb R^p, y_i \in\{0,1\}: i =1, \ldots,n\}.$` o conjunto de treino.

- Seja `$\hat p$` a probabilidade estimada de `$y$` pertencer à classe 1. Consequentemente, `$1-\hat p$` é a probabilidade estimada de `$y$` pertencer à classe 0.

- A função de perda usada para problemas de classificação é a perda logarítmica (*log-loss*) definida como segue:

`$$\begin{align} L(y,\hat{p}) &= - y \log(\hat p) - (1-y_i)\log(1-\hat p)\\
&=- y \log\left(\frac{\hat p}{1-\hat p}\right) - \log(1-\hat p)\\
&=-y\log(\hat q)+\log(1+\exp(\log(\hat q)))\\
&=L(y, \log(\hat q)).\end{align}$$`

- Ou seja, a *log-loss* é uma função do valor observado `$y$` e de `$\log(\hat q)$`, em que `$\hat q = \frac{\hat p}{1-\hat p}$` é a chance (*odd*) predita.

---

`$$\begin{align}-\frac{\partial L(y,\log(\hat q))}{\partial \log(\hat q)} &=  y - \frac{\exp(\log(\hat q))}{1 + \exp(\log(\hat q))}\\
&=y - \hat p\end{align}$$`

- Lembre que `$\hat{y} = I(\hat p > 0,5)$`, então `$-\frac{\partial L(y,\log(\hat q))}{\partial \log(\hat q)}$` pode ser visto como o (pseudo) resíduo obtido ao prever `$y$` pelo modelo logístico.

- Voltando no passo 1 do algoritmo, vemos que o preditor inicial `$\hat{f}_0(x) = \log(\hat{q})$`, em que `$\hat{q} = \frac{\hat p}{1-\hat p}$`, em que `$\hat p$` é a proporção de amostras rotuladas com 1 na base de treino.

---

## Diabetes (**Toy**)

```
##     Outcome Pregnancies Glucose BloodPressure Age
## 228       0           6     105            70  37
## 80        0           6     111            64  24
## 322       0           0     124            56  21
## 333       0           4     132            86  63
## 336       0           4      85            58  28
## 459       1           9     145            80  40
## 153       1           3     162            52  24
## 255       1          12     140            82  58
## 204       1          14     100            78  46
## 267       1           5     144            82  58
```

---

- Para ilustração, vamos construir o preditor com 2 árvores (tocos) e `$\eta = 1$`.

- A previsão inicial da `$\log(odds)$` é `$\hat f_0(\mathbf x) = \log(1) = 0$`, pois temos 5 repostas normal (0) e 5 respostas diabetes (1).

- Para `$b=1$`, temos que

``` r
# resposta observada
y <- as.integer(diabetes_toy$Outcome)-1
# predições iniciais
p.pred0 <- rep(1/2,10)
# residuos
r1 <- y - p.pred0
diabetes2 <- data.frame(r1,diabetes_toy[,-1])
```

---

- Agora, ajustamos uma árvore de regressão a `r1`, gerando regiões terminais `$R_{11}$` e `$R_{21}$`.

``` r
library(rpart)
rpart1 <- rpart(r1 ~., data = diabetes2, method = "anova",
                 control =rpart.control(minsplit = 10))
rpart1
```

```
## n= 10 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 10 2.5000000  0.0000000  
##   2) Glucose< 136 6 0.8333333 -0.3333333 *
##   3) Glucose>=136 4 0.0000000  0.5000000 *
```

---

`$R_{11} = \{\text{Glucose} < 136\}$` e `$R_{21} = \{\text{Glucose > = 136}\}$`

- O valor predito `$\gamma_{11}$` e `$\gamma_{21}$` em cada região é obtido otimizando em `$\gamma$`, a função:

`$$\gamma_{jb} =\text{argmin}_\gamma \sum_{\vet x_i \in R_{jb}} L(y_i,\hat f_{b-1}(\mathbf x_i)+\gamma)$$`
- Substituindo na função de perda, temos:

`$$L(y,\hat f_{b-1}(\mathbf x)+\gamma) = -y \log(\hat f_{b-1}(\mathbf x) + \gamma) + \log(1+\exp(\log(\hat f_{b-1}(\mathbf x) + \gamma)))$$`
- Usando a expansão de Taylor de segunda ordem:

`$$L(y_i,\hat f_{b-1}(\mathbf x_i)+\color{red}{\gamma}) \approx L(y,\hat f_{b-1}(\mathbf x)) + \frac{d}{d\gamma} L(y,\hat f_{b-1}(\mathbf{x}))\color{red}{\gamma} + \frac{d^2}{d\gamma^2}L(y,\hat f_{b-1}(\mathbf{x}))\color{red}{\gamma^2}.$$`

---

- Derivando com relação a `$\gamma$` e igualando a zero na expressão do lado direito e após MUITAS simplificações, obtemos:

`$$\gamma_{jb} = \frac{\sum_{\mathbf x_i \in R_{jb}}(y_i -\hat p_i^{(b-1)})}{\sum_{\mathbf{x}_i \in R_{jb}}\hat p_{i}^{(b-1)}(1-\hat p_i^{(b-1)})},$$`
em que `$\hat p_i^{(b-1)}$` é a probabilidade prevista na iteração `$b-1$` (probabilidade prévia).

---

``` r
diabetes2_aumentado <- cbind(p.pred0,diabetes2)
R11 <- cbind(diabetes2_aumentado[diabetes2_aumentado$Glucose < 136,])
R21 <- cbind(diabetes2_aumentado[diabetes2_aumentado$Glucose >= 136,]) 
gamma11 <- sum(R11$r1)/sum(R11$p.pred0*(1-R11$p.pred0))
gamma21 <- sum(R21$r1)/sum(R21$p.pred0*(1-R21$p.pred0))
```

---

```
##        gamma11 gamma21
## [1,] -1.333333       2
```

<div class="grViz html-widget html-fill-item" id="htmlwidget-00cbd347f90b02880d8c" style="width:504px;height:504px;"></div>
<script type="application/json" data-for="htmlwidget-00cbd347f90b02880d8c">{"x":{"diagram":"\ndigraph tree {\n  node [shape = ellipse, style = filled, color = lightblue, fontname = Helvetica]\n\n  raiz   [label = \"\"]\n  filho1 [label = \"-1,33\"]\n  filho2 [label = \"2\"]\n\n  raiz -> filho1 [xlabel = \"Glucose < 136\"]\n  raiz -> filho2 [label = \"Glucose >= 136\"]\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---

``` r
# log-odds preditas
predictions1 <- log(1) + ifelse(diabetes2$Glucose < 136, -1.33, 2)
# prob da classe 1 preditas
p.pred1 <- exp(predictions1)/(1+exp(predictions1))
# pseudo-resíduos
r2 <- y - p.pred1; diabetes3 <- data.frame(r2, diabetes2[,-1])
head(diabetes3)
```

```
##             r2 Pregnancies Glucose BloodPressure Age
## 228 -0.2091594           6     105            70  37
## 80  -0.2091594           6     111            64  24
## 322 -0.2091594           0     124            56  21
## 333 -0.2091594           4     132            86  63
## 336 -0.2091594           4      85            58  28
## 459  0.1192029           9     145            80  40
```

---

- Agora, ajustamos uma árvore de regressão a `r2`, gerando regiões terminais `$R_{12}$` e `$R_{22}$`.

``` r
rpart2 <- rpart(r2 ~., data = diabetes3, method = "anova",
                 control =rpart.control(minsplit = 10))
rpart2
```

```
## n= 10 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 10 0.8960825  0.02218555  
##   2) Pregnancies< 7.5 7 0.1540311 -0.11534160 *
##   3) Pregnancies>=7.5 3 0.3007315  0.34308220 *
```

---

`$R_{12} = \{\text{Pregnancies} < 7,5\}$` e `$R_{22} = \{\text{Pregnancies} >= 7,5\}$`.

``` r
# prob previstas b = 2
diabetes3_aumentado <- cbind(p.pred1,diabetes3)
R12 <- cbind(diabetes3_aumentado[diabetes3_aumentado$Pregnancies < 7.5,])
R22 <- cbind(diabetes3_aumentado[diabetes3_aumentado$Pregnancies >= 7.5,]) 
gamma12 <- sum(R12$r2)/sum(R12$p.pred1*(1-R12$p.pred1))
gamma22 <- sum(R22$r2)/sum(R22$p.pred1*(1-R22$p.pred1))
cbind(gamma12,gamma22) #log odds previstas
```

```
##        gamma12  gamma22
## [1,] -0.778549 2.741741
```

---

<div class="grViz html-widget html-fill-item" id="htmlwidget-bf9369760f4f3ddf61f3" style="width:504px;height:504px;"></div>
<script type="application/json" data-for="htmlwidget-bf9369760f4f3ddf61f3">{"x":{"diagram":"\ndigraph tree {\n  node [shape = ellipse, style = filled, color = lightblue, fontname = Helvetica]\n\n  raiz   [label = \"\"]\n  filho1 [label = \"-0.778\"]\n  filho2 [label = \"2.741\"]\n\n  raiz -> filho1 [xlabel = \"Pregnancies < 7,5\"]\n  raiz -> filho2 [label = \"Pregnancies >= 7,5\"]\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---

``` r
# log-odds preditas
predictions2 <- log(1) + 
                ifelse(diabetes3$Glucose < 136,-1.33,2) +
                ifelse(diabetes3$Pregnancies < 7.5, -0.778, 2.741)
# prob da classe 1 preditas
p.pred2 <- exp(predictions2)/(1+exp(predictions2))
# pseudo-resíduos
r3 <- y - p.pred2; 
classe.pred <- ifelse(p.pred2 > 0.5,1,0)
diabetes4 <- data.frame(p.pred2,classe.pred ,diabetes3[,-1])
```

---

``` r
diabetes4
```

```
##       p.pred2 classe.pred Pregnancies Glucose BloodPressure Age
## 228 0.1083217           0           6     105            70  37
## 80  0.1083217           0           6     111            64  24
## 322 0.1083217           0           0     124            56  21
## 333 0.1083217           0           4     132            86  63
## 336 0.1083217           0           4      85            58  28
## 459 0.9913456           1           9     145            80  40
## 153 0.7724153           1           3     162            52  24
## 255 0.9913456           1          12     140            82  58
## 204 0.8039236           1          14     100            78  46
## 267 0.7724153           1           5     144            82  58
```

---

``` r
library(caret)
library(gbm)
set.seed(12345)
index_train <- createDataPartition(y = diabetes_limpo$Outcome, p = 0.8, list = F)
diabetes_limpo$Outcome <- as.numeric(diabetes_limpo$Outcome)-1
training <- diabetes_limpo[index_train,]
testing <- diabetes_limpo[-index_train,]
gbm_diabetes <- gbm(Outcome ~.,
                    data = training,
                    distribution = "bernoulli", # para classificação
                    n.trees = 200,
                    interaction.depth = 8,
                    shrinkage = 0.1,
                    bag.fraction = 1,
                    cv.folds = 10)
```

---

``` r
gbm_diabetes
```

```
## gbm(formula = Outcome ~ ., distribution = "bernoulli", data = training, 
##     n.trees = 200, interaction.depth = 8, shrinkage = 0.1, bag.fraction = 1, 
##     cv.folds = 10)
## A gradient boosted model with bernoulli loss function.
## 200 iterations were performed.
## The best cross-validation iteration was 21.
## There were 7 predictors of which 7 had non-zero influence.
```

---

``` r
summary(gbm_diabetes, plotit = FALSE)
```

```
##                                               var   rel.inf
## Glucose                                   Glucose 39.031083
## DiabetesPedigreeFunction DiabetesPedigreeFunction 16.887903
## Age                                           Age 16.563161
## BMI                                           BMI 12.474932
## Pregnancies                           Pregnancies  5.494548
## BloodPressure                       BloodPressure  4.926033
## SkinThickness                       SkinThickness  4.622341
```

---

``` r
summary(gbm_diabetes)
```

---

## Avaliando o desempenho do ajuste

``` r
predictions <- predict(gbm_diabetes,
newdata = testing, n.trees = 200,type = "response")
predictions <- as.factor(ifelse(predictions > 0.5,1,0))
testing$Outcome <- as.factor(testing$Outcome)
```

---

``` r
confusionMatrix(predictions,testing$Outcome)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60 15
##          1 11 20
##                                           
##                Accuracy : 0.7547          
##                  95% CI : (0.6616, 0.8331)
##     No Information Rate : 0.6698          
##     P-Value [Acc > NIR] : 0.03714         
##                                           
##                   Kappa : 0.4289          
##                                           
##  Mcnemar's Test P-Value : 0.55630         
##                                           
##             Sensitivity : 0.8451          
##             Specificity : 0.5714          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.6452          
##              Prevalence : 0.6698          
##          Detection Rate : 0.5660          
##    Detection Prevalence : 0.7075          
##       Balanced Accuracy : 0.7082          
##                                           
##        'Positive' Class : 0               
## 
```

---

## Número ideal de árvores

``` r
gbm.perf(gbm_diabetes)
```

```
## [1] 21
```

---

## Avaliando o desempenho do ajuste (com o número de árvores escolhido)

``` r
predictions_cv <- predict(gbm_diabetes,
newdata = testing, n.trees = 21,type = "response")
predictions_cv <- as.factor(ifelse(predictions_cv > 0.5,1,0))
```

---

``` r
confusionMatrix(predictions_cv,testing$Outcome)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62 16
##          1  9 19
##                                           
##                Accuracy : 0.7642          
##                  95% CI : (0.6718, 0.8412)
##     No Information Rate : 0.6698          
##     P-Value [Acc > NIR] : 0.02258         
##                                           
##                   Kappa : 0.4383          
##                                           
##  Mcnemar's Test P-Value : 0.23014         
##                                           
##             Sensitivity : 0.8732          
##             Specificity : 0.5429          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.6786          
##              Prevalence : 0.6698          
##          Detection Rate : 0.5849          
##    Detection Prevalence : 0.7358          
##       Balanced Accuracy : 0.7080          
##                                           
##        'Positive' Class : 0               
## 
```