Adaboost

========================================================

This is an R Markdown document created to explain AdaBoost. Based on [1] and [2].

About boosting

Taken from [3]: Boosting is a general method for improving the accuracy of any given learning algorithm. A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a “rule of thumb” for that set of races (such as, “Bet on the horse that has recently won the most races” or “Bet on the horse with the most favored odds”). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert’s opinion on different collections of races, the gambler is able to extract many rules of thumb. In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: First, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule? Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.

An informal view on boosting

1- Use a classifier

2- Check which cases were misclassified in the previous step

3- Create a new classifier that is less likely to misclassify the cases identified in the previous step

4- Repeat 2 and 3 a number of times, thus creating several classifiers

5- Weight the classifiers and add them up to get a stronger predictor

A formal view on boosting

(adapted from [2])

Given: a training set (X, y). This is to be interpreted as follows: the training set is composed of the vectors \((x_1, y_1)\), …, \((x_m, y_m)\), where for each \(i\), \(x_i \in IR^p\) is a vector corresponding to the values taken by the p features for case \(i\) and \(y_m \in \{-1, 1 \}\) is the correct label of case \(i\)
For \(t=1, ..., T\),

construct distribution \(D_t\) on \(\{1, ..., m\}\), where \(D_t(i)\) is to be interpreted as the weight attributed to case \(i\) on the iteration \(t\)
build up a classifier \(h_t : X \rightarrow \{-1, 1 \}\)
compute the error associated to the classifier: \(\epsilon_t= \sum_{i=1}^m D_t(i) \times \delta_i\), where \(\delta_i\) is 0 if \(h_t(x_i)=y_i\) and 1 otherwise

Compute final classifier \(H_{final}\) as a weighted average of the classifiers \(h_t, t \in \{1, ..., T\}\).

AdaBoost

In the case of AdaBoost:

\(D_1(i)=1/m\)
Given \(D_t=D_t(i)\) and \(h_t=h_t(x_i)\), \(D_{t+1}(i) =D_t(i)*F_i/Z_t\), where

\(F_i=e^{-\alpha_t}\) if \(h_t(x_i)=y_i\) and \(F_i=e^{\alpha_t}\) if \(h_t(x_i) \neq y_i\)
\(\alpha_t=\frac{1}{2} ln (\frac{1-\epsilon_t}{\epsilon_t})\)
\(Z_t\) is a normalisation constant such that \(\sum_{i=1}^m D_t(i)=1\)

To weigh the classifiers and compute the final classifier:

\(H_{final}=sign(\sum_t \alpha_t h_t(x))\)

Example 1:

Let us look at the toy example from [2], where the classifiers are given by rules of thumb (i.e., someone looking at the data decided to classify it this way):

Given:

\(X=(x_1, x_2, ..., x_10)\) are the coordinates of the 10 cases. In particular, \(x_1\) is a vector: \((x_1(1), x_1(2))\) are the coordinates of the first case.
\(y=(y_1, y2, ..., y_10)=(-1, 1, 1, -1, -1, -1, 1, 1, 1, -1)\):

y=c(-1, 1, 1, -1, -1, -1, 1, 1, 1, -1)

1. Let t=1:

\(D_1(i)=1/m\)

D1=rep(1/10, 10)

In this toy example, let the classifier \(h_1\) be a simple rule of thumb: everything to the left of the vertical line in the following image is to be classified as \(1\) and everything to the right as \(-1\)

The classifier is then given by

h1=c(-1, 1, 1, -1, -1, -1, -1, -1, -1, -1)

The error associated to the classifier is \(\epsilon_1= \sum_{i=1}^m D_1(i) \times \delta_i=0.3\)

epsilon_1=sum(D1[!(y==h1)])
epsilon_1

## [1] 0.3

Let t=2:

\(D_2(i)=D_1(i) \times F_i/Z_t\), where \(\alpha_1=\frac{1}{2} ln (\frac{1-\epsilon_1}{\epsilon_1})\) comes into play:

alpha_1=0.5*log((1-epsilon_1)/epsilon_1)
alpha_1

## [1] 0.4236

D2=c()
for (i in seq(1:length(y))){
if (h1[i]==y[i]) {F[i]=exp(-alpha_1)   } else {F[i]=exp(alpha_1)}
D2[i]=D1[i]*F[i]
}
F

##  [1] 0.6547 0.6547 0.6547 0.6547 0.6547 0.6547 1.5275 1.5275 1.5275 0.6547

D2=D2/sum(D2)
D2

##  [1] 0.07143 0.07143 0.07143 0.07143 0.07143 0.07143 0.16667 0.16667
##  [9] 0.16667 0.07143

Schematically, what we are doing is upweigthing the misclassified cases (cases 7, 8, 9). Let’s assume the next classifier, \(h_2\) we build takes that into account:

This classifier is such that:

h2=c(1, 1, 1, 1, 1, -1, 1, 1, 1, -1)

The error associated to the classifier is \(\epsilon_2= \sum_{i=1}^m D_2(i) \times \delta_i\)

epsilon_2=sum(D2[!(y==h2)])
epsilon_2

## [1] 0.2143

Let t=3:

\(D_3(i)=D_1(i) \times F_i/Z_t\), where \(\alpha_2=\frac{1}{2} ln (\frac{1-\epsilon_2}{\epsilon_2})\) comes into play:

alpha_2=0.5*log((1-epsilon_2)/epsilon_2)
alpha_2

## [1] 0.6496

D3=c()
for (i in seq(1:length(y))){
if (h2[i]==y[i]) {F[i]=exp(-alpha_2)   } else {F[i]=exp(alpha_2)}
D3[i]=D2[i]*F[i]
}
F

##  [1] 1.9149 0.5222 0.5222 1.9149 1.9149 0.5222 0.5222 0.5222 0.5222 0.5222

D3=D3/sum(D3)
D3

##  [1] 0.16667 0.04545 0.04545 0.16667 0.16667 0.04545 0.10606 0.10606
##  [9] 0.10606 0.04545

Schematically, what we are doing is upweigthing the misclassified cases (cases 2, 3, 10). Let’s assume the next classifier, \(h_3\) we build takes that into account:

This classifier is such that:

h3=c(-1, -1, -1, -1, -1, -1, 1, 1, 1, 1)

The error associated to the classifier is \(\epsilon_2= \sum_{i=1}^m D_2(i) \times \delta_i\)

epsilon_3=sum(D3[!(y==h3)])
epsilon_3

## [1] 0.1364

The corresponding \(\alpha_3\) is given by:

alpha_3=0.5*log((1-epsilon_3)/epsilon_3)
alpha_3

## [1] 0.9229

Let us build the final classifier, using the formula \(H_{final}=sign(\sum_t \alpha_t h_t(x))\):

h_final=sign(alpha_1*h1+alpha_2*h2+alpha_3*h3)
h_final

##  [1] -1  1  1 -1 -1 -1  1  1  1 -1

Notice how accurate this is:

h_final==y

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Example 2:

Let’s consider the iris data set and try to classify this into “setosa=1” or “otherwise=-1”:

irisdata=iris
labels=as.numeric(irisdata$Species=="setosa")
head(irisdata)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(irisdata)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Our goal is to use logistic regression to classify the data set.

glmt1 <- glm(labels~irisdata$Sepal.Length, family=binomial)
coefficients(summary(glmt1))

##                       Estimate Std. Error z value  Pr(>|z|)
## (Intercept)             27.829     4.8276   5.765 8.190e-09
## irisdata$Sepal.Length   -5.176     0.8934  -5.793 6.903e-09

fit1 <- fitted(glmt1)
accuracytable=table(fit1>0.5, labels)
accuracy=(accuracytable[1, 1]+accuracytable[2, 2])/length(labels)
accuracy

## [1] 0.8933

This is the accuracy we obtain using logistic regression directly: 0.89.Can we improve on this accuracy using Adaboost? Here we follow the algorithm that was included in the section “A formal view on boosting”

### step 1 of the algorithm
X=irisdata$Sepal.Length # the training set is composed by X and y
y=labels # in this case, y=0/1 and not, as before, y=-1/1

### step 2 of the algorithm
T=10 # number of steps
m=length(y) # number of cases
classifier=matrix(data = NA, nrow = length(y), ncol = T) # where classifiers are to be stored
epsilon=0 # initialise epsilon
alpha=0 # initialise alpha
for (t in seq(1:T)){
   
   # construct distribution: D_{t}(i)=D_{t-1}(i)*F_i/Z
   if (t==1){
      D=rep(1/m, m) # particular case, D_1
   } else
      {
      for (i in seq(1:length(y))){
      if (h[i]==y[i]) {F[i]=exp(-alpha[t-1])   } else {F[i]=exp(alpha[t-1])} # F_i(for t-1)
      D[i]=D[i]*F[i] # D_{t}(i)=D_{t-1}(i)*F_i/Z
      }}
   D=D/sum(D) # normalisation
      
   # build up classifier
   
   ### is this required? Should weights be treated as integers?
   # D=as.integer(D*10000) 
   ### is this required? Should weights be treated as integers?
   h=glm(y~X, family=binomial, weights=D)
   h = as.numeric(fitted(h)>0.5)
   classifier[,t]=h
   ### required if weights converted to integers
   # D=D/sum(D) 
   ### required if weights converted to integers
   
   
   # compute error associated with classifier and alpha
   epsilon[t]=sum(D[!(y==h)])
   alpha[t]=0.5*log((1-epsilon[t])/epsilon[t]) # alpha_{t}

}


### step 3 of the algorithm
finalclassifier=matrix(data = 0, nrow = length(y), ncol = 1)
classifier[classifier==0]=-1
labels[labels==0]=-1
for (t in seq(1:T)){
finalclassifier=finalclassifier+alpha[t]*classifier[, t]
   }

finalclassifier=sign(finalclassifier)

accuracytable=table(finalclassifier, labels)
accuracy2=(accuracytable[1, 1]+accuracytable[2, 2])/length(labels)
accuracy2

## [1] 0.8933

The accuracy actually remained constant and equal to 0.89. Boosting was not useful.

Final remarks:

1- I translated the code in example 2 to Matlab and got the same results. I then used the same code but with SVM’s instead of logistic regressions, and boosting did provide higher accuracy. I will eventually (+hopefully) update this document to reflect this

2- I am unsure whether the weighted logistic regression was well implemented in example 2.

[1] http://videolectures.net/mlss05us_schapire_b/

[2] http://www.cc.gatech.edu/~thad/6601-gradAI-fall2013/boosting.pdf

[3] http://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf