Random Forest

Random Forest 알고리즘 특징

Decition tree (tree model)은 어떤 predictive model에 비해 예측 정확도가 낮음
그 부정확성의 원천은 과도적합에 의한 큰 분산이아 Breaiman은 majority voting에 의해 분산을 줄이는 bagging(1994)을 고안함
그 이후 Breiman(2001)은 bagging의 정확도를 유지하며 (또는 능가하며) 계산속도를 획기적으로 향상시킨 Random Forest라는 알고리즘을 고안함
즉, bagging과 같은 bootstrap sample을 traning data로 사용하되 split을 하는 변수를 random하게 선택하는 하여 각 split 시, 변수선택 과정을 생략하고 random하게 선택된 변수에서 best split의 위치를 찾는 방법을 택함

Let the number of training cases be N, and the number of variables in the classifier be M.

We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.

Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e., take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.

For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.

Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Random Forest의 특징

Pros

It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Prototypes are computed that give information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.

Cons

Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
Unlike decision trees, the classifications made by random forests are difficult for humans to interpret.
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem.
If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Example - R and Data Mining by Zhao

ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
ind

##   [1] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1
##  [36] 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1
##  [71] 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1
## [106] 2 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 1 2 1
## [141] 2 1 2 1 2 2 2 1 1 1

trainData <- iris[ind == 1, ]
testData <- iris[ind == 2, ]

library(randomForest)

## randomForest 4.6-7

## Type rfNews() to see new features/changes/bug fixes.

rf <- randomForest(Species ~ ., data = trainData, ntree = 100, proximity = TRUE)
rf

## 
## Call:
##  randomForest(formula = Species ~ ., data = trainData, ntree = 100,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.66%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         39          0         0     0.00000
## versicolor      0         34         3     0.08108
## virginica       0          3        27     0.10000

table(predict(rf), trainData$Species)

##             
##              setosa versicolor virginica
##   setosa         39          0         0
##   versicolor      0         34         3
##   virginica       0          3        27

print(rf)

## 
## Call:
##  randomForest(formula = Species ~ ., data = trainData, ntree = 100,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.66%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         39          0         0     0.00000
## versicolor      0         34         3     0.08108
## virginica       0          3        27     0.10000

attributes(rf)

## $names
##  [1] "call"            "type"            "predicted"      
##  [4] "err.rate"        "confusion"       "votes"          
##  [7] "oob.times"       "classes"         "importance"     
## [10] "importanceSD"    "localImportance" "proximity"      
## [13] "ntree"           "mtry"            "forest"         
## [16] "y"               "test"            "inbag"          
## [19] "terms"          
## 
## $class
## [1] "randomForest.formula" "randomForest"

plot(rf)

plot of chunk unnamed-chunk-4

importance(rf)

##              MeanDecreaseGini
## Sepal.Length            7.978
## Sepal.Width             2.587
## Petal.Length           28.832
## Petal.Width            30.221

varImpPlot(rf)

plot of chunk unnamed-chunk-5

irisPred <- predict(rf, newdata = testData)
table(irisPred, testData$Species)

##             
## irisPred     setosa versicolor virginica
##   setosa         11          0         0
##   versicolor      0         13         0
##   virginica       0          0        20

plot(margin(rf, testData$Species))

## Loading required package: RColorBrewer

plot of chunk unnamed-chunk-6

Example - Liaw and Wiener

ftp://131.252.97.79/Transfer/temp_treg/WFRE_Articles/Liaw_02_Classification%20and%20regression%20by%20randomForest.pdf

library(randomForest)
library(MASS)
data(fgl)
set.seed(17)
fgl.rf <- randomForest(type ~ ., data = fgl, mtry = 2, importance = TRUE, do.trace = 100)

## ntree      OOB      1      2      3      4      5      6
##   100:  23.83% 14.29% 26.32% 70.59% 23.08% 22.22% 13.79%
##   200:  20.09% 10.00% 19.74% 70.59% 23.08% 22.22% 13.79%
##   300:  20.56% 10.00% 23.68% 64.71% 23.08% 22.22% 10.34%
##   400:  18.69% 10.00% 18.42% 58.82% 23.08% 22.22% 13.79%
##   500:  19.16% 10.00% 19.74% 58.82% 23.08% 22.22% 13.79%

print(fgl.rf)

## 
## Call:
##  randomForest(formula = type ~ ., data = fgl, mtry = 2, importance = TRUE,      do.trace = 100) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.16%
## Confusion matrix:
##       WinF WinNF Veh Con Tabl Head class.error
## WinF    63     6   1   0    0    0      0.1000
## WinNF   10    61   1   2    1    1      0.1974
## Veh      8     2   7   0    0    0      0.5882
## Con      0     2   0  10    0    1      0.2308
## Tabl     0     2   0   0    7    0      0.2222
## Head     1     3   0   0    0   25      0.1379

# OOB = Out-Of-Bag

library(ipred)

## Loading required package: rpart

## Loading required package: survival

## Loading required package: splines

## Loading required package: nnet

## Loading required package: class

## Loading required package: prodlim

## KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009

set.seed(131)
error.RF <- numeric(10)
for (i in 1:10) error.RF[i] <- errorest(type ~ ., data = fgl, model = randomForest, 
    mtry = 2)$error
summary(error.RF)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.182   0.194   0.208   0.207   0.215   0.238

library(e1071)
set.seed(563)
error.SVM <- numeric(10)
for (i in 1:10) error.SVM[i] <- errorest(type ~ ., data = fgl, model = svm, 
    cost = 10, gamma = 1.5)$error
summary(error.SVM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.364   0.379   0.381   0.379   0.383   0.388

par(mfrow = c(2, 2))
for (i in 1:4) plot(sort(fgl.rf$importance[, i], dec = TRUE), type = "h", main = paste("Measure", 
    i))

plot of chunk unnamed-chunk-10

data(Boston)
set.seed(1341)
BH.rf <- randomForest(medv ~ ., Boston)
print(BH.rf)

## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 9.914
##                     % Var explained: 88.26