Decition tree (tree model)은 어떤 predictive model에 비해 예측 정확도가 낮음
그 부정확성의 원천은 과도적합에 의한 큰 분산이아 Breaiman은 majority voting에 의해 분산을 줄이는 bagging(1994)을 고안함
그 이후 Breiman(2001)은 bagging의 정확도를 유지하며 (또는 능가하며) 계산속도를 획기적으로 향상시킨 Random Forest라는 알고리즘을 고안함
즉, bagging과 같은 bootstrap sample을 traning data로 사용하되 split을 하는 변수를 random하게 선택하는 하여 각 split 시, 변수선택 과정을 생략하고 random하게 선택된 변수에서 best split의 위치를 찾는 방법을 택함
Let the number of training cases be N, and the number of variables in the classifier be M.
We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e., take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Prototypes are computed that give information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.
Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
Unlike decision trees, the classifications made by random forests are difficult for humans to interpret.
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem.
If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
ind
## [1] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1
## [36] 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1
## [71] 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1
## [106] 2 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 1 2 1
## [141] 2 1 2 1 2 2 2 1 1 1
trainData <- iris[ind == 1, ]
testData <- iris[ind == 2, ]
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
rf <- randomForest(Species ~ ., data = trainData, ntree = 100, proximity = TRUE)
rf
##
## Call:
## randomForest(formula = Species ~ ., data = trainData, ntree = 100, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.66%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 39 0 0 0.00000
## versicolor 0 34 3 0.08108
## virginica 0 3 27 0.10000
table(predict(rf), trainData$Species)
##
## setosa versicolor virginica
## setosa 39 0 0
## versicolor 0 34 3
## virginica 0 3 27
print(rf)
##
## Call:
## randomForest(formula = Species ~ ., data = trainData, ntree = 100, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.66%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 39 0 0 0.00000
## versicolor 0 34 3 0.08108
## virginica 0 3 27 0.10000
attributes(rf)
## $names
## [1] "call" "type" "predicted"
## [4] "err.rate" "confusion" "votes"
## [7] "oob.times" "classes" "importance"
## [10] "importanceSD" "localImportance" "proximity"
## [13] "ntree" "mtry" "forest"
## [16] "y" "test" "inbag"
## [19] "terms"
##
## $class
## [1] "randomForest.formula" "randomForest"
plot(rf)
importance(rf)
## MeanDecreaseGini
## Sepal.Length 7.978
## Sepal.Width 2.587
## Petal.Length 28.832
## Petal.Width 30.221
varImpPlot(rf)
irisPred <- predict(rf, newdata = testData)
table(irisPred, testData$Species)
##
## irisPred setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 13 0
## virginica 0 0 20
plot(margin(rf, testData$Species))
## Loading required package: RColorBrewer
library(randomForest)
library(MASS)
data(fgl)
set.seed(17)
fgl.rf <- randomForest(type ~ ., data = fgl, mtry = 2, importance = TRUE, do.trace = 100)
## ntree OOB 1 2 3 4 5 6
## 100: 23.83% 14.29% 26.32% 70.59% 23.08% 22.22% 13.79%
## 200: 20.09% 10.00% 19.74% 70.59% 23.08% 22.22% 13.79%
## 300: 20.56% 10.00% 23.68% 64.71% 23.08% 22.22% 10.34%
## 400: 18.69% 10.00% 18.42% 58.82% 23.08% 22.22% 13.79%
## 500: 19.16% 10.00% 19.74% 58.82% 23.08% 22.22% 13.79%
print(fgl.rf)
##
## Call:
## randomForest(formula = type ~ ., data = fgl, mtry = 2, importance = TRUE, do.trace = 100)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.16%
## Confusion matrix:
## WinF WinNF Veh Con Tabl Head class.error
## WinF 63 6 1 0 0 0 0.1000
## WinNF 10 61 1 2 1 1 0.1974
## Veh 8 2 7 0 0 0 0.5882
## Con 0 2 0 10 0 1 0.2308
## Tabl 0 2 0 0 7 0 0.2222
## Head 1 3 0 0 0 25 0.1379
# OOB = Out-Of-Bag
library(ipred)
## Loading required package: rpart
## Loading required package: survival
## Loading required package: splines
## Loading required package: nnet
## Loading required package: class
## Loading required package: prodlim
## KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009
set.seed(131)
error.RF <- numeric(10)
for (i in 1:10) error.RF[i] <- errorest(type ~ ., data = fgl, model = randomForest,
mtry = 2)$error
summary(error.RF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.182 0.194 0.208 0.207 0.215 0.238
library(e1071)
set.seed(563)
error.SVM <- numeric(10)
for (i in 1:10) error.SVM[i] <- errorest(type ~ ., data = fgl, model = svm,
cost = 10, gamma = 1.5)$error
summary(error.SVM)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.364 0.379 0.381 0.379 0.383 0.388
par(mfrow = c(2, 2))
for (i in 1:4) plot(sort(fgl.rf$importance[, i], dec = TRUE), type = "h", main = paste("Measure",
i))
data(Boston)
set.seed(1341)
BH.rf <- randomForest(medv ~ ., Boston)
print(BH.rf)
##
## Call:
## randomForest(formula = medv ~ ., data = Boston)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 9.914
## % Var explained: 88.26