Random Forest

随机森林(Random Forest,RF)是一种集成(ensemble)学习器，他利用Bootstrap重抽样方法从原始样本中抽取多个样本进行决策树(decision tree)建模，然后将这些决策树组合在一起，通过对所有决策树结果的平均(Mean)或投票(Vote)得出最终预测的回归(Regression)或分类(Classification)的结果。

大量的理论和实证研究都证明了随机森林：

具有较高的预测准确率
对异常值和噪声数据具有很好的容忍度
不容易出现过拟合

1. Random Forest 基学习器(base learner)–决策树(decision tree)

研究随机森林，就必须涉及到决策树，因为随机森林的基学习器就是没有兼职的决策树。关于决策树的描述在另一篇文档中.

2. Random Forest 构建

随机森林的产生，是为了克服决策树在回归和分类预测方面有诸多缺点，结合单个树学习器组合成多个学习器的思想生成多棵决策树，这些决策树不需要都有很高的精度，并让所有的决策树通过投票的形式进行决策。

构建Random Forest的主要步骤：

2.1 为每棵决策树Bootstrap抽样产生训练集

(1) 每棵决策树都对应一个训练集数据，要构建M棵决策树，就需要产生对应数量(M)的训练集，从原始训练集中产生M个训练子集要用到统计抽样技术。现有的统计抽样技术很多，按照抽样是否放回主要包括以下两种：

(i)不放回抽样(简单随机抽样)
- 抽签法(小样本)
- 随机数法(大样本)
(ii)放回抽样
- 无权重放回抽样(Bootstrap抽样)
  - 无权重抽样，也叫bagging方法。是一种用来提高学习算法准确度的方法。该方法于1996年由Breiman根据Boosting技术提出的。bagging方法是以可重复的随机抽样为基础的，每个样本是初始数据集有放回抽样。在可重复抽样生成多个训练子集时，存在于初始训练集D中的所有的样本都有可能被抽取的可能，但在重复多次后，总有一些样本是不能被抽取的，每个样本不能被抽取的概率为\((1-\frac{1}{N})^N\)。
- 有权重放回抽样
  - 有权重抽样，也叫boosting方法，也叫更新权重抽样。Boosting方法抽样，首先有放回随机抽样产生一组(\(n \leqslant N\))训练集，然后对这组训练集中每一个训练集设定权重为\(\frac{1}{n}\)，在设定权重后，对每个带权重的训练集进行测试(决策树训练)，在每次测试结束后，对分类性能差的训练集的权重进行提升，从而产生一个新的权重系列，经过多次训练后，每个训练集就有一个和其对应的权重，在投票时，这些权重就可以对投票的结果产生影响。从而影响最终的决策结果。

(2) Bagging和Boosting方法都是可放回的抽样方法，但两者间存在很大的差别：

(i)Bagging方法在训练的过程中采用独立随机的方式。而Boosting方法在训练的过程中，每次训练都是在前一次的基础上进行的，因此是一种串行的关系，这对算法的执行是一个很大的挑战，以为每次执行都要等待上次的结果才能继续。而Bagging方法就不存在这个问题，这为算法的并行处理提供了很好的支持。
(ii)Bagging方法抽取出来的训练集是没有权重的各训练集的待遇是相同的，而Boosting方法在抽取的过程中，对每个训练集都设置权重，使得抽取结束后每个训练集的待遇是不一致的。

(3) 随机森林算法在生成的过程中，主要采用bagging方法，也就是Bootstrap抽样。

从原始训练集中产生M个训练子集，每个训练子集的大小约为原始训练集的\(\frac{2}{3}\)，每次抽样均为随机且放回抽样，这样使得训练子集中的样本存在一定的重复，这样做的目的是为了使得随机森林中的决策树不至于产生局部最优解。

2.2 构建每棵决策树

随机森林算法为每个Bootstrap抽样训练子集分别建立一棵决策树，生成M棵决策树从而形成“森林”。每棵树任其生长，不需要剪枝。其中涉及两个主要过程：

(1)节点分裂

决策树已经介绍了这一部分内容，随机森林常用的主要有C4.5, CART.

(2)随机特征变量的随机选取

随机特征变量是指随机森林算法在生成的过程中，参与节点分裂属性(变量)比较的属性(变量)个数。
由于随机森林在节点分裂时，不是所有的属性(变量)都参与属性(变量)指标的计算，而是随机地选择某几个属性(变量)参与比较，参与的属性个数就称之为随机特征变量。随机特征变量是为了使每棵决策树之间的相关性减少，同时提升每棵决策树的分类精度，从而提升整个随机森林的性能而引入的。其基本思想是，在进行节点分裂时，让所有的属性(变量)按照某种概率分布随机选择其中某几个属性参与节点分裂过程。在随机森林算法中，随机特征变量的产生方法主要有两种：
- 随机选择输入变量(Forest-RI)
  - Forest-RI是对输入变量(p个)随机分组(每组变量的个数F是一个定值)，然后对于每组变量，利用CART方法产生一棵树，并让其充分生长，不进行剪枝。在每个节点上，对输入该节点的变量，重复前面的随机分组，再重复CART方法，直到将所有节点均为叶节点为止。一般F有两种选择，首先是F=1，其次取F为小于\(log_{2}(p+1)\)的最大整数。假如只有很少的输入变量，比如p值不大，用Forest-RI法从p中随机选择F个作为随机特征变量，这样可能提高每棵树模型的精度，但同时也增大了各棵树之间的相关系数。
- 随机组合输入变量(Forest-RC)
  - Forest-RC是先将随机特征进行线性组合，然后再作为输入变量来构造随机森林的方法。
最常用的随机森林算法都是使用Forest-RI方法构建，在每棵子树的生长过程中，不是将全部p个输入变量参与节点分裂，而是随机抽取指定F(\(F \leqslant p\))个随机特征变量，F的取值一般为\(log_{2}(p+1)\)，以这F个属性上最好的分裂方式对节点进行分裂，从而达到节点分裂的随机性。

2.3 随机森林的形成

通过建立大量(M棵)的决策树，就形成了随机森林。算法最终的输出结果采取大多数投票法实现。根据随机构建的M决策子树将对某测试样本进行分类，将每棵子树的结果汇总，所得票数最多的分类结果将作为算法最终的输出结果。

3. Random Forest 性质、性能、参数、缺点探讨

3.1 Random Forest 性质讨论

随机森林泛化误差的收敛性
- 随机森林中的决策树的泛化误差都收敛于：\(\underset{n \rightarrow \infty}{\lim}PE^{*}=P_{xy}(P_{\Theta}(k(X,\Theta)=Y)-\underset{j\neq Y}{\max}P_{\Theta}(k(X,\Theta)\neq Y) > 0)\)
- 随着随机森林中决策树数量(\(n\))的增加，随机森林泛化误差(\(PE^{*}\))将趋向一个上界。随机森林对未知实例有很好的扩展性，也就是说随机森林随着决策树数量的增多不易过拟合
随机森林中决策树的相关度和强度影响算法的泛化误差
- 随机森林泛化误差的上界为：\(PE^{*}\leqslant \frac{\bar{\rho}(1-s^{2})}{s^{2}}\)
  - \(\bar{\rho}\)为决策树之间的相关度的平均值
  - \(s\)为决策树的平均强度
- 要使随机森林的泛化性能好，则应该尽量减小决策树之间的相关性(\(\rho\))，增大单棵树的分类性能(\(s\))。每棵树的分类强度越大，则随机森林的分类性能越好，森林中树之间的相关度越小，则随机森林的分类性能越好

3.2 Random Forest 性能评价指标

随机森林分类性能主要受内外两方面因素的影响：
- 外部因素：训练样本的正负类样本分布，即训练样本的平衡
- 内部因素：单棵树的分类强度和树之间的相关度
衡量随机森林性能的指标：
- 分类效果
  - 分类精度:准确度(acccuracy of measurement), 是指使用算法得出的分类结果与真实之间的接近程度
  - 二分类数据的混淆矩阵
    - 分类精度(accuracy): \(Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\)
    - 灵敏度(Sensitivity)(正类的的分类精度)：\(Sensitivity=\frac{TP}{TP+FN}\)
    - 特异度(Specificity)(负类的的分类精度)：\(Specificity=\frac{TN}{FP+TN}\)
    - 几何均值(G-mean)：\(G-mean=\sqrt{\frac{TP}{TP+FN}\times \frac{TN}{FP+TN}}\)
    - 负类的查全率(Recall)：\(Recall=\frac{TP}{TP+FN}\)
    - 负类的查准率(Precision)：\(Precision=\frac{TP}{TP+FP}\)
    - 负类的检验值(F-value)：\(F-value=\frac{(1+\beta^{2})\cdot recall \cdot precision}{\beta^{2}\cdot recall \cdot precision}, \beta \in (0, 1]\)
- 泛化误差
  - 泛化能力(generalization ability)
  - 泛化误差(generalization error)
    - 泛化误差是反应泛化能力的一个指标
    - 随机森林的泛化误差理论上是可以计算出来的，然而，在实际环境中，样本的期望输出和分布情况都是不知道的，无法直接通过计算泛化误差来评估随机森林的泛化能力
    - 估计泛化误差：
      - 交叉验证(Cross-Validation, CV)(验证集上的)
        
        运算量很大
      - OOB估计
        
        随机森林是使用Bootstrap来进行每棵树训练集的生成，在生成这些(M)个训练集时，初始训练集中有一些样本是不能被抽取的这些样本的个数是初始数据集的\((1-\frac{1}{N})^N\)。可以证明，当\(N\)足够大时，\((1-\frac{1}{N})^N\)将收敛于\(\frac{1}{e}\approx 0.368\)，说明将有近\(37\%\)的样本不会被抽取出来，这些不能被抽取的样本组成的集合，称之为袋外数据(OOB)。
        
        使用OOB数据来估计随机森林算法的泛化能力称为OOB估计：以每棵决策树为单位，利用OOB数据统计该树的OOB误分率；将所有决策树的误分率取平均得到随机森林的OOB误分率，就可以得到一个OOB误差估计。
        
        Breiman通过实验已经证明，OOB估计是随机森林的泛化误差的一个无偏估计
        
        相比于CV估计，效率很高
- 运行效率

3.3 Random Forest 参数设置

随机森林算法中需要设置的主要参数：
- 随机森林中决策树的数量(ntree)
- 随机森林内部节点随机选择属性的个数(mtry)：一般为小于\(log_{2}(p+1)\)的最大整数
一般来讲，决策树的数量越多，算法的精度越高，但程序的速度会有所下降；
内部节点随机选择属性的个数(mtry)是影响算法精度的主要因子，随机森林内决策树的强度和相关度和随机选择属性的个数相关，如果随机选择属性的个数足够小，树的相关性趋向于减弱，另一方面，决策树模型的分类强度随着随机选择属性的个数的增加而提高。

3.4 Random Forest 缺点

不能很好地处理非平衡数据
- 由于随机森林在构建过程中，训练集是随机选取的，使用Bootstrap随机抽样时，由于原训练集中的少数类本身就比较少，因此被选中的概率就很低，这使得M个随机选取的训练集中含有的少数类数量比原有的数据集更好或没有，这反而加剧了数据集的不平衡性，使得基于此数据集训练出来的决策树的规则就没有代表性
- 由于数据集本身少数类占有的比例就低，使得训练出来的决策树不能很好地体现占有数量少的少数类的特点，只有将少数类的数量加大，使数据集中的数据达到一定的程度平衡，才能使得算法稳定
需要对连续性变量进行离散化
随机森林的分类精度需要进一步提升
- 数据集的维度和样本的平衡性
- 算法本身的决策树分裂规则、随机抽样

4. Random Forest 实现

R
- (1)载入randomForest包：library(randomForest)
- (2)导入数据集
- (3)设置算法参数(主要有两个)
  - 随机森林中决策树的数量ntree
  - 内部节点随机选择属性的个数mtry
- (4)运行并分析结果
Python
- 教程

4.1 随机森林的R实现

(1)载入`randomForest`包,及其他有用的包

library(caret)
library(ggplot2)
library(randomForest)

(2)主要函数

## Implements Breiman's random forest algorithm (based on Breiman and Cutler's original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.
randomForest(formula, data = NULL, 
             #x, y = NULL,
             #xtest = NULL, ytest = NULL,
             #subset,
             #na.action = na.fail, 
             ntree = 500,
             mtry = if (!is.null(y) && !is.factor(y)) {
                 max(floor(ncol(x)/3), 1)
                 else floor(sqrt(ncol(x)))
             }, 
             #replace = TRUE,
             #classwt = NULL,
             #cutoff,
             #strata,
             sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
             nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
             maxnodes = NULL,
             importance = FALSE, 
             localImp = FALSE, 
             nPerm = 1,
             proximity, 
             oob.prox = proximity,
             norm.votes = TRUE, 
             do.trace = FALSE,
             keep.forest = !is.null(y) && is.null(xtest), 
             corr.bias = FALSE,
             keep.inbag = FALSE, 
             ...)

## Print the 'randomForest'
print(rf_Model, ...)

# Plot method for randomForest objects
plot(rf_Model, type = "l", main = "")

# Extract variable importance measure
importance(rf_Model, type = (1 or 2), class = NULL, scale = TRUE)

# predict method for random forest objects
predict(rf_Model, 
        test,
        type = "response",
        norm.votes = TRUE,
        predict.all = FALSE,
        proximity = FALSE,
        nodes = FALSE,
        cutoff)
# Random Forest Cross-Valdidation for feature selection
rfcv(trainx, trainy, 
     cv.fold = 5, 
     scale = "log", 
     step = 0.5, 
     mtry = function(p) max(1, floor(sqrt(p))), 
     recursive = FALSE,
     ...)

# Size of trees in an ensemble
treesize(x, terminal = TRUE)

# Tune randomForest for the optimal mtry parameter
tuneRF(credit[, -21], 
       credit[, 21],
       mtryStart, 
       ntreeTry = 50,
       stepFactor = 2, 
       improve = 0.05,
       trace = TRUE, 
       plot = TRUE, 
       doBest = FALSE, ...)

# Variable Importance Plot
varImpPlot(rf_Model, 
           sort = TRUE, 
           n.var = min(30, nrow(rf_Model$importance)),
           type=NULL,
           class=NULL,
           scale=TRUE,
           main=deparse(substitute(x)))

# Variables used in a random forest
varUsed(rf_Model, by.tree = FALSE, count = TRUE)

(3)导入数据集

## German Credit data 
data = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", 
                header = FALSE, 
                sep = "")
names(data) = c(paste0(rep("x", 20), 1:20), "y")
data$y = as.integer(data$y) - 1
library(magrittr)
data = sapply(data, as.integer) %>%
    as.data.frame() %>%
    sapply(function(x) x/max(x)) %>%
    as.data.frame()
head(data)

##     x1         x2  x3  x4         x5  x6  x7   x8   x9       x10  x11  x12
## 1 0.25 0.08333333 1.0 0.5 0.06344985 1.0 1.0 1.00 0.75 0.3333333 1.00 0.25
## 2 0.50 0.66666667 0.6 0.5 0.32300261 0.2 0.6 0.50 0.50 0.3333333 0.50 0.25
## 3 1.00 0.16666667 1.0 0.8 0.11376465 0.2 0.8 0.50 0.75 0.3333333 0.75 0.25
## 4 0.25 0.58333333 0.6 0.4 0.42781155 0.2 0.8 0.50 0.75 1.0000000 1.00 0.50
## 5 0.25 0.33333333 0.8 0.1 0.26432914 0.2 0.6 0.75 0.75 0.3333333 1.00 1.00
## 6 1.00 0.50000000 0.6 0.8 0.49147851 1.0 0.6 0.50 0.75 0.3333333 1.00 1.00
##         x13 x14       x15  x16  x17 x18 x19 x20 y
## 1 0.8933333   1 0.6666667 0.50 0.75 0.5 1.0 0.5 0
## 2 0.2933333   1 0.6666667 0.25 0.75 0.5 0.5 0.5 1
## 3 0.6533333   1 0.6666667 0.25 0.50 1.0 0.5 0.5 0
## 4 0.6000000   1 1.0000000 0.25 0.75 1.0 0.5 0.5 0
## 5 0.7066667   1 1.0000000 0.50 0.75 1.0 0.5 0.5 1
## 6 0.4666667   1 1.0000000 0.25 0.50 1.0 1.0 0.5 0

# rf_Model = randomForest(y ~ ., data = data, ntree = 500)

Method: randomForest Data: credit

## Classification:
## data(iris)
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
print(iris.rf)

## 
## Call:
##  randomForest(formula = Species ~ ., data = iris, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         46         4        0.08
## virginica       0          4        46        0.08

## Look at variable importance:
round(importance(iris.rf), 2)

##              setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length   6.04       7.85      7.93                11.51
## Sepal.Width    4.40       1.03      5.44                 5.40
## Petal.Length  21.76      31.33     29.64                32.94
## Petal.Width   22.84      32.67     31.68                34.50
##              MeanDecreaseGini
## Sepal.Length             8.77
## Sepal.Width              2.19
## Petal.Length            42.54
## Petal.Width             45.77

## Do MDS on 1 - proximity:
iris.mds <- cmdscale(1 - iris.rf$proximity, eig=TRUE)
op <- par(pty="s")
pairs(cbind(iris[,1:4], iris.mds$points), cex=0.6, gap=0,
      col=c("red", "green", "blue")[as.numeric(iris$Species)],
      main="Iris Data: Predictors and MDS of Proximity Based on RandomForest")

par(op)
print(iris.mds$GOF)

## [1] 0.7282700 0.7903363

## The `unsupervised' case:
set.seed(17)
iris.urf <- randomForest(iris[, -5])
MDSplot(iris.urf, iris$Species)

## Loading required package: RColorBrewer

## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
(iris.rf2 <- randomForest(iris[1:4], iris$Species, 
                          sampsize=c(20, 30, 20)))

## 
## Call:
##  randomForest(x = iris[1:4], y = iris$Species, sampsize = c(20,      30, 20)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          5        45        0.10

## Regression:
## data(airquality)
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3,
                         importance=TRUE, na.action=na.omit)
print(ozone.rf)

## 
## Call:
##  randomForest(formula = Ozone ~ ., data = airquality, mtry = 3,      importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 303.8304
##                     % Var explained: 72.31

## Show "importance" of variables: higher value mean more important:
round(importance(ozone.rf), 2)

##         %IncMSE IncNodePurity
## Solar.R   11.09      10534.24
## Wind      23.50      43833.13
## Temp      42.03      55218.05
## Month      4.07       2032.65
## Day        2.63       7173.19

## "x" can be a matrix instead of a data frame:
set.seed(17)
x <- matrix(runif(5e2), 100)
y <- gl(2, 50)
(myrf <- randomForest(x, y))

## 
## Call:
##  randomForest(x = x, y = y) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 45%
## Confusion matrix:
##    1  2 class.error
## 1 30 20         0.4
## 2 25 25         0.5

(predict(myrf, x))

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  91  92  93  94  95  96  97  98  99 100 
##   2   2   2   2   2   2   2   2   2   2 
## Levels: 1 2

## "complicated" formula:
(swiss.rf <- randomForest(sqrt(Fertility) ~ . - Catholic + I(Catholic < 50),
                          data=swiss))

## 
## Call:
##  randomForest(formula = sqrt(Fertility) ~ . - Catholic + I(Catholic <      50), data = swiss) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.3207372
##                     % Var explained: 45.54

(predict(swiss.rf, swiss))

##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville 
##     8.544219     8.977287     9.118769     8.781467     8.522875 
##   Porrentruy        Broye        Glane      Gruyere       Sarine 
##     8.924009     8.888588     9.140677     8.952968     8.875341 
##      Veveyse        Aigle      Aubonne     Avenches     Cossonay 
##     9.114092     7.936236     8.328221     8.229574     8.023034 
##    Echallens     Grandson     Lausanne    La Vallee       Lavaux 
##     8.274981     8.434141     7.742152     7.598992     8.180066 
##       Morges       Moudon        Nyone         Orbe         Oron 
##     7.987572     8.354154     7.860249     7.873372     8.483563 
##      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##     8.551495     8.410619     8.035697     7.846781     8.363983 
##      Conthey    Entremont       Herens     Martigwy      Monthey 
##     8.844938     8.657802     8.776533     8.570825     8.892461 
##   St Maurice       Sierre         Sion       Boudry La Chauxdfnd 
##     8.477949     8.919503     8.664517     8.277097     8.033753 
##     Le Locle    Neuchatel   Val de Ruz ValdeTravers V. De Geneve 
##     8.170280     7.760338     8.553581     8.152936     6.781376 
##  Rive Droite  Rive Gauche 
##     7.500274     7.194497

## Test use of 32-level factor as a predictor:
set.seed(1)
x <- data.frame(x1=gl(53, 10), x2=runif(530), y=rnorm(530))
(rf1 <- randomForest(x[-3], x[[3]], ntree=10))

## 
## Call:
##  randomForest(x = x[-3], y = x[[3]], ntree = 10) 
##                Type of random forest: regression
##                      Number of trees: 10
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 1.49581
##                     % Var explained: -34.99

## Grow no more than 4 nodes per tree:
(treesize(randomForest(Species ~ ., data=iris, maxnodes=4, ntree=30)))

##  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

## test proximity in regression
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
str(iris.rrf$proximity)

##  num [1:150, 1:150] 1 0.337 0.327 0.356 0.891 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:150] "1" "2" "3" "4" ...
##   ..$ : chr [1:150] "1" "2" "3" "4" ...

——————————————————————————————

Method: caret Data: iris

data(iris)
inTrain = createDataPartition(y = iris$Species, p = 0.8, list = FALSE)
training = iris[inTrain, ]
testing = iris[-inTrain, ]
modFit = train(Species ~ ., data = training, method = "rf", prox = TRUE)
modFit

## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 120, 120, 120, 120, 120, 120, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   2     0.9444936  0.9155944  0.03712069   0.05657357
##   3     0.9470468  0.9194600  0.03640573   0.05556211
##   4     0.9425767  0.9127153  0.03709553   0.05650571
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

getTree(modFit$finalModel, k = 2)

##    left daughter right daughter split var split point status prediction
## 1              2              3         4        0.80      1          0
## 2              0              0         0        0.00     -1          1
## 3              4              5         3        4.85      1          0
## 4              6              7         4        1.65      1          0
## 5              8              9         4        1.65      1          0
## 6              0              0         0        0.00     -1          2
## 7             10             11         2        3.10      1          0
## 8             12             13         3        4.95      1          0
## 9              0              0         0        0.00     -1          3
## 10             0              0         0        0.00     -1          3
## 11             0              0         0        0.00     -1          2
## 12             0              0         0        0.00     -1          2
## 13             0              0         0        0.00     -1          3

irisP = classCenter(training[, c(3, 4)], training$Species, modFit$finalModel$prox)
irisP = as.data.frame(irisP)
irisP$Species = rownames(irisP)

p = qplot(Petal.Width, Petal.Length, data = training, col = Species)
p + geom_point(aes(x = Petal.Width, y = Petal.Length, col = Species), size = 5, 
               shape = 4, data = irisP)

pred = predict(modFit, testing)
testing$predRight = pred == testing$Species
confusionMatrix(pred, testing$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750

p1 = ggplot(data = testing, aes(x = Petal.Width, y = Petal.Length, 
                                color = predRight))
p1 + 
    geom_point() +
    ggtitle("newdata Predictions") +
    theme_bw()

Random Forest

zfwang

2016年8月1日

1. Random Forest 基学习器(base learner)–决策树(decision tree)

2. Random Forest 构建

2.1 为每棵决策树Bootstrap抽样产生训练集

2.2 构建每棵决策树

2.3 随机森林的形成

3. Random Forest 性质、性能、参数、缺点探讨

3.1 Random Forest 性质讨论

3.2 Random Forest 性能评价指标

3.3 Random Forest 参数设置

3.4 Random Forest 缺点

4. Random Forest 实现

4.1 随机森林的R实现

(1)载入`randomForest`包,及其他有用的包

(2)主要函数

(3)导入数据集

——————————————————————————————

Random Forest

zfwang

2016年8月1日

1. Random Forest 基学习器(base learner)–决策树(decision tree)

2. Random Forest 构建

2.1 为每棵决策树Bootstrap抽样产生训练集

2.2 构建每棵决策树

2.3 随机森林的形成

3. Random Forest 性质、性能、参数、缺点探讨

3.1 Random Forest 性质讨论

3.2 Random Forest 性能评价指标

3.3 Random Forest 参数设置

3.4 Random Forest 缺点

4. Random Forest 实现

4.1 随机森林的R实现

(1)载入randomForest包,及其他有用的包

(2)主要函数

(3)导入数据集

——————————————————————————————

(1)载入`randomForest`包,及其他有用的包