导言

很多人看到Machine learning会说，我只有几个样本，没有办法做Machine learing。事实真的是这个样子吗？事实不是这样子的。其实一个样本只需要两组数据，比如一个个体的ChIP-seq（或者BS-seq）数据和RNA-seq数据，我就可以来做Machine learning来回答生物学问题。

Really? 是的，你没有看错。想知道怎么做吗，看我娓娓道来。先预告本文不会涉及如何使用一个个体的ChIP-seq（或者BS-seq）数据和RNA-seq数据来做机器学习。在后续的文章我会介绍，本文我会主要用一个R实例来揭开你机器学习的庐山真面目。后面的一系列文章我会介绍如何使用一个个体的ChIP-seq（或者BS-seq）数据和RNA-seq数据进行机器学习来解决生物学问题。如果你对此感兴趣，敬请关注本公众号。

接下来我们会一步步介绍如果用R包caret进行机器学习建模。

R包`caret`简介

R包caret是（Classification And REgression Training）的简称。该包包含一系列的函数，通过这些函数是机器学习中建模过程更简单。

这个包包含的工具有：

数据分割
数据预处理
特征选择
通过重采样方法来调试模型
特征重要性评估

导入需要的R包

本文需要的R包如下：

caret   ## 进行机器学习建模
AppliedPredictiveModeling  ##展示数据
ellipse  ##　AppliedPredictiveModeling依赖此R包

本文需要的数据iris。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。本文主要介绍如何使用机器学习通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个特征变量来预测花的种类。

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

dim(iris)

## [1] 150   5

通过上面的R代码，我们可以发现该数据集共有150个个体，4个属性（特征变量），三个类别。

数据可视化展示

通常在对数据进行建模前，首先需要对数据中的变量进行初步的了解。对于iris这样的数据我们可以使用AppliedPredictiveModeling来生成散点图，

散点图

#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.3.3

transparentTheme(trans = .4)

## Warning: package 'lattice' was built under R version 3.3.2

library(caret)

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.2

featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "pairs",
            ## Add a key at the top
            auto.key = list(columns = 3))

使用椭圆进行分类的散点图（Scatterplot Matrix with Ellipses）

#install.packages("ellipse")
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "ellipse",
            ## Add a key at the top
            auto.key = list(columns = 3))

数据变量的分布密度图

featurePlot(x = iris[, 1:4], 
            y = iris$Species,
            plot = "density", 
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 3))

箱线图

featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "box", 
            ## Pass in options to bwplot() 
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),  
            layout = c(4,1 ), 
            auto.key = list(columns = 2))

Random forest 建模

在本文章，我们使用Random forest进行数据建模。Random forest的一个很大优势是可以计算出每个特征变量的重要性（Variabe importance)。这样可以帮助我们进行特征选择。

通过训练数据（Training data）进行建模

set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)

train_set = iris[train_index, ]
test_set  = iris[-train_index, ]
## Resampling: Bootstrapped (25 reps) by defaultcross 
fit_rf <- train(Species ~ ., data=train_set, method='rf')

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.3.2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

fit_rf

## Random Forest 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 114, 114, 114, 114, 114, 114, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9557937  0.9327467
##   3     0.9568748  0.9343695
##   4     0.9526572  0.9280205
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

rfVarImp = varImp(fit_rf)
rfVarImp

## rf variable importance
## 
##               Overall
## Petal.Width  100.0000
## Petal.Length  98.9277
## Sepal.Length   0.5247
## Sepal.Width    0.0000

set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)

train_set = iris[train_index, ]
test_set  = iris[-train_index, ]
fit_rf_cv <- train(Species ~ ., data=train_set, method='rf', metric = "Accuracy",
                   trControl=trainControl(method="cv",number=5)) 
fit_rf_cv

## Random Forest 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 92, 91, 90, 92, 91 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9398221  0.9094760
##   3     0.9485178  0.9226563
##   4     0.9485178  0.9226563
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

##计算特征变量的重要性
rfVarImpcv = varImp(fit_rf_cv)
rfVarImpcv

## rf variable importance
## 
##               Overall
## Petal.Width  100.0000
## Petal.Length  93.8615
## Sepal.Length   0.5066
## Sepal.Width    0.0000

通过结果我们可以发现在四个变量中最重要的变量是Petal.Width，其次是Petal.Length。

在测试数据上检验模型的优劣

前面讲到我们使用Randm forest在训练数据上进行建模，接着通常我们需要对模型在测试数据上进行测试。

具体代码如下：

## 在模型上对测试数据进行预测
test_set$predict_rf <- predict(fit_rf_cv, test_set, "raw")
## 对预测的结果进行评估
confusionMatrix(test_set$predict_rf, test_set$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         11         1
##   virginica       0          1        11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.8134, 0.9932)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.728e-14       
##                                           
##                   Kappa : 0.9167          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9167           0.9167
## Specificity                 1.0000            0.9583           0.9583
## Pos Pred Value              1.0000            0.9167           0.9167
## Neg Pred Value              1.0000            0.9583           0.9583
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3056           0.3056
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9375           0.9375

通过结果我们可以看到该模型的准确性是0.94，95%置信区间是 : (0.8134, 0.9932)。

Reference:

caret demostration with kaggle bikeshare data (I)
https://rpubs.com/chengjiun/52658
An example of using Random Forest in Caret with R.
http://bigcomputing.blogspot.com/2014/10/an-example-of-using-random-forest-in.html
https://topepo.github.io/caret/model-training-and-tuning.html
The caret Package https://topepo.github.io/caret/model-training-and-tuning.html
How to Perform a Logistic Regression in R
https://datascienceplus.com/perform-logistic-regression-in-r/

Caret实例-Random Forest

导言

R包caret简介