导言

很多人看到Machine learning会说,我只有几个样本,没有办法做Machine learing。事实真的是这个样子吗?事实不是这样子的。其实一个样本只需要两组数据,比如一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据,我就可以来做Machine learning来回答生物学问题。

Really? 是的,你没有看错。想知道怎么做吗,看我娓娓道来。先预告本文不会涉及如何使用一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据来做机器学习。在后续的文章我会介绍,本文我会主要用一个R实例来揭开你机器学习的庐山真面目。后面的一系列文章我会介绍如何使用一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据进行机器学习来解决生物学问题。如果你对此感兴趣,敬请关注本公众号。

接下来我们会一步步介绍如果用R包caret进行机器学习建模。

R包caret简介

R包caret是(Classification And REgression Training)的简称。该包包含一系列的函数,通过这些函数是机器学习中建模过程更简单。

这个包包含的工具有:

导入需要的R包

本文需要的R包如下:

caret   ## 进行机器学习建模
AppliedPredictiveModeling  ##展示数据
ellipse  ## AppliedPredictiveModeling依赖此R包

本文需要的数据iris。Iris也称鸢尾花卉数据集,是一类多重变量分析的数据集。通过花萼长度,花萼宽度,花瓣长度,花瓣宽度4个属性预测鸢尾花卉属于(Setosa,Versicolour,Virginica)三个种类中的哪一类。本文主要介绍如何使用机器学习通过花萼长度,花萼宽度,花瓣长度,花瓣宽度4个特征变量来预测花的种类。

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
dim(iris)
## [1] 150   5

通过上面的R代码,我们可以发现该数据集共有150个个体,4个属性(特征变量),三个类别。

数据可视化展示

通常在对数据进行建模前,首先需要对数据中的变量进行初步的了解。对于iris这样的数据我们可以使用AppliedPredictiveModeling来生成散点图,

散点图

#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.3.3
transparentTheme(trans = .4)
## Warning: package 'lattice' was built under R version 3.3.2
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.2
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "pairs",
            ## Add a key at the top
            auto.key = list(columns = 3))

使用椭圆进行分类的散点图(Scatterplot Matrix with Ellipses)

#install.packages("ellipse")
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "ellipse",
            ## Add a key at the top
            auto.key = list(columns = 3))

数据变量的分布密度图

featurePlot(x = iris[, 1:4], 
            y = iris$Species,
            plot = "density", 
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 3))

箱线图

featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "box", 
            ## Pass in options to bwplot() 
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),  
            layout = c(4,1 ), 
            auto.key = list(columns = 2))

Random forest 建模

在本文章,我们使用Random forest进行数据建模。Random forest的一个很大优势是可以计算出每个特征变量的重要性(Variabe importance)。这样可以帮助我们进行特征选择。

通过训练数据(Training data)进行建模

set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)

train_set = iris[train_index, ]
test_set  = iris[-train_index, ]
## Resampling: Bootstrapped (25 reps) by defaultcross 
fit_rf <- train(Species ~ ., data=train_set, method='rf') 
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit_rf
## Random Forest 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 114, 114, 114, 114, 114, 114, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9557937  0.9327467
##   3     0.9568748  0.9343695
##   4     0.9526572  0.9280205
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.
rfVarImp = varImp(fit_rf)
rfVarImp
## rf variable importance
## 
##               Overall
## Petal.Width  100.0000
## Petal.Length  98.9277
## Sepal.Length   0.5247
## Sepal.Width    0.0000
set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)

train_set = iris[train_index, ]
test_set  = iris[-train_index, ]
fit_rf_cv <- train(Species ~ ., data=train_set, method='rf', metric = "Accuracy",
                   trControl=trainControl(method="cv",number=5)) 
fit_rf_cv
## Random Forest 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 92, 91, 90, 92, 91 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9398221  0.9094760
##   3     0.9485178  0.9226563
##   4     0.9485178  0.9226563
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.
##计算特征变量的重要性
rfVarImpcv = varImp(fit_rf_cv)
rfVarImpcv
## rf variable importance
## 
##               Overall
## Petal.Width  100.0000
## Petal.Length  93.8615
## Sepal.Length   0.5066
## Sepal.Width    0.0000

通过结果我们可以发现在四个变量中最重要的变量是Petal.Width,其次是Petal.Length。

在测试数据上检验模型的优劣

前面讲到我们使用Randm forest在训练数据上进行建模,接着通常我们需要对模型在测试数据上进行测试。

具体代码如下:

## 在模型上对测试数据进行预测
test_set$predict_rf <- predict(fit_rf_cv, test_set, "raw")
## 对预测的结果进行评估
confusionMatrix(test_set$predict_rf, test_set$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         11         1
##   virginica       0          1        11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.8134, 0.9932)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.728e-14       
##                                           
##                   Kappa : 0.9167          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9167           0.9167
## Specificity                 1.0000            0.9583           0.9583
## Pos Pred Value              1.0000            0.9167           0.9167
## Neg Pred Value              1.0000            0.9583           0.9583
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3056           0.3056
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9375           0.9375

通过结果我们可以看到该模型的准确性是0.94,95%置信区间是 : (0.8134, 0.9932)。

Reference:

caret demostration with kaggle bikeshare data (I)
https://rpubs.com/chengjiun/52658
An example of using Random Forest in Caret with R.
http://bigcomputing.blogspot.com/2014/10/an-example-of-using-random-forest-in.html
https://topepo.github.io/caret/model-training-and-tuning.html
The caret Package https://topepo.github.io/caret/model-training-and-tuning.html
How to Perform a Logistic Regression in R
https://datascienceplus.com/perform-logistic-regression-in-r/