很多人看到Machine learning会说,我只有几个样本,没有办法做Machine learing。事实真的是这个样子吗?事实不是这样子的。其实一个样本只需要两组数据,比如一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据,我就可以来做Machine learning来回答生物学问题。
Really? 是的,你没有看错。想知道怎么做吗,看我娓娓道来。先预告本文不会涉及如何使用一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据来做机器学习。在后续的文章我会介绍,本文我会主要用一个R实例来揭开你机器学习的庐山真面目。后面的一系列文章我会介绍如何使用一个个体的ChIP-seq(或者BS-seq)数据和RNA-seq数据进行机器学习来解决生物学问题。如果你对此感兴趣,敬请关注本公众号。
接下来我们会一步步介绍如果用R包caret进行机器学习建模。
caret简介R包caret是(Classification And REgression Training)的简称。该包包含一系列的函数,通过这些函数是机器学习中建模过程更简单。
这个包包含的工具有:
本文需要的R包如下:
caret ## 进行机器学习建模
AppliedPredictiveModeling ##展示数据
ellipse ## AppliedPredictiveModeling依赖此R包
本文需要的数据iris。Iris也称鸢尾花卉数据集,是一类多重变量分析的数据集。通过花萼长度,花萼宽度,花瓣长度,花瓣宽度4个属性预测鸢尾花卉属于(Setosa,Versicolour,Virginica)三个种类中的哪一类。本文主要介绍如何使用机器学习通过花萼长度,花萼宽度,花瓣长度,花瓣宽度4个特征变量来预测花的种类。
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
dim(iris)
## [1] 150 5
通过上面的R代码,我们可以发现该数据集共有150个个体,4个属性(特征变量),三个类别。
通常在对数据进行建模前,首先需要对数据中的变量进行初步的了解。对于iris这样的数据我们可以使用AppliedPredictiveModeling来生成散点图,
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.3.3
transparentTheme(trans = .4)
## Warning: package 'lattice' was built under R version 3.3.2
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.2
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 3))
#install.packages("ellipse")
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "ellipse",
## Add a key at the top
auto.key = list(columns = 3))
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "box",
## Pass in options to bwplot()
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(4,1 ),
auto.key = list(columns = 2))
在本文章,我们使用Random forest进行数据建模。Random forest的一个很大优势是可以计算出每个特征变量的重要性(Variabe importance)。这样可以帮助我们进行特征选择。
set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)
train_set = iris[train_index, ]
test_set = iris[-train_index, ]
## Resampling: Bootstrapped (25 reps) by defaultcross
fit_rf <- train(Species ~ ., data=train_set, method='rf')
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
fit_rf
## Random Forest
##
## 114 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 114, 114, 114, 114, 114, 114, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9557937 0.9327467
## 3 0.9568748 0.9343695
## 4 0.9526572 0.9280205
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
rfVarImp = varImp(fit_rf)
rfVarImp
## rf variable importance
##
## Overall
## Petal.Width 100.0000
## Petal.Length 98.9277
## Sepal.Length 0.5247
## Sepal.Width 0.0000
set.seed(186)
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)
train_set = iris[train_index, ]
test_set = iris[-train_index, ]
fit_rf_cv <- train(Species ~ ., data=train_set, method='rf', metric = "Accuracy",
trControl=trainControl(method="cv",number=5))
fit_rf_cv
## Random Forest
##
## 114 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 92, 91, 90, 92, 91
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9398221 0.9094760
## 3 0.9485178 0.9226563
## 4 0.9485178 0.9226563
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
##计算特征变量的重要性
rfVarImpcv = varImp(fit_rf_cv)
rfVarImpcv
## rf variable importance
##
## Overall
## Petal.Width 100.0000
## Petal.Length 93.8615
## Sepal.Length 0.5066
## Sepal.Width 0.0000
通过结果我们可以发现在四个变量中最重要的变量是Petal.Width,其次是Petal.Length。
前面讲到我们使用Randm forest在训练数据上进行建模,接着通常我们需要对模型在测试数据上进行测试。
具体代码如下:
## 在模型上对测试数据进行预测
test_set$predict_rf <- predict(fit_rf_cv, test_set, "raw")
## 对预测的结果进行评估
confusionMatrix(test_set$predict_rf, test_set$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 11 1
## virginica 0 1 11
##
## Overall Statistics
##
## Accuracy : 0.9444
## 95% CI : (0.8134, 0.9932)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.728e-14
##
## Kappa : 0.9167
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9167 0.9167
## Specificity 1.0000 0.9583 0.9583
## Pos Pred Value 1.0000 0.9167 0.9167
## Neg Pred Value 1.0000 0.9583 0.9583
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3056 0.3056
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9375 0.9375
通过结果我们可以看到该模型的准确性是0.94,95%置信区间是 : (0.8134, 0.9932)。
Reference:
caret demostration with kaggle bikeshare data (I)
https://rpubs.com/chengjiun/52658
An example of using Random Forest in Caret with R.
http://bigcomputing.blogspot.com/2014/10/an-example-of-using-random-forest-in.html
https://topepo.github.io/caret/model-training-and-tuning.html
The caret Package https://topepo.github.io/caret/model-training-and-tuning.html
How to Perform a Logistic Regression in R
https://datascienceplus.com/perform-logistic-regression-in-r/