siero dataの解析

以下の発表で紹介したデータについて解析したフローを示します。
発表スライドに詳細を追記し、分析手法も追加して比較しました。

データ: https://gist.github.com/siero5335/c7323407efb2b0b20b55
スライド: http://www.slideshare.net/eguchiakifumi/up-tokyor-48-lt

はじめにdevtools::install_github(“dichika/jaguchi”)で
jaguchiパッケージをインストールしておきます。

パッケージインストール・読み込み

library(jaguchi)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

データ読み込み・前処理

目的変数をfactorにしたり、naを削除したりします。
もっとエレガントに書けるぜ！という方がいればご指導ください。

siero <-jaguchi("siero")

## Loading required package: RCurl
## Loading required package: bitops

y <- siero$Food
y1 <- as.factor(y)

x <- data.matrix(siero[,-c(1,2)], rownames.force  = NA)
x1 <- t(x)
x2 <- na.omit(x1)
x3 <- t(x2)

いくつかの手法で解析・比較

PLS, Elastic net, Randomforestで解析してみます。

PLS

set.seed(12)
con<-trainControl(method ="LOOCV",
                  preProc = c("center", "scale"))
plsGrid <- expand.grid(ncomp = seq(1:10))

plsmetabo<- train(x3, y1, method ="pls",
               tuneGrid = plsGrid,
               trControl = con)

## Loading required package: pls
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

plsmetabo

## Partial Least Squares 
## 
##  12 samples
## 261 predictors
##   2 classes: 'After', 'Before' 
## 
## No pre-processing
## Resampling: 
## 
## Summary of sample sizes: 11, 11, 11, 11, 11, 11, ... 
## 
## Resampling results across tuning parameters:
## 
##   ncomp  Accuracy  Kappa
##    1     1         1    
##    2     1         1    
##    3     1         1    
##    4     1         1    
##    5     1         1    
##    6     1         1    
##    7     1         1    
##    8     1         1    
##    9     1         1    
##   10     1         1    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was ncomp = 1.

plot(varImp(plsmetabo), top = 10)

Enet

#elastic net
set.seed(12)
con<-trainControl(method ="LOOCV",
                  preProc = c("center", "scale"))

train_grid = expand.grid(alpha = 1:10 * 0.1, lambda = 10^{1:5 * -1})

enet_fit = train(x3,y1,  
                 method = "glmnet", 
                 tuneGrid = train_grid,
                 trControl=con)

## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2

enet_fit

## glmnet 
## 
##  12 samples
## 261 predictors
##   2 classes: 'After', 'Before' 
## 
## No pre-processing
## Resampling: 
## 
## Summary of sample sizes: 11, 11, 11, 11, 11, 11, ... 
## 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  Accuracy  Kappa
##   0.1    1e-05   1         1    
##   0.1    1e-04   1         1    
##   0.1    1e-03   1         1    
##   0.1    1e-02   1         1    
##   0.1    1e-01   1         1    
##   0.2    1e-05   1         1    
##   0.2    1e-04   1         1    
##   0.2    1e-03   1         1    
##   0.2    1e-02   1         1    
##   0.2    1e-01   1         1    
##   0.3    1e-05   1         1    
##   0.3    1e-04   1         1    
##   0.3    1e-03   1         1    
##   0.3    1e-02   1         1    
##   0.3    1e-01   1         1    
##   0.4    1e-05   1         1    
##   0.4    1e-04   1         1    
##   0.4    1e-03   1         1    
##   0.4    1e-02   1         1    
##   0.4    1e-01   1         1    
##   0.5    1e-05   1         1    
##   0.5    1e-04   1         1    
##   0.5    1e-03   1         1    
##   0.5    1e-02   1         1    
##   0.5    1e-01   1         1    
##   0.6    1e-05   1         1    
##   0.6    1e-04   1         1    
##   0.6    1e-03   1         1    
##   0.6    1e-02   1         1    
##   0.6    1e-01   1         1    
##   0.7    1e-05   1         1    
##   0.7    1e-04   1         1    
##   0.7    1e-03   1         1    
##   0.7    1e-02   1         1    
##   0.7    1e-01   1         1    
##   0.8    1e-05   1         1    
##   0.8    1e-04   1         1    
##   0.8    1e-03   1         1    
##   0.8    1e-02   1         1    
##   0.8    1e-01   1         1    
##   0.9    1e-05   1         1    
##   0.9    1e-04   1         1    
##   0.9    1e-03   1         1    
##   0.9    1e-02   1         1    
##   0.9    1e-01   1         1    
##   1.0    1e-05   1         1    
##   1.0    1e-04   1         1    
##   1.0    1e-03   1         1    
##   1.0    1e-02   1         1    
##   1.0    1e-01   1         1    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1.

plot(varImp(plsmetabo), top = 10)

Randomforest

set.seed(12)
con<-trainControl(method ="LOOCV",
                  preProc = c("center", "scale"))

rfGrid <- expand.grid(mtry = seq(1:10))

rfmetabo <- train(x3,y1, 
                  method ="rf", 
                  tuneGrid = rfGrid ,
                  trControl = con, importance = TRUE)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

plot(varImp(rfmetabo), top = 10)

いずれの手法でも食前、食後をしっかり判別することができているようです。

しかしアルゴリズムによって変数重要度に入ってくる要素がバラバラですね…
この手の解析の場合、一般的にどうやって解釈してるんでしょう。

一番予測が上手く行った手法があれば話が早いのかもですが、
今回のようにどの手法でも上手く判別できたり、
どの手法でもほぼ同じ値が得られた場合には困るかも知れません。

中でもRFは乱数次第で結果が大きく変わります。

Randomforest 2

set.seed(1234)
con<-trainControl(method ="LOOCV",
                  preProc = c("center", "scale"))

rfGrid <- expand.grid(mtry = seq(1:10))

rfmetabo <- train(x3,y1, 
                  method ="rf", 
                  tuneGrid = rfGrid ,
                  trControl = con, importance = TRUE)

plot(varImp(rfmetabo), top = 10)

結果の再現はできるとはいえ、生体内の機能を解析したい場合には
VIPが固定で出てくる手法のほうが好まれるかもしれません。

もっと分離が難しかったり、
測定サンプル数がもうちょっと多ければ
話は別なのかもしれませんがどうしたものやら。