NNMF

筋電の分解は、非負値行列分解 (non-negative matrix factorization [NNMF]) がしばしば利用される。こちらは全ての成分を0より大きいまま分解する方法であり、特別な制約を入れない限り、PCAとは異なり互いに直交しない成分が出てくる。

直交することのメリットは、抽出される成分が少ない、解釈が容易なこと。しかし、あくまで直交する良さは数学的な簡便さであり、それが理解に繋がるか否かは別問題である。直交しないことのメリットは、数学的な簡便さよりも、データに内在する時空間成分を抽出できる(であろう)ことにあるといえる。

NNMFは、非負のデータ行列 \(\boldsymbol{X} \in \mathbb{R}_+^{S \times T}\) を、非負のベクトル \(\boldsymbol{s}_r \in \mathbb{R}_+^{S \times 1}\)、\(\boldsymbol{t}_r \in \mathbb{R}_+^{T \times 1}\)を利用して、 \[\begin{equation} \boldsymbol{X} = \sum_{r=1}^R \lambda_r \boldsymbol{s}_r\boldsymbol{t}_r^T \end{equation}\] として分解することである。ここで、\(R\) は、いくつのベクトル要素に分解するかを表す指標である。また、筋電データの場合、多くは\(\boldsymbol{s}_r\)が筋肉のグループを表す空間情報、\(\boldsymbol{t}_r\)がその時間変動を表す時間情報として捉えられる。

実際にNNMFを実行してみよう。

## Loading required package: pkgmaker

## Loading required package: registry

## Loading required package: rngtools

## Loading required package: cluster

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## Warning: namespace 'Biobase' is not available and has been replaced
## by .GlobalEnv when processing object ''

## NMF - BioConductor layer [NO: missing Biobase] | Shared memory capabilities [NO: bigmemory] | Cores 15/16

##   To enable the Bioconductor layer, try: install.extras('
## NMF
## ') [with Bioconductor repository enabled]
##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')

# library(NMF) # この読み込みは必須
X = matrix(runif(100, 0, 1), 5, 20)

# 正規化 (各空間情報の最小値が0、最大値が1になるようにする)
dimX = dim(X)
minX = apply(X, 1, min)
maxX = apply(X, 1, max)
X = (X - minX%*%matrix(1, 1, dimX[2]))/(maxX%*%matrix(1, 1, dimX[2]) - minX%*%matrix(1, 1, dimX[2]))

# Rseq = c(1, 2, 3, 4, 5) # これで複数比較することも可能ではある。
R   = 3
res = nmf(X, rank = R, .options = "t")
s   = basis(res) # sの確認 (unnormalized)
t   = coef(res) # tの確認 (unnormalized)
err = sum((X - fitted(res))^2)/sum(X^2)

par(mfrow=c(2,3))
barplot(s[,1]/sqrt(sum(s[,1]^2)))
barplot(s[,2]/sqrt(sum(s[,2]^2)))
barplot(s[,3]/sqrt(sum(s[,3]^2)))

plot(t[1,]/sqrt(sum(t[1,]^2)), type="l")
plot(t[2,]/sqrt(sum(t[2,]^2)), type="l")
plot(t[3,]/sqrt(sum(t[3,]^2)), type="l")

(結構重要な) 余談: 様々 \(R\) を決める指標は提案されているが、多くは元データを再現する割合を元に議論される。正直、様々提案されているが、これといって \(R\) を決める方法はない。例えばAICを使う方法もあるが、例えば\(\lambda_r = 2\)のとき、\(\boldsymbol{s}_r\)を2倍するのか \(\boldsymbol{t}_r\) を2倍するのかどちらでも問題ない。本来AICが想定する状況は、冗長性のない状況であり、NNMFにAICを適用することは妥当ではない。また、Cross-validationという方法もあるが、基本は\(R\)は大きいほどいい。とはいえ、大きすぎる \(R\) はノイズも拾い、解釈も難しい。どうしようもないのが現実だが、元データを80%再現、90%再現あたりが無難な閾値として用いられる。