背景

科学研究など様々な場面で、あるサンプル（標本）について測定されたた複数の特徴量から、そのサンプルが属するクラスを特定するモデルが必要とされている。このようなモデルは、クラスと測定された特徴の間の関係の記述にも利用でき、また、新しいサンプルのクラスを予測するためにも利用できる。例えば、異なる品種について測定された複数の形質がある場合、品種をクラスとみなし、形質の観測値に基づいてクラスを特定するモデルを構築することにより、品種間の形質変動を記述するために利用できる。あるいは、リモートセンシングのデータを用いて、各地点で栽培されている作物種を特定するような例もある。本講義では、このような場合に有効なクラス同定モデルの構築方法を紹介する。

分類モデルを構築するためには、様々な手法がある。本講義では、線形判別分析、サポートベクターマシン、ランダムフォレストについて紹介する。分類モデルでは、モデルに関わる特徴量がサンプル数よりはるかに多い場合、「large p small n 」問題が発生する場合もある。サポートベクターマシンやランダムフォレストは、こうした問題の下でも利用できる。

線形判別分析(LDA)

　第3回の講義では、Hao et al. (2015, PLoS ONE 10: e0137748) のリモートセンシングデータをもとに、主成分分析を行った。この論文では、Landsatなどの人工衛星から取得したリモートセンシング画像から、対象地点で栽培されている作物を予測する機械学習モデルを構築することを研究目的としている。リモートセンシングは1年間に11回行われ、そこからNDVI（正規化差分植生指数）が算出された。第3回の講義で説明したように、NDVIは以下のように算出される。 \[ NDVI=\frac{\rho(NIR)-\rho(R)}{\rho(NIR)+\rho(R)} \] 植生は近赤外線（NIR）を反射し、赤色（レッド：R）を吸収するため、植生があるところではNDVIが高くなる。ここでは、Haoら（2015）のデータの一部を用いて、11個のNDVIデータセットからコムギを除く4種類の作物（ワタ、ブドウ、トウモロコシ、スイカ）を識別するモデルを構築してみよう。モデル構築には1000サンプルのデータ（訓練データ）を、モデル評価には500サンプルのデータ（検証データ）を使用する。

まず、モデル構築のためのデータをロードする。

ndvi <- read.csv("ndvi_train.csv", row.names = 1, stringsAsFactors = T) # Read data
head(ndvi) # check first 6 lines

##     Class  DOY96 DOY112 DOY130 DOY144 DOY159 DOY192 DOY208 DOY228 DOY256 DOY270
## 21 cotton 0.1278 0.0668 0.0553 0.0972 0.1220 0.7251 0.7916 0.7432 0.6334 0.4983
## 22 cotton 0.1278 0.0665 0.0553 0.0868 0.1218 0.7027 0.7585 0.7323 0.6186 0.4871
## 27 cotton 0.1157 0.0841 0.1234 0.1253 0.1519 0.2402 0.2969 0.3398 0.2512 0.3306
## 47 cotton 0.1402 0.0731 0.0370 0.1084 0.1758 0.7599 0.8256 0.7902 0.6509 0.5038
## 53 cotton 0.1108 0.0537 0.0510 0.0972 0.1455 0.6849 0.8143 0.8284 0.6767 0.5568
## 55 cotton 0.1426 0.0731 0.0379 0.1085 0.1814 0.7408 0.8140 0.8133 0.6240 0.4971
##    DOY289
## 21 0.3435
## 22 0.3374
## 27 0.1993
## 47 0.3319
## 53 0.3317
## 55 0.3590

次に、主成分分析を行って、データのもつ変異を要約してみよう。読み込んだデータから、NDVIデータ（2列目から最後の列まで）を変数Xとして抽出する（Xは行列）。また、作物種を変数clとして抽出する。

X <- as.matrix(ndvi[, 2:ncol(ndvi)]) # Extract wavelength data
cl <- ndvi$Class
res.pca <- prcomp(X) # principal component analysis

次に、主成分分析の結果を描いてみよう。

op <- par(mfrow = c(2,2)) # output 2x2
plot(res.pca$x[,1:2], col = cl) # scatter of 1st and 2nd principal component scores
legend("bottomleft", levels(cl), pch = 1, col = 1:nlevels(cl)) # Draw the legend in the lower left maizeer
plot(res.pca$x[,2:3], col = cl) # scatter of 2nd and 3rd principal component scores
plot(res.pca$x[,c(1,3)], col = cl) # scatter of 1st,3rd principal component scores
par(op)

plotly パッケージの plot_ly 関数を使用して 3 次元散布図を描画します。

plotly::plot_ly(x = res.pca$x[,1], y = res.pca$x[,2], z = res.pca$x[,3], color = cl,
                type = "scatter3d", mode = "markers", size = 0.1)

主成分得点をもとに散布図を描くと、第1、第2主成分のグラフでは、スイカが他の作物種から離れた場所に位置していることがわかる。一方、ブドウ、ワタ、トウモロコシは、第1〜3主成分の空間で互いに重なり合っていることがわかる。

　次に、線形判別分析を用いて分類モデルを構築してみる。線形判別分析には、MASSパッケージに含まれるlda関数を使用する。lda関数は、どの変数で何を判別するかを指定する必要がある。ここでは、“Class ~ .”として指定する。これは、ndviというデータの残りの変数でClass（因子でなければならない）を識別するモデルを作成することを意味する。

res.lda <- MASS::lda(Class ~ . , data = ndvi) # linear discriminant analysis
res.lda # Display results

## Call:
## lda(Class ~ ., data = ndvi)
## 
## Prior probabilities of groups:
##     cotton      grape      maize watermelon 
##      0.586      0.170      0.065      0.179 
## 
## Group means:
##                 DOY96     DOY112     DOY130    DOY144    DOY159    DOY192
## cotton     0.09183481 0.07079317 0.06931945 0.1138387 0.2074708 0.7767010
## grape      0.10465529 0.08929941 0.07177353 0.1158300 0.1834359 0.5488500
## maize      0.10179385 0.07642615 0.07822000 0.1518662 0.3258077 0.8426938
## watermelon 0.09710670 0.09306425 0.07633296 0.1052223 0.1310408 0.6249503
##               DOY208    DOY228    DOY256    DOY270    DOY289
## cotton     0.8182379 0.8005734 0.6696370 0.5292558 0.3142102
## grape      0.6512294 0.6459153 0.5694894 0.5027753 0.2954747
## maize      0.8459585 0.8108923 0.5997385 0.4333092 0.2346062
## watermelon 0.7282106 0.5984654 0.3159866 0.1956598 0.1454872
## 
## Coefficients of linear discriminants:
##               LD1        LD2         LD3
## DOY96    1.239276  8.9295380 -14.2410016
## DOY112  22.052720 -9.6572911  -9.4822330
## DOY130   3.153688  9.3666079  12.9301262
## DOY144  -3.195624 -0.7050797  13.0531728
## DOY159  -1.873597 -2.9388017 -24.1224798
## DOY192  -6.897472  8.9389458   0.9706753
## DOY208  19.383390 -5.0019306   6.9063329
## DOY228  -3.897492 10.8862230  -1.1247100
## DOY256 -11.118489 -2.4111915  -3.3169339
## DOY270  -2.130809 -3.4693096   3.7571186
## DOY289  -4.117050 -4.0018919   6.1395059
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7801 0.1437 0.0763

解析結果は、各サンプルが各クラス（グループ）に属する確率（事前確率）、各クラスのNDVI値の平均値、線形判別関数の係数、各線形判別関数がグループ間分散のどの程度を説明しているかを示している。線形判別関数の係数と、各線形判別関数で説明される群間分散の割合が表示される。

線形判別分析の結果を可視化してみよう。

plot(res.lda, col = as.numeric(cl), abbrev = 1) # Graphical representation of the result

判別スコアを計算し、その3次元散布図を描く。

fit.lda <- predict(res.lda) # calculate discriminant scores
head(fit.lda$x)

##           LD1         LD2         LD3
## 21 -0.7454548  0.01170553  1.54443730
## 22 -0.9495328 -0.03185156  1.14782997
## 27  0.2183818 -3.90433866 -1.60240434
## 47 -0.7081727  0.36198457  0.02702861
## 53 -1.2852901  0.07155552  1.31392312
## 55 -0.6945178  0.49373205 -0.02349397

plotly::plot_ly(x = fit.lda$x[,1], y = fit.lda$x[,2], z = fit.lda$x[,3], color = cl,
                type = "scatter3d", mode = "markers", size = 0.1)

なお、軸（判別関数）はグループをうまく分離するように配置されているので、主成分分析の場合よりも、グループ間の重なりがはっきりしている。例えば、トウモロコシとワタ、ブドウとワタなどは、LD2やLD3を使って何とか分離できそうである。

構築したモデルを用いて、分類精度を確認してみよう。

cl.est <- fit.lda$class # Estimation class for all samples
head(cl.est, 10)

##  [1] cotton cotton grape  cotton cotton cotton cotton cotton cotton cotton
## Levels: cotton grape maize watermelon

head(cl, 10)

##  [1] cotton cotton cotton cotton cotton cotton cotton cotton cotton cotton
## Levels: cotton grape maize watermelon

table(cl, cl.est) # Comparison of actual and estimated classes

##             cl.est
## cl           cotton grape maize watermelon
##   cotton        544    39     2          1
##   grape          53   106     3          8
##   maize          19     0    46          0
##   watermelon      0     3     0        176

tab <- table(cl, cl.est) # assignment to tab
diag(tab) # Extract only the diagonal components

##     cotton      grape      maize watermelon 
##        544        106         46        176

sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.872

各サンプルについて同定（推定）されたクラスは，predict 関数を用いて取得することができる．真のクラス（cl）と同定（推定）クラス（cl.est）の分割表を作成し，比較することにより，正しく（あるいは誤って）同定された標本数を確認することができる．表の対角成分は正しく同定されたケースなので、diag関数で対角成分のみを取り出し、それらを合計し、総サンプル数で割ることで、正しく同定された割合を算出します。正答率は87.2%であり、特にブドウとワタを間違えるケースが多い（170サンプル中53サンプル）ことがわかる。

　なお、ここで算出された精度は、あくまで訓練データ（モデル構築に使用したデータ）にモデルを「当てはめた」ものである。このような場合、前回の講義と同様に「過剰適合」（overfit）が起こっている可能性があります。

　次に、各サンプルが各クラス（グループ）に属する確率（事後確率）を見てみる。

p.post <- fit.lda$posterior # posterior probability
head(p.post, 10) # Posterior probability for all samples

##          cotton       grape        maize   watermelon
## 21 0.9884138886 0.011562870 2.318608e-05 5.540017e-08
## 22 0.9845281188 0.015419070 5.279434e-05 1.683514e-08
## 27 0.0002483168 0.999750138 1.522126e-06 2.340179e-08
## 47 0.9805836205 0.016874276 2.542028e-03 7.502016e-08
## 53 0.9915076025 0.008467135 2.526051e-05 2.066750e-09
## 55 0.9836531459 0.012882084 3.464691e-03 7.920424e-08
## 57 0.9894689390 0.010494787 3.627413e-05 2.259079e-10
## 76 0.9617996597 0.037834700 3.652758e-04 3.646052e-07
## 78 0.9956985896 0.004171789 1.296110e-04 1.062132e-08
## 79 0.8963881244 0.101943826 1.665348e-03 2.702040e-06

id <- apply(p.post, 1, which.max) # posterior probability max
table(id, cl.est) # classified into the class with the largest value

##    cl.est
## id  cotton grape maize watermelon
##   1    616     0     0          0
##   2      0   148     0          0
##   3      0     0    51          0
##   4      0     0     0        185

各サンプルが各クラスに属する事後確率もpredict関数で取得することができる。例えば、ID21のサンプルは、綿である確率が0.99、ブドウである確率が0.01、それ以外が0であるとする。この場合、ID 21のサンプルは事後確率が最も高いワタであると同定される。次の行では、検索された事後確率p.postについて、各サンプルに対応する各行でどの要素（4つの作物種）の値が最も大きいかを調べている。分類結果(cl.est)と比較すると、両者は全く同じであることがわかる。

　分類の正誤と事後確率の関係を可視化するために、図を描いてみる。ここでは、ワタに限って確認する。

op <- par(mfrow = c(2,2)) # display in 2x2
hi <- hist(p.post[cl == "cotton", 1], main = "all cotton") # About  cotton data
hist(p.post[cl == "cotton" & cl.est == "cotton", 1], 
     breaks = hi$breaks, main = "correct") # correct data only
hist(p.post[cl == "cotton" & cl.est != "cotton", 1], 
     breaks = hi$breaks, main = "incorrect") # incorrect data only
par(op)

　　上の図から、正しく識別された場合には事後確率が大きく（ほとんどのサンプルで0.5以上）、誤って識別された場合には事後確率が小さいことがわかる。このことから、事後確率は信頼性の指標にもなっていることがわかる。

　前述したように、ここまでは訓練データ（＝モデルを構築するためのデータ）に対する分類精度を評価してきた。構築するモデルが解析されたデータを記述するためのものであれば、訓練データで分類精度を評価しても構わないかもれいない。しかし、多くの場合、そのようなモデルを構築する目的は、新しいデータのクラスを予測することである。そのような場合には、交差検証などの手法により、モデル構築に使用しなかったデータ（以下、「検証データ」と呼ぶ）の分類精度を評価する必要がある。　　Hao et al. (2015)のデータから、モデル評価用の検証データを別途用意した。検証データを読み込んでみよう。

ndvi.test <- read.csv("ndvi_test.csv", row.names = 1, stringsAsFactors = T) # Read data
head(ndvi.test) # check first 6 lines

##     Class  DOY96 DOY112 DOY130 DOY144 DOY159 DOY192 DOY208 DOY228 DOY256 DOY270
## 1  cotton 0.1305 0.0731 0.0637 0.1143 0.2143 0.7696 0.8281 0.7887 0.6716 0.4999
## 3  cotton 0.0860 0.0844 0.0851 0.1421 0.1327 0.3117 0.3341 0.3971 0.2237 0.3605
## 6  cotton 0.1216 0.0678 0.0646 0.1027 0.2037 0.7829 0.8274 0.7668 0.7010 0.5201
## 8  cotton 0.1134 0.0680 0.0528 0.1138 0.1762 0.6998 0.7722 0.7633 0.6218 0.5164
## 14 cotton 0.1352 0.0731 0.0553 0.0975 0.1254 0.7062 0.7467 0.7514 0.6363 0.4876
## 24 cotton 0.1459 0.0668 0.0532 0.0919 0.1286 0.7096 0.7538 0.6604 0.5642 0.4306
##    DOY289
## 1  0.3400
## 3  0.2358
## 6  0.3285
## 8  0.3274
## 14 0.3491
## 24 0.2938

cl.test <- ndvi.test$Class # Extract class (crop species) data

　次に、構築した識別モデルres.ldaを用いて、評価用データの識別を行ってみよう。

pred.lda <- predict(res.lda, ndvi.test)
cl.pred <- pred.lda$class # Predict to unknown data
head(cl.pred, 10)

##  [1] cotton grape  cotton cotton cotton cotton grape  cotton cotton cotton
## Levels: cotton grape maize watermelon

head(cl.test, 10)

##  [1] cotton cotton cotton cotton cotton cotton cotton cotton cotton cotton
## Levels: cotton grape maize watermelon

識別された（予測された）クラスと実際のクラスを分割表を使って比較してみよう。先ほどと同じ手順で分類の正答率も計算してみる。

table(cl.test, cl.pred) # Create binary division table

##             cl.pred
## cl.test      cotton grape maize watermelon
##   cotton        265    24     1          0
##   grape          23    56     3          4
##   maize          15     0    22          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred) # assignment to tab
sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.86

ワタとぶどうの区別、ワタとトウモロコシの区別が難しいことがわかります。正解率は86%で、訓練データで計算した正解率より若干低くなっている。

検証データ内のサンプルについて、判別スコアを計算して、プロットしてみましょう。

op <- par(mfrow = c(2, 2))
plot(pred.lda$x[, 1:2], col = cl.test, pch = as.numeric(cl.pred))
legend("bottomright", levels(cl), pch = 1, col = 1:nlevels(cl)) # Draw the legend in the lower left maizeer
plot(pred.lda$x[, 2:3], col = cl.test, pch = as.numeric(cl.pred))
plot(pred.lda$x[, c(1,3)], col = cl.test, pch = as.numeric(cl.pred))
par(op)

色の違いは真のクラス、記号の違いは予測されたクラスを表しています。誤って分類されたサンプルは、クラス間の境界に位置することが多いことが分かります。

サポートベクターマシン(SVM)

　線形判別分析は、元の変数Xの線形結合を用いてクラス識別を行おうとするものである。このような方法は、クラス間の境界が線形分離可能（直線で分離できる）でなければ、あまり正確ではない。以下のサポートベクターマシンは、クラス間の境界が直線的に分離できない複雑なパターンの場合、カーネル法という方法を用いて適用することができる。

　では、実際に使ってその精度を確認してみよう。解析には、kernlabパッケージに含まれるksvmという関数を使用する。SVMの種類は、サポートベクターマシンを使用する目的によって異なる。クラス識別モデルを構築するのであれば、まず「C-svc」を選択するとよい。SVMは多変量回帰や外れ値検出にも使用でき、その場合はそれぞれ「eps-svr」「one-svc」を選択するとよい。このほかにも、さまざまなタイプを指定することができる。

res.svm <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc") # Execute SVM
res.svm

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.143220071935847 
## 
## Number of Support Vectors : 336 
## 
## Objective Function Value : -112.121 -53.4998 -20.956 -21.5002 -19.0631 -7.9378 
## Training error : 0.038

では、特定（推定）されたクラスと実際のクラスを比較してみよう。

cl.est <- predict(res.svm) # Estimation class for all samples
table(cl, cl.est) # Comparison of actual and estimated classes

##             cl.est
## cl           cotton grape maize watermelon
##   cotton        580     6     0          0
##   grape          17   152     1          0
##   maize          13     0    52          0
##   watermelon      0     1     0        178

tab <- table(cl, cl.est) # assignment to tab
sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.962

線形判別分析と比較して、分類の正答率（96.2％）が向上している。また、ブドウとワタを間違える率も低くなっている（170サンプル中17サンプル）。

この訓練データを用いて、10分割交差検証によりモデルの分類精度を評価する。

res.svm <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc", 
                         cross = 10) # Execute SVM with cross-validation
res.svm

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.151582959178556 
## 
## Number of Support Vectors : 339 
## 
## Objective Function Value : -110.4197 -52.8538 -20.9671 -21.2138 -18.9894 -8.0293 
## Training error : 0.038 
## Cross validation error : 0.06

cross(res.svm)

## [1] 0.06

1 - cross(res.svm)

## [1] 0.94

では、テストデータに対して、識別の精度を確認してみましょう。

cl.pred <- predict(res.svm, ndvi.test) # Predict to unknown data
table(cl.test, cl.pred) # Create binary division table

##             cl.pred
## cl.test      cotton grape maize watermelon
##   cotton        283     7     0          0
##   grape          15    70     0          1
##   maize          13     1    23          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred) # assignment to tab
sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.926

また、テストデータによる検証でも、高い分類精度（93％）を示している。

SVMでは、様々なカーネル行列（サンプル間の関係を定義する行列。具体的には、特徴空間におけるサンプル間の内積行列である）を用いて分類モデルを作成することができる。

例えば、線形カーネルは、以下のように適用できる。

res.svm.lin <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc", 
                             kernel = "vanilladot", # linear kernel
                             cross = 10)

##  Setting default kernel parameters

res.svm.lin

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 269 
## 
## Objective Function Value : -175.02 -52.5992 -10.4078 -12.954 -12.7449 -0.7291 
## Training error : 0.091 
## Cross validation error : 0.101

線形カーネルで精度を確認する。

cl.pred.lin <- predict(res.svm.lin, ndvi.test) # Estimation class for all samples
table(cl.test, cl.pred.lin) # Comparison of actual and estimated classes

##             cl.pred.lin
## cl.test      cotton grape maize watermelon
##   cotton        266    23     1          0
##   grape          13    71     1          1
##   maize          13     2    22          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred.lin) # assignment to tab
sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.892

デフォルトのカーネル（放射基底関数カーネル、Radial basis function: RBF、rbfdot）よりも精度が低くなっています。

次に、3次多項式カーネルを使用してみます。

res.svm.pol <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc", 
                             kernel = "polydot", kpar = list(degree = 3), # polynomial kernel
                             cross = 10) 
res.svm.pol

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  1  offset =  1 
## 
## Number of Support Vectors : 189 
## 
## Objective Function Value : -2.5758 -2.1475 -0.0294 -0.0911 -0.0293 -0.0161 
## Training error : 0 
## Cross validation error : 0.076

多項式カーネルで精度を確認する。

cl.pred.pol <- predict(res.svm.pol, ndvi.test) # Estimation class for all samples
table(cl.test, cl.pred.pol) # Comparison of actual and estimated classes

##             cl.pred.pol
## cl.test      cotton grape maize watermelon
##   cotton        269    15     6          0
##   grape           9    77     0          0
##   maize           5     1    31          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred.pol) # assignment to tab
sum(diag(tab)) / sum(tab) # Calculate percentage of correct answers

## [1] 0.928

精度はデフォルトのカーネル（放射基底関数カーネル rbfdot）と同等ですが、2つのカーネルでは正解・不正解のパターンが異なっています。

table(cl.pred, cl.pred.pol, cl.test)

## , , cl.test = cotton
## 
##             cl.pred.pol
## cl.pred      cotton grape maize watermelon
##   cotton        265    12     6          0
##   grape           4     3     0          0
##   maize           0     0     0          0
##   watermelon      0     0     0          0
## 
## , , cl.test = grape
## 
##             cl.pred.pol
## cl.pred      cotton grape maize watermelon
##   cotton          5    10     0          0
##   grape           4    66     0          0
##   maize           0     0     0          0
##   watermelon      0     1     0          0
## 
## , , cl.test = maize
## 
##             cl.pred.pol
## cl.pred      cotton grape maize watermelon
##   cotton          4     0     9          0
##   grape           0     1     0          0
##   maize           1     0    22          0
##   watermelon      0     0     0          0
## 
## , , cl.test = watermelon
## 
##             cl.pred.pol
## cl.pred      cotton grape maize watermelon
##   cotton          0     0     0          0
##   grape           0     0     0          0
##   maize           0     0     0          0
##   watermelon      0     0     0         87

Random Forst

　機械学習手法ランダムフォレストを使って分類モデルを構築してみよう。ランダムフォレストは、識別に用いる変数とクラスの間に非線形な関係がある場合にも用いることができる。

　ランダムフォレストでは、rangerパッケージに含まれるranger関数を使用する。

res.rfo <- ranger::ranger(Class ~ . , data = ndvi) # Run RF
# Estimated class vs. actual
fit.rfo <- predict(res.rfo, ndvi) # Estimation class for all samples
cl.est <- fit.rfo$predictions
table(cl, cl.est) # Comparison of actual and estimated classes

##             cl.est
## cl           cotton grape maize watermelon
##   cotton        586     0     0          0
##   grape           0   170     0          0
##   maize           0     0    65          0
##   watermelon      0     0     0        179

tab <- table(cl, cl.est) # assignment to tab
sum(diag(tab)) / sum(tab) # rate of correct prediction

## [1] 1

このモデルは学習データに完全に適合している。

モデルの精度は、OOB（out of bag）誤差で評価できる。

accuracy <- 1 - res.rfo$prediction.error
accuracy

## [1] 0.933

検証データにモデルを当てはめ、予測精度を確認する。

pred.rfo <- predict(res.rfo, ndvi.test) # Prediction of the classes for all samples
cl.pred <- pred.rfo$predictions
table(cl.test, cl.pred) # Comparison of actual and estimated classes

##             cl.pred
## cl.test      cotton grape maize watermelon
##   cotton        283     7     0          0
##   grape          15    69     0          2
##   maize          10     1    26          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred) # assignment to tab
sum(diag(tab)) / sum(tab) # rate of correct prediction

## [1] 0.93

テストデータに対する予測では、SVMと同程度の正答率で分類されていることがわかる。SVMと比較すると、特にトウモロコシの識別がより正確であるように見える。

RFでは、変数の重要度を確認することができる。importance オプションを “none” から “impurity” に設定することで、識別における変数の重要度が算出される。

res.rfo.wim <- ranger::ranger(Class ~ . , data = ndvi, importance = "impurity")
var.imp <- ranger::importance(res.rfo.wim)
var.imp

##     DOY96    DOY112    DOY130    DOY144    DOY159    DOY192    DOY208    DOY228 
##  26.83092  36.01618  20.07646  41.05563  62.09203  55.73582  46.70534  64.63019 
##    DOY256    DOY270    DOY289 
## 106.83699  78.67582  51.56404

barplot(var.imp, las = 2)

夏以降（DOY228以降）のデータが同定に重要であることがわかる。

ハイパーパラメータの最適化

　機械学習の手法や統計的手法には、ハイパーパラメータと呼ばれるモデルの特性を定義する値のセットがあることが多い。例えば、RFでは、モデルに用いる木の本数、木を構成する変数の数などがハイパーパラメータとなる。SVMでは、カーネルの特性を指定するハイパーパラメータがある。

RFでは、RFで用いられる木の本数や木を構成する変数の数を変化させてモデルを作成し、OOBエラーに基づいて精度を評価しようとするものである。

res.rfo <- ranger::ranger(Class ~ ., data = ndvi, 
                            num.trees = 10, mtry = 5)
1 - res.rfo$prediction.error

## [1] 0.8958544

res.rfo <- ranger::ranger(Class ~ ., data = ndvi, 
                            num.trees = 100, mtry = 5)
1 - res.rfo$prediction.error

## [1] 0.928

res.rfo <- ranger::ranger(Class ~ ., data = ndvi, 
                            num.trees = 100, mtry = 2)
1 - res.rfo$prediction.error

## [1] 0.922

この2つのハイパーパラメータの値を変えることで、精度が変化する。

より精度の高いモデルを構築するためには、ハイパーパラメータを慎重に調整する必要がある（このプロセスを「チューニング」と呼ぶ）。ハイパーパラメータのチューニングには、さまざまな方法がある。ここでは例として、ベイズ最適化を用いてチューニングを行う。

まず、独自の関数を作成する。この関数は、ハイパーパラメータの値が与えられたときに、その値に基づいて分類モデルを作成し、その分類精度を返すものです。

rfo.error <- function(ntre10, nvar) {
  res.rfo <- ranger::ranger(Class ~ ., data = ndvi, 
                            num.trees = ntre10 * 10, mtry = nvar)
  accuracy <-  1 - res.rfo$prediction.error
  return(list(Score = accuracy, Pred = 0))
}

この機能により、モデルの分類精度を簡単に確認することができる。

rfo.error(100, 2)

## $Score
## [1] 0.928
## 
## $Pred
## [1] 0

次に、rBayesianOptimizationパッケージのBayesianOptimization関数を使って、2つのRFハイパーパラメータの値のチューニングを試みる。各ハイパーパラメータが取りうる値の範囲を指定する必要があることに注意する。ハイパーパラメータの値が実数でなく整数の場合は、値の後ろに “L”をつける必要がある。まずランダムに 10 個の値の組み合わせを試した後、ベイズ最適化を 20 回繰り返し、より良い値の組み合わせを探索する。

opt.rfo <- rBayesianOptimization::BayesianOptimization(rfo.error,
                                        bounds = list(ntre10 = c(1L, 20L),
                                                      nvar = c(1L, 10L)),
                                        init_points=10, n_iter=20)

## elapsed = 0.012  Round = 1   ntre10 = 3.0000 nvar = 8.0000   Value = 0.9150 
## elapsed = 0.027  Round = 2   ntre10 = 14.0000    nvar = 8.0000   Value = 0.9220 
## elapsed = 0.018  Round = 3   ntre10 = 6.0000 nvar = 8.0000   Value = 0.9260 
## elapsed = 0.029  Round = 4   ntre10 = 14.0000    nvar = 7.0000   Value = 0.9340 
## elapsed = 0.017  Round = 5   ntre10 = 13.0000    nvar = 3.0000   Value = 0.9260 
## elapsed = 0.014  Round = 6   ntre10 = 7.0000 nvar = 5.0000   Value = 0.9240 
## elapsed = 0.013  Round = 7   ntre10 = 6.0000 nvar = 4.0000   Value = 0.9230 
## elapsed = 0.032  Round = 8   ntre10 = 16.0000    nvar = 6.0000   Value = 0.9290 
## elapsed = 0.022  Round = 9   ntre10 = 9.0000 nvar = 8.0000   Value = 0.9240 
## elapsed = 0.027  Round = 10  ntre10 = 17.0000    nvar = 5.0000   Value = 0.9380 
## elapsed = 0.016  Round = 11  ntre10 = 11.0000    nvar = 3.0000   Value = 0.9270 
## elapsed = 0.024  Round = 12  ntre10 = 15.0000    nvar = 5.0000   Value = 0.9250 
## elapsed = 0.011  Round = 13  ntre10 = 5.0000 nvar = 4.0000   Value = 0.9330 
## elapsed = 0.014  Round = 14  ntre10 = 12.0000    nvar = 2.0000   Value = 0.9240 
## elapsed = 0.036  Round = 15  ntre10 = 16.0000    nvar = 9.0000   Value = 0.9300 
## elapsed = 0.015  Round = 16  ntre10 = 11.0000    nvar = 2.0000   Value = 0.9310 
## elapsed = 0.027  Round = 17  ntre10 = 17.0000    nvar = 5.0000   Value = 0.9310 
## elapsed = 0.038  Round = 18  ntre10 = 18.0000    nvar = 10.0000  Value = 0.9250 
## elapsed = 0.026  Round = 19  ntre10 = 19.0000    nvar = 3.0000   Value = 0.9320 
## elapsed = 0.028  Round = 20  ntre10 = 15.0000    nvar = 6.0000   Value = 0.9340 
## elapsed = 0.01   Round = 21  ntre10 = 5.0000 nvar = 3.0000   Value = 0.9180 
## elapsed = 0.009  Round = 22  ntre10 = 4.0000 nvar = 2.0000   Value = 0.9230 
## elapsed = 0.019  Round = 23  ntre10 = 16.0000    nvar = 2.0000   Value = 0.9280 
## elapsed = 0.02   Round = 24  ntre10 = 17.0000    nvar = 2.0000   Value = 0.9310 
## elapsed = 0.042  Round = 25  ntre10 = 18.0000    nvar = 10.0000  Value = 0.9290 
## elapsed = 0.023  Round = 26  ntre10 = 13.0000    nvar = 5.0000   Value = 0.9320 
## elapsed = 0.013  Round = 27  ntre10 = 8.0000 nvar = 3.0000   Value = 0.9250 
## elapsed = 0.02   Round = 28  ntre10 = 12.0000    nvar = 5.0000   Value = 0.9260 
## elapsed = 0.036  Round = 29  ntre10 = 17.0000    nvar = 10.0000  Value = 0.9230 
## elapsed = 0.035  Round = 30  ntre10 = 18.0000    nvar = 8.0000   Value = 0.9320 
## 
##  Best Parameters Found: 
## Round = 10   ntre10 = 17.0000    nvar = 5.0000   Value = 0.9380

得られたハイパーパラメータ値に基づいて分類モデルを構築し、検証データに基づいてその分類精度を評価する。

opt.rfo$Best_Par

## ntre10   nvar 
##     17      5

# Run RF with tuned hyper-parameters
res.rfo <- ranger::ranger(Class ~ . , data = ndvi, 
                          num.trees = opt.rfo$Best_Par[1] * 10, 
                          mtry = opt.rfo$Best_Par[2])
1 - res.rfo$prediction.error

## [1] 0.928

pred.rfo <- predict(res.rfo, ndvi.test) # Prediction of the classes for all samples
cl.pred <- pred.rfo$predictions
table(cl.test, cl.pred) # Comparison of actual and estimated classes

##             cl.pred
## cl.test      cotton grape maize watermelon
##   cotton        279    10     1          0
##   grape          16    68     0          2
##   maize           9     0    28          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred) # assignment to tab
sum(diag(tab)) / sum(tab) # rate of correct prediction

## [1] 0.924

次に、SVMのハイパーパラメータのチューニングも試してみる。ここでは、多項式カーネルを用いたSVMをチューニングする。多項式カーネルには3つのハイパーパラメータがあり、そのうち次数は3で固定、残りの2つのハイパーパラメータ（「scale」と「offset」）をチューニングしていく。

まず、独自の関数を作成する。この関数は、2つのハイパーパラメータの値が与えられると、それに基づいて分類モデルを構築し、分類精度を返す。

svm.error <- function(vsca,  voff) {
  res.svm <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc", 
                           kernel = "polydot", 
                           kpar = list(degree = 3, scale = vsca, offset = voff), 
                           cross = 10) 
  accuracy <-  1 - cross(res.svm)
  return(list(Score = accuracy, Pred = 0))
}

試してみよう。

svm.error(1, 1)

## $Score
## [1] 0.929
## 
## $Pred
## [1] 0

svm.error(1.5, 0.5)

## $Score
## [1] 0.911
## 
## $Pred
## [1] 0

チューニングは BayesianOptimization 関数を用いて行い、2 つのハイパーパラメータの値の範囲は 0 ～ 2 とする。2 つの RF ハイパーパラメータとは異なり、これらのハイパーパラメータは実数（連続値）である。そのため、値の後に “L”をつける必要はない。

opt.svm <- rBayesianOptimization::BayesianOptimization(svm.error,
                                        bounds = list(vsca = c(0, 2),
                                                      voff = c(0, 2)),
                                        init_points=10, n_iter=20)

## elapsed = 0.18   Round = 1   vsca = 0.5053649    voff = 1.630155 Value = 0.9280 
## elapsed = 0.197  Round = 2   vsca = 1.255317 voff = 0.6439898    Value = 0.9220 
## elapsed = 0.206  Round = 3   vsca = 1.369706 voff = 1.069619 Value = 0.9240 
## elapsed = 0.155  Round = 4   vsca = 0.2135308    voff = 0.8019723    Value = 0.9380 
## elapsed = 0.206  Round = 5   vsca = 1.726767 voff = 0.6759446    Value = 0.9170 
## elapsed = 0.181  Round = 6   vsca = 0.9855211    voff = 1.916092 Value = 0.9370 
## elapsed = 0.182  Round = 7   vsca = 1.049182 voff = 1.827455 Value = 0.9270 
## elapsed = 0.182  Round = 8   vsca = 0.5121594    voff = 0.2823746    Value = 0.9250 
## elapsed = 0.176  Round = 9   vsca = 0.4300577    voff = 1.052161 Value = 0.9310 
## elapsed = 0.145  Round = 10  vsca = 0.03732534   voff = 1.130649 Value = 0.9220 
## elapsed = 0.15   Round = 11  vsca = 0.1252746    voff = 1.2275   Value = 0.9420 
## elapsed = 0.183  Round = 12  vsca = 1.078942 voff = 1.95356  Value = 0.9290 
## elapsed = 0.197  Round = 13  vsca = 1.488215 voff = 1.410327 Value = 0.9210 
## elapsed = 0.197  Round = 14  vsca = 0.8613775    voff = 0.8831598    Value = 0.9140 
## elapsed = 0.192  Round = 15  vsca = 1.573131 voff = 1.176714 Value = 0.9200 
## elapsed = 0.165  Round = 16  vsca = 0.2988366    voff = 1.669297 Value = 0.9430 
## elapsed = 0.183  Round = 17  vsca = 0.7591538    voff = 1.151152 Value = 0.9270 
## elapsed = 0.152  Round = 18  vsca = 0.1692371    voff = 1.554473 Value = 0.9330 
## elapsed = 0.183  Round = 19  vsca = 0.8647934    voff = 1.040555 Value = 0.9200 
## elapsed = 0.167  Round = 20  vsca = 0.294892 voff = 1.963315 Value = 0.9380 
## elapsed = 0.162  Round = 21  vsca = 0.3058102    voff = 1.12413  Value = 0.9430 
## elapsed = 0.162  Round = 22  vsca = 0.2590116    voff = 1.310417 Value = 0.9410 
## elapsed = 0.149  Round = 23  vsca = 0.1365377    voff = 0.7316305    Value = 0.9360 
## elapsed = 0.183  Round = 24  vsca = 0.9095273    voff = 1.8115   Value = 0.9260 
## elapsed = 0.17   Round = 25  vsca = 0.3489737    voff = 1.451167 Value = 0.9400 
## elapsed = 0.16   Round = 26  vsca = 0.2799205    voff = 0.6269431    Value = 0.9320 
## elapsed = 0.154  Round = 27  vsca = 0.1681362    voff = 1.089159 Value = 0.9400 
## elapsed = 0.166  Round = 28  vsca = 0.3540109    voff = 0.9174483    Value = 0.9350 
## elapsed = 0.152  Round = 29  vsca = 0.09514123   voff = 1.727637 Value = 0.9330 
## elapsed = 0.182  Round = 30  vsca = 1.180806 voff = 2.0000   Value = 0.9290 
## 
##  Best Parameters Found: 
## Round = 16   vsca = 0.2988366    voff = 1.669297 Value = 0.9430

調整されたハイパーパラメータ値に基づいて分類モデルを構築し、検証データに基づいて分類精度を評価する。

opt.svm$Best_Par

##      vsca      voff 
## 0.2988366 1.6692967

# Run RF with tuned hyper-parameters
res.svm <- kernlab::ksvm(Class ~ . , data = ndvi, type = "C-svc", 
                         kernel = "polydot", 
                         kpar = list(degree = 3, 
                                     scale = opt.svm$Best_Par[1], 
                                     offset = opt.svm$Best_Par[2]),
                         cross = 10)
1 - cross(res.svm)

## [1] 0.935

cl.pred <- predict(res.svm, ndvi.test) # Prediction of the classes for all samples
table(cl.test, cl.pred) # Comparison of actual and estimated classes

##             cl.pred
## cl.test      cotton grape maize watermelon
##   cotton        276    10     4          0
##   grape          15    71     0          0
##   maize           5     1    31          0
##   watermelon      0     0     0         87

tab <- table(cl.test, cl.pred) # assignment to tab
sum(diag(tab)) / sum(tab) # rate of correct prediction

## [1] 0.93

第5回講義：多変量に基づくクラスの予測

岩田洋佳

2022/08/30

必要なパッケージ

背景

線形判別分析(LDA)

サポートベクターマシン(SVM)

Random Forst

ハイパーパラメータの最適化