1 ライブラリ.


2 階層的クラスタリング.


2.1 stats::distを用いて距離行列を作成する場合

  • 距離定義はeuclidean[default], maximum, manhattan, canberra, binary, minkowskiから選択.
    • binaryではjaccard係数による距離行列(1-jaccard係数).
  • hclustはdistオブジェクトを第一引数に受け取り階層的クラスタリングを実行する.
    • ward.D, ward.D2, single, complete[default], average(= UPGMA), mcquitty(= WPGMA), median(= WPGMC) centroid (= UPGMC)の中から選択.

2.2 amap::Distを用いて距離行列を作成する場合.

  • 距離定義はeuclidean[default], maximum, manhattan, canberra, binary, pearson, abspearson, correlation, abscorrelation, spearman, kendallから選択
  • amap::Distのメソッドpearsonは1-cosine類似度, correlationは1-pearsonの積率相関係数であることに注意
## List of 7
##  $ merge      : int [1:31, 1:2] -12 -1 -14 -10 -5 -7 -22 -15 -16 -4 ...
##  $ height     : num [1:31] 5.32e-06 5.43e-06 1.78e-05 3.44e-05 5.95e-05 ...
##  $ order      : int [1:32] 28 30 31 19 20 18 26 8 4 6 ...
##  $ labels     : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ method     : chr "average"
##  $ call       : language hclust(d = amap::Dist(as.matrix(mtcars), method = distm), method = clm)
##  $ dist.method: chr "correlation"
##  - attr(*, "class")= chr "hclust"

2.3 proxy::distを用いて距離行列を作る場合.

  • カテゴリカルデータの距離・類似度行列にも使える。
  • 距離及び、類似度定義はsummary(pr_DB)で一覧が見える
  • 距離行列はproxy::dist, 類似度行列はproxy::siml
  • カテゴリカルデータの場合バイナリデータ(0/1)に変換する必要有り
## * Similarity measures:
## Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
## eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
## Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
## Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
## Stiles, Tanimoto, Tschuprow, Yule, Yule2
## 
## * Distance measures:
## Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
## fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
## Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker

2.4 カテゴリカルデータのクラスタリング.

  • stats::dist(method="binary"), amap::Dist(method="binary")はいずれもjaccard係数による距離行列が作られている

2.4.1 サンプルデータ.

## List of 6
##  $ v1: chr [1:4] "a" "b" "c" "d"
##  $ v2: chr [1:4] "b" "d" "a" "c"
##  $ v3: chr [1:4] "a" "b" "c" "e"
##  $ v4: chr [1:4] "b" "c" "e" "f"
##  $ v5: chr [1:4] "a" "e" "f" "g"
##  $ v6: chr [1:4] "f" "h" "i" "j"

2.4.2 カテゴリ変数(list)からダミー変数を作成する関数.

  • リストが集合ではなく, 順番のある要素の場合にも対応させる.
カテゴリカルデータ
v1 v2 v3 v4 v5 v6
a b a b a f
b d b c e h
c a c e f i
d c e f g j
ダミー変数_1
id v1 v2 v3 v4 v5 v6
a 1 1 1 0 1 0
b 1 1 1 1 0 0
c 1 1 1 1 0 0
d 1 1 0 0 0 0
e 0 0 1 1 1 0
f 0 0 0 1 1 1
g 0 0 0 0 1 0
h 0 0 0 0 0 1
i 0 0 0 0 0 1
j 0 0 0 0 0 1
ダミー変数_2(1~12行)
vdummy X1 X2 X3 X4 X5 X6
a_1 1 0 1 0 1 0
b_1 0 1 0 1 0 0
c_1 0 0 0 0 0 0
d_1 0 0 0 0 0 0
e_1 0 0 0 0 0 0
f_1 0 0 0 0 0 1
g_1 0 0 0 0 0 0
h_1 0 0 0 0 0 0
i_1 0 0 0 0 0 0
j_1 0 0 0 0 0 0
a_2 0 0 0 0 0 0
b_2 1 0 1 0 0 0

2.4.3 データが大きい場合はqdapTools::mtabulateを使う.

qdapTools::mtabulate
a b c d e f g h i j
v1 1 1 1 1 0 0 0 0 0 0
v2 1 1 1 1 0 0 0 0 0 0
v3 1 1 1 0 1 0 0 0 0 0
v4 0 1 1 0 1 1 0 0 0 0
v5 1 0 0 0 1 1 1 0 0 0
v6 0 0 0 0 0 1 0 1 1 1

3 クラスタを分割する.


4 デンドログラム描画1


5 デンドログラム描画2 ape


## List of 4
##  $ edge       : int [1:62, 1:2] 33 34 38 38 53 58 58 53 59 59 ...
##  $ edge.length: num [1:62] 6.86 3.39 1.5 0.5 1 ...
##  $ tip.label  : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ Nnode      : int 31
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

6 デンドログラム描画3 dendextend


6.1 データ.

  • rownamesがhclustオブジェクトのlabelsに入るので、追加しておく.
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
set1 5.1 3.5 1.4 0.2
set2 4.9 3.0 1.4 0.2
set3 4.7 3.2 1.3 0.2
set4 4.6 3.1 1.5 0.2
set5 5.0 3.6 1.4 0.2
set6 5.4 3.9 1.7 0.4

6.5 クラスター数でラベルとブランチの色を変更set(dend, "branches_k_color", value)

  • stats::cutree()ではなく、dendextend::cutreeを使ってカラーコードのベクトルを作る.
  • その際にorder_clusters_as_data = Fとするとleafラベル順のクラスタ番号がわかる

6.9 指定したデータを取り除く, 末端ノードのみ表示, ノードラベル非表示

  • prune 指定データを取り除く
  • leaves_pch, leaves_col, labels

# labels - set the labels (using labels<-.dendrogram)
# labels_colors - set the labels’ colors (using color_labels)
# labels_cex - set the labels’ size (using assign_values_to_leaves_nodePar)
# labels_to_character - set the labels’ to be characters
# leaves_pch - set the leaves’ point type (using assign_values_to_leaves_nodePar)
# leaves_cex - set the leaves’ point size (using assign_values_to_leaves_nodePar)
# leaves_col - set the leaves’ point color (using assign_values_to_leaves_nodePar)
# nodes_pch - set the nodes’ point type (using assign_values_to_nodes_nodePar)
# nodes_cex - set the nodes’ point size (using assign_values_to_nodes_nodePar)
# nodes_col - set the nodes’ point color (using assign_values_to_nodes_nodePar)
# hang_leaves - hang the leaves (using hang.dendrogram)
# branches_k_color - color the branches (using color_branches)
# branches_col - set the color of branches (using assign_values_to_branches_edgePar)
# branches_lwd - set the line width of branches (using assign_values_to_branches_edgePar)
# branches_lty - set the line type of branches (using assign_values_to_branches_edgePar)
# by_labels_branches_col - set the color of branches with specific labels (using branches_attr_by_labels)
# by_labels_branches_lwd - set the line width of branches with specific labels (using branches_attr_by_labels)
# by_labels_branches_lty - set the line type of branches with specific labels (using branches_attr_by_labels)
# clear_branches - clear branches’ attributes (using remove_branches_edgePar)
# clear_leaves - clear leaves’ attributes (using remove_branches_edgePar)

6.10 ラベルの高さを揃えない

  • dendextend::hang.dendrogram

6.11 タングルグラム

7 環境

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tibble_2.1.3       dendextend_1.12.0  RColorBrewer_1.1-2
## [4] ape_5.3            dplyr_0.8.1        proxy_0.4-23      
## [7] amap_0.8-16       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1        highr_0.8         plyr_1.8.4       
##  [4] pillar_1.4.2      compiler_3.5.2    bitops_1.0-6     
##  [7] qdapTools_1.3.3   viridis_0.5.1     tools_3.5.2      
## [10] digest_0.6.19     viridisLite_0.3.0 evaluate_0.14    
## [13] nlme_3.1-140      gtable_0.3.0      lattice_0.20-38  
## [16] pkgconfig_2.0.2   rlang_0.4.0       rstudioapi_0.10  
## [19] yaml_2.2.0        parallel_3.5.2    xfun_0.8         
## [22] rsko_0.1.0        kableExtra_1.1.0  gridExtra_2.3    
## [25] xml2_1.2.0        httr_1.4.0        stringr_1.4.0    
## [28] knitr_1.23        hms_0.4.2         webshot_0.5.1    
## [31] grid_3.5.2        tidyselect_0.2.5  data.table_1.12.2
## [34] glue_1.3.1        R6_2.4.0          rmarkdown_1.13   
## [37] pacman_0.5.1      readr_1.3.1       purrr_0.3.2      
## [40] ggplot2_3.2.0     magrittr_1.5      scales_1.0.0     
## [43] htmltools_0.3.6   rvest_0.3.4       assertthat_0.2.1 
## [46] colorspace_1.4-1  stringi_1.4.3     RCurl_1.95-4.12  
## [49] lazyeval_0.2.2    munsell_0.5.0     chron_2.3-53     
## [52] crayon_1.3.4