October 7, 2015

Background

The goal of the test is to benchmark the performance and outputs interpretability of different clustering/matrix decomposition methods using let-418, hpl-2, lin-61 and lin-13 replicates dataset.

Test data includes chrI tracks binned in 100bp and z-scored:

  • LET418_N2_L3_IL026^IL025
  • HPL2_N2_L3_AA381^AA382^AA162
  • LIN61_N2_L3_AA005^AA156^AA160
  • LIN13_NA_NA_AA314^AA207^PK013
##  num [1:150724, 1:4] -1.17 -1.17 -1.16 -1.15 -1.08 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "LET418" "HPL2" "LIN61" "LIN13"
##         LET418       HPL2      LIN61      LIN13
## [1,] -1.166359 -0.7580633 -1.0294355 -0.8053523
## [2,] -1.166359 -0.7580633 -1.0294355 -0.8053523
## [3,] -1.163778 -0.7464810 -1.0248851 -0.7908842
## [4,] -1.152165 -0.6199542 -0.9705874 -0.7114266
## [5,] -1.081366 -0.6224677 -0.7709940 -0.6868739

Models

  • k-means
  • Hierarchical clustering
  • Correlation based clustering
  • PCA
  • NMF
  • FA
  • SOM
  • NSFA

We want to compare other models to NSFA. NSFA run on test data (15 iterations) produced 13 factors, so we would use same number of clusters/components/factors for other methods, unless the method can infer the number of clusters from the data.

Hierarchical clustering family

Hierarchical clustering

Hierarchical clustering - horizontal plot

Correlation based hierarchical clustering

Correlation based heatmap

K-means family

K-means with 13 clusters

K-means with 13 clusters - clusters coverage

  • Cluster 1: 27.66%
  • Cluster 2: 3.91%
  • Cluster 3: 11.07%
  • Cluster 4: 1.32%
  • Cluster 5: 0.11%
  • Cluster 6: 3.69%
  • Cluster 7: 9.91%
  • Cluster 8: 32.43%
  • Cluster 9: 0.02%
  • Cluster 10: 1.26%
  • Cluster 11: 0.3%
  • Cluster 12: 0.43%
  • Cluster 13: 7.88%

Clusters in IGV

Clusters interactive