Objective

We will apply clustering method and visualization method on simulated data with different information-noise ratios. Multiple experiments will be used to show the properties of different methods and give guidance in future application.

Project Description

Clustering part

We will use k-means clustering (with different kernals), spectral clustering (with different mutants), gaussian mixed model combined with EM algorithm, and t-distribution model combined with EM algorithm to generte the clustering results. Comparison and further analysis will be performed, too.

Visualization part

Although clustering is a unsupervised learning, we still need to pre-decide the cluster number k for many algorithms, thus visualization is somehow necessary to give some important guidances.

Visualization can also be regarded as a method of dimension reduction and it gives people an intuitive understanding about the data distribution, because it reduces the dimension from p to 2 or 3. We will perfrom PCA (Principal component analysis), MDS (multidimentional distance scaling), LLE (Local linear embedding) and t-sne (t-distributed stochastic neighbor embedding). Materials taught in class will be combined, e.g. optimization in t-sne and sparse matrix in LLE.

Development

Furthermore, if we want to divide the samples into k cluster, we can use the above methods to reduce the dimension into a lower dimension first, and then use clustering methods, which may contribute to the final result because of the “information accumulation” and the “noise reduction”.

Input

n × k matrix of gene expression across n samples and k genes.

Output

Cluster results of genes into k different groups from difference methods.

NMI (normalized mutual information) of different methods, which is a criteria of clustering goodness.

Visualization of different methods’ result and dynamic convergence gif for methods like t-sne.

Example of t-sne: Four Groups

Example of t-sne: Two linked Ring in Three Dimension Space

Main Reference Papers

An Introduction to Locally Linear Embedding (Lawrence K. Saul) https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf

Visualizing Data using t-SNE (Laurens van der Maaten) https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Literature Review on Spectral Clustering (Mengna Chen) https://escholarship.org/uc/item/08q7s99b