class: center, middle, inverse, title-slide .title[ # Dimension Reduction ] .author[ ### Jun Kang ] .date[ ### 2022.06.15 ] --- ##### DNA methylation-based classification of central nervous system tumours  .footer[Nature 555, 469–474 (2018). https://doi.org/10.1038/nature26000] --- ##### Sarcoma classification by DNA methylation profiling  .footer[Nat Commun 12, 498 (2021). https://doi.org/10.1038/s41467-020-20603-4] ---  .footer[J Hum Genet 66, 85–91 (2021). https://doi.org/10.1038/s10038-020-00851-4] ??? The Genome Aggregation Database (gnomAD, left) and Biobank Japan (BBJ, right) visualized using UMAP. UMAP illustrates the ancestral diversity of gnomAD, showing many the relationships between populations on continental and subcontinental levels. For the relatively more homogeneous BBJ data, it splits data geographically into the large mainland cluster (consisting of Hokkaido, Tohoku, Kanto-Koshinetsu, Chubu-Hokuriku, Kinki, and Kyushu regions), and smaller non-mainland clusters. The gnomAD image is reproduced from [10], and the BBJ image is reproduced from [12] --- # High dimensional data * Population genetics * Single cell sequencing * Spatial transcriptomics --- ### Dimension reduction technique * Principal component analysis (PCA) * t-Distributed Stochastic Neighbor Embedding (t-SNE) * Uniform Manifold Approximation and Projection (UMAP) --- ### Dimension reduction * Change and select basis to clustering * Visualization in 2 dimensional space --- # Concepts * Dimension * Basis or Latent features * Graph * Projection or Embedding * Linearity ---  .footer[J Hum Genet 66, 85–91 (2021). https://doi.org/10.1038/s10038-020-00851-4] ??? PCA (left) and UMAP (right) projections of the UKB data, coloured by self-identified ethnic background. Unlike PCA, UMAP focuses on preserving local relationships and emphasizes fine-scale patterns in data. Groups in the UMAP projection are less compressed showing, for example, the relative size of the British and Irish populations in the UKB, alongside populations of other ancestries, while simultaneously showing the population structure between and within groups --- class: center  .footer[J Hum Genet 66, 85–91 (2021). https://doi.org/10.1038/s10038-020-00851-4] ??? UMAP projection of the same genotype data from the 1000GP comparing parametrization with a small (top) and large (bottom) number of nearest neighbours. Left images are coloured by population; right images are the same points but with the simplicial complex drawn. When adding more neighbours, subclusters become less separated, as with the LWK population, for example. Looking at the connectivity maps, we see new connections between continental groups, such as the Central/South American clusters and East Asian clusters. Darker lines indicate that individuals are closer to each other in genotype space --- # Iris  .footer[/ˈpedl/, /ˈsēpəl/] --- class: center ## Iris data
??? clustering 가장 좋은 2개 basis? petal 1 개? --- ## Dimension reducsion by human selection <!-- --> --- ## Clustering? and select one dimension <!-- --> --- ## Principal component analysis (PCA)  --- ## Linearity $$ PC1 = \alpha_1v_1 + \alpha_2v_2 + \alpha_3v_3 ... + \alpha_nv_n\\\ PC2 = \beta_2v_1 + \beta_2v_2 + \beta_3v_3 ... + \beta_nv_n\\ $$ --- class: center # Projection  --- # Orthogonality * PCs are orthogonal each other --- # Principal component analysis (PCA)
--- # Latent variables (features) * Linear * Non-linear --- # Non-linear  .footer[http://rasbt.github.io/mlxtend/user_guide/feature_extraction/RBFKernelPCA/] --- ## Non-linear * Feature expansion: Lower dimension to higher dimension * Gaussian Kernel (Radial basis function kernel) ---  The "lifting trick". (a) A binary classification problem that is not linearly separable in `\(\mathbb{R}^2\)` (b) A lifting of the data into `\(\mathbb{R}^3\)` using a polynomial kernel, `\(\varphi([x_1 \;\; x_2]) = [x_1^2 \;\; x_2^2 \;\; \sqrt{2} x_1 x_2]\)` Polynomial kernal `\((x_1 + x_2)^2 = x_1^2 + x_2^2 + 2x_1x_2\)` .footer[https://gregorygundersen.com/blog/2019/12/10/kernel-trick/] --- ## Kernel basis functions (Eigenfunctions)  .footer[Electronic Journal of Statistics. 10. 423-463. 10.1214/16-EJS1112.] ---  .footer[https://www.cs.cornell.edu/courses/cs4786/2020sp/lectures/lec08.pdf] ??? Basis가 manifold 개념으로 ---  ---  ---  ---  --- ## t-Distributed Stochastic Neighbor Embedding (t-SNE) .footer[https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf] --- ## Graph (Neighbor) Embedding  .footer[https://gearons.org/blog/2016/MAGE/] --- # Probablistic/Stochastic Neighbor Embedding (SNE) --- ## Similarity scores (probablistic/stochastic)  .footer[https://towardsdatascience.com/t-sne-machine-learning-algorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e] --- ## Similarity matrix (High dimension)  .footer[https://towardsdatascience.com/t-sne-machine-learning-algorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e] --- ## Similarity matrix (low dimension initial)  .footer[https://towardsdatascience.com/t-sne-machine-learning-algorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e] --- ## Minimize the Kullback–Leibler divergence (KL divergence) through gradient descent. * Learning rate * Iteration number --- ## Perplexity  --- ## Perplexity  .footer[https://towardsdatascience.com/t-sne-machine-learning-algorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e] ---  .footer[https://distill.pub/2016/misread-tsne/] --- class: center ### Simplicial complex  --- ## Test data set of a noisy sine wave  .footer[https://umap-learn.readthedocs.io/en/latest/how_umap_works.html] --- # Graph with combined edge weights  --- ## Minimize cross entropy $$ \sum_{e\in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e)) \log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right) $$ --- ## Parameters of UMAP * Number of neighbors * Minimal distance * Learning rate * Number of epoch --- # Principal component analysis (PCA)
--- ## t-Distributed Stochastic Neighbor Embedding (t-SNE)
--- ## Uniform Manifold Approximation and Projection (UMAP)
.footer[http://mlss2018.net.ar/slides/Pfau-1.pdf]