2025-03-12

Introduction

Principal Component Analysis (PCA) is a powerful statistical tool used to reduce the dimensionality of large datasets while preserving most of the variance. In bioinformatics, PCA is often applied to gene expression data to reveal patterns and groupings among samples.

Mathematical Foundations

PCA is based on the eigen-decomposition of the covariance matrix. For a data matrix \(X\), the covariance matrix \(\Sigma\) is given by:

\[ \Sigma = \frac{1}{n-1} X^T X \]

where \(n\) is the number of samples.

PCA Transformation

Once the eigenvalues and eigenvectors are computed, the original data \(X\) is transformed into principal components \(Z\) by:

\[ Z = XW \]

where \(W\) is the matrix whose columns are the eigenvectors.

Data Example

Below we simulate gene expression data for demonstration. In this example, each row represents a sample and each column a gene.

##               Gene1      Gene2      Gene3      Gene4       Gene5
## Sample1 -0.56047565 -0.7104066  2.1988103 -0.7152422 -0.07355602
## Sample2 -0.23017749  0.2568837  1.3124130 -0.7526890 -1.16865142
## Sample3  1.55870831 -0.2466919 -0.2651451 -0.9385387 -0.63474826
## Sample4  0.07050839 -0.3475426  0.5431941 -1.0525133 -0.02884155
## Sample5  0.12928774 -0.9516186 -0.4143399 -0.4371595  0.67069597

2D PCA with ggplot

We perform PCA on the simulated data and use ggplot2 to plot the first two principal components.

Explained Variance with ggplot

Visualize the percentage of variance explained by each principal component.

3D PCA Plot with Plotly

Here we create an interactive 3D Plotly plot of the first three principal components.

R Code Example

Below is an overview of the R code used in this presentation, which includes plot creation.

pca_data_3d <- data.frame(PC1 = pca_result$x[,1],
                          PC2 = pca_result$x[,2],
                          PC3 = pca_result$x[,3])
p <- plot_ly(pca_data_3d, x = ~PC1, y = ~PC2, z = ~PC3,
             type = 'scatter3d', mode = 'markers',
             marker = list(size = 3, color = 'red')) %>%
  layout(title = "3D PCA Plot Example")
invisible(p)

Conclusion

PCA is an essential tool in bioinformatics for reducing data complexity and revealing underlying structures in gene expression data. This presentation demonstrated how to perform PCA and visualize its results using both static (ggplot2) and interactive (plotly) plots. Further analysis may involve integrating PCA with clustering or other dimensionality reduction techniques.