DS Labs Assignment - Tissue Gene Expression Dataset - DATA110

Author

Jorge Pineda

Tissue Gene Expression Dataset

This visualization uses the tissue_gene_expression dataset from the dslabs package. The dataset contains gene expression levels for 500 genes measured across 189 tissue samples, categorized into 7 tissue types including cerebellum, hippocampus, liver, colon and others. To visualize this high-dimensional data, we can perform Principal Component Analysis (PCA), a common dimensionality reduction technique that captures the most variance in the data using fewer dimensions.

In this scatterplot, we plot the first two principal components (PC1 and PC2) and color each point by its corresponding tissue type. This helps identify natural groupings or similarities in gene expression patterns across tissues.

Load Packages and Data

library(dslabs)
data("tissue_gene_expression")
library(tidyverse)    # For ggplot and data manipulation

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)     # For non-default themes

Generate PCA Data and Summarize Gene Patterns

A gene expression level represents how active a gene is in a given tissue. Higher values mean the gene is more “activated”. PCA identifies new axes (principal components) that capture the largest sources of variation in the data Plotting PC1 and PC2 makes sense because they explain the most variance and thus reveal the strongest underlying patterns.

Visualizing samples in this reduced 2D space allows us to see natural groupings or “clusters” by tissue type. Clustering indicates that gene expression signatures differ systematically between tissues, which would make biological sense ( meaning samples from the same tissue group together based on their gene expression patterns, gene regulation varies by tissue).

# Perform PCA on the gene expression matrix using the built-in R function prcomp()
pca <- prcomp(tissue_gene_expression$x)

# Create a data frame with first 2 principal components and tissue labels
tissue_pca_df <- data.frame(PC1 = pca$x[,1],
                            PC2 = pca$x[,2],
                            Tissue = tissue_gene_expression$y)

Scatterplot: PC1 vs PC2 Colored by Tissue Type

# Define a palette with fewer than 10 distinguishable colors
my_colors <- c("#E41A1C","#FF7F00","#4DAF4A","#984EA3","#377EB8","#FDB462", "#A65628")

# Create the scatterplot
p <- ggplot(tissue_pca_df, aes(x = PC1, y = PC2, color = Tissue)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(title = "PCA of Gene Expression by Tissue Type",
       x = "Principal Component 1",
       y = "Principal Component 2",
       color = "Tissue Type") +
  scale_color_manual(values = my_colors) +
  theme_classic(base_size = 14)

p

Interpretation

Principal Component Analysis (PCA) allows us to compress the complexity of gene expression data—originally 500 genes per sample into just two dimensions that preserve the most meaningful variation. By plotting the first two principal components (PC1 and PC2), we highlight the dominant patterns in the dataset. Each point represents a tissue sample, and its position is determined by how its overall gene expression profile aligns with these principal axes. Coloring by tissue type reveals that samples from the same tissue tend to cluster together, which indicates that their gene expression patterns are more similar to each other than to those from other tissues. This clustering is not arbitrary it suggests that gene activity differs systematically by tissue, affirming the biological relevance of the data. In this way, PCA helps us visually uncover structure in biological systems, supporting the idea that each tissue has a unique gene expression fingerprint. We observe that tissue types generally cluster together, indicating that gene expression patterns are similar within tissues and different between them. This supports the idea that gene expression is tissue-specific, and PCA helps reduce dimensionality while preserving this biologically meaningful variation.

Additional Perspective

Imagine you’re listening to a choir of 500 genes singing at different volumes in each tissue. PCA helps us pick out the loudest harmonies, the dominant patterns in gene expression, and PC1 and PC2 capture those primary melodies. When samples from the same tissue type cluster in this PCA plot, it’s as if each tissue is singing its own distinct song. This clustering indicates that gene expression is tissue-specific and that PCA has successfully revealed biologically meaningful structure in the data.

Note

I avoided using the default theme and default color scheme. I selected custom colors inspired by the Set1 palette from the package RColorBrewer but manually adjusted them to improve contrast, especially for lighter categories like liver. This ensures all seven tissue types remain distinct while staying within the 10-color limit.