Libraries

library(tidyverse)
library(reshape2)
library(HDclassif)
library(psych)
library(factoextra)
library(caret)
library(scatterplot3d)
library(openxlsx)

Data

# read in data
df <- read.csv("/Users/krishshah/Downloads/GSE124814_HW_expr_matrix.csv")
# read in metadata
df_meta <- read.xlsx("/Users/krishshah/Downloads/GSE124814_sample_descriptions.xlsx")
dim(df)
## [1] 14883  1642
str(df_meta)
## 'data.frame':    1642 obs. of  20 variables:
##  $ SAMPLES: chr  "Sample name" "Sample_1" "Sample_2" "Sample_3" ...
##  $ X2     : chr  "title" "EMTAB292-A195-1-2" "EMTAB292-A195-10-2" "EMTAB292-A195-11-2" ...
##  $ X3     : chr  "CEL file" "Reanalysis of existing record" "Reanalysis of existing record" "Reanalysis of existing record" ...
##  $ X4     : chr  "source name" "Medulloblastoma" "Medulloblastoma" "Medulloblastoma" ...
##  $ X5     : chr  "organism" "Homo sapiens" "Homo sapiens" "Homo sapiens" ...
##  $ X6     : chr  "molecule" "total_RNA" "total_RNA" "total_RNA" ...
##  $ X7     : chr  "label" "biotin" "biotin" "biotin" ...
##  $ X8     : chr  "chip name" "HuEx-1-0-st-v1" "HuEx-1-0-st-v1" "HuEx-1-0-st-v1" ...
##  $ X9     : chr  "characteristics: age" "6.5 years" "5.7 years" "8.1 years" ...
##  $ X10    : chr  "characteristics: gender" "female" "male" "female" ...
##  $ X11    : chr  "characteristics: histology" "anaplastic" "classic" "anaplastic" ...
##  $ X12    : chr  "characteristics: brain region" "cerebellum" "cerebellum" "cerebellum" ...
##  $ X13    : chr  "characteristics: death" NA NA NA ...
##  $ X14    : chr  "characteristics: metastatic stage" NA NA NA ...
##  $ X15    : chr  "characteristics: follow up" NA NA NA ...
##  $ X16    : chr  "characteristics: subgroup supplied original" "Unknown" "Unknown" "Unknown" ...
##  $ X17    : chr  "characteristics: subgroup supplied renamed" "Unknown" "Unknown" "Unknown" ...
##  $ X18    : chr  "characteristics: subgroup relabeled" "G4" "G4" "G4" ...
##  $ X19    : chr  "description" "reanalysis of A195_1_2 (E-MTAB-292)" "reanalysis of A195_10_2 (E-MTAB-292)" "reanalysis of A195_11_2 (E-MTAB-292)" ...
##  $ X20    : chr  "description" "Sample_1" "Sample_2" "Sample_3" ...

Quick Data Wrangling

# set column names to the first row
colnames(df_meta) <- df_meta[1, ]
# remove first row
df_meta <- df_meta[-1, ]
# set row names of expression matrix to gene symbol
rownames(df) <- df[, 1]
# remove first column
df <- df[,-1]
# transpose df so observations are rows
# genes are columns
t_df <- t(df) %>% data.frame()

For this assignment, I am utilizing data from a meta-analysis (https://academic.oup.com/bioinformatics/article/35/18/3357/5305636?login=true) looking at microarray gene expression data of control and medulloblastoma patients. Medulloblastoma is a brain cancer that is typically found in pediatric patients and molecular classification through sequencing can reveal the specific type of molecular profile that the cancer exhibits. For this paper, the scientists collected microarray gene expression data from control and medulloblastoma patients found publicly and went on to normalize and correct the data for batch effects that may arise from different sequencing protocols. This was done in an effort to curate a large dataset of transcription data that can be useful for future research projects looking at expression in medulloblastoma vs controls. After cleaning up some of the columns and rows, the main dataframe (t_df) represents 1641 patients that each have 14883 genes and their associated normalized expression values. All of the columns are numeric with each row representing an observation. The affiliated meta data (df_meta) has information regarding each row/sample. There are 1641 patients with information regarding if they are controls or medulloblastoma patients, their age, their molecular classification, their gender, and other related information.

Research Question

Based on the existing high-dimensional dataset, I am curious to see whether two-three components (for visualization purposes) explain enough variability to cluster samples based on controls or medulloblastoma. In addition, do samples cluster by molecular profile/subtype when looking at a two-dimensional or three-dimensional plot?

Variables of Interest/Feature Selection

To speed up the PCA dimensionality reduction process, I intend to conduct some basic feature selection by only choosing the top 1500 variable genes based on deviation from the mean.

# calculate st dev for each column
gene_variance <- apply(t_df, 2, sd)
# sort names of genes from highest to lowest sd
sorted_t_df <- names(sort(gene_variance, decreasing = T))
# only choose top 1500
top_var_genes <- sorted_t_df[1:1500]
# merge with t_df to only keep top 1500 genes
top_t_df <- t_df[, top_var_genes]

Exploratory Data Analysis

To not make it extremely computationally intense, I am selecting a cutoff of 0.89 for variables that may exhibit multicollinearity.

# correlation matrix for 1500 genes
pca_cor <- cor(top_t_df, use ='pairwise.complete')
# find genes that have high correlations
high_cor <- findCorrelation(pca_cor, cutoff = 0.89)
# remove genes from top df
top_t_df_filtered <- top_t_df[, -high_cor]

Sample plot of correlations using heatmap using a sample of genes

# sample plot to examine heatmap of features
melted_pca_cor <- melt(pca_cor[1:25, 1:25])
# melted data plotted in a heatmap to look at features visually
ggplot(data = melted_pca_cor, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  labs(x = '', y = '', fill = 'Correlation Coefficient', title = 'Heatmap of Genes') +
  scale_fill_gradient2(low = 'red', mid = 'white', high = 'blue', midpoint = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# PCA Modeling

Screeplot

# run PCA and return eigenvalues and loadings with no rotation
pca <- principal(top_t_df_filtered, rotation = 'none')
# create df for scree plot with each component and the associated eigenvalues
scree <- data.frame(PC = 1:length(pca$values), Eigenvalue = pca$values)
# use ggplot to plot eigenvalue vs PC
ggplot(scree, aes(x = PC, y = Eigenvalue)) +
  geom_line() + # add line to connect points
  geom_point() + # add points
  geom_hline(yintercept=1, linetype = 'dashed') + # dashed line at 1
  scale_x_continuous(limits=c(0,100), breaks=seq(0,100,5)) + # many components so just selected first 100
  theme_minimal() +
  labs(x = 'Principal Component', y = "Eigenvalue", title = "Scree Plot")

Based on the dashed line at an eigenvalue of 1, about 50-51 components are needed, however, when looking at the elbow or bend in the screeplot, about 10-11 components would suffice as more components do not produce meaningful returns.

Parallel Analysis

# setting seed for reproducibility
set.seed(2354)
# parallel analysis using filtered df
pa <- fa.parallel(top_t_df_filtered, fa = 'pc')

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  49

Based on the parallel analysis, 49 components are needed. Let’s further visualize the results from the parallel analysis using ggplot.

# creating a pa df to visualize Eigenvalue x PC
# a column of PC and a column of actual vs resampled
# with the associated eigenvalue
pa_plot <- data.frame(
  PC = c(1:length(pa$pc.values), 1:length(pa$pc.simr)),
  type = c(rep('actual', length(pa$pc.values)), rep('resampled', length(pa$pc.simr))),
  Eigenvalue = c(pa$pc.values, pa$pc.simr)
)
# plotting eigenvalue vs PC as done before
ggplot(pa_plot,aes(x = PC, y = Eigenvalue, col = type)) +
  geom_line()+
  geom_point() +
  scale_x_continuous(limits=c(0,100), breaks=seq(0,100,5)) +
  theme_minimal() +
  labs(x = 'Principal Component', y = 'Eigenvalue', title = "Scree Plot with Parallel Analysis Values") +
  scale_color_manual(name = '', values = c('black', 'red'))
## Warning: Removed 2760 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 2760 rows containing missing values or values outside the scale range
## (`geom_point()`).

From this plot, we can see that about 50 components are necessary for the PCA. Let’s conduct a basic PCA and check the cumulative variance to see how many components are needed for over 50% explained variance.

Bartlett’s Test

# scale filtered df values
scale_train <- top_t_df_filtered %>% mutate_at(c(1:1480), ~(scale(.) %>% as.vector))
# run Bartlett's test to test if PCA can be ran and if
# variables are related
cortest.bartlett(scale_train, 1641)
## R was not square, finding R from data
## $chisq
## [1] Inf
## 
## $p.value
## [1] 0
## 
## $df
## [1] 1094460

The p.value is less than 0.05 so the variables are related and PCA can be run. ### Kaiser-Meyer-Olkin Test

# obtain Kaser values for all genes in scaled df
kmo_result <- KMO(scale_train)
# check to see if all MSAs are above 0.5 for all vars
all(kmo_result$MSA > 0.5)
## [1] TRUE

All variables display a Kaser value above the desired threshold of 0.5.

Basic PCA for Cum. Variance

# just run a basic PCA to check for cumulative variance
pca <- prcomp(top_t_df_filtered, center = TRUE, scale. = TRUE)
# calculate variance explained
variance_explained <- (pca$sdev^2) / sum(pca$sdev^2)
# calculate cumulative variance explained
cumulative_variance <- cumsum(variance_explained)
# display first 25 rows
head(cumulative_variance, n=25)
##  [1] 0.1960693 0.3131053 0.3849769 0.4391692 0.4703352 0.4921633 0.5107036
##  [8] 0.5268164 0.5408446 0.5536588 0.5655432 0.5745870 0.5832868 0.5912567
## [15] 0.5986347 0.6053979 0.6119971 0.6183049 0.6239923 0.6296196 0.6346974
## [22] 0.6394931 0.6441583 0.6485778 0.6527265

From the cumulative variance, we can see that about 7 components are necessary for an explained cumulative variance of >50%. Let’s check PCA rotation using about 50 components to see which is best.

Justifying rotation method

# PCA with no rotation
pca_no <- principal(top_t_df_filtered, nfactors = 50, rotate = 'none')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_no
# PCA with orthogonal rotation
pca_or <- principal(top_t_df_filtered, nfactors = 50, rotate = 'varimax')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_or
# PCA with oblique rotation
pca_ob <- principal(top_t_df_filtered, nfactors = 50, rotate = 'promax')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_ob
#(pca_no$loadings)
#pca_ob$loadings
#pca_or$loadings

Based on the cumulative variances of each rotation method and the scope of the research question, the no rotation method will be utilized further for visualizations and loading analyses. The no rotation method has the highest explained cumulative variance for the first 3 principal components.

Visualization

# obtain loadings / code from demo slides
loadings <- pca_no$loadings
loadings <- unclass(loadings)
loadings_df <- as.data.frame(loadings)
loadings_df$features <- rownames(loadings_df)
rownames(loadings_df) <- NULL
# obtain scores for each sample by predicting PCA model on data
pca.df <- as.data.frame(predict(pca_no, data=top_t_df_filtered))
# ensure PCA df dimensions line up with 1641 observations
dim(pca.df)
## [1] 1641   50
# add in the normal vs medulloblastoma label
pca.df$Class = df_meta$`source name`
# add molecular profile label from meta data
pca.df$Type = df_meta$`characteristics: subgroup relabeled`
# obtain percent variance explained by each PC
percentVar <- pca_no$values / sum(pca_no$values)
# ggplot looking at first two PCs, coloring by class
ggplot(pca.df, aes(x = PC1, y = PC2, color = Class)) +
  geom_point() + # plot point
  xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) + # also include variance for axes
  ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
  labs(title = "PCA - 2 Features by Class") +
  theme_classic() +
  scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
  scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))

From this plot, we can see (even though the first two components explain about 30% of the variance) that the normal samples and the medulloblastoma samples cluster separately. This suggests that the PCA alongside the feature selection is relatively sufficient even though much of the variance is not explained. If a more effective variance capturing technique is utilized, the clustering may become even more distinct.

# factor the Class var
pca.df$Class <- as.factor(pca.df$Class)
# encode colors for the numeric variables
colors <- c("red", "blue")
pca.df$Class_numeric <- factor(pca.df$Class, levels = c("Normal", "Medulloblastoma"), labels = c(0, 1))
# give a color to each class
colors <- colors[(pca.df$Class_numeric)]
# Plot in 3D
s3d <- scatterplot3d(pca.df[,1:3],
          color = colors, # specify colors
          pch = 16,  # specify shape as circle
          xlab = paste0("PC1: ", round(percentVar[1] * 100, digits = 2), "% variance"), # variance labels
          ylab = paste0("PC2: ", round(percentVar[2] * 100, digits = 2), "% variance"),
          zlab = paste0("PC3: ", round(percentVar[3] * 100, digits = 2), "% variance"),
          main = "PCA - 3 Features by Class") # title 
legend("bottom", legend = levels(pca.df$Class), col = c("blue", "red"), pch = 16) # add legend

Similar to the 2D plot above, we can see that the normal cases cluster on their own while the medulloblastoma cases also cluster separately. I wonder though if the medulloblastoma cases have 2 subclasses within as there seem to be 2 clusters that differ according to the top 3 PCs.

# changing out NAs to Normal
pca.df$Type <- ifelse(is.na(pca.df$Type), "Normal", pca.df$Type)
# same as above but coloring for type
ggplot(pca.df, aes(x = PC1, y = PC2, color = Type)) +
  geom_point() +
  xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) +
  ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
  labs(title = "PCA - 2 Features by Type") +
  theme_classic() +
  scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
  scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))

Based on this plot, the above question can be answered. Again, the normal cases generally seem to cluster on their own except for a few cases that may be due to low variability by the two PCs. The medulloblastoma cluster on the left side consists mainly of G4 and G3 cases with a few unknowns sprinkled in there which likely exhibit transcription/expression profiles similar to the G4 and G3 subtypes of medulloblastoma. On the right side, that cluster of medulloblastoma is primarily SHH and WNT subtypes with a few unknowns as well. With this in mind, it seems like normals and medulloblastoma cases cluster separately with subtypes wihtin the medulloblastoma class clustering on their own as well.

ggplot(pca.df, aes(x = PC1, y = PC2, color = Type, shape = Class)) +
  geom_point() +
  xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) +
  ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
  labs(title = "PCA - 2 Features by Type and Class") +
  theme_classic() +
  scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
  scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))

By mapping by both type and class, the above hypotheses can be confirmed as accurate based on the data.

Discussion

To see if the clustering patterns are accurate and to ensure a proper discussion, I am including another dimensionality reduction technique known as UMAP, or Uniform Manifold Approximation and Projection.

# code adapted from prior research projects
library(umap)
# load library and set seed
set.seed(5678)
# run umap
umap_df = umap(top_t_df_filtered)
umap.df = as.data.frame(umap_df$layout)
# add labels for type and class
umap.df$Type = pca.df$Type
umap.df$Class = pca.df$Class
# plot the umap data by type and class
ggplot(umap.df, aes(x = V1,y=V2, color = Type, shape = Class))+ 
  geom_point(size=1.5) + 
  theme_classic() + 
  labs(title = "UMAP by Class and Type", x = "UMAP1", y = "UMAP2") +
  scale_x_continuous(limits=c(-9,14), breaks=seq(-9,14,1)) +
  scale_y_continuous(limits=c(-9,12), breaks=seq(-9,12,1))
## Warning: Removed 77 rows containing missing values or values outside the scale range
## (`geom_point()`).

From the PCA and the UMAP above, there are some clear conclusions that can be drawn. Firstly, normal and medulloblastoma samples cluster separately with a little bit of overlap that may be due to ineffective feature selection. For this example, the top 1500 variable genes by standard deviation were chosen to include in the PCA and UMAP rather than genes that may be valuable in differentiating between normal and medulloblastoma cases. In the future, finding the most differentially expressed genes across both normal and medulloblastoma samples would be ideal for feature selection.

Another key conclusion that can be drawn is that medulloblastoma subtypes (G3, G4, WNT, SHH) also cluster separately and exhibit similar microarray transcription profiles, however, the results are variable depending on the dimension reduction technique used. For example, the PCA generated created 2 distinct clusters while the UMAP above has distinct groupings for each subtype suggesting less spatial resolution for PCA in discriminating between subtypes like G3 and G4. This definitely makes a lot of sense because UMAP is equipped to handle complex relationships while the PCA generated only explained 30% of the variance. If the PCA explained more of the variance, perhaps the individual subtypes would also cluster independently.