library(tidyverse)
library(reshape2)
library(HDclassif)
library(psych)
library(factoextra)
library(caret)
library(scatterplot3d)
library(openxlsx)
# read in data
df <- read.csv("/Users/krishshah/Downloads/GSE124814_HW_expr_matrix.csv")
# read in metadata
df_meta <- read.xlsx("/Users/krishshah/Downloads/GSE124814_sample_descriptions.xlsx")
dim(df)
## [1] 14883 1642
str(df_meta)
## 'data.frame': 1642 obs. of 20 variables:
## $ SAMPLES: chr "Sample name" "Sample_1" "Sample_2" "Sample_3" ...
## $ X2 : chr "title" "EMTAB292-A195-1-2" "EMTAB292-A195-10-2" "EMTAB292-A195-11-2" ...
## $ X3 : chr "CEL file" "Reanalysis of existing record" "Reanalysis of existing record" "Reanalysis of existing record" ...
## $ X4 : chr "source name" "Medulloblastoma" "Medulloblastoma" "Medulloblastoma" ...
## $ X5 : chr "organism" "Homo sapiens" "Homo sapiens" "Homo sapiens" ...
## $ X6 : chr "molecule" "total_RNA" "total_RNA" "total_RNA" ...
## $ X7 : chr "label" "biotin" "biotin" "biotin" ...
## $ X8 : chr "chip name" "HuEx-1-0-st-v1" "HuEx-1-0-st-v1" "HuEx-1-0-st-v1" ...
## $ X9 : chr "characteristics: age" "6.5 years" "5.7 years" "8.1 years" ...
## $ X10 : chr "characteristics: gender" "female" "male" "female" ...
## $ X11 : chr "characteristics: histology" "anaplastic" "classic" "anaplastic" ...
## $ X12 : chr "characteristics: brain region" "cerebellum" "cerebellum" "cerebellum" ...
## $ X13 : chr "characteristics: death" NA NA NA ...
## $ X14 : chr "characteristics: metastatic stage" NA NA NA ...
## $ X15 : chr "characteristics: follow up" NA NA NA ...
## $ X16 : chr "characteristics: subgroup supplied original" "Unknown" "Unknown" "Unknown" ...
## $ X17 : chr "characteristics: subgroup supplied renamed" "Unknown" "Unknown" "Unknown" ...
## $ X18 : chr "characteristics: subgroup relabeled" "G4" "G4" "G4" ...
## $ X19 : chr "description" "reanalysis of A195_1_2 (E-MTAB-292)" "reanalysis of A195_10_2 (E-MTAB-292)" "reanalysis of A195_11_2 (E-MTAB-292)" ...
## $ X20 : chr "description" "Sample_1" "Sample_2" "Sample_3" ...
# set column names to the first row
colnames(df_meta) <- df_meta[1, ]
# remove first row
df_meta <- df_meta[-1, ]
# set row names of expression matrix to gene symbol
rownames(df) <- df[, 1]
# remove first column
df <- df[,-1]
# transpose df so observations are rows
# genes are columns
t_df <- t(df) %>% data.frame()
For this assignment, I am utilizing data from a meta-analysis (https://academic.oup.com/bioinformatics/article/35/18/3357/5305636?login=true) looking at microarray gene expression data of control and medulloblastoma patients. Medulloblastoma is a brain cancer that is typically found in pediatric patients and molecular classification through sequencing can reveal the specific type of molecular profile that the cancer exhibits. For this paper, the scientists collected microarray gene expression data from control and medulloblastoma patients found publicly and went on to normalize and correct the data for batch effects that may arise from different sequencing protocols. This was done in an effort to curate a large dataset of transcription data that can be useful for future research projects looking at expression in medulloblastoma vs controls. After cleaning up some of the columns and rows, the main dataframe (t_df) represents 1641 patients that each have 14883 genes and their associated normalized expression values. All of the columns are numeric with each row representing an observation. The affiliated meta data (df_meta) has information regarding each row/sample. There are 1641 patients with information regarding if they are controls or medulloblastoma patients, their age, their molecular classification, their gender, and other related information.
Based on the existing high-dimensional dataset, I am curious to see whether two-three components (for visualization purposes) explain enough variability to cluster samples based on controls or medulloblastoma. In addition, do samples cluster by molecular profile/subtype when looking at a two-dimensional or three-dimensional plot?
To speed up the PCA dimensionality reduction process, I intend to conduct some basic feature selection by only choosing the top 1500 variable genes based on deviation from the mean.
# calculate st dev for each column
gene_variance <- apply(t_df, 2, sd)
# sort names of genes from highest to lowest sd
sorted_t_df <- names(sort(gene_variance, decreasing = T))
# only choose top 1500
top_var_genes <- sorted_t_df[1:1500]
# merge with t_df to only keep top 1500 genes
top_t_df <- t_df[, top_var_genes]
To not make it extremely computationally intense, I am selecting a cutoff of 0.89 for variables that may exhibit multicollinearity.
# correlation matrix for 1500 genes
pca_cor <- cor(top_t_df, use ='pairwise.complete')
# find genes that have high correlations
high_cor <- findCorrelation(pca_cor, cutoff = 0.89)
# remove genes from top df
top_t_df_filtered <- top_t_df[, -high_cor]
Sample plot of correlations using heatmap using a sample of genes
# sample plot to examine heatmap of features
melted_pca_cor <- melt(pca_cor[1:25, 1:25])
# melted data plotted in a heatmap to look at features visually
ggplot(data = melted_pca_cor, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
labs(x = '', y = '', fill = 'Correlation Coefficient', title = 'Heatmap of Genes') +
scale_fill_gradient2(low = 'red', mid = 'white', high = 'blue', midpoint = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# PCA Modeling
# run PCA and return eigenvalues and loadings with no rotation
pca <- principal(top_t_df_filtered, rotation = 'none')
# create df for scree plot with each component and the associated eigenvalues
scree <- data.frame(PC = 1:length(pca$values), Eigenvalue = pca$values)
# use ggplot to plot eigenvalue vs PC
ggplot(scree, aes(x = PC, y = Eigenvalue)) +
geom_line() + # add line to connect points
geom_point() + # add points
geom_hline(yintercept=1, linetype = 'dashed') + # dashed line at 1
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,5)) + # many components so just selected first 100
theme_minimal() +
labs(x = 'Principal Component', y = "Eigenvalue", title = "Scree Plot")
Based on the dashed line at an eigenvalue of 1, about 50-51 components
are needed, however, when looking at the elbow or bend in the screeplot,
about 10-11 components would suffice as more components do not produce
meaningful returns.
# setting seed for reproducibility
set.seed(2354)
# parallel analysis using filtered df
pa <- fa.parallel(top_t_df_filtered, fa = 'pc')
## Parallel analysis suggests that the number of factors = NA and the number of components = 49
Based on the parallel analysis, 49 components are needed. Let’s further visualize the results from the parallel analysis using ggplot.
# creating a pa df to visualize Eigenvalue x PC
# a column of PC and a column of actual vs resampled
# with the associated eigenvalue
pa_plot <- data.frame(
PC = c(1:length(pa$pc.values), 1:length(pa$pc.simr)),
type = c(rep('actual', length(pa$pc.values)), rep('resampled', length(pa$pc.simr))),
Eigenvalue = c(pa$pc.values, pa$pc.simr)
)
# plotting eigenvalue vs PC as done before
ggplot(pa_plot,aes(x = PC, y = Eigenvalue, col = type)) +
geom_line()+
geom_point() +
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,5)) +
theme_minimal() +
labs(x = 'Principal Component', y = 'Eigenvalue', title = "Scree Plot with Parallel Analysis Values") +
scale_color_manual(name = '', values = c('black', 'red'))
## Warning: Removed 2760 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 2760 rows containing missing values or values outside the scale range
## (`geom_point()`).
From this plot, we can see that about 50 components are necessary for
the PCA. Let’s conduct a basic PCA and check the cumulative variance to
see how many components are needed for over 50% explained variance.
# scale filtered df values
scale_train <- top_t_df_filtered %>% mutate_at(c(1:1480), ~(scale(.) %>% as.vector))
# run Bartlett's test to test if PCA can be ran and if
# variables are related
cortest.bartlett(scale_train, 1641)
## R was not square, finding R from data
## $chisq
## [1] Inf
##
## $p.value
## [1] 0
##
## $df
## [1] 1094460
The p.value is less than 0.05 so the variables are related and PCA can be run. ### Kaiser-Meyer-Olkin Test
# obtain Kaser values for all genes in scaled df
kmo_result <- KMO(scale_train)
# check to see if all MSAs are above 0.5 for all vars
all(kmo_result$MSA > 0.5)
## [1] TRUE
All variables display a Kaser value above the desired threshold of 0.5.
# just run a basic PCA to check for cumulative variance
pca <- prcomp(top_t_df_filtered, center = TRUE, scale. = TRUE)
# calculate variance explained
variance_explained <- (pca$sdev^2) / sum(pca$sdev^2)
# calculate cumulative variance explained
cumulative_variance <- cumsum(variance_explained)
# display first 25 rows
head(cumulative_variance, n=25)
## [1] 0.1960693 0.3131053 0.3849769 0.4391692 0.4703352 0.4921633 0.5107036
## [8] 0.5268164 0.5408446 0.5536588 0.5655432 0.5745870 0.5832868 0.5912567
## [15] 0.5986347 0.6053979 0.6119971 0.6183049 0.6239923 0.6296196 0.6346974
## [22] 0.6394931 0.6441583 0.6485778 0.6527265
From the cumulative variance, we can see that about 7 components are necessary for an explained cumulative variance of >50%. Let’s check PCA rotation using about 50 components to see which is best.
# PCA with no rotation
pca_no <- principal(top_t_df_filtered, nfactors = 50, rotate = 'none')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_no
# PCA with orthogonal rotation
pca_or <- principal(top_t_df_filtered, nfactors = 50, rotate = 'varimax')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_or
# PCA with oblique rotation
pca_ob <- principal(top_t_df_filtered, nfactors = 50, rotate = 'promax')
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined.
## Chi square is based upon observed residuals.
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
#pca_ob
#(pca_no$loadings)
#pca_ob$loadings
#pca_or$loadings
Based on the cumulative variances of each rotation method and the scope of the research question, the no rotation method will be utilized further for visualizations and loading analyses. The no rotation method has the highest explained cumulative variance for the first 3 principal components.
# obtain loadings / code from demo slides
loadings <- pca_no$loadings
loadings <- unclass(loadings)
loadings_df <- as.data.frame(loadings)
loadings_df$features <- rownames(loadings_df)
rownames(loadings_df) <- NULL
# obtain scores for each sample by predicting PCA model on data
pca.df <- as.data.frame(predict(pca_no, data=top_t_df_filtered))
# ensure PCA df dimensions line up with 1641 observations
dim(pca.df)
## [1] 1641 50
# add in the normal vs medulloblastoma label
pca.df$Class = df_meta$`source name`
# add molecular profile label from meta data
pca.df$Type = df_meta$`characteristics: subgroup relabeled`
# obtain percent variance explained by each PC
percentVar <- pca_no$values / sum(pca_no$values)
# ggplot looking at first two PCs, coloring by class
ggplot(pca.df, aes(x = PC1, y = PC2, color = Class)) +
geom_point() + # plot point
xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) + # also include variance for axes
ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
labs(title = "PCA - 2 Features by Class") +
theme_classic() +
scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))
From this plot, we can see (even though the first two components explain
about 30% of the variance) that the normal samples and the
medulloblastoma samples cluster separately. This suggests that the PCA
alongside the feature selection is relatively sufficient even though
much of the variance is not explained. If a more effective variance
capturing technique is utilized, the clustering may become even more
distinct.
# factor the Class var
pca.df$Class <- as.factor(pca.df$Class)
# encode colors for the numeric variables
colors <- c("red", "blue")
pca.df$Class_numeric <- factor(pca.df$Class, levels = c("Normal", "Medulloblastoma"), labels = c(0, 1))
# give a color to each class
colors <- colors[(pca.df$Class_numeric)]
# Plot in 3D
s3d <- scatterplot3d(pca.df[,1:3],
color = colors, # specify colors
pch = 16, # specify shape as circle
xlab = paste0("PC1: ", round(percentVar[1] * 100, digits = 2), "% variance"), # variance labels
ylab = paste0("PC2: ", round(percentVar[2] * 100, digits = 2), "% variance"),
zlab = paste0("PC3: ", round(percentVar[3] * 100, digits = 2), "% variance"),
main = "PCA - 3 Features by Class") # title
legend("bottom", legend = levels(pca.df$Class), col = c("blue", "red"), pch = 16) # add legend
Similar to the 2D plot above, we can see that the normal cases cluster
on their own while the medulloblastoma cases also cluster separately. I
wonder though if the medulloblastoma cases have 2 subclasses within as
there seem to be 2 clusters that differ according to the top 3 PCs.
# changing out NAs to Normal
pca.df$Type <- ifelse(is.na(pca.df$Type), "Normal", pca.df$Type)
# same as above but coloring for type
ggplot(pca.df, aes(x = PC1, y = PC2, color = Type)) +
geom_point() +
xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) +
ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
labs(title = "PCA - 2 Features by Type") +
theme_classic() +
scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))
Based on this plot, the above question can be answered. Again, the
normal cases generally seem to cluster on their own except for a few
cases that may be due to low variability by the two PCs. The
medulloblastoma cluster on the left side consists mainly of G4 and G3
cases with a few unknowns sprinkled in there which likely exhibit
transcription/expression profiles similar to the G4 and G3 subtypes of
medulloblastoma. On the right side, that cluster of medulloblastoma is
primarily SHH and WNT subtypes with a few unknowns as well. With this in
mind, it seems like normals and medulloblastoma cases cluster separately
with subtypes wihtin the medulloblastoma class clustering on their own
as well.
ggplot(pca.df, aes(x = PC1, y = PC2, color = Type, shape = Class)) +
geom_point() +
xlab(paste0("PC1: ",round(percentVar[1] * 100,digits = 2),"% variance")) +
ylab(paste0("PC2: ",round(percentVar[2] * 100,digits = 2),"% variance")) +
labs(title = "PCA - 2 Features by Type and Class") +
theme_classic() +
scale_x_continuous(limits=c(-2,2.5), breaks=seq(-2,2.5,0.5)) +
scale_y_continuous(limits=c(-4,2.5), breaks=seq(-4,2.5,0.5))
By mapping by both type and class, the above hypotheses can be confirmed
as accurate based on the data.
To see if the clustering patterns are accurate and to ensure a proper discussion, I am including another dimensionality reduction technique known as UMAP, or Uniform Manifold Approximation and Projection.
# code adapted from prior research projects
library(umap)
# load library and set seed
set.seed(5678)
# run umap
umap_df = umap(top_t_df_filtered)
umap.df = as.data.frame(umap_df$layout)
# add labels for type and class
umap.df$Type = pca.df$Type
umap.df$Class = pca.df$Class
# plot the umap data by type and class
ggplot(umap.df, aes(x = V1,y=V2, color = Type, shape = Class))+
geom_point(size=1.5) +
theme_classic() +
labs(title = "UMAP by Class and Type", x = "UMAP1", y = "UMAP2") +
scale_x_continuous(limits=c(-9,14), breaks=seq(-9,14,1)) +
scale_y_continuous(limits=c(-9,12), breaks=seq(-9,12,1))
## Warning: Removed 77 rows containing missing values or values outside the scale range
## (`geom_point()`).
From the PCA and the UMAP above, there are some clear conclusions that
can be drawn. Firstly, normal and medulloblastoma samples cluster
separately with a little bit of overlap that may be due to ineffective
feature selection. For this example, the top 1500 variable genes by
standard deviation were chosen to include in the PCA and UMAP rather
than genes that may be valuable in differentiating between normal and
medulloblastoma cases. In the future, finding the most differentially
expressed genes across both normal and medulloblastoma samples would be
ideal for feature selection.
Another key conclusion that can be drawn is that medulloblastoma subtypes (G3, G4, WNT, SHH) also cluster separately and exhibit similar microarray transcription profiles, however, the results are variable depending on the dimension reduction technique used. For example, the PCA generated created 2 distinct clusters while the UMAP above has distinct groupings for each subtype suggesting less spatial resolution for PCA in discriminating between subtypes like G3 and G4. This definitely makes a lot of sense because UMAP is equipped to handle complex relationships while the PCA generated only explained 30% of the variance. If the PCA explained more of the variance, perhaps the individual subtypes would also cluster independently.