Marvin Lim, Barry Digby, Dr. Pilib O’Broin, Prof. Stephen Finn
This document provides results from a differential expression analysis performed on LNCaP cell lines (n=3) with varying degrees of enzalutamide resistance: Clone1 = highly resistant, Clone9 = moderately resistant and Control samples. Sequencing was performed by TrinSeq, at the Trinity Genome Sequencing Laboratory located in the Trinity Translational Medicine Institute, Trinity College Dublin using MiSeq technology.
Gene expression counts were normalised using DESeq2 median of ratios method and transformed using logarithm base 2 for visualisation.
library(knitr)
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(gplots)
library(biomaRt)
library(PCAtools)
library(DT)
library(IHW)
library(apeglm)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(circlize) #colorRamp2
library(dendextend) #clustering
library(tidyverse)
library(fgsea)
library(STRINGdb)
library(networkD3)
library(visNetwork)
library(igraph)
## Set dir
setwd("/home/barry/Desktop/GSE118959/scripts/")
mtx <- read.table("../data/RNA-Seq/mRNA_counts.txt", header=T, row.names = "gene_symbols")
samples <- c("Clone1_N1", "Clone1_N2", "Clone1_N3",
"Clone9_N1", "Clone9_N2", "Clone9_N3",
"Control_N1", "Control_N2", "Control_N3")
colnames(mtx) <- samples
## Create sample DF
sample_vec <- c()
replicates_vec <- c()
sample_df <- as.data.frame(matrix(ncol=2, nrow=0))
for(i in samples){
split <- strsplit(i, "\\_")
sample <- split[[1]][1]
rep <- split[[1]][2]
sample_df[i,1] <- sample
sample_df[i,2] <- rep
}
colnames(sample_df) <- c("Condition", "Replicate")
sample_df$Condition <- as.factor(sample_df$Condition)
sample_df$Replicate <- as.factor(sample_df$Replicate)
## Create DDS object
dds <- DESeqDataSetFromMatrix(mtx, colData = sample_df, design = ~ Replicate + Condition)
dds$Condition <- relevel(dds$Condition, ref = "Control")
keep <- rowSums(counts(dds)) >= 20
dds <- dds[keep,]
dds <- DESeq(dds)
## Get log2 counts
cts <- counts(dds, normalized=T)
log_counts <- log2(cts + 1)num_conditions <- nlevels(sample_df$Condition)
pal <- colorRampPalette(brewer.pal(num_conditions, "Set1"))(num_conditions)
cond_colors <- pal[as.integer(sample_df$Condition)]
heatmap.2(cor(log_counts), scale="column", dendrogram = "column", labRow="",
ColSideColors=cond_colors, trace='none', margins=c(7,8),
main='Sample correlations (log2-transformed)')Heatmap of sample to sample distances. The heatmap was built using the heatmap.2 package using log2 transformed gene expression data. Samples are clustered by replicate according to their degree of enzalutamide resistance. Heatmap shows there are no outliers in the dataset and Clone1 has the largest distance in latent space.
Scree plot shows principal component 1 (PC1) explains ~70% of the variation present in the dataset.
biplot(p, x="PC1", y="PC2",
colby = 'Condition', colkey = c('Control'='royalblue', 'Clone9'='forestgreen', 'Clone1'='red'),
hline = 0, vline = 0,
#legendPosition = 'right', legendLabSize = 12, legendIconSize = 8.0,
drawConnectors = TRUE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2')Principal component analysis was performed on the log2 transformed counts using the PCAtools package. PCA bi-plot demonstrates clustering of samples and confirms the quality of the experiment design: each replicate clusters according to their resistance to enzalutamide. Furthermore, it is revealed that the variation observed in PC1 in the scree plot is accounted for by Clone1, as it clusters further away from Clone9 and Control in PC1 (69.14%) Separation of Clone9 and Control samples is evident in PC2 (11.99% variation).
Due to the small variation present in Clone9 and Control replicates in the PCA bi-plot, the coefficient “replicates” will be accounted for when fitting the generalized linear model in DESeq2 to remove technical variation from the analysis.
Differential gene tests were conducted using DESeq2 package. DESeq2 fits negative binomial generalized linear models for each gene and uses the Wald test for significance testing, returning a table of differentially genes. Multiple testing correction was performed using the independent hypothesis weighting IHW package, with the significance threshold for genes set at P-value < 0.05. Genes were annotated using the BiomaRt package.
Initial results returned a subset of genes with extremely high/low Log2 Fold Changes in the order of +/- 20. This is due to the high variability within these gene counts. When the read counts are low or highly variable, the maximum likelihood estimates for the Log Fold Change has high variance, leading to large estimates not representative of true differences. Lowly expressed genes are filtered out using the rowSums function, where the threshold is set to 20. Inevitably, highly variable genes will exceed this filtering threshold and introduce outliers to the results. The authors of DESeq2 propose the use of a heavy-tailed Cauchy prior distribution for effect sizes. The proposed method, Approximate Posterior Estimation for GLM, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference. The rowSums function to filter lowly expressed genes was used in conjunction with the apeglm function to shrink the effect size of genes with high variance, returning a robust set of results for the analysis. Results in the tables have been filtered according to an adjusted P-value of <0.05 and a log2 Fold Change >= 1 / <= -1, returning both biologically and statistically significant genes.
res1 <- results(dds, filterFun=ihw, alpha=0.05, c("Condition", "Clone1", "Control"))
LFC1 <- lfcShrink(dds = dds, res= res1, coef = 4, type = "apeglm")
LFC1_df <- as.data.frame(LFC1)
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
up_key <- intersect(rownames(LFC1)[which(LFC1$log2FoldChange>=1)],
rownames(LFC1)[which(LFC1$padj<=0.05)])
up_df <- as.data.frame((LFC1)[which(rownames(LFC1) %in% up_key),])
up_df <- tibble::rownames_to_column(up_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = up_df$external_gene_name,
mart = mart,
useCache = F)
tmp <- merge(info, up_df, by="external_gene_name")
tmp <- tmp[order(-tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
up_regulated_table1 <- tmp
DT::datatable(up_regulated_table1, rownames = FALSE)res1 <- results(dds, filterFun=ihw, alpha=0.05, c("Condition", "Clone1", "Control"))
LFC1 <- lfcShrink(dds = dds, res= res1, coef = 4, type = "apeglm")
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
down_key <- intersect(rownames(LFC1)[which(LFC1$log2FoldChange<=-1)],
rownames(LFC1)[which(LFC1$padj<=0.05)])
down_df <- as.data.frame((LFC1)[which(rownames(LFC1) %in% down_key),])
down_df <- tibble::rownames_to_column(down_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = down_df$external_gene_name,
mart = mart,
useCache = F)
tmp <- merge(info, down_df, by="external_gene_name")
tmp <- tmp[order(tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
down_regulated_table1 <- tmp
DT::datatable(down_regulated_table1, rownames=FALSE)EnhancedVolcano(LFC1_df,
lab = rownames(LFC1_df),
x = 'log2FoldChange',
y = 'pvalue',
selectLab = c(""),
FCcutoff = 1.0,
pCutoff = 0.05,
title = "Clone1 vs Control DESeq2 Results",
subtitle = "Differentially Expressed Genes",
caption = paste0('Total Genes = ', nrow(LFC1_df)),
legendPosition = "right",
legend = c("Not Sig", "+/- 1 Log2 FC", "P-val <0.05", "P-val & Log2 FC"),
legendVisible = F,
xlim = c(-10, 10),
ylim = c(0, 100),
transcriptPointSize = 1.5)c1heat <- c(up_regulated_table1$Gene, down_regulated_table1$Gene)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
# subset log counts
c1heat_df <- subset(log_counts, rownames(log_counts) %in% c1heat)
# remove clone9
c1heat_df <- c1heat_df[,-c(4,5,6)]
## flip it so it reads left to right
c1heat_df <- c1heat_df[,c(4,5,6,1,2,3)]
## remove no SD between cases
c1heat_df <- c1heat_df[apply(c1heat_df, MARGIN = 1, FUN = function(x) sd(x) != 0),]
## calculate zscore for viz
mat1 <- as.matrix(c1heat_df)
mat1 <- t(mat1)
mat1 <- scale(mat1, center=T)
mat1 <- t(mat1)
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Control", 3), rep("Clone1", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Control" = "black", "Clone1" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
hm1 <- Heatmap(mat1,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Differentially Expressed Genes\nClone1 vs Control",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha)
draw(hm1, annotation_legend_side = "right", heatmap_legend_side="right")res9 <- results(dds, filterFun=ihw, alpha=0.05, c("Condition", "Clone9", "Control"))
LFC9 <- lfcShrink(dds = dds, res= res9, coef = 5, type = "apeglm")
LFC9_df <- as.data.frame(LFC9)
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
up_key <- intersect(rownames(LFC9)[which(LFC9$log2FoldChange>=1)],
rownames(LFC9)[which(LFC9$padj<=0.05)])
up_df <- as.data.frame((LFC9)[which(rownames(LFC9) %in% up_key),])
up_df <- tibble::rownames_to_column(up_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = up_df$external_gene_name,
mart = mart,
useCache = F)
tmp <- merge(info, up_df, by="external_gene_name")
tmp <- tmp[order(-tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
up_regulated_table9 <- tmp
DT::datatable(up_regulated_table9, rownames = FALSE)res9 <- results(dds, filterFun=ihw, alpha=0.05, c("Condition", "Clone9", "Control"))
LFC9 <- lfcShrink(dds = dds, res= res9, coef = 5, type = "apeglm")
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
down_key <- intersect(rownames(LFC9)[which(LFC9$log2FoldChange<=-1)],
rownames(LFC9)[which(LFC9$padj<=0.05)])
down_df <- as.data.frame((LFC9)[which(rownames(LFC9) %in% down_key),])
down_df <- tibble::rownames_to_column(down_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = down_df$external_gene_name,
mart = mart,
useCache=F)
tmp <- merge(info, down_df, by="external_gene_name")
tmp <- tmp[order(tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
down_regulated_table9 <- tmp
DT::datatable(down_regulated_table9, rownames=FALSE)EnhancedVolcano(LFC9_df,
lab = rownames(LFC9_df),
x = 'log2FoldChange',
y = 'pvalue',
selectLab = c(""),
FCcutoff = 1.0,
pCutoff = 0.05,
title = "Clone9 vs Control DESeq2 Results",
subtitle = "Differentially Expressed Genes",
caption = paste0('Total Genes = ', nrow(LFC9_df)),
legendPosition = "right",
legend = c("Not Sig", "+/- 1 Log2 FC", "P-val <0.05", "P-val & Log2 FC"),
legendVisible = F,
xlim = c(-8, 8),
ylim = c(0, 75),
transcriptPointSize = 1.5)c9heat <- c(up_regulated_table9$Gene, down_regulated_table9$Gene)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
# subset log counts
c9heat_df <- subset(log_counts, rownames(log_counts) %in% c9heat)
# remove clone1
c9heat_df <- c9heat_df[,-c(1,2,3)]
## flip it so it reads left to right
c9heat_df <- c9heat_df[,c(4,5,6,1,2,3)]
## remove no SD between cases
c9heat_df <- c9heat_df[apply(c9heat_df, MARGIN = 1, FUN = function(x) sd(x) != 0),]
## calculate zscore for viz
mat9 <- as.matrix(c9heat_df)
mat9 <- t(mat9)
mat9 <- scale(mat9, center=T)
mat9 <- t(mat9)
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Control", 3), rep("Clone9", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Control" = "black", "Clone9" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
hm9 <- Heatmap(mat9,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Differentially Expressed Genes\nClone9 vs Control",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha)
draw(hm9, annotation_legend_side = "right", heatmap_legend_side="right")dds19 <- DESeqDataSetFromMatrix(mtx, colData = sample_df, design = ~ Replicate + Condition)
dds19$Condition <- relevel(dds19$Condition, ref = "Clone9")
keep <- rowSums(counts(dds19)) >= 20
dds19 <- dds19[keep,]
dds19 <- DESeq(dds19)
res19 <- results(dds19, filterFun=ihw, alpha=0.05, c("Condition", "Clone1", "Clone9"))
LFC19 <- lfcShrink(dds = dds19, res= res19, coef = 4, type = "apeglm")
LFC19_df <- as.data.frame(LFC19)
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
up_key <- intersect(rownames(LFC19)[which(LFC19$log2FoldChange>=1)],
rownames(LFC19)[which(LFC19$padj<=0.05)])
up_df <- as.data.frame((LFC19)[which(rownames(LFC19) %in% up_key),])
up_df <- tibble::rownames_to_column(up_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = up_df$external_gene_name,
mart = mart,
useCache = F)
tmp <- merge(info, up_df, by="external_gene_name")
tmp <- tmp[order(-tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
up_regulated_table19 <- tmp
DT::datatable(up_regulated_table19, rownames = FALSE)res19 <- results(dds19, filterFun=ihw, alpha=0.05, c("Condition", "Clone1", "Clone9"))
LFC19 <- lfcShrink(dds = dds19, res= res19, coef = 4, type = "apeglm")
output_col <- c("Gene", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
down_key <- intersect(rownames(LFC19)[which(LFC19$log2FoldChange<=-1)],
rownames(LFC19)[which(LFC19$padj<=0.05)])
down_df <- as.data.frame((LFC19)[which(rownames(LFC19) %in% down_key),])
down_df <- tibble::rownames_to_column(down_df, "external_gene_name")
info <- getBM(attributes=c("external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("external_gene_name"),
values = down_df$external_gene_name,
mart = mart,
useCache = F)
tmp <- merge(info, down_df, by="external_gene_name")
tmp <- tmp[order(tmp$log2FoldChange),]
## col out index
index <- c(1, 2, 3, 4, 5, 6, 8, 10, 11)
tmp <- tmp[,index]
colnames(tmp) <- output_col
tmp$Strand <- gsub("-1", "-", tmp$Strand)
tmp$Strand <- gsub("1", "+", tmp$Strand)
# remove dups
tmp <- tmp[!duplicated(tmp$Gene),]
down_regulated_table19 <- tmp
DT::datatable(down_regulated_table19, rownames=FALSE)EnhancedVolcano(LFC19_df,
lab = rownames(LFC19_df),
x = 'log2FoldChange',
y = 'pvalue',
selectLab = c(""),
FCcutoff = 1.0,
pCutoff = 0.05,
title = "Clone1 vs Clone9 DESeq2 Results",
subtitle = "Differentially Expressed Genes",
caption = paste0('Total Genes = ', nrow(LFC19_df)),
legendPosition = "right",
legend = c("Not Sig", "+/- 1 Log2 FC", "P-val <0.05", "P-val & Log2 FC"),
legendVisible = F,
xlim = c(-10, 10),
ylim = c(0, 100),
transcriptPointSize = 1.5)c19heat <- c(up_regulated_table19$Gene, down_regulated_table19$Gene)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
# subset log counts
c19heat_df <- subset(log_counts, rownames(log_counts) %in% c19heat)
# remove control
c19heat_df <- c19heat_df[,-c(7,8,9)]
## flip it so it reads left to right
c19heat_df <- c19heat_df[,c(4,5,6,1,2,3)]
## remove no SD between cases
c19heat_df <- c19heat_df[apply(c19heat_df, MARGIN = 1, FUN = function(x) sd(x) != 0),]
## calculate zscore for viz
mat19 <- as.matrix(c19heat_df)
mat19 <- t(mat19)
mat19 <- scale(mat19, center=T)
mat19 <- t(mat19)
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Clone9", 3), rep("Clone1", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Clone9" = "black", "Clone1" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
hm19 <- Heatmap(mat19,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Differentially Expressed Genes\nClone1 vs Clone9",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha)
draw(hm19, annotation_legend_side = "right", heatmap_legend_side="right")Gene Set Enrichment Analysis (GSEA) was performed using GSEA software, a joint project of UC San Diego and Broad Institute. The dataset was normalised using DESeq2 median of ratios method and parsed to .gct format using custom bash scripts as input to GSEA. Parameters that deviate from the default setings are described below:
Permutation Type: Gene_Set was chosen for this parameter as the number of samples per phenotype was below 7, thus phenotype shuffling is not feasible for this option.
Metric For Ranking Genes: T-test uses the difference of means scaled by the standard deviation and number of samples, as opposed to the default Signal-to-noise metric which uses the difference of means scaled by the sandard deviation, without taking into account the number of samples.
Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying overlaps between gene sets in other MSigDB collections and retaining genes that display coordinated expression.
hallmarks1 <- rbind(H_clone1_vs_control_clone1, H_clone1_vs_control_control)
hallmarks1$p.value <- hallmarks1$NOM.p.val
hallmarks1 <- hallmarks1[,c(1,2,4)]
pathway_vec <- c()
for(i in hallmarks1$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
hallmarks1$NAME <- pathway_vec
hallmarks1$Enrichment <- ifelse(hallmarks1$NES < 0, "Control", "Clone1")
ggplot(hallmarks1, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone1="skyblue",Control="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Hallmark pathways NES from GSEA",
subtitle = "Clone1 vs Control") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
hallmarks9 <- rbind(H_clone9_vs_control_clone9, H_clone9_vs_control_control)
hallmarks9$p.value <- hallmarks9$NOM.p.val
hallmarks9 <- hallmarks9[,c(1,2,4)]
pathway_vec <- c()
for(i in hallmarks9$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
hallmarks9$NAME <- pathway_vec
hallmarks9$Enrichment <- ifelse(hallmarks9$NES < 0, "Control", "Clone9")
ggplot(hallmarks9, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone9="skyblue",Control="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Hallmark pathways NES from GSEA",
subtitle = "Clone9 vs Control") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
hallmarks19 <- rbind(H_clone1_vs_clone9_clone1, H_clone1_vs_clone9_clone9)
hallmarks19$p.value <- hallmarks19$NOM.p.val
hallmarks19 <- hallmarks19[,c(1,2,4)]
pathway_vec <- c()
for(i in hallmarks19$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
hallmarks19$NAME <- pathway_vec
hallmarks19$Enrichment <- ifelse(hallmarks19$NES < 0, "Clone9", "Clone1")
ggplot(hallmarks19, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone1="skyblue",Clone9="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Hallmark pathways NES from GSEA",
subtitle = "Clone1 vs Clone9") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
The Kyoto Encyclopedia of Genes and Genomes (KEGG) database is a resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
kegg1 <- rbind(K_clone1_vs_control_clone1, K_clone1_vs_control_control)
kegg1$p.value <- kegg1$NOM.p.val
kegg1 <- kegg1[,c(1,2,4)]
pathway_vec <- c()
for(i in kegg1$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
kegg1$NAME <- pathway_vec
kegg1$Enrichment <- ifelse(kegg1$NES < 0, "Control", "Clone1")
ggplot(kegg1, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone1="skyblue",Control="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="KEGG pathways NES from GSEA",
subtitle = "Clone1 vs Control") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
kegg9 <- rbind(K_clone9_vs_control_clone9, K_clone9_vs_control_control)
kegg9$p.value <- kegg9$NOM.p.val
kegg9 <- kegg9[,c(1,2,4)]
pathway_vec <- c()
for(i in kegg9$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
kegg9$NAME <- pathway_vec
kegg9$Enrichment <- ifelse(kegg9$NES < 0, "Control", "Clone9")
ggplot(kegg9, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone9="skyblue",Control="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="KEGG pathways NES from GSEA",
subtitle = "Clone9 vs Control") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
kegg19 <- rbind(K_clone1_vs_clone9_clone1, K_clone1_vs_clone9_clone9)
kegg19$p.value <- kegg19$NOM.p.val
kegg19 <- kegg19[,c(1,2,4)]
pathway_vec <- c()
for(i in kegg19$NAME){
list <- strsplit(i, "_")
new <- paste(list[[1]][-1], collapse="_")
pathway_vec <- c(pathway_vec, new)
}
kegg19$NAME <- pathway_vec
kegg19$Enrichment <- ifelse(kegg19$NES < 0, "Clone9", "Clone1")
g <- ggplot(kegg19, aes(reorder(NAME, NES), NES)) +
geom_col(stat="identity",position="identity",aes(fill = Enrichment)) +
scale_fill_manual(values=c(Clone1="skyblue",Clone9="indianred1")) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="KEGG pathways NES from GSEA",
subtitle = "Clone1 vs Clone9") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme_minimal()## Warning: Ignoring unknown parameters: stat
The larger processes, or ‘biological programs’ accomplished by multiple molecular activities. Examples of broad biological process terms are DNA repair or signal transduction. Note that a biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.
Marvin, due to the large pathway names, enriched pathway plots have been omitted from this section. If you wish to create enriched pathway plots similar to the previous sections contact me with the pathways you would like displayed and I will create them for you.
Molecular-level activities performed by gene products. Molecular function terms describe activities that occur at the molecular level, such as “catalysis” or “transport”. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products (i.e. a protein or RNA), but some activities are performed by molecular complexes composed of multiple gene products.
Marvin, due to the large pathway names, enriched pathway plots have been omitted from this section. If you wish to create enriched pathway plots similar to the previous sections contact me with the pathways you would like displayed and I will create them for you.
Per request, the epithelial to mesenchymal pathway has been investigated. All 1184 EMT genes provided by dbEMT (http://www.dbemt.bioinfo-minzhao.org/download.cgi) were downloaded. Next, differentially expressed genes in each cell line comparison that are present in the dbEMT list are selected. This returns a statistically significant set of genes involved in the EMT pathway (adjusted p-value < 0.05). Heatmaps and gene tables are provided for each cell line comparison.
emt <- read.table("~/Desktop/GSE118959/data/dbEMT/EMT_symbols.txt", header=T)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
## what DE genes are involved (capture both up and down)
emt_merged1 <- rbind(up_regulated_table1, down_regulated_table1)
de_key <- subset(emt, emt$Gene %in% emt_merged1$Gene)
# subset log counts
emt_df1 <- subset(log_counts, rownames(log_counts) %in% de_key$Gene)
# remove clone9
emt_df1 <- emt_df1[,-c(4,5,6)]
## flip it so it reads left to right
emt_df1 <- emt_df1[,c(4,5,6,1,2,3)]
## remove no SD between cases sanity check
emt_df1 <- emt_df1[apply(emt_df1, MARGIN = 1, FUN = function(x) sd(x) != 0),]
emt_mat1 <- as.matrix(emt_df1)
emt_mat1 <- t(emt_mat1)
emt_mat1 <- scale(emt_mat1, center=T)
emt_mat1 <- t(emt_mat1)
# color scheme
#f1 = colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red"))
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Control", 3), rep("Clone1", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Control" = "black", "Clone1" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
emt_hm1 <- Heatmap(emt_mat1,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Epithelial to Mesenchymal Transition\nClone1 vs Control",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha)
draw(emt_hm1, annotation_legend_side = "right", heatmap_legend_side="right")emt <- read.table("~/Desktop/GSE118959/data/dbEMT/EMT_symbols.txt", header=T)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
## what DE genes are involved
emt_merged9 <- rbind(up_regulated_table9, down_regulated_table9)
de_key <- subset(emt, emt$Gene %in% emt_merged9$Gene)
# subset log counts
emt_df9 <- subset(log_counts, rownames(log_counts) %in% de_key$Gene)
# remove clone1
emt_df9 <- emt_df9[,-c(1,2,3)]
## flip it so it reads left to right
emt_df9 <- emt_df9[,c(4,5,6,1,2,3)]
## remove no SD between cases
emt_df9 <- emt_df9[apply(emt_df9, MARGIN = 1, FUN = function(x) sd(x) != 0),]
## calculate zscore for viz
emt_mat9 <- as.matrix(emt_df9)
emt_mat9 <- t(emt_mat9)
emt_mat9 <- scale(emt_mat9, center=T)
emt_mat9 <- t(emt_mat9)
# color scheme
#f1 = colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red"))
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Control", 3), rep("Clone9", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Control" = "black", "Clone9" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
emt_hm9 <- Heatmap(emt_mat9,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Epithelial to Mesenchymal Transition\nClone9 vs Control",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(2,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(emt_df9),
#Annotations
top_annotation = ha)
draw(emt_hm9, annotation_legend_side = "right", heatmap_legend_side="right")emt <- read.table("~/Desktop/GSE118959/data/dbEMT/EMT_symbols.txt", header=T)
## subset log2 for viz
log_counts <- as.data.frame(log_counts)
## what DE genes are involved
emt_merged19 <- rbind(up_regulated_table19, down_regulated_table19)
de_key <- subset(emt, emt$Gene %in% emt_merged19$Gene)
# subset log counts
emt_df19 <- subset(log_counts, rownames(log_counts) %in% de_key$Gene)
# remove control
emt_df19 <- emt_df19[,-c(7,8,9)]
## flip it so it reads left to right
emt_df19 <- emt_df19[,c(4,5,6,1,2,3)]
## remove no SD between cases
emt_df19 <- emt_df19[apply(emt_df19, MARGIN = 1, FUN = function(x) sd(x) != 0),]
## calculate zscore for viz
emt_mat19 <- as.matrix(emt_df19)
emt_mat19 <- t(emt_mat19)
emt_mat19 <- scale(emt_mat19, center=T)
emt_mat19 <- t(emt_mat19)
# color scheme
#f1 = colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red"))
# Top annotation
df2 = data.frame(Cell_Line = c(rep("Clone9", 3), rep("Clone1", 3)))
ha = HeatmapAnnotation(df = df2, col = list(Cell_Line = c("Clone9" = "black", "Clone1" = "gold")), annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"), labels_gp = gpar(fontsize = 12)))
emt_hm19 <- Heatmap(emt_mat19,
col= colorRamp2(c(-2,-1,0,1,2),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2,-1,0,1,2),color_bar="continuous",
legend_direction="vertical", legend_width=unit(15,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=TRUE,
show_row_dend=TRUE,
row_title_side="left",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=TRUE,
show_column_dend=TRUE,
column_title="Epithelial to Mesenchymal Transition\nClone1 vs Clone9",
column_title_side="top",
column_title_gp=gpar(fontsize=18, fontface="bold"),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 20, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 2,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha)
draw(emt_hm19, annotation_legend_side = "right", heatmap_legend_side="right")emt_down19 <- subset(down_regulated_table19, down_regulated_table19$Gene %in% rownames(emt_df19))
DT::datatable(emt_down19, rownames = F)The most recent update of this html document occurred: Thu Mar 25 11:48:00 2021.
For any queries about the analysis, contact Barry Digby at (b.digby237@gmail.com)