library(dplyr)
library(biomaRt)
library(tximport)
library(rhdf5)
library(gplots)
library(org.Hs.eg.db)
library(DESeq2)
library(DT)
library(apeglm)
library(RColorBrewer)
library(IHW)
library(PCAtools)
library(pheatmap)
library(clusterProfiler)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(circlize)
library(fgsea)
library(tidyverse)
library(ggpubr)This R markdown document walks through an RNA-Seq differential expression analysis using DESeq2 for MA5112. Quality control plots, differential expression results, plots and pathway analysis are covered in the document.
The tutorial assumes you have 9 Kallisto quantification directories under ~/RNA-Seq/quant/ on your laptop.
Read in the file samples.csv which contains the experimental metadata. Please make sure that the rownames of this dataframe match the kallisto quantification directory names at ~/RNA-Seq/quant/.
samples <- read.csv("~/RNA-Seq/samples.csv", header=T, row.names = "sample", stringsAsFactors = T)
samples## condition replicate
## ctrl_1 control 1
## ctrl_2 control 2
## ctrl_3 control 3
## A549_1 lung 1
## A549_2 lung 2
## A549_3 lung 3
## A375_1 melanoma 1
## A375_2 melanoma 2
## A375_3 melanoma 3
When performing differential expression analysis using DESeq2, if a column in samples.csv is numeric and provided to the design formula, DESeq2 will produce the following message:
the design formula contains a numeric variable with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function
What does this mean? From Mike Love (DESeq2 author):
There is a constant fold change for every unit of change in replicates. So if the estimated fold change is 2, this implies that replicates 2 = 2x replicates 1, replicates 3 = 2x replicates 2, etc. Or in other words, the relationship is linear on the log counts scale.
This is not what we want. We want each replicate to be modelled as a seperate term in the model. To overcome this, we could convert 1 -> A, 2 -> B, 3 -> C or we can use factor() to specify that replicates should be treated as a factor, not a numeric.
The code block below checks the input metadata file, tells the user if any columns are numeric and converts them to a factor.
## check if all columns are factors.
factor_cols <- sapply(samples, is.factor)
if(all(factor_cols) == TRUE){
print("All columns in metadata are factors and suitable for analysis.")
}else{
numeric_cols <- sapply(samples, is.numeric)
names <- colnames(samples)[numeric_cols]
print(paste0("Column(s) ", names, " is numeric. Converting to factor."))
samples[numeric_cols] <- as.data.frame(lapply(samples[numeric_cols], factor))
final_check <- sapply(samples, is.factor)
if(all(final_check) == TRUE){
print("All columns in metadata are factors and suitable for analysis.")
}else{
print("Error in converting to factors. Check input metadata file.")
}
}## [1] "Column(s) replicate is numeric. Converting to factor."
## [1] "All columns in metadata are factors and suitable for analysis."
We need to create a file handle object (a named character list) containing the sample IDs and the paths to the kallisto quantification .h5 files. If the rownames of your metadata object (samples) do not match the quantification directory names, you will get an error during the TXI object step.
dir <- ("~/RNA-Seq/quant")
files <- file.path(dir, rownames(samples), "abundance.h5")
names(files) <- paste0(rownames(samples))
files## ctrl_1 ctrl_2
## "~/RNA-Seq/quant/ctrl_1/abundance.h5" "~/RNA-Seq/quant/ctrl_2/abundance.h5"
## ctrl_3 A549_1
## "~/RNA-Seq/quant/ctrl_3/abundance.h5" "~/RNA-Seq/quant/A549_1/abundance.h5"
## A549_2 A549_3
## "~/RNA-Seq/quant/A549_2/abundance.h5" "~/RNA-Seq/quant/A549_3/abundance.h5"
## A375_1 A375_2
## "~/RNA-Seq/quant/A375_1/abundance.h5" "~/RNA-Seq/quant/A375_2/abundance.h5"
## A375_3
## "~/RNA-Seq/quant/A375_3/abundance.h5"
tximport imports transcript-level abundances from quantification tools (kallisto in our case) and converts them to gene counts for downstream analyses using a differential expression analysis package.
We will use biomaRt to connect to the ENSEMBL databases and map transcript IDs to gene IDs. Firstly, we will create the mart object specifying which database to use:
Recall we used the ENSEMBL reference cDNA file Homo_sapiens.GRCh38.cdna.all.fa.gz for Kallisto quant on LUGH. This means that our quantification files have ENSEMBL transcript ID’s. We can map them to gene symbols running the code below:
tx2gene <- getBM(attributes = c("ensembl_transcript_id_version", "hgnc_symbol"), mart = mart, useCache = FALSE)
head(tx2gene)## ensembl_transcript_id_version hgnc_symbol
## 1 ENST00000387314.1 MT-TF
## 2 ENST00000389680.2 MT-RNR1
## 3 ENST00000387342.1 MT-TV
## 4 ENST00000387347.2 MT-RNR2
## 5 ENST00000386347.1 MT-TL1
## 6 ENST00000361390.2 MT-ND1
Create a txi object summarising kallisto transcipt quantification to the gene-level.
## ctrl_1 ctrl_2 ctrl_3 A549_1 A549_2
## 7.284162e+03 7.953254e+03 7.560597e+03 7.349643e+03 7458.5431526
## A1BG 5.890236e+00 4.897379e+00 5.099503e+00 4.850610e+00 5.4659467
## A1CF 3.782977e-02 3.830744e-03 8.870971e-02 8.845872e-02 0.0651354
## A2M 4.125516e+01 2.949814e+01 3.714409e+01 3.728517e+01 48.3297803
## A2ML1 1.887853e-01 2.739898e-01 7.846744e-02 3.797615e-01 0.4940688
## A2MP1 7.176260e-01 7.217753e-01 5.020828e-01 9.791689e-01 1.3553921
## A549_3 A375_1 A375_2 A375_3
## 7.303448e+03 7.371409e+03 7.801736e+03 7.959860e+03
## A1BG 4.309155e+00 4.563118e+00 5.942350e+00 3.991947e+00
## A1CF 1.938563e-02 1.034428e-02 3.691082e-02 1.865877e-02
## A2M 4.759494e+01 3.585082e+01 3.098624e+01 4.383335e+01
## A2ML1 2.521947e-01 4.775542e-01 3.322863e-01 3.713035e-01
## A2MP1 1.359307e+00 8.023772e-01 7.089108e-01 8.666825e-01
Now we are ready to create the DDS object for DESeq2 analysis that contains metadata (colData), counts further information that can be obtained via dds@.
The model design is one of the most important steps for an RNA-Seq analysis. Here, we are going to specify the design ~ replicate + condition (columns of the metadata file). The factor of interest goes last in the model terms, we want to compare control vs. lung vs. melanoma in the experiment. Replicate has been included as we want to control for the effect of sample replicates.
We are going to create the DDS object using DESeqDataSetFromTximport() as we used tximport to convert transcript abundances to gene-level counts.
By default, R will choose a reference level for factors based on alphabetical order. If you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. Set the reference level using relevel(), and create the (DESeq) DDS object using DESeq().
Check that relevel() worked correctly by running resultsNames(dds) - we want control to be the reference level so we can compare lung vs control and melanoma vs. control.
## [1] "Intercept" "replicate_2_vs_1"
## [3] "replicate_3_vs_1" "condition_lung_vs_control"
## [5] "condition_melanoma_vs_control"
We will use the gene-level counts to perform some quality control checks on the samples in the experiment.
Extract gene-level counts from the DDS object.
Write the counts files to text files in ~/RNA-Seq
A heatmap of sample distances matrix (dist(t(log2))) gives us an overview over similarities and dissimilarities between samples.
sampleDists <- dist(t(log2))
sampleDistMatrix <- as.matrix(sampleDists)
colnames(sampleDistMatrix) <- NULL
pheatmap(mat=sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colorRampPalette( rev(brewer.pal(9, "Blues")) )(255))PCA plots the variance explained by samples in each principal component. Typically PC1 & PC2 explain the most variation in the dataset, you should look at these first and foremost.
In the plot below we can see that the lung and control samples cluster tightly together, indicating the replicates were performed rigorously in the lab.
The melanoma samples overlap with the control samples in PC1/PC2. Make the effort to explore other principal components - we can see that in PC2 vs PC4 there is excellent separation between the melanoma and control samples.
PCA can give you a good idea of how succesful the Differential Expression analysis will be - samples that are very close together in PC feature space will not produce as many DE genes as those that are separated by a large distance.
p <- pca(log2, metadata = samples)
biplot(p,
colby = 'condition',
colkey = c('melanoma'='royalblue', 'control'='red1', 'lung'='forestgreen'),
hline = 0,
vline = 0,
legendPosition = 'right',
legendLabSize = 12,
legendIconSize = 8.0,
lab = TRUE,
drawConnectors = TRUE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2')biplot(p,
x = "PC2",
y = "PC4",
colby = 'condition',
colkey = c('melanoma'='royalblue', 'control'='red1', 'lung'='forestgreen'),
hline = 0,
vline = 0,
legendPosition = 'right',
legendLabSize = 12,
legendIconSize = 8.0,
drawConnectors = TRUE,
title = 'PCA bi-plot',
subtitle = 'PC2 versus PC4')We can extract differentially expressed genes between phenotypes of interest by using results() on the DDS object. Furtehrmore, we will apply the apeglm shrinkage estimator on our results. I will demonstrate the effect of apeglm with volcano plots. In summary apeglm shrinks low confidence (high inter sample variation) differentially expressed genes towards 0, producing a robust set of differentially expressed genes.
The argument coef in lfcShrink refers to the contrast of interest returned by resultsNames(dds). For lung vs. control, it is the 4th character string returned thus coef=4 for lung vs. control.
Set up lung vs control, melanoma vs control:
# make lung vs control object
lung_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "control"))
res1 <- lfcShrink(dds=dds, res=lung_v_ctrl, coef=4, type="apeglm")
summary(res1)##
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2805, 12%
## LFC < 0 (down) : 3404, 15%
## outliers [1] : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
# make melanoma vs control object
melanoma_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "melanoma", "control"))
res2 <- lfcShrink(dds=dds, res=melanoma_v_ctrl, coef=5, type="apeglm")
summary(res2)##
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 456, 2%
## LFC < 0 (down) : 423, 1.9%
## outliers [1] : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
Set up lung vs melanoma. We need to reconfigure the DDS object so ‘melanoma’ is the reference level in ‘condition’, and re-run DESeq().
Check resultsNames(dds) to ensure the correct contrasts are set:
# to make lung vs melanoma, relevel the dds object reference level and redo the DESeq call
dds$condition <- relevel(dds$condition, ref = "melanoma")
dds <- DESeq(dds)
# double check it worked
resultsNames(dds)## [1] "Intercept" "replicate_2_vs_1"
## [3] "replicate_3_vs_1" "condition_control_vs_melanoma"
## [5] "condition_lung_vs_melanoma"
Now extract the contrast results and perform apeglm.
# make lung vs melanoma
lung_v_melanoma <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "melanoma"))
res3 <- lfcShrink(dds=dds, res=lung_v_melanoma, coef=5, type="apeglm")
summary(res3)##
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2773, 12%
## LFC < 0 (down) : 3335, 15%
## outliers [1] : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
Take a look at the summary statistics of the number of DE genes. Do they conform to comments made about the PCA plot?
Below are some handy functions I re-use all the time for extracting differentially expressed genes and performing annotation of DE genes.
annotate_de_genes <- function(df){
df$hgnc_symbol <- rownames(df)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
info <- getBM(attributes=c("hgnc_symbol",
"ensembl_gene_id_version",
"chromosome_name",
"start_position",
"end_position",
"strand",
"entrezgene_description"),
filters = c("hgnc_symbol"),
values = df$hgnc_symbol,
mart = mart,
useCache=FALSE)
tmp <- merge(df, info, by="hgnc_symbol")
tmp$strand <- gsub("-1", "-", tmp$strand)
tmp$strand <- gsub("1", "+", tmp$strand)
tmp$hgnc_symbol <- make.names(tmp$hgnc_symbol, unique = T)
tmp <- tmp[!grepl("CHR", tmp$chromosome_name),]
output_col <- c("Gene", "Ensembl ID", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
tmp <- subset(tmp, select=c(hgnc_symbol, ensembl_gene_id_version, chromosome_name, start_position, end_position, strand, entrezgene_description, log2FoldChange, pvalue, padj))
colnames(tmp) <- output_col
if(min(tmp$Log2FC) > 0){
tmp <- tmp[order(-tmp$Log2FC),]
}else{
tmp <- tmp[order(tmp$Log2FC),]
}
return(tmp)
}Use the above functions to extract up/down regulated genes and annotate them using biomaRt, and write to a file. Please note that the get_(up/down)regulated function requires a dataframe as input.
Note: we will focus on lung vs control for the remainder of the practical
de_up <- get_upregulated(as.data.frame(res1))
de_down <- get_downregulated(as.data.frame(res1))
upregulated_genes <- annotate_de_genes(de_up)
downregulated_genes <- annotate_de_genes(de_down)
dir.create("~/RNA-Seq/DESeq_results/")
write.table(upregulated_genes, "~/RNA-Seq/DESeq_results/lung_vs_control_upregulated.txt", sep="\t", row.names=F, quote=F)
write.table(downregulated_genes, "~/RNA-Seq/DESeq_results/lung_vs_control_downregulated.txt", sep="\t", row.names=F, quote=F)Volcano plots are useful to show how many genes are differentially expressed in the experimental contrast of interest. Labels are optional, I have included them so you know how to use them.
N.B Volcano plots use -log10 on the Y-axis.
res1 <- na.omit(res1)
min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))
up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)
down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)
plot_top_20 <- c(up_list, down_list)
EnhancedVolcano(res1,
lab=rownames(res1),
x="log2FoldChange",
y="pvalue",
selectLab=plot_top_20,
drawConnectors=TRUE,
FCcutoff=1.0,
pCutoff=0.05,
title="Volcano Plot",
subtitle="Lung vs. Control",
legendVisible=F,
caption = paste0('Total Genes = ', nrow(res1)),
xlim=c(min_width, max_width),
ylim=c(0, max_height))Make a volcano plot of lung_v_ctrl, the DESeq2 results object that has not been filtered by apeglm.
You can see that there are genes with very large log2 FC values. As a rule of thumb, be skeptical of genes that have LFC values over +/- 10.
res1 <- na.omit(lung_v_ctrl)
min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))
up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)
down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)
plot_top_20 <- c(up_list, down_list)
EnhancedVolcano(res1,
lab=rownames(res1),
x="log2FoldChange",
y="pvalue",
selectLab=plot_top_20,
drawConnectors=TRUE,
FCcutoff=1.0,
pCutoff=0.05,
title="Volcano Plot",
subtitle="Lung vs. Control",
legendVisible=F,
caption = paste0('Total Genes = ', nrow(res1)))Heatmaps are another way to show the differentially expressed genes in the experimental contrast of interest. Complexheatmap is as the name suggest, quite complex compared to other heatmap packages but it is undoubtedly the best in terms of flexibility.
# subset the counts matrix to get the lung and control samples
subset <- counts[, 1:6]
# now select de_up, de_down, i.e DE genes that passed the filtering
up <- rownames(de_up)
down <- rownames(de_down)
# subset them
key <- c(up, down)
subset <- subset[which(rownames(subset) %in% key),]
# Scale and center the counts matrix
# Scale works on columns, transpose the matrix to scale and center genes, transpose back.
mat <- t(subset)
mat <- scale(mat, center=T, scale=T)
mat <- t(mat)
mat <- na.omit(mat)
# set up annotation dataframe
ann <- data.frame(Cell_Type = c(rep("control", 3), c(rep("lung", 3))))
# set up heatmap column annotation
ha_col = HeatmapAnnotation(df = ann,
col = list(Cell_Type = c("control" = "gold",
"lung" = "forestgreen")),
annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"),
labels_gp = gpar(fontsize = 12)),
annotation_name_side = "left")
# set up heatmap object
hm1 <- Heatmap(mat,
col= colorRamp2(c(-2.6,-1,0,1,2.6),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2.6,-1,0,1,2.6),color_bar="continuous",
legend_direction="vertical", legend_width=unit(5,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=T,
show_row_dend=T,
row_title_side="right",
row_title_gp=gpar(fontsize=8),
show_row_names=FALSE,
row_names_side="left",
#Column annotation configuratiions
cluster_columns=T,
show_column_dend=T,
column_title="Lung vs. Control DE Genes",
column_title_side="top",
column_title_gp=gpar(fontsize=15, fontface="bold"),
show_column_names = T,
column_names_gp = gpar(fontsize = 12, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 1,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha_col)
# plot heatmap
draw(hm1, annotation_legend_side = "right", heatmap_legend_side="right")Instead of plotting all DE genes in a heatmap, we can select the top n differentially expressed genes to plot. I have decided to use the largest LFC values to plot the top 20 genes, however you might see people plotting the top n variable genes in this situation.
# same logic as volcano plot
up <- subset(de_up, de_up$log2FoldChange > 1 & de_up$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)
down <- subset(de_down, de_down$log2FoldChange < 1 & de_down$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)
plot_top_20 <- c(up_list, down_list)
# subset the counts matrix from last step
top_genes <- mat[which(rownames(mat) %in% plot_top_20),]
hm1 <- Heatmap(top_genes,
col= colorRamp2(c(-2.6,-1,0,1,2.6),c("blue","skyblue","white","lightcoral","red")),
heatmap_legend_param=list(at=c(-2.6,-1,0,1,2.6),color_bar="continuous",
legend_direction="vertical", legend_width=unit(5,"cm"),
title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
name = "Z-score",
#Row annotation configurations
cluster_rows=T,
show_row_dend=T,
row_title_side="right",
row_title_gp=gpar(fontsize=8),
show_row_names=TRUE,
row_names_side="right",
#Column annotation configuratiions
cluster_columns=T,
show_column_dend=T,
column_title="Lung vs. Control top 20 DE genes",
column_title_side="top",
column_title_gp=gpar(fontsize=15, fontface="bold"),
show_column_names = T,
column_names_gp = gpar(fontsize = 12, fontface="bold"),
#Dendrogram configurations: columns
clustering_distance_columns="euclidean",
clustering_method_columns="complete",
column_dend_height=unit(10,"mm"),
#Dendrogram configurations: rows
clustering_distance_rows="euclidean",
clustering_method_rows="complete",
row_dend_width=unit(4,"cm"),
row_dend_side = "left",
row_dend_reorder = TRUE,
#Splits
border=T,
row_km = 1,
column_km = 1,
#plot params
width = unit(5, "inch"),
height = unit(4, "inch"),
#height = unit(0.4, "cm")*nrow(mat),
#Annotations
top_annotation = ha_col)
# plot heatmap
draw(hm1, annotation_legend_side = "right", heatmap_legend_side="right")Please download the GO Biological processes GMT file from the following link: https://raw.githubusercontent.com/BarryDigby/barrydigby.github.io/master/RNA-Seq/c5.bp.v7.0.symbols.gmt
Convert the DESeq results object to a dataframe, (not the apeglm filtered results object, which has no column named stat) and produce a dataframe of genes with their summary statistic.
## convert result object to dataframe
res <- as.data.frame(lung_v_ctrl)
res$hgnc_symbol <- rownames(res)
# compute summary stat
fgsea_rank <- res %>%
dplyr::select(hgnc_symbol, stat) %>%
na.omit() %>%
distinct() %>%
group_by(hgnc_symbol) %>%
summarize(stat=mean(stat))
fgsea_rank## # A tibble: 22,759 x 2
## hgnc_symbol stat
## <chr> <dbl>
## 1 "" -2.14
## 2 "A1BG" -0.855
## 3 "A1CF" 0.448
## 4 "A2M" 1.13
## 5 "A2ML1" 1.49
## 6 "A2MP1" 2.61
## 7 "A3GALT2" 0.448
## 8 "A4GALT" 1.52
## 9 "A4GNT" -0.104
## 10 "AAAS" -5.07
## # … with 22,749 more rows
## A1BG A1CF A2M A2ML1 A2MP1 A3GALT2
## -2.1441358 -0.8549118 0.4475917 1.1329554 1.4913325 2.6087616 0.4477794
## A4GALT A4GNT AAAS AACS AACSP1 AADAC AADACL3
## 1.5218178 -0.1035146 -5.0736368 -6.0395784 -0.8230446 -0.1258552 0.1489940
## AADACP1 AADAT AAGAB AAK1 AAMDC AAMP
## -0.4319702 -1.3401389 1.4935266 1.0193265 -2.6350318 1.3554245
## $GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION
## [1] "CDK9" "CHD1" "DHX9" "EP300" "SNW1" "RRP1B" "NELFB"
## [8] "GTF2B" "GTF2F1" "GTF2F2" "MDFIC" "HPN" "JUN" "LEF1"
## [15] "ZNF639" "NELFCD" "RSF1" "POLR2A" "POLR2B" "POLR2C" "POLR2D"
## [22] "POLR2E" "POLR2F" "POLR2G" "POLR2H" "POLR2I" "POLR2J" "POLR2K"
## [29] "POLR2L" "NUCKS1" "SMARCA4" "SMARCB1" "SP1" "SUPT4H1" "SUPT5H"
## [36] "TAF11" "TFAP4" "NELFA" "NELFE" "CCNT1" "CTDP1"
Subset the fgsea results object to include results whose normalised enrichment score is > 2.3. You can use any preferred filtering method here, this is just to return a handful of pathways for the enrichment plot for loop.
filtered_pathway <- subset(fgsea, NES > 2.3)
filt_p <- as.vector(filtered_pathway$pathway)
for (i in filt_p){
plt <- plotEnrichment(pathway = pathway[[i]],
gseaParam = 1, ticksSize = 0.5, stats= rank) +
labs(title=i) + theme(plot.title = element_text(hjust = 0.5, face="bold"))
print(plt)
}To use cluster profiler, gene IDs must be converted to ENTREZ IDs. We can achieve this using biomaRt and attaching the ENTREZ IDs to the results table. For the sake of the tutorial, we are only going to supply statistically significant genes to clusterProfiler (padj < 0.01)
df <- as.data.frame(res1)
df$hgnc_symbol <- rownames(df)
info <- getBM(attributes=c("hgnc_symbol",
"entrezgene_id"),
filters = c("hgnc_symbol"),
values = df$hgnc_symbol,
mart = mart,
useCache=FALSE)
tmp <- merge(df, info, by="hgnc_symbol")
# subset the dataframe to include only stat sig genes
tmp <- tmp[tmp$padj < 0.01,]Be careful to specify the ENTREZ IDs (which are numeric names) as characters.
OrgDb <- org.Hs.eg.db
geneList <- as.vector(tmp$log2FoldChange)
names(geneList) <- as.character(tmp$entrezgene_id)
gene <- na.omit(as.character(tmp$entrezgene_id))
# GO over-representation test
ego <- clusterProfiler::enrichGO(gene = gene,
OrgDb = OrgDb,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.001,
qvalueCutoff = 0.01,
readable = TRUE)
head(summary(ego))## ID Description GeneRatio BgRatio
## GO:0006260 GO:0006260 DNA replication 151/4336 274/18670
## GO:0007059 GO:0007059 chromosome segregation 164/4336 321/18670
## GO:0006261 GO:0006261 DNA-dependent DNA replication 98/4336 153/18670
## GO:0000819 GO:0000819 sister chromatid segregation 110/4336 189/18670
## GO:0140014 GO:0140014 mitotic nuclear division 138/4336 264/18670
## GO:0000070 GO:0000070 mitotic sister chromatid segregation 93/4336 151/18670
## pvalue p.adjust qvalue
## GO:0006260 2.000329e-30 1.280211e-26 9.917421e-27
## GO:0007059 7.179680e-28 2.297498e-24 1.779805e-24
## GO:0006261 3.967488e-27 8.463975e-24 6.556797e-24
## GO:0000819 3.622111e-25 5.795377e-22 4.489511e-22
## GO:0140014 6.803353e-25 8.708291e-22 6.746061e-22
## GO:0000070 5.585756e-24 4.988101e-21 3.864137e-21
## geneID
## GO:0006260 ALYREF/ATAD5/ATM/ATRIP/BARD1/BCL6/BLM/BOD1L1/BRCA1/BRCA2/CACYBP/CCNA2/CDC45/CDC6/CDC7/CDK1/CDK2/CDT1/CENPS/CENPX/CHAF1A/CHAF1B/CHEK1/CHEK2/CHTF18/CHTF8/CINP/CLSPN/DBF4/DBF4B/DDX11/DHX9/DNA2/DONSON/DSCC1/DTL/DUT/E2F7/E2F8/EGFR/EME1/EME2/ESCO2/ETAA1/EXO1/FAF1/FANCM/FBXO5/FEN1/GEN1/GINS1/GINS2/GINS3/GINS4/GMNN/GTPBP4/HUS1/ID3/ING4/JUN/KCTD13/LIG1/LIG3/MCM10/MCM2/MCM3/MCM4/MCM5/MCM6/MCM7/MCM8/MCMBP/MGME1/MMS22L/MRE11/MSH3/MSH6/NASP/NFIA/NFIB/NFIX/NUCKS1/ORC1/ORC3/ORC5/ORC6/PCLAF/PCNA/PIF1/POLA2/POLD2/POLD3/POLE/POLE2/POLE3/POLE4/POLQ/PRIM1/PRIM2/PRIMPOL/PTMS/PURA/RAC1/RAD1/RAD51/RAD9B/RBBP7/RBMS1/RECQL/RECQL5/REV1/REV3L/RFC1/RFC2/RFC3/RFC4/RFC5/RFWD3/RHNO1/RMI1/RMI2/RNASEH2A/RPA1/RPA2/RPA3/RRM1/RRM2/RRM2B/SET/SETMAR/SLBP/SLX4/SMC1A/SSBP1/SSRP1/STAG2/SUPT16H/TBRG1/TICRR/TIMELESS/TIPIN/TNFAIP1/TONSL/TOPBP1/TREX1/UPF1/WDHD1/WRN/WRNIP1/ZBTB38/ZNF830
## GO:0007059 ACTR3/AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BANF1/BEX4/BIRC5/BRCA1/BUB1/BUB1B/BUB3/CCNB1/CCNB1IP1/CDC20/CDC23/CDC27/CDC42/CDC6/CDCA2/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CENPN/CENPQ/CENPS/CENPW/CENPX/CEP85/CHAMP1/CHMP1B/CHMP5/CHTF8/CIAO1/CTCF/CTNNB1/DDX11/DLGAP5/DMC1/DSCC1/DSN1/DUSP1/ECT2/EME1/EME2/ESCO2/ESPL1/FAM83D/FANCD2/FANCM/FBXO5/FEN1/FMN2/GEN1/HASPIN/HJURP/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNL1/KNSTRN/KPNB1/LATS1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MCMBP/MIS18A/MKI67/MLH1/MMS19/MRE11/MSH4/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC1/NDC80/NDE1/NEK2/NR3C1/NSMCE2/NUF2/NUMA1/NUP37/NUP43/NUP62/NUSAP1/OIP5/PCID2/PHF13/PIBF1/PLK1/PMF1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAD18/RAD51C/RAN/RCC1/RECQL5/RIOK2/RIOK3/RMI1/RMI2/RPS3/RRS1/SEH1L/SETDB2/SFPQ/SGO1/SGO2/SKA1/SKA2/SKA3/SLC25A5/SLF1/SLF2/SLX4/SMC1A/SMC2/SMC4/SMC5/SMC6/SPAG5/SPC25/SPDL1/STAG1/STAG2/STAG3/STAG3L4/TACC3/TLK1/TOP2A/TRIP13/TTK/TTL/TUBG1/UBE2I/XRCC3/ZW10/ZWINT
## GO:0006261 ALYREF/ATAD5/BCL6/BLM/BOD1L1/BRCA2/CDC45/CDC6/CDC7/CDK2/CDT1/CENPS/CENPX/CHEK2/CHTF18/CHTF8/DBF4/DBF4B/DDX11/DNA2/DONSON/DSCC1/E2F7/E2F8/EME1/EME2/ETAA1/FANCM/FBXO5/FEN1/GEN1/GINS1/GINS2/GINS3/GINS4/GMNN/LIG1/LIG3/MCM10/MCM2/MCM3/MCM4/MCM5/MCM6/MCM7/MCMBP/MGME1/MMS22L/MRE11/MSH3/MSH6/NUCKS1/ORC1/ORC3/ORC5/ORC6/PCNA/POLA2/POLD2/POLD3/POLE/POLE2/POLE3/POLE4/POLQ/PRIM1/PRIM2/PRIMPOL/PURA/RAD51/RECQL/RECQL5/REV3L/RFC1/RFC2/RFC3/RFC4/RFC5/RFWD3/RNASEH2A/RPA1/RPA2/RPA3/RRM2B/SETMAR/SLBP/SMC1A/SSBP1/STAG2/TICRR/TIMELESS/TIPIN/TONSL/UPF1/WDHD1/WRN/WRNIP1/ZNF830
## GO:0000819 AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CHAMP1/CHMP1B/CHMP5/CHTF8/CTCF/CTNNB1/DDX11/DLGAP5/DSCC1/DSN1/DUSP1/ESCO2/ESPL1/FBXO5/FEN1/GEN1/HASPIN/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KPNB1/LATS1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MCMBP/MRE11/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NEK2/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PIBF1/PLK1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAD51C/RAN/RIOK2/RMI2/RRS1/SEH1L/SFPQ/SGO1/SGO2/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/STAG1/STAG2/STAG3/STAG3L4/TACC3/TOP2A/TRIP13/TTK/TUBG1/XRCC3/ZW10/ZWINT
## GO:0140014 AAAS/AKAP8L/ANAPC1/ANAPC15/ANAPC5/ANLN/ARHGEF10/ATM/AURKA/AURKB/BCCIP/BIRC5/BORA/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC25C/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CEP192/CEP85/CHAMP1/CHEK1/CHEK2/CHMP1B/CHMP5/CHTF8/CLASP2/DLGAP5/DRG1/DSCC1/DSN1/DUSP1/EPS8/ESPL1/FBXO5/GEN1/HASPIN/HNRNPU/HSPA1A/HSPA1B/IL1B/INCENP/INSR/KIF11/KIF14/KIF18A/KIF18B/KIF20B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KNTC1/KPNB1/MACROH2A1/MAD2L1/MAD2L2/MAP9/MAU2/MKI67/MSTO1/MTBP/MYBL2/MZT1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NDE1/NEK2/NFE2L1/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PHIP/PIBF1/PKMYT1/PLK1/PPP1R9B/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAN/RANBP1/RCC1/RHOA/RIOK2/RPL24/RRS1/SEH1L/SGO1/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/SPHK1/STAG1/STAG2/TACC3/TOM1L1/TPX2/TRIP13/TTK/TUBG1/UBE2C/UBE2S/UBXN2B/USP16/XRCC3/ZW10/ZWINT
## GO:0000070 AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CHAMP1/CHMP1B/CHMP5/CHTF8/DLGAP5/DSCC1/DSN1/DUSP1/ESPL1/FBXO5/GEN1/HASPIN/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KPNB1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NEK2/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PIBF1/PLK1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAN/RIOK2/RRS1/SEH1L/SGO1/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/TACC3/TRIP13/TTK/TUBG1/XRCC3/ZW10/ZWINT
## Count
## GO:0006260 151
## GO:0007059 164
## GO:0006261 98
## GO:0000819 110
## GO:0140014 138
## GO:0000070 93
Subset the enrichment object, be sure to add asis = T. For the sake of the tutorial, we are selecting pathways with 21 genes in them to reduce the plot size. You may want to filter by pathways or genes of interest here.
clusterProfiler produces a nice gene-concept network plot showing gene fold changes and the pathway to which they belong.
subset <- ego[ego$Count == 21, asis=T]
cnetplot(subset, categorySize="geneNum", foldChange=geneList)You may wish to demonstrate specific gene expression values using boxplots. We will use the package ggpubr here which produces really nice publication ready plots.
In the example below we will use the DESeq2 plotCounts function, which converts the count matrix for gene \(i\) into long format, and appends metadata. This is the standard input format for plotting categorical data in R.
## count condition
## ctrl_1 8431.116 control
## ctrl_2 9376.712 control
## ctrl_3 9692.598 control
## A549_1 10316.675 lung
## A549_2 10445.949 lung
## A549_3 10317.552 lung
## A375_1 7946.369 melanoma
## A375_2 9958.632 melanoma
## A375_3 10465.176 melanoma
We could use the function to produce count plots (returnData=F), however in my opinion the plot is rather lackluster…
If we want to plot lung vs control, we need to remove the melanoma counts from the dataframe:
## count condition
## ctrl_1 8431.116 control
## ctrl_2 9376.712 control
## ctrl_3 9692.598 control
## A549_1 10316.675 lung
## A549_2 10445.949 lung
## A549_3 10317.552 lung
Now plot the data using ggpubr:
ggboxplot(d, x = "condition", y = "count",
fill = "condition", color = "black", palette = "jco",
ylab ="Normalized Counts", title = "PTEN Expression",
xlab="", add = c("dotplot"),
add.params = list(size=0.75, jitter=0.01),
legend = "none", bxp.errorbar = T,
bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
theme(axis.text.x = element_text( colour = "black", size=14)) +
theme(axis.title.y = element_text( colour = "black", size=14)) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())We can perform a pairwise t-test and add the significance values for the boxplot. First, set up the comparison of interest comp using a list. We will call this variable using stat_compare_means.
I like to bump the significance value label slightly higher than the default. To do this dynamically, simply add 5% to the max value and set this as the y axis value for the significance label (signif_yaxis variable).
In the plot below we can see that PTEN is more highly expressed in the lung samples, however not at a significant level.
comps <- list(c("lung", "control"))
signif_yaxis = max(d$count) + (max(d$count)*0.05)
ggboxplot(d, x = "condition", y = "count",
fill = "condition", color = "black", palette = "jco",
ylab ="Normalized Counts", title = "PTEN Expression",
xlab="", add = c("dotplot"),
add.params = list(size=0.75, jitter=0.01),
legend = "none", bxp.errorbar = T,
bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
theme(axis.text.x = element_text( colour = "black", size=14)) +
theme(axis.title.y = element_text( colour = "black", size=14)) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
stat_compare_means(comparisons = comps, method = "t.test", label.y = signif_yaxis)Lets plot lung vs control vs melanoma for PNN and run an ANOVA test to see if PNN is significantly over or under expressed in a condition compared to all. Re-run the plotCounts function to obtain the dataframe with all three conditions:
## count condition
## ctrl_1 4668.903 control
## ctrl_2 4538.654 control
## ctrl_3 5027.576 control
## A549_1 3183.788 lung
## A549_2 3087.202 lung
## A549_3 3442.206 lung
## A375_1 4191.339 melanoma
## A375_2 4855.743 melanoma
## A375_3 5046.059 melanoma
To perform pairwise test vs. all (i.e the base-mean), use compare_means and set the ref.group to ".all.". We are going to change the signficance labels for each pairwise comparison to ns, *, ** and *** instead of reporting precise p-values. This can be toggled using label in the stat_compare_means function.
We will be plotting two significance labels (ANOVA and pairwise) thus we will need to make two signif_yaxis variables at 5% and 10% of the max value respectively to produce a tidy plot.
In the plot below, we see that PNN is underexpressed in melanoma compared to control and lung, and the ANOVA test suggests the results are significant.
signif_yaxis = max(d$count) + (max(d$count)*0.10)
signif_yaxis2 = max(d$count) + (max(d$count)*0.05)
compare_means(count ~ condition, data = d,
ref.group = ".all.", method = "t.test")## # A tibble: 3 x 8
## .y. group1 group2 p p.adj p.format p.signif method
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 count .all. melanoma 0.246 0.25 0.2461 ns T-test
## 2 count .all. control 0.117 0.23 0.1166 ns T-test
## 3 count .all. lung 0.00612 0.018 0.0061 ** T-test
ggboxplot(d, x = "condition", y = "count",
fill = "condition", color = "black", palette = "jco",
ylab ="Normalized Counts", title = "PNN Expression",
xlab="", add = c("dotplot"),
add.params = list(size=0.75, jitter=0.01),
legend = "none", bxp.errorbar = T,
bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
theme(axis.text.x = element_text( colour = "black", size=14)) +
theme(axis.title.y = element_text( colour = "black", size=14)) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
rotate_x_text(angle = 45) +
geom_hline(yintercept = mean(d$count), linetype = 2) +
stat_compare_means(method = "anova", label.y = signif_yaxis) +
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.", label.y = signif_yaxis2)