library(dplyr)
library(biomaRt)
library(tximport)
library(rhdf5)
library(gplots)
library(org.Hs.eg.db)
library(DESeq2)
library(DT)
library(apeglm)
library(RColorBrewer)
library(IHW)
library(PCAtools)
library(pheatmap)
library(clusterProfiler)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(circlize)
library(fgsea)
library(tidyverse)
library(ggpubr)

Introduction

This R markdown document walks through an RNA-Seq differential expression analysis using DESeq2 for MA5112. Quality control plots, differential expression results, plots and pathway analysis are covered in the document.

The tutorial assumes you have 9 Kallisto quantification directories under ~/RNA-Seq/quant/ on your laptop.

Reading Input Data

Metadata

Read in the file samples.csv which contains the experimental metadata. Please make sure that the rownames of this dataframe match the kallisto quantification directory names at ~/RNA-Seq/quant/.

samples <- read.csv("~/RNA-Seq/samples.csv", header=T, row.names = "sample", stringsAsFactors = T)
samples

##        condition replicate
## ctrl_1   control         1
## ctrl_2   control         2
## ctrl_3   control         3
## A549_1      lung         1
## A549_2      lung         2
## A549_3      lung         3
## A375_1  melanoma         1
## A375_2  melanoma         2
## A375_3  melanoma         3

Convert Numerics to Factors

When performing differential expression analysis using DESeq2, if a column in samples.csv is numeric and provided to the design formula, DESeq2 will produce the following message:

the design formula contains a numeric variable with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function

What does this mean? From Mike Love (DESeq2 author):

There is a constant fold change for every unit of change in replicates. So if the estimated fold change is 2, this implies that replicates 2 = 2x replicates 1, replicates 3 = 2x replicates 2, etc. Or in other words, the relationship is linear on the log counts scale.

This is not what we want. We want each replicate to be modelled as a seperate term in the model. To overcome this, we could convert 1 -> A, 2 -> B, 3 -> C or we can use factor() to specify that replicates should be treated as a factor, not a numeric.

The code block below checks the input metadata file, tells the user if any columns are numeric and converts them to a factor.

## check if all columns are factors.  
factor_cols <- sapply(samples, is.factor)

if(all(factor_cols) == TRUE){
  print("All columns in metadata are factors and suitable for analysis.")
}else{
  numeric_cols <- sapply(samples, is.numeric)
  names <- colnames(samples)[numeric_cols]
  print(paste0("Column(s) ", names, " is numeric. Converting to factor."))
  samples[numeric_cols] <- as.data.frame(lapply(samples[numeric_cols], factor))
  final_check <- sapply(samples, is.factor)
  if(all(final_check) == TRUE){
    print("All columns in metadata are factors and suitable for analysis.")
  }else{
    print("Error in converting to factors. Check input metadata file.")
  }
}

## [1] "Column(s) replicate is numeric. Converting to factor."
## [1] "All columns in metadata are factors and suitable for analysis."

Stage Kallisto files

We need to create a file handle object (a named character list) containing the sample IDs and the paths to the kallisto quantification .h5 files. If the rownames of your metadata object (samples) do not match the quantification directory names, you will get an error during the TXI object step.

dir <- ("~/RNA-Seq/quant")
files <- file.path(dir, rownames(samples), "abundance.h5")
names(files) <- paste0(rownames(samples))
files

##                                ctrl_1                                ctrl_2 
## "~/RNA-Seq/quant/ctrl_1/abundance.h5" "~/RNA-Seq/quant/ctrl_2/abundance.h5" 
##                                ctrl_3                                A549_1 
## "~/RNA-Seq/quant/ctrl_3/abundance.h5" "~/RNA-Seq/quant/A549_1/abundance.h5" 
##                                A549_2                                A549_3 
## "~/RNA-Seq/quant/A549_2/abundance.h5" "~/RNA-Seq/quant/A549_3/abundance.h5" 
##                                A375_1                                A375_2 
## "~/RNA-Seq/quant/A375_1/abundance.h5" "~/RNA-Seq/quant/A375_2/abundance.h5" 
##                                A375_3 
## "~/RNA-Seq/quant/A375_3/abundance.h5"

Tximport

tximport imports transcript-level abundances from quantification tools (kallisto in our case) and converts them to gene counts for downstream analyses using a differential expression analysis package.

BiomaRt

We will use biomaRt to connect to the ENSEMBL databases and map transcript IDs to gene IDs. Firstly, we will create the mart object specifying which database to use:

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

Transcript to gene (tx2gene)

Recall we used the ENSEMBL reference cDNA file Homo_sapiens.GRCh38.cdna.all.fa.gz for Kallisto quant on LUGH. This means that our quantification files have ENSEMBL transcript ID’s. We can map them to gene symbols running the code below:

tx2gene <- getBM(attributes = c("ensembl_transcript_id_version", "hgnc_symbol"), mart = mart, useCache = FALSE)
head(tx2gene)

##   ensembl_transcript_id_version hgnc_symbol
## 1             ENST00000387314.1       MT-TF
## 2             ENST00000389680.2     MT-RNR1
## 3             ENST00000387342.1       MT-TV
## 4             ENST00000387347.2     MT-RNR2
## 5             ENST00000386347.1      MT-TL1
## 6             ENST00000361390.2      MT-ND1

TXI object

Create a txi object summarising kallisto transcipt quantification to the gene-level.

txi <- tximport(files, type = "kallisto", tx2gene = tx2gene)
head(txi$abundance)

##             ctrl_1       ctrl_2       ctrl_3       A549_1       A549_2
##       7.284162e+03 7.953254e+03 7.560597e+03 7.349643e+03 7458.5431526
## A1BG  5.890236e+00 4.897379e+00 5.099503e+00 4.850610e+00    5.4659467
## A1CF  3.782977e-02 3.830744e-03 8.870971e-02 8.845872e-02    0.0651354
## A2M   4.125516e+01 2.949814e+01 3.714409e+01 3.728517e+01   48.3297803
## A2ML1 1.887853e-01 2.739898e-01 7.846744e-02 3.797615e-01    0.4940688
## A2MP1 7.176260e-01 7.217753e-01 5.020828e-01 9.791689e-01    1.3553921
##             A549_3       A375_1       A375_2       A375_3
##       7.303448e+03 7.371409e+03 7.801736e+03 7.959860e+03
## A1BG  4.309155e+00 4.563118e+00 5.942350e+00 3.991947e+00
## A1CF  1.938563e-02 1.034428e-02 3.691082e-02 1.865877e-02
## A2M   4.759494e+01 3.585082e+01 3.098624e+01 4.383335e+01
## A2ML1 2.521947e-01 4.775542e-01 3.322863e-01 3.713035e-01
## A2MP1 1.359307e+00 8.023772e-01 7.089108e-01 8.666825e-01

DESeq2

Now we are ready to create the DDS object for DESeq2 analysis that contains metadata (colData), counts further information that can be obtained via dds@.

Design

The model design is one of the most important steps for an RNA-Seq analysis. Here, we are going to specify the design ~ replicate + condition (columns of the metadata file). The factor of interest goes last in the model terms, we want to compare control vs. lung vs. melanoma in the experiment. Replicate has been included as we want to control for the effect of sample replicates.

DDS

We are going to create the DDS object using DESeqDataSetFromTximport() as we used tximport to convert transcript abundances to gene-level counts.

dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ replicate + condition )

Factor levels

By default, R will choose a reference level for factors based on alphabetical order. If you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. Set the reference level using relevel(), and create the (DESeq) DDS object using DESeq().

Check that relevel() worked correctly by running resultsNames(dds) - we want control to be the reference level so we can compare lung vs control and melanoma vs. control.

dds$condition <- relevel(dds$condition, ref = "control")
dds <- DESeq(dds)
resultsNames(dds)

## [1] "Intercept"                     "replicate_2_vs_1"             
## [3] "replicate_3_vs_1"              "condition_lung_vs_control"    
## [5] "condition_melanoma_vs_control"

Quality Control

We will use the gene-level counts to perform some quality control checks on the samples in the experiment.

Extract Counts

Extract gene-level counts from the DDS object.

counts <- counts(dds, normalized=TRUE)
log2 <- log2(counts + 1)

Save Counts

Write the counts files to text files in ~/RNA-Seq

dir.create("~/RNA-Seq/counts")
write.table(counts, "~/RNA-Seq/counts/normalised_counts.txt", sep="\t", quote = F)
write.table(log2, "~/RNA-Seq/counts/log2_counts.txt", sep="\t", quote = F)

Sample Heatmap

A heatmap of sample distances matrix (dist(t(log2))) gives us an overview over similarities and dissimilarities between samples.

sampleDists <- dist(t(log2))
sampleDistMatrix <- as.matrix(sampleDists)
colnames(sampleDistMatrix) <- NULL
pheatmap(mat=sampleDistMatrix,
            clustering_distance_rows=sampleDists,
            clustering_distance_cols=sampleDists,
            col=colorRampPalette( rev(brewer.pal(9, "Blues")) )(255))

PCA

PCA plots the variance explained by samples in each principal component. Typically PC1 & PC2 explain the most variation in the dataset, you should look at these first and foremost.

In the plot below we can see that the lung and control samples cluster tightly together, indicating the replicates were performed rigorously in the lab.

The melanoma samples overlap with the control samples in PC1/PC2. Make the effort to explore other principal components - we can see that in PC2 vs PC4 there is excellent separation between the melanoma and control samples.

PCA can give you a good idea of how succesful the Differential Expression analysis will be - samples that are very close together in PC feature space will not produce as many DE genes as those that are separated by a large distance.

p <- pca(log2, metadata = samples)

biplot(p,
       colby = 'condition',
       colkey = c('melanoma'='royalblue', 'control'='red1', 'lung'='forestgreen'),
       hline = 0,
       vline = 0,
       legendPosition = 'right',
       legendLabSize = 12,
       legendIconSize = 8.0,
       lab = TRUE,
       drawConnectors = TRUE,
       title = 'PCA bi-plot',
       subtitle = 'PC1 versus PC2')

biplot(p,
       x = "PC2",
       y = "PC4",
       colby = 'condition',
       colkey = c('melanoma'='royalblue', 'control'='red1', 'lung'='forestgreen'),
       hline = 0,
       vline = 0,
       legendPosition = 'right',
       legendLabSize = 12,
       legendIconSize = 8.0,
       drawConnectors = TRUE,
       title = 'PCA bi-plot',
       subtitle = 'PC2 versus PC4')

Differential Expression Analysis

DESeq results()

We can extract differentially expressed genes between phenotypes of interest by using results() on the DDS object. Furtehrmore, we will apply the apeglm shrinkage estimator on our results. I will demonstrate the effect of apeglm with volcano plots. In summary apeglm shrinks low confidence (high inter sample variation) differentially expressed genes towards 0, producing a robust set of differentially expressed genes.

The argument coef in lfcShrink refers to the contrast of interest returned by resultsNames(dds). For lung vs. control, it is the 4th character string returned thus coef=4 for lung vs. control.

Set up lung vs control, melanoma vs control:

# make lung vs control object
lung_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "control"))
res1 <- lfcShrink(dds=dds, res=lung_v_ctrl, coef=4, type="apeglm")
summary(res1)

## 
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2805, 12%
## LFC < 0 (down)     : 3404, 15%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

# make melanoma vs control object
melanoma_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "melanoma", "control"))
res2 <- lfcShrink(dds=dds, res=melanoma_v_ctrl, coef=5, type="apeglm")
summary(res2)

## 
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 456, 2%
## LFC < 0 (down)     : 423, 1.9%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

Set up lung vs melanoma. We need to reconfigure the DDS object so ‘melanoma’ is the reference level in ‘condition’, and re-run DESeq().

Check resultsNames(dds) to ensure the correct contrasts are set:

# to make lung vs melanoma, relevel the dds object reference level and redo the DESeq call
dds$condition <- relevel(dds$condition, ref = "melanoma")
dds <- DESeq(dds)

# double check it worked
resultsNames(dds)

## [1] "Intercept"                     "replicate_2_vs_1"             
## [3] "replicate_3_vs_1"              "condition_control_vs_melanoma"
## [5] "condition_lung_vs_melanoma"

Now extract the contrast results and perform apeglm.

# make lung vs melanoma
lung_v_melanoma <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "melanoma"))
res3 <- lfcShrink(dds=dds, res=lung_v_melanoma, coef=5, type="apeglm")
summary(res3)

## 
## out of 22759 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2773, 12%
## LFC < 0 (down)     : 3335, 15%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

Take a look at the summary statistics of the number of DE genes. Do they conform to comments made about the PCA plot?

Resusable Functions

Below are some handy functions I re-use all the time for extracting differentially expressed genes and performing annotation of DE genes.

Extract Up Regulated Genes

get_upregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange>=1)], rownames(df)[which(df$pvalue<=0.05)])

  results <- as.data.frame((df)[which(rownames(df) %in% key),])
    return(results)
}

Extract Down Regulated genes

get_downregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange<=-1)], rownames(df)[which(df$pvalue<=0.05)])

    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    return(results)
}

Annotate DE genes

annotate_de_genes <- function(df){

    df$hgnc_symbol <- rownames(df)
    mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
    info <- getBM(attributes=c("hgnc_symbol",
                               "ensembl_gene_id_version",
                               "chromosome_name",
                               "start_position",
                               "end_position",
                               "strand",
                               "entrezgene_description"),
                  filters = c("hgnc_symbol"),
                  values = df$hgnc_symbol,
                  mart = mart,
                  useCache=FALSE)

    tmp <- merge(df, info, by="hgnc_symbol")
    tmp$strand <- gsub("-1", "-", tmp$strand)
    tmp$strand <- gsub("1", "+", tmp$strand)
    tmp$hgnc_symbol <- make.names(tmp$hgnc_symbol, unique = T)
    tmp <- tmp[!grepl("CHR", tmp$chromosome_name),]

    output_col <- c("Gene", "Ensembl ID", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
    tmp <- subset(tmp, select=c(hgnc_symbol, ensembl_gene_id_version, chromosome_name, start_position, end_position, strand, entrezgene_description, log2FoldChange, pvalue, padj))
    colnames(tmp) <- output_col

    if(min(tmp$Log2FC) > 0){
        tmp <- tmp[order(-tmp$Log2FC),]
    }else{
        tmp <- tmp[order(tmp$Log2FC),]
    }

    return(tmp)

}

Write DE results

Use the above functions to extract up/down regulated genes and annotate them using biomaRt, and write to a file. Please note that the get_(up/down)regulated function requires a dataframe as input.

Note: we will focus on lung vs control for the remainder of the practical

de_up <- get_upregulated(as.data.frame(res1))
de_down <- get_downregulated(as.data.frame(res1))

upregulated_genes <- annotate_de_genes(de_up)
downregulated_genes <- annotate_de_genes(de_down)

dir.create("~/RNA-Seq/DESeq_results/")
write.table(upregulated_genes, "~/RNA-Seq/DESeq_results/lung_vs_control_upregulated.txt", sep="\t", row.names=F, quote=F)
write.table(downregulated_genes, "~/RNA-Seq/DESeq_results/lung_vs_control_downregulated.txt", sep="\t", row.names=F, quote=F)

Results Plots

Volcano Plot

Volcano plots are useful to show how many genes are differentially expressed in the experimental contrast of interest. Labels are optional, I have included them so you know how to use them.

N.B Volcano plots use -log10 on the Y-axis.

res1 <- na.omit(res1)

min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))

up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)

down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)

plot_top_20 <- c(up_list, down_list)
EnhancedVolcano(res1,
                      lab=rownames(res1),
                    x="log2FoldChange",
                    y="pvalue",
                    selectLab=plot_top_20,
                    drawConnectors=TRUE,
                    FCcutoff=1.0,
                    pCutoff=0.05,
                    title="Volcano Plot",
                    subtitle="Lung vs. Control",
                    legendVisible=F,
                    caption = paste0('Total Genes = ', nrow(res1)),
                    xlim=c(min_width, max_width),
                    ylim=c(0, max_height))

Bonus: apeglm visualised

Make a volcano plot of lung_v_ctrl, the DESeq2 results object that has not been filtered by apeglm.

You can see that there are genes with very large log2 FC values. As a rule of thumb, be skeptical of genes that have LFC values over +/- 10.

res1 <- na.omit(lung_v_ctrl)

min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))

up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)

down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)

plot_top_20 <- c(up_list, down_list)
EnhancedVolcano(res1,
                      lab=rownames(res1),
                    x="log2FoldChange",
                    y="pvalue",
                    selectLab=plot_top_20,
                    drawConnectors=TRUE,
                    FCcutoff=1.0,
                    pCutoff=0.05,
                    title="Volcano Plot",
                    subtitle="Lung vs. Control",
                    legendVisible=F,
                    caption = paste0('Total Genes = ', nrow(res1)))

##reset res1 for heatmaps
res1 <- na.omit(res1)

Complex Heatmap

Heatmaps are another way to show the differentially expressed genes in the experimental contrast of interest. Complexheatmap is as the name suggest, quite complex compared to other heatmap packages but it is undoubtedly the best in terms of flexibility.

# subset the counts matrix to get the lung and control samples
subset <- counts[, 1:6]

# now select de_up, de_down, i.e DE genes that passed the filtering
up <- rownames(de_up)
down <- rownames(de_down)

# subset them
key <- c(up, down)
subset <- subset[which(rownames(subset) %in% key),]

# Scale and center the counts matrix
# Scale works on columns, transpose the matrix to scale and center genes, transpose back.
mat <- t(subset)
mat <- scale(mat, center=T, scale=T)
mat <- t(mat)
mat <- na.omit(mat)

# set up annotation dataframe
ann <- data.frame(Cell_Type = c(rep("control", 3), c(rep("lung", 3))))

# set up heatmap column annotation
ha_col = HeatmapAnnotation(df = ann,
                           col = list(Cell_Type = c("control" =  "gold",
                                                    "lung" = "forestgreen")),
                           annotation_legend_param = list(title_gp = gpar(fontsize = 12, fontface = "bold"),
                                                                          labels_gp = gpar(fontsize = 12)),
                          annotation_name_side = "left")

# set up heatmap object
hm1 <- Heatmap(mat,
              col= colorRamp2(c(-2.6,-1,0,1,2.6),c("blue","skyblue","white","lightcoral","red")),
              heatmap_legend_param=list(at=c(-2.6,-1,0,1,2.6),color_bar="continuous",
                                        legend_direction="vertical", legend_width=unit(5,"cm"),
                                        title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
              name = "Z-score",

              #Row annotation configurations
              cluster_rows=T,
              show_row_dend=T,
              row_title_side="right",
              row_title_gp=gpar(fontsize=8),
              show_row_names=FALSE,
              row_names_side="left",

              #Column annotation configuratiions
              cluster_columns=T,
              show_column_dend=T,
              column_title="Lung vs. Control DE Genes",
              column_title_side="top",
              column_title_gp=gpar(fontsize=15, fontface="bold"),
              show_column_names = T,
              column_names_gp = gpar(fontsize = 12, fontface="bold"),

              #Dendrogram configurations: columns
              clustering_distance_columns="euclidean",
              clustering_method_columns="complete",
              column_dend_height=unit(10,"mm"),

              #Dendrogram configurations: rows
              clustering_distance_rows="euclidean",
              clustering_method_rows="complete",
              row_dend_width=unit(4,"cm"),
              row_dend_side = "left",
              row_dend_reorder = TRUE,

              #Splits
              border=T,
              row_km = 1,
              column_km = 1,

              #plot params
              width = unit(5, "inch"),
              height = unit(4, "inch"),
              #height = unit(0.4, "cm")*nrow(mat),

              #Annotations
              top_annotation = ha_col)

# plot heatmap
draw(hm1, annotation_legend_side = "right", heatmap_legend_side="right")

Heatmap (top 20 genes)

Instead of plotting all DE genes in a heatmap, we can select the top n differentially expressed genes to plot. I have decided to use the largest LFC values to plot the top 20 genes, however you might see people plotting the top n variable genes in this situation.

# same logic as volcano plot
up <- subset(de_up, de_up$log2FoldChange > 1 & de_up$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)

down <- subset(de_down, de_down$log2FoldChange < 1 & de_down$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)

plot_top_20 <- c(up_list, down_list)

# subset the counts matrix from last step
top_genes <- mat[which(rownames(mat) %in% plot_top_20),]

hm1 <- Heatmap(top_genes,
              col= colorRamp2(c(-2.6,-1,0,1,2.6),c("blue","skyblue","white","lightcoral","red")),
              heatmap_legend_param=list(at=c(-2.6,-1,0,1,2.6),color_bar="continuous",
                                        legend_direction="vertical", legend_width=unit(5,"cm"),
                                        title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")),
              name = "Z-score",

              #Row annotation configurations
              cluster_rows=T,
              show_row_dend=T,
              row_title_side="right",
              row_title_gp=gpar(fontsize=8),
              show_row_names=TRUE,
              row_names_side="right",

              #Column annotation configuratiions
              cluster_columns=T,
              show_column_dend=T,
              column_title="Lung vs. Control top 20 DE genes",
              column_title_side="top",
              column_title_gp=gpar(fontsize=15, fontface="bold"),
              show_column_names = T,
              column_names_gp = gpar(fontsize = 12, fontface="bold"),

              #Dendrogram configurations: columns
              clustering_distance_columns="euclidean",
              clustering_method_columns="complete",
              column_dend_height=unit(10,"mm"),

              #Dendrogram configurations: rows
              clustering_distance_rows="euclidean",
              clustering_method_rows="complete",
              row_dend_width=unit(4,"cm"),
              row_dend_side = "left",
              row_dend_reorder = TRUE,

              #Splits
              border=T,
              row_km = 1,
              column_km = 1,

              #plot params
              width = unit(5, "inch"),
              height = unit(4, "inch"),
              #height = unit(0.4, "cm")*nrow(mat),

              #Annotations
              top_annotation = ha_col)

# plot heatmap
draw(hm1, annotation_legend_side = "right", heatmap_legend_side="right")

Pathway Analysis

fgsea

Please download the GO Biological processes GMT file from the following link: https://raw.githubusercontent.com/BarryDigby/barrydigby.github.io/master/RNA-Seq/c5.bp.v7.0.symbols.gmt

Create ranked gene list

Convert the DESeq results object to a dataframe, (not the apeglm filtered results object, which has no column named stat) and produce a dataframe of genes with their summary statistic.

## convert result object to dataframe
res <- as.data.frame(lung_v_ctrl)
res$hgnc_symbol <- rownames(res)

# compute summary stat
fgsea_rank <- res %>%
              dplyr::select(hgnc_symbol, stat) %>%
              na.omit() %>%
              distinct() %>%
              group_by(hgnc_symbol) %>%
              summarize(stat=mean(stat))

fgsea_rank

## # A tibble: 22,759 x 2
##    hgnc_symbol   stat
##    <chr>        <dbl>
##  1 ""          -2.14 
##  2 "A1BG"      -0.855
##  3 "A1CF"       0.448
##  4 "A2M"        1.13 
##  5 "A2ML1"      1.49 
##  6 "A2MP1"      2.61 
##  7 "A3GALT2"    0.448
##  8 "A4GALT"     1.52 
##  9 "A4GNT"     -0.104
## 10 "AAAS"      -5.07 
## # … with 22,749 more rows

Convert to a named list

# create named list
rank <- deframe(fgsea_rank)
head(rank, 20)

##                  A1BG       A1CF        A2M      A2ML1      A2MP1    A3GALT2 
## -2.1441358 -0.8549118  0.4475917  1.1329554  1.4913325  2.6087616  0.4477794 
##     A4GALT      A4GNT       AAAS       AACS     AACSP1      AADAC    AADACL3 
##  1.5218178 -0.1035146 -5.0736368 -6.0395784 -0.8230446 -0.1258552  0.1489940 
##    AADACP1      AADAT      AAGAB       AAK1      AAMDC       AAMP 
## -0.4319702 -1.3401389  1.4935266  1.0193265 -2.6350318  1.3554245

Read GMT file

# read in gmt file
pathway <- gmtPathways("~/RNA-Seq/c5.bp.v7.0.symbols.gmt")
head(pathway, 1)

## $GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION
##  [1] "CDK9"    "CHD1"    "DHX9"    "EP300"   "SNW1"    "RRP1B"   "NELFB"  
##  [8] "GTF2B"   "GTF2F1"  "GTF2F2"  "MDFIC"   "HPN"     "JUN"     "LEF1"   
## [15] "ZNF639"  "NELFCD"  "RSF1"    "POLR2A"  "POLR2B"  "POLR2C"  "POLR2D" 
## [22] "POLR2E"  "POLR2F"  "POLR2G"  "POLR2H"  "POLR2I"  "POLR2J"  "POLR2K" 
## [29] "POLR2L"  "NUCKS1"  "SMARCA4" "SMARCB1" "SP1"     "SUPT4H1" "SUPT5H" 
## [36] "TAF11"   "TFAP4"   "NELFA"   "NELFE"   "CCNT1"   "CTDP1"

Run fgsea

# run fgsea
fgsea <- fgsea(pathways=pathway, stats=rank, nperm=1000)

fgseaResTidy <- fgsea %>%
  as_tibble() %>%
  arrange(desc(NES))

# Show in a nice table:
fgseaResTidy %>%
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>%
  arrange(padj) %>%
  DT::datatable()

Enrichment plots

Subset the fgsea results object to include results whose normalised enrichment score is > 2.3. You can use any preferred filtering method here, this is just to return a handful of pathways for the enrichment plot for loop.

filtered_pathway <- subset(fgsea, NES > 2.3)

filt_p <- as.vector(filtered_pathway$pathway)

for (i in filt_p){
    plt <- plotEnrichment(pathway = pathway[[i]],
    gseaParam = 1, ticksSize = 0.5, stats= rank) +
    labs(title=i) + theme(plot.title = element_text(hjust = 0.5, face="bold"))
    print(plt)
}

clusterProfiler

To use cluster profiler, gene IDs must be converted to ENTREZ IDs. We can achieve this using biomaRt and attaching the ENTREZ IDs to the results table. For the sake of the tutorial, we are only going to supply statistically significant genes to clusterProfiler (padj < 0.01)

df <- as.data.frame(res1)
df$hgnc_symbol <- rownames(df)
info <- getBM(attributes=c("hgnc_symbol",
                           "entrezgene_id"),
                  filters = c("hgnc_symbol"),
                  values = df$hgnc_symbol,
                  mart = mart,
                  useCache=FALSE)
tmp <- merge(df, info, by="hgnc_symbol")

# subset the dataframe to include only stat sig genes
tmp <- tmp[tmp$padj < 0.01,]

Run clusterProfiler

Be careful to specify the ENTREZ IDs (which are numeric names) as characters.

OrgDb <- org.Hs.eg.db

geneList <- as.vector(tmp$log2FoldChange)
names(geneList) <- as.character(tmp$entrezgene_id)
gene <- na.omit(as.character(tmp$entrezgene_id))


# GO over-representation test
ego <- clusterProfiler::enrichGO(gene          = gene,
                                 OrgDb         = OrgDb,
                                 ont           = "BP",
                                 pAdjustMethod = "BH",
                                 pvalueCutoff  = 0.001,
                                 qvalueCutoff  = 0.01,
                                 readable      = TRUE)
head(summary(ego))

##                    ID                          Description GeneRatio   BgRatio
## GO:0006260 GO:0006260                      DNA replication  151/4336 274/18670
## GO:0007059 GO:0007059               chromosome segregation  164/4336 321/18670
## GO:0006261 GO:0006261        DNA-dependent DNA replication   98/4336 153/18670
## GO:0000819 GO:0000819         sister chromatid segregation  110/4336 189/18670
## GO:0140014 GO:0140014             mitotic nuclear division  138/4336 264/18670
## GO:0000070 GO:0000070 mitotic sister chromatid segregation   93/4336 151/18670
##                  pvalue     p.adjust       qvalue
## GO:0006260 2.000329e-30 1.280211e-26 9.917421e-27
## GO:0007059 7.179680e-28 2.297498e-24 1.779805e-24
## GO:0006261 3.967488e-27 8.463975e-24 6.556797e-24
## GO:0000819 3.622111e-25 5.795377e-22 4.489511e-22
## GO:0140014 6.803353e-25 8.708291e-22 6.746061e-22
## GO:0000070 5.585756e-24 4.988101e-21 3.864137e-21
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    geneID
## GO:0006260                                                                                                                          ALYREF/ATAD5/ATM/ATRIP/BARD1/BCL6/BLM/BOD1L1/BRCA1/BRCA2/CACYBP/CCNA2/CDC45/CDC6/CDC7/CDK1/CDK2/CDT1/CENPS/CENPX/CHAF1A/CHAF1B/CHEK1/CHEK2/CHTF18/CHTF8/CINP/CLSPN/DBF4/DBF4B/DDX11/DHX9/DNA2/DONSON/DSCC1/DTL/DUT/E2F7/E2F8/EGFR/EME1/EME2/ESCO2/ETAA1/EXO1/FAF1/FANCM/FBXO5/FEN1/GEN1/GINS1/GINS2/GINS3/GINS4/GMNN/GTPBP4/HUS1/ID3/ING4/JUN/KCTD13/LIG1/LIG3/MCM10/MCM2/MCM3/MCM4/MCM5/MCM6/MCM7/MCM8/MCMBP/MGME1/MMS22L/MRE11/MSH3/MSH6/NASP/NFIA/NFIB/NFIX/NUCKS1/ORC1/ORC3/ORC5/ORC6/PCLAF/PCNA/PIF1/POLA2/POLD2/POLD3/POLE/POLE2/POLE3/POLE4/POLQ/PRIM1/PRIM2/PRIMPOL/PTMS/PURA/RAC1/RAD1/RAD51/RAD9B/RBBP7/RBMS1/RECQL/RECQL5/REV1/REV3L/RFC1/RFC2/RFC3/RFC4/RFC5/RFWD3/RHNO1/RMI1/RMI2/RNASEH2A/RPA1/RPA2/RPA3/RRM1/RRM2/RRM2B/SET/SETMAR/SLBP/SLX4/SMC1A/SSBP1/SSRP1/STAG2/SUPT16H/TBRG1/TICRR/TIMELESS/TIPIN/TNFAIP1/TONSL/TOPBP1/TREX1/UPF1/WDHD1/WRN/WRNIP1/ZBTB38/ZNF830
## GO:0007059 ACTR3/AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BANF1/BEX4/BIRC5/BRCA1/BUB1/BUB1B/BUB3/CCNB1/CCNB1IP1/CDC20/CDC23/CDC27/CDC42/CDC6/CDCA2/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CENPN/CENPQ/CENPS/CENPW/CENPX/CEP85/CHAMP1/CHMP1B/CHMP5/CHTF8/CIAO1/CTCF/CTNNB1/DDX11/DLGAP5/DMC1/DSCC1/DSN1/DUSP1/ECT2/EME1/EME2/ESCO2/ESPL1/FAM83D/FANCD2/FANCM/FBXO5/FEN1/FMN2/GEN1/HASPIN/HJURP/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNL1/KNSTRN/KPNB1/LATS1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MCMBP/MIS18A/MKI67/MLH1/MMS19/MRE11/MSH4/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC1/NDC80/NDE1/NEK2/NR3C1/NSMCE2/NUF2/NUMA1/NUP37/NUP43/NUP62/NUSAP1/OIP5/PCID2/PHF13/PIBF1/PLK1/PMF1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAD18/RAD51C/RAN/RCC1/RECQL5/RIOK2/RIOK3/RMI1/RMI2/RPS3/RRS1/SEH1L/SETDB2/SFPQ/SGO1/SGO2/SKA1/SKA2/SKA3/SLC25A5/SLF1/SLF2/SLX4/SMC1A/SMC2/SMC4/SMC5/SMC6/SPAG5/SPC25/SPDL1/STAG1/STAG2/STAG3/STAG3L4/TACC3/TLK1/TOP2A/TRIP13/TTK/TTL/TUBG1/UBE2I/XRCC3/ZW10/ZWINT
## GO:0006261                                                                                                                                                                                                                                                                                                                                                                                                                                ALYREF/ATAD5/BCL6/BLM/BOD1L1/BRCA2/CDC45/CDC6/CDC7/CDK2/CDT1/CENPS/CENPX/CHEK2/CHTF18/CHTF8/DBF4/DBF4B/DDX11/DNA2/DONSON/DSCC1/E2F7/E2F8/EME1/EME2/ETAA1/FANCM/FBXO5/FEN1/GEN1/GINS1/GINS2/GINS3/GINS4/GMNN/LIG1/LIG3/MCM10/MCM2/MCM3/MCM4/MCM5/MCM6/MCM7/MCMBP/MGME1/MMS22L/MRE11/MSH3/MSH6/NUCKS1/ORC1/ORC3/ORC5/ORC6/PCNA/POLA2/POLD2/POLD3/POLE/POLE2/POLE3/POLE4/POLQ/PRIM1/PRIM2/PRIMPOL/PURA/RAD51/RECQL/RECQL5/REV3L/RFC1/RFC2/RFC3/RFC4/RFC5/RFWD3/RNASEH2A/RPA1/RPA2/RPA3/RRM2B/SETMAR/SLBP/SMC1A/SSBP1/STAG2/TICRR/TIMELESS/TIPIN/TONSL/UPF1/WDHD1/WRN/WRNIP1/ZNF830
## GO:0000819                                                                                                                                                                                                                                                                                                                       AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CHAMP1/CHMP1B/CHMP5/CHTF8/CTCF/CTNNB1/DDX11/DLGAP5/DSCC1/DSN1/DUSP1/ESCO2/ESPL1/FBXO5/FEN1/GEN1/HASPIN/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KPNB1/LATS1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MCMBP/MRE11/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NEK2/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PIBF1/PLK1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAD51C/RAN/RIOK2/RMI2/RRS1/SEH1L/SFPQ/SGO1/SGO2/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/STAG1/STAG2/STAG3/STAG3L4/TACC3/TOP2A/TRIP13/TTK/TUBG1/XRCC3/ZW10/ZWINT
## GO:0140014                                                                                                                                             AAAS/AKAP8L/ANAPC1/ANAPC15/ANAPC5/ANLN/ARHGEF10/ATM/AURKA/AURKB/BCCIP/BIRC5/BORA/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC25C/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CEP192/CEP85/CHAMP1/CHEK1/CHEK2/CHMP1B/CHMP5/CHTF8/CLASP2/DLGAP5/DRG1/DSCC1/DSN1/DUSP1/EPS8/ESPL1/FBXO5/GEN1/HASPIN/HNRNPU/HSPA1A/HSPA1B/IL1B/INCENP/INSR/KIF11/KIF14/KIF18A/KIF18B/KIF20B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KNTC1/KPNB1/MACROH2A1/MAD2L1/MAD2L2/MAP9/MAU2/MKI67/MSTO1/MTBP/MYBL2/MZT1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NDE1/NEK2/NFE2L1/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PHIP/PIBF1/PKMYT1/PLK1/PPP1R9B/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAN/RANBP1/RCC1/RHOA/RIOK2/RPL24/RRS1/SEH1L/SGO1/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/SPHK1/STAG1/STAG2/TACC3/TOM1L1/TPX2/TRIP13/TTK/TUBG1/UBE2C/UBE2S/UBXN2B/USP16/XRCC3/ZW10/ZWINT
## GO:0000070                                                                                                                                                                                                                                                                                                                                                                                                                            AKAP8L/ANAPC1/ANAPC15/ANAPC5/ATM/AURKB/BUB1/BUB1B/BUB3/CCNB1/CDC20/CDC23/CDC27/CDC6/CDCA5/CDCA8/CDK5RAP2/CDT1/CENPE/CENPF/CENPK/CHAMP1/CHMP1B/CHMP5/CHTF8/DLGAP5/DSCC1/DSN1/DUSP1/ESPL1/FBXO5/GEN1/HASPIN/HNRNPU/INCENP/KIF14/KIF18A/KIF18B/KIF22/KIF23/KIF2C/KIF4A/KIFC1/KNSTRN/KPNB1/MACROH2A1/MAD2L1/MAD2L2/MAU2/MSTO1/NAA10/NAA50/NCAPD2/NCAPG/NCAPH/NCAPH2/NDC80/NEK2/NSMCE2/NUF2/NUMA1/NUP62/NUSAP1/PCID2/PHF13/PIBF1/PLK1/PRC1/PSMG2/PSRC1/PTTG1/RAB11A/RACGAP1/RAN/RIOK2/RRS1/SEH1L/SGO1/SLF1/SLF2/SMC1A/SMC2/SMC4/SMC5/SPAG5/SPDL1/TACC3/TRIP13/TTK/TUBG1/XRCC3/ZW10/ZWINT
##            Count
## GO:0006260   151
## GO:0007059   164
## GO:0006261    98
## GO:0000819   110
## GO:0140014   138
## GO:0000070    93

CNET plot

Subset the enrichment object, be sure to add asis = T. For the sake of the tutorial, we are selecting pathways with 21 genes in them to reduce the plot size. You may want to filter by pathways or genes of interest here.

clusterProfiler produces a nice gene-concept network plot showing gene fold changes and the pathway to which they belong.

subset <- ego[ego$Count == 21, asis=T]
cnetplot(subset, categorySize="geneNum", foldChange=geneList)

Boxplots

You may wish to demonstrate specific gene expression values using boxplots. We will use the package ggpubr here which produces really nice publication ready plots.

plotCounts

In the example below we will use the DESeq2 plotCounts function, which converts the count matrix for gene \(i\) into long format, and appends metadata. This is the standard input format for plotting categorical data in R.

d <- plotCounts(dds, gene="PTEN", intgroup = "condition", returnData = T)
d

##            count condition
## ctrl_1  8431.116   control
## ctrl_2  9376.712   control
## ctrl_3  9692.598   control
## A549_1 10316.675      lung
## A549_2 10445.949      lung
## A549_3 10317.552      lung
## A375_1  7946.369  melanoma
## A375_2  9958.632  melanoma
## A375_3 10465.176  melanoma

We could use the function to produce count plots (returnData=F), however in my opinion the plot is rather lackluster…

plotCounts(dds, gene="PTEN", intgroup="condition", returnData=F)

Pairwise Plot

If we want to plot lung vs control, we need to remove the melanoma counts from the dataframe:

d <- subset(d, d$condition != "melanoma")
d

##            count condition
## ctrl_1  8431.116   control
## ctrl_2  9376.712   control
## ctrl_3  9692.598   control
## A549_1 10316.675      lung
## A549_2 10445.949      lung
## A549_3 10317.552      lung

Now plot the data using ggpubr:

ggboxplot(d, x = "condition", y = "count",
                 fill = "condition", color = "black", palette = "jco",
                 ylab ="Normalized Counts", title = "PTEN Expression",
                 xlab="", add = c("dotplot"),
                 add.params = list(size=0.75, jitter=0.01),
                 legend = "none", bxp.errorbar = T,
                 bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
                 theme(axis.text.x = element_text( colour = "black", size=14)) +
                 theme(axis.title.y = element_text( colour = "black", size=14)) +
                 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Add significance values

We can perform a pairwise t-test and add the significance values for the boxplot. First, set up the comparison of interest comp using a list. We will call this variable using stat_compare_means.

I like to bump the significance value label slightly higher than the default. To do this dynamically, simply add 5% to the max value and set this as the y axis value for the significance label (signif_yaxis variable).

In the plot below we can see that PTEN is more highly expressed in the lung samples, however not at a significant level.

comps <- list(c("lung", "control"))

signif_yaxis = max(d$count) + (max(d$count)*0.05)

ggboxplot(d, x = "condition", y = "count",
                 fill = "condition", color = "black", palette = "jco",
                 ylab ="Normalized Counts", title = "PTEN Expression",
                 xlab="", add = c("dotplot"),
                 add.params = list(size=0.75, jitter=0.01),
                 legend = "none", bxp.errorbar = T,
                 bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
                 theme(axis.text.x = element_text( colour = "black", size=14)) +
                 theme(axis.title.y = element_text( colour = "black", size=14)) +
                 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
                 stat_compare_means(comparisons = comps, method = "t.test", label.y = signif_yaxis)

Anova

Lets plot lung vs control vs melanoma for PNN and run an ANOVA test to see if PNN is significantly over or under expressed in a condition compared to all. Re-run the plotCounts function to obtain the dataframe with all three conditions:

d <- plotCounts(dds, gene="PNN", intgroup = "condition", returnData = T)
d

##           count condition
## ctrl_1 4668.903   control
## ctrl_2 4538.654   control
## ctrl_3 5027.576   control
## A549_1 3183.788      lung
## A549_2 3087.202      lung
## A549_3 3442.206      lung
## A375_1 4191.339  melanoma
## A375_2 4855.743  melanoma
## A375_3 5046.059  melanoma

To perform pairwise test vs. all (i.e the base-mean), use compare_means and set the ref.group to ".all.". We are going to change the signficance labels for each pairwise comparison to ns, *, ** and *** instead of reporting precise p-values. This can be toggled using label in the stat_compare_means function.

We will be plotting two significance labels (ANOVA and pairwise) thus we will need to make two signif_yaxis variables at 5% and 10% of the max value respectively to produce a tidy plot.

In the plot below, we see that PNN is underexpressed in melanoma compared to control and lung, and the ANOVA test suggests the results are significant.

signif_yaxis = max(d$count) + (max(d$count)*0.10)
signif_yaxis2 = max(d$count) + (max(d$count)*0.05)

compare_means(count ~ condition,  data = d,
              ref.group = ".all.", method = "t.test")

## # A tibble: 3 x 8
##   .y.   group1 group2         p p.adj p.format p.signif method
##   <chr> <chr>  <chr>      <dbl> <dbl> <chr>    <chr>    <chr> 
## 1 count .all.  melanoma 0.246   0.25  0.2461   ns       T-test
## 2 count .all.  control  0.117   0.23  0.1166   ns       T-test
## 3 count .all.  lung     0.00612 0.018 0.0061   **       T-test

ggboxplot(d, x = "condition", y = "count",
                 fill = "condition", color = "black", palette = "jco",
                 ylab ="Normalized Counts", title = "PNN Expression",
                 xlab="", add = c("dotplot"),
                 add.params = list(size=0.75, jitter=0.01),
                 legend = "none", bxp.errorbar = T,
                 bxp.errorbar.width = 0.2, width=0.3, ggtheme = theme_classic()) +
                 theme(axis.text.x = element_text( colour = "black", size=14)) +
                 theme(axis.title.y = element_text( colour = "black", size=14)) +
                 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
                 rotate_x_text(angle = 45) +
                 geom_hline(yintercept = mean(d$count), linetype = 2) +
                 stat_compare_means(method = "anova", label.y = signif_yaxis) +        
                 stat_compare_means(label = "p.signif", method = "t.test",
                                    ref.group = ".all.", label.y = signif_yaxis2)