output: html_document: title: ‘MA5111 RNA-Seq’ toc: true toc_float: true code_folding: hide highlight: zenburn theme: flatly —r::opts_chunk$set(echo = TRUE)



```r
library(dplyr)
library(biomaRt)
library(tximport)
library(rhdf5)
library(gplots)
library(org.Hs.eg.db)
library(DESeq2)
library(DT)
library(apeglm)
library(RColorBrewer)
library(IHW)
library(PCAtools)
library(pheatmap)
library(clusterProfiler)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(circlize)
library(fgsea)
library(tidyverse)
library(ggpubr)

Introduction

This R markdown document walks through an RNA-Seq differential expression analysis using DESeq2 for MA5111. Quality control plots, differential expression results, plots and pathway analysis are covered in the document.

The tutorial assumes you have 9 Kallisto quantification directories and a samples.csv file under /home/rnaseq_data/quant/ as well as a .gmt file that we will use for the gene set enrichment analysis.

quant_dir <- "/home/rstudio/rnaseq_data/quant"
list.files(quant_dir)

##  [1] "A375_1"                 "A375_2"                 "A375_3"                
##  [4] "A549_1"                 "A549_2"                 "A549_3"                
##  [7] "c5.bp.v7.0.symbols.gmt" "ctrl_1"                 "ctrl_2"                
## [10] "ctrl_3"                 "samples.csv"

Reading Input Data

Metadata

Read in the file samples.csv which contains the experimental metadata. Please make sure that the rownames of this dataframe match the kallisto quantification directory names at /home/rnaseq/quant/.

samples <- read.csv(paste0(quant_dir, "/samples.csv"), header=T, row.names = "sample", stringsAsFactors = T)
samples

##        condition replicate
## ctrl_1   control         1
## ctrl_2   control         2
## ctrl_3   control         3
## A549_1      lung         1
## A549_2      lung         2
## A549_3      lung         3
## A375_1  melanoma         1
## A375_2  melanoma         2
## A375_3  melanoma         3

Convert Numerics to Factors

When performing differential expression analysis using DESeq2, if a column in samples.csv is numeric and provided to the design formula, DESeq2 will produce the following message:

the design formula contains a numeric variable with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function

What does this mean? From Mike Love (DESeq2 author):

There is a constant fold change for every unit of change in replicates. So if the estimated fold change is 2, this implies that replicates 2 = 2x replicates 1, replicates 3 = 2x replicates 2, etc. Or in other words, the relationship is linear on the log counts scale.

This is not what we want! Be really careful with your metadata file as it defines our statistical model design. To fix this, change the replicate column to a categorical factor using factor().

The code block below checks the input metadata file, tells the user if any columns are numeric and converts them to a factor.

samples$replicate <- factor(samples$replicate)

# check its ok:
factor_cols <- sapply(samples, is.factor)
factor_cols

## condition replicate 
##      TRUE      TRUE

Stage Kallisto files

We need to create a file handle object (a named character list) containing the sample IDs and the paths to the kallisto quantification .h5 files. If the rownames of your metadata object (samples) do not match the quantification directory names, you will get an error during the TXI object step.

files <- file.path(quant_dir, rownames(samples), "abundance.h5")
names(files) <- paste0(rownames(samples))
files

##                                                ctrl_1 
## "/home/rstudio/rnaseq_data/quant/ctrl_1/abundance.h5" 
##                                                ctrl_2 
## "/home/rstudio/rnaseq_data/quant/ctrl_2/abundance.h5" 
##                                                ctrl_3 
## "/home/rstudio/rnaseq_data/quant/ctrl_3/abundance.h5" 
##                                                A549_1 
## "/home/rstudio/rnaseq_data/quant/A549_1/abundance.h5" 
##                                                A549_2 
## "/home/rstudio/rnaseq_data/quant/A549_2/abundance.h5" 
##                                                A549_3 
## "/home/rstudio/rnaseq_data/quant/A549_3/abundance.h5" 
##                                                A375_1 
## "/home/rstudio/rnaseq_data/quant/A375_1/abundance.h5" 
##                                                A375_2 
## "/home/rstudio/rnaseq_data/quant/A375_2/abundance.h5" 
##                                                A375_3 
## "/home/rstudio/rnaseq_data/quant/A375_3/abundance.h5"

Tximport

tximport imports transcript-level abundances from quantification tools (kallisto in our case) and converts them to gene counts for downstream analyses using a differential expression analysis package.

BiomaRt

We will use biomaRt to connect to the ENSEMBL databases and map transcript IDs to gene IDs. Firstly, we will create the mart object specifying which database to use:

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

Transcript to gene (tx2gene)

Recall we used the ENSEMBL reference cDNA file Homo_sapiens.GRCh38.cdna.all.fa.gz for Kallisto quant on LUGH. This means that our quantification files have ENSEMBL transcript ID’s. We can map them to gene symbols running the code below:

tx2gene <- getBM(attributes = c("ensembl_transcript_id_version", "hgnc_symbol"), mart = mart, useCache = FALSE)
head(tx2gene)

##   ensembl_transcript_id_version hgnc_symbol
## 1             ENST00000387314.1       MT-TF
## 2             ENST00000389680.2     MT-RNR1
## 3             ENST00000387342.1       MT-TV
## 4             ENST00000387347.2     MT-RNR2
## 5             ENST00000386347.1      MT-TL1
## 6             ENST00000361390.2      MT-ND1

TXI object

Create a txi object summarising kallisto transcipt quantification to the gene-level.

txi <- tximport(files, type = "kallisto", tx2gene = tx2gene)
head(txi$abundance)

##             ctrl_1       ctrl_2       ctrl_3       A549_1       A549_2
##       7.565669e+03 8.212640e+03 7.877847e+03 7745.3681011 7.850105e+03
## A1BG  5.901603e+00 4.889110e+00 5.094564e+00    4.8513322 5.471458e+00
## A1CF  3.783157e-02 3.832046e-03 8.864742e-02    0.0453066 6.511605e-02
## A2M   4.130882e+01 2.953310e+01 3.708851e+01   37.1903231 4.827051e+01
## A2ML1 1.886978e-01 2.740595e-01 7.846124e-02    0.3793876 4.937987e-01
## A2MP1 6.811755e-01 6.825399e-01 5.020745e-01    0.9794183 1.189639e+00
##             A549_3       A375_1       A375_2       A375_3
##       7.888903e+03 7.705010e+03 8.004009e+03 8.018222e+03
## A1BG  4.309155e+00 4.564884e+00 5.934494e+00 3.989345e+00
## A1CF  1.938563e-02 1.034112e-02 3.690397e-02 1.865727e-02
## A2M   4.759494e+01 3.576941e+01 3.101334e+01 4.386300e+01
## A2ML1 2.521947e-01 4.777695e-01 3.322747e-01 3.713090e-01
## A2MP1 1.041572e+00 8.024568e-01 6.518761e-01 8.664537e-01

DESeq2

Now we are ready to create the DDS object for DESeq2 analysis that contains metadata (colData), counts further information that can be obtained via dds@.

Design

The model design is one of the most important steps for an RNA-Seq analysis. Here, we are going to specify the design ~ replicate + condition (columns of the metadata file). The factor of interest goes last in the model terms, we want to compare control vs. lung vs. melanoma in the experiment. Replicate has been included as we want to control for the effect of sample replicates.

DDS

We are going to create the DDS object using DESeqDataSetFromTximport() as we used tximport to convert transcript abundances to gene-level counts.

dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ replicate + condition )

Factor levels

By default, R will choose a reference level for factors based on alphabetical order. We will explicitly set control as the reference level so you can see how it’s done using relevel().

N.B this step must be done prior to creating the dds object using DESeq()! If you want to change the reference level, you must re-run DESeq().

Check that relevel() worked correctly by running resultsNames(dds) - we want control to be the reference level so we can compare lung vs control and melanoma vs. control.

dds$condition <- relevel(dds$condition, ref = "control")
dds <- DESeq(dds)
resultsNames(dds)

## [1] "Intercept"                     "replicate_2_vs_1"             
## [3] "replicate_3_vs_1"              "condition_lung_vs_control"    
## [5] "condition_melanoma_vs_control"

Quality Control

We will use the gene-level counts to perform some quality control checks on the samples in the experiment. We will choose a suitable transformation for the gene counts to reduce impact of lowly-expressed genes on our analysis.

Extract Counts

Extract gene-level counts from the DDS object.

counts <- counts(dds, normalized=TRUE)

###Transform Counts We’ll look at two transformations: log2(), and rlog().

log2_counts <- assay(normTransform(dds))
rld_counts <- assay(rlog(dds))

library(vsn)
library(hexbin)

## x-axis is the transformed mean not the raw mean..

log2_plt <- meanSdPlot(log2_counts, ranks=FALSE, plot=FALSE)
log2_plt$gg + ggtitle("Log2 + PC Transformation") + xlim(0,20)

rld_plt <- meanSdPlot(rld_counts, ranks=FALSE, plot=FALSE)
rld_plt$gg + ggtitle("Rlog Transformation") + xlim(0,20)

Save Counts

Write the counts files

#dir.create("/home/rstudio/rnaseq_data/counts")
write.table(counts, "normalised_counts.txt", sep="\t", quote = F)
write.table(log2_counts, "log2_counts.txt", sep="\t", quote = F)

Sample Heatmap

A heatmap of sample distances matrix (dist(t(log2))) gives us an overview over similarities and dissimilarities between samples.

## Calculate distance between samples
sampleDists <- dist(t(rld_counts))

## Place distances in matrix
sampleDistMatrix <- as.matrix(sampleDists)

## Optional, remove colnames
colnames(sampleDistMatrix) <- NULL

## create annotation dataframe
ann <- data.frame(Condition = samples$condition)

col <- c("blue", "forestgreen", "red1")
names(col) <- c("melanoma", "lung", "control")
ann_col <- list(Condition = col)

## match annotation rownames to distance mat
rownames(ann) <- rownames(sampleDistMatrix)

pheatmap(mat=sampleDistMatrix,
         ## pass distance metric calculated to heatmap
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         ## pass annotation dataframe 
         annotation_col = ann,
         ## add colors
         annotation_colors = ann_col,
         ## heatmap colours
         col=hcl.colors(100,"GnBu",rev=T))

PCA

PCA plots the variance explained by samples in each principal component. Typically PC1 & PC2 explain the most variation in the dataset, you should look at these first and foremost.

In the plot below we can see that samples from the lung and control groups cluster tightly together and that the melanoma samples overlap with the control samples in PC1/PC2. Make the effort to explore other principal components yourdelf - for example, in PC2 vs PC4 there is excellent separation between the melanoma and control samples.

PCA can give you a good idea of how succesful the Differential Expression analysis will be - samples that are very close together in PC feature space will not produce as many DE genes as those that are separated by a large distance.

p <- pca(rld_counts, metadata = samples)

biplot(p,
       colby = 'condition',
       colkey = c('melanoma'='royalblue', 'control'='red1', 'lung'='forestgreen'),
       hline = 0,
       vline = 0,
       legendPosition = 'right',
       legendLabSize = 12,
       legendIconSize = 8.0,
       title = 'PCA bi-plot',
       subtitle = 'PC1 versus PC2')

Differential Expression Analysis

DESeq results()

We can extract differentially expressed genes between phenotypes of interest by using results() on the DDS object. We will apply the apeglm shrinkage estimator on our results. apeglm shrinks low confidence (high inter-sample variation) differentially expressed genes towards 0, producing a robust set of differentially expressed genes.

The argument coef in lfcShrink refers to the contrast of interest returned by resultsNames(dds). For lung vs. control, it is the 4th character string returned thus coef=4 for lung vs. control.

Set up lung vs control, melanoma vs control:

# make lung vs control object
lung_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "control"))
res1 <- lfcShrink(dds=dds, res=lung_v_ctrl, coef=4, type="apeglm")
summary(res1)

## 
## out of 22664 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2661, 12%
## LFC < 0 (down)     : 3253, 14%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

# make melanoma vs control object
melanoma_v_ctrl <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "melanoma", "control"))
res2 <- lfcShrink(dds=dds, res=melanoma_v_ctrl, coef=5, type="apeglm")
summary(res2)

## 
## out of 22664 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 439, 1.9%
## LFC < 0 (down)     : 394, 1.7%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

Set up lung vs melanoma. We need to reconfigure the DDS object so ‘melanoma’ is the reference level in ‘condition’, and re-run DESeq().

Check resultsNames(dds) to ensure the correct contrasts are set:

# to make lung vs melanoma, relevel the dds object reference level and redo the DESeq call
dds$condition <- relevel(dds$condition, ref = "melanoma")
dds <- DESeq(dds)

# double check it worked
resultsNames(dds)

## [1] "Intercept"                     "replicate_2_vs_1"             
## [3] "replicate_3_vs_1"              "condition_control_vs_melanoma"
## [5] "condition_lung_vs_melanoma"

Now extract the contrast results and perform apeglm.

# make lung vs melanoma
lung_v_melanoma <- results(dds, filterFun=ihw, alpha=0.05, c("condition", "lung", "melanoma"))
res3 <- lfcShrink(dds=dds, res=lung_v_melanoma, coef=5, type="apeglm")
summary(res3)

## 
## out of 22664 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2646, 12%
## LFC < 0 (down)     : 3192, 14%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

Resusable Functions

Below are some handy functions for extracting differentially expressed genes and performing annotation of DE genes.

Extract Up Regulated Genes

get_upregulated <- function(df){
    key <- intersect(rownames(df)[which(df$log2FoldChange>=1)], rownames(df)[which(df$pvalue<=0.05)])
  results <- as.data.frame((df)[which(rownames(df) %in% key),])
    return(results)
}

Extract Down Regulated genes

get_downregulated <- function(df){
    key <- intersect(rownames(df)[which(df$log2FoldChange<=-1)], rownames(df)[which(df$pvalue<=0.05)])
    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    return(results)
}

Annotate DE genes

annotate_de_genes <- function(df){

    df$hgnc_symbol <- rownames(df)
    mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
    info <- getBM(attributes=c("hgnc_symbol",
                               "ensembl_gene_id_version",
                               "chromosome_name",
                               "start_position",
                               "end_position",
                               "strand",
                               "entrezgene_description"),
                  filters = c("hgnc_symbol"),
                  values = df$hgnc_symbol,
                  mart = mart,
                  useCache=FALSE)

    tmp <- merge(df, info, by="hgnc_symbol")
    tmp$strand <- gsub("-1", "-", tmp$strand)
    tmp$strand <- gsub("1", "+", tmp$strand)
    tmp$hgnc_symbol <- make.names(tmp$hgnc_symbol, unique = T)
    tmp <- tmp[!grepl("CHR", tmp$chromosome_name),]

    output_col <- c("Gene", "Ensembl ID", "Chromosome", "Start", "Stop", "Strand", "Description", "Log2FC", "P-value", "Adj P-value")
    tmp <- subset(tmp, select=c(hgnc_symbol, ensembl_gene_id_version, chromosome_name, start_position, end_position, strand, entrezgene_description, log2FoldChange, pvalue, padj))
    colnames(tmp) <- output_col

    if(min(tmp$Log2FC) > 0){
        tmp <- tmp[order(-tmp$Log2FC),]
    }else{
        tmp <- tmp[order(tmp$Log2FC),]
    }

    return(tmp)

}

Write DE results

Use the above functions to extract up/down regulated genes and annotate them using biomaRt, and write to a file. Please note that the get_(up/down)regulated function requires a dataframe as input.

Note: we will focus on lung vs control for the remainder of the practical

de_up <- get_upregulated(as.data.frame(res1))
de_down <- get_downregulated(as.data.frame(res1))
upregulated_genes <- annotate_de_genes(de_up)
downregulated_genes <- annotate_de_genes(de_down)

#confirm these worked
head(upregulated_genes)

##          Gene         Ensembl ID   Chromosome     Start      Stop Strand
## 467     INSRR  ENSG00000027644.5            1 156840063 156859117      -
## 282 FAM187B2P  ENSG00000262497.1           19  35232291  35233003      -
## 290     FDCSP  ENSG00000181617.6            4  70226124  70235252      +
## 781      SAA1  ENSG00000288411.1 HG2111_PATCH     15086     18803      +
## 782    SAA1.1 ENSG00000173432.13           11  18266260  18269977      +
## 601       NDP ENSG00000124479.11            X  43948776  43973395      -
##                                    Description   Log2FC       P-value
## 467          insulin receptor related receptor 8.728400  5.039956e-09
## 282                                            7.402744  7.541358e-06
## 290 follicular dendritic cell secreted protein 7.393437 6.828074e-193
## 781                           serum amyloid A1 7.088209  6.108702e-18
## 782                           serum amyloid A1 7.088209  6.108702e-18
## 601      norrin cystine knot growth factor NDP 6.907855  3.721412e-05
##       Adj P-value
## 467  1.448311e-07
## 282  1.322405e-04
## 290 5.489574e-190
## 781  1.045581e-16
## 782  1.045581e-16
## 601  3.637315e-02

#dir.create("/home/rstudio/rnaseq_data/DESeq_results/")
write.table(upregulated_genes, "lung_vs_control_upregulated.txt", sep="\t", row.names=F, quote=F)
write.table(downregulated_genes, "lung_vs_control_downregulated.txt", sep="\t", row.names=F, quote=F)

Results Plots

Volcano Plot

Volcano plots are useful to show how many genes are differentially expressed in the experimental contrast of interest. Labels are optional, I have included them so you know how to use them.

N.B Volcano plots use -log10 on the Y-axis.

## remove NA values from results
res1 <- na.omit(res1)

## calculate min/max axis values for plot (optional)
min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))

## Grab top 10 up-reg genes for plot
up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)

## Grab top 10 down-reg genes for plot
down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)

## place top 20 DE genes in vector (optinal...)
plot_top_20 <- c(up_list, down_list)

EnhancedVolcano(res1,
                lab=rownames(res1),
                x="log2FoldChange",
                y="pvalue",
                selectLab=plot_top_20,
                drawConnectors=TRUE,
                legendPosition = "none",
                FCcutoff=1.0,
                pCutoff=0.05,
                title="Volcano Plot",
                subtitle="Lung vs. Control",
                caption = paste0('Total Genes = ', nrow(res1)),
                xlim=c(min_width, max_width),
                ylim=c(0, max_height))

Bonus: apeglm visualised

Make a volcano plot of lung_v_ctrl, the DESeq2 results object that has not been filtered by apeglm.

You can see that there are genes with very large log2 FC values. As a rule of thumb, be skeptical of genes that have LFC values over +/- 10.

res1 <- na.omit(lung_v_ctrl)

min_width <- min(res1$log2FoldChange)
max_width <- max(res1$log2FoldChange)
max_height <- -log10(min(res1[res1$pvalue>0, 5]))

up <- subset(res1, res1$log2FoldChange > 1 & res1$pvalue <= 0.05)
up <- up[order(-up$log2FoldChange),]
up_list <- head(rownames(up), n=10L)

down <- subset(res1, res1$log2FoldChange < -1 & res1$pvalue <= 0.05)
down <- down[order(down$log2FoldChange),]
down_list <- head(rownames(down), n=10L)

plot_top_20 <- c(up_list, down_list)
EnhancedVolcano(res1,
                      lab=rownames(res1),
                    x="log2FoldChange",
                    y="pvalue",
                    selectLab=plot_top_20,
                    drawConnectors=TRUE,
                    FCcutoff=1.0,
                    pCutoff=0.05,
                    title="Volcano Plot",
                    subtitle="Lung vs. Control",
                    legendLabSize=8,
                    caption = paste0('Total Genes = ', nrow(res1)))

##reset res1 for heatmaps
res1 <- na.omit(res1)

Heatmap

Heatmaps are another way to show the differentially expressed genes in the experimental contrast of interest.

# subset the counts matrix to get the lung and control samples
subset <- rld_counts[, 1:6]

# now select de_up, de_down, i.e DE genes that passed the filtering our function produced
up <- rownames(de_up)
down <- rownames(de_down)

# subset matrix to include only DE genes
key <- c(up, down)
subset <- subset[which(rownames(subset) %in% key),]

# scale and center the values
mat <- as.matrix(scale(t(subset), center = T))

# basic plot to check we're plotting something sensible
#pheatmap(t(mat))

# spruce it up a bit..
ann <- data.frame(Condition = c(rep("Control", 3), rep("Lung", 3)))
rownames(ann) <- rownames(mat)
col <- c("blue", "forestgreen")
names(col) <- c("Control", "Lung")
ann_col <- list(Condition = col)

pheatmap(t(mat), 
         show_rownames = FALSE,
         annotation_col = ann,
         annotation_colors = ann_col,
         color = hcl.colors(100, "PRGn",rev=F))

Pathway Analysis

fgsea

We will use the GO Biological processes GMT file c5.bp.v7.0.symbols.gmt which should be in the quant directory.

Create ranked gene list

Extract the gene names and associated log2FoldChanges from our lung vs control study to generate a ranked gene list.

## convert result object to dataframe
res <- as.data.frame(res1) # lung vs control
res$hgnc_symbol <- rownames(res)

# compute summary stat
fgsea_rank <- res %>%
              dplyr::select(hgnc_symbol, log2FoldChange) %>%
              na.omit() %>%
              distinct() %>%
              group_by(hgnc_symbol) %>%
              summarize(stat=mean(log2FoldChange))

fgsea_rank

## # A tibble: 22,664 × 2
##    hgnc_symbol    stat
##    <chr>         <dbl>
##  1 ""          -0.0969
##  2 "A1BG"      -0.215 
##  3 "A1CF"      -0.0345
##  4 "A2M"        0.215 
##  5 "A2ML1"      1.03  
##  6 "A2MP1"      0.692 
##  7 "A3GALT2"    1.72  
##  8 "A4GALT"     1.71  
##  9 "A4GNT"     -0.119 
## 10 "AAAS"      -0.444 
## # ℹ 22,654 more rows

Convert to a named list

# create named list
rank <- deframe(fgsea_rank)
head(rank, 20)

##                    A1BG        A1CF         A2M       A2ML1       A2MP1 
## -0.09692611 -0.21512443 -0.03447851  0.21531794  1.03183917  0.69165261 
##     A3GALT2      A4GALT       A4GNT        AAAS        AACS      AACSP1 
##  1.72434705  1.71202195 -0.11870411 -0.44423038 -0.68357989 -1.66781509 
##       AADAC     AADACL3     AADACP1       AADAT       AAGAB        AAK1 
## -0.50864719  0.60883057  1.20853625 -0.18175691  0.12732625  0.16300913 
##       AAMDC        AAMP 
## -0.59608215  0.20323132

Read GMT file

# read in gmt file
pathway <- gmtPathways("/home/rstudio/rnaseq_data/quant/c5.bp.v7.0.symbols.gmt")
head(pathway, 1)

## $GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION
##  [1] "CDK9"    "CHD1"    "DHX9"    "EP300"   "SNW1"    "RRP1B"   "NELFB"  
##  [8] "GTF2B"   "GTF2F1"  "GTF2F2"  "MDFIC"   "HPN"     "JUN"     "LEF1"   
## [15] "ZNF639"  "NELFCD"  "RSF1"    "POLR2A"  "POLR2B"  "POLR2C"  "POLR2D" 
## [22] "POLR2E"  "POLR2F"  "POLR2G"  "POLR2H"  "POLR2I"  "POLR2J"  "POLR2K" 
## [29] "POLR2L"  "NUCKS1"  "SMARCA4" "SMARCB1" "SP1"     "SUPT4H1" "SUPT5H" 
## [36] "TAF11"   "TFAP4"   "NELFA"   "NELFE"   "CCNT1"   "CTDP1"

Run fgsea

# run fgsea
fgsea <- fgsea(pathways=pathway, stats=rank, nperm=1000)

fgseaResTidy <- fgsea %>%
  as_tibble() %>%
  arrange(desc(NES))

# Show in a nice table:
fgseaResTidy %>%
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>%
  arrange(padj) %>%
  DT::datatable()

Enrichment plots

Here, you can see examples of pathways enriched in our lung samples, and pathways enriched in Control (i.e negative NES score)

filtered_pathway <- subset(fgsea, NES > 2.0)
filtered_pathway

##                                           pathway        pval       padj
## 1: GO_POSITIVE_REGULATION_OF_LYMPHOCYTE_MIGRATION 0.001360544 0.03671469
## 2:                     GO_HUMORAL_IMMUNE_RESPONSE 0.001084599 0.03632369
## 3:                       GO_GRANULOCYTE_MIGRATION 0.001150748 0.03632369
## 4:                        GO_NEUTROPHIL_MIGRATION 0.001175088 0.03632369
## 5:                       GO_RESPONSE_TO_CHEMOKINE 0.001200480 0.03632369
##           ES      NES nMoreExtreme size
## 1: 0.7766969 2.025782            0   32
## 2: 0.6240522 2.010755            0  194
## 3: 0.6800119 2.098605            0  118
## 4: 0.6867646 2.082366            0   98
## 5: 0.6974278 2.072428            0   84
##                                  leadingEdge
## 1: CCL20,CCL7,CXCL10,CXCL12,TNFSF14,CCL4,...
## 2:           PI3,BPI,LCN2,C3,CXCL2,CXCL3,...
## 3:      SAA1,CCL20,CCL8,IL36G,CCL7,CXCL2,...
## 4:      SAA1,CCL20,CCL8,IL36G,CCL7,CXCL2,...
## 5:     CCL20,CCL8,CXCR5,CCL7,CCRL2,CXCL2,...

filt_up <- as.vector(filtered_pathway$pathway)

for (i in filt_up){
    plt <- plotEnrichment(pathway = pathway[[i]],
    gseaParam = 1, ticksSize = 0.5, stats= rank) +
    labs(title=i) + theme(plot.title = element_text(hjust = 0.5, face="bold"))
    print(plt)
}

filtered_pathway <- subset(fgsea, NES < -2.6)
filtered_pathway

##                                    pathway        pval       padj        ES
## 1: GO_MITOTIC_SISTER_CHROMATID_SEGREGATION 0.008474576 0.09297966 -0.683861
##          NES nMoreExtreme size                            leadingEdge
## 1: -2.601437            0  148 TRIP13,NUSAP1,AURKB,TTK,NUF2,BUB1B,...

filt_down <- as.vector(filtered_pathway$pathway)

for (i in filt_down){
    plt <- plotEnrichment(pathway = pathway[[i]],
    gseaParam = 1, ticksSize = 0.5, stats= rank) +
    labs(title=i) + theme(plot.title = element_text(hjust = 0.5, face="bold"))
    print(plt)
}