Test and Visualise Gene Enrichment with GSEA over MSigDB

# dataset
library(airway)

# tidyverse core packages
library(tibble)
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(ggplot2)
library(purrr)

# tidyverse-friendly packages
library(plotly)
library(ggrepel)
library(GGally)
library(tidybulk)
library(tidySummarizedExperiment) # we'll load this below to show what it can do
library(enrichplot)
library(patchwork)

tidySummarizedExperiment provides a bridge between Bioconductor SummarizedExperiment [@morgan2020summarized] and the tidyverse [@wickham2019welcome]. It enables viewing the Bioconductor SummarizedExperiment object as a tidyverse tibble, and provides SummarizedExperiment-compatible dplyr, tidyr, ggplot and plotly functions. This allows users to get the best of both Bioconductor and tidyverse worlds.

Here we will demonstrate performing a bulk RNA sequencing analysis using tidySummarizedExperiment and tidybulk. We will use data from the airway package, which comes from the paper by [@himes2014rna]. It includes 8 samples from human airway smooth muscle cells, from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control) a sample has undergone RNA sequencing and gene counts have been generated.

# load airway RNA sequencing data
data(airway)
# take a look
airway

## # A SummarizedExperiment-tibble abstraction: 512,816 x 13
## [90m# Transcripts=64102 | Samples=8 | Assays=counts[39m
##    feature sample counts SampleName cell  dex   albut Run   avgLength Experiment
##    <chr>   <chr>   <int> <fct>      <fct> <fct> <fct> <fct>     <int> <fct>     
##  1 ENSG00… SRR10…    679 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  2 ENSG00… SRR10…      0 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  3 ENSG00… SRR10…    467 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  4 ENSG00… SRR10…    260 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  5 ENSG00… SRR10…     60 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  6 ENSG00… SRR10…      0 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  7 ENSG00… SRR10…   3251 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  8 ENSG00… SRR10…   1433 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  9 ENSG00… SRR10…    519 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## 10 ENSG00… SRR10…    394 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## # … with 40 more rows, and 3 more variables: Sample <fct>, BioSample <fct>,
## #   coordinate <list>

Full pipeline

In one modular step it is possible to go from raw counts to enriched pathways

airway %>%
  
  # Annotate
  mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
                                        keys = feature,
                                        keytype = "ENSEMBL",
                                        column = "ENTREZID",
                                        multiVals = "first"
  )) %>% 

  # Filter
  keep_abundant(factor_of_interest = dex) %>%
  
  # Test differential gene transcript abundance
  test_differential_abundance(
    ~ dex + cell,
    method="edger_robust_likelihood_ratio", 
    test_above_log2_fold_change = 1,
  ) %>% 
  
  # Test gene enrichment
  filter(PValue   %>% is.na %>% `!`) %>%
  test_gene_rank(
    .entrez = entrez,
    .arrange_desc = logFC ,
    species="Homo sapiens",
    gene_sets = c("H", "C2", "C5")
  )

##

## 'select()' returned 1:many mapping between keys and columns

## =====================================
## tidybulk says: All testing methods use raw counts, irrespective of if scale_abundance
## or adjust_abundance have been calculated. Therefore, it is essential to add covariates
## such as batch effects (if applicable) in the formula.
## =====================================

## tidybulk says: The design column names are "(Intercept), dexuntrt, cellN061011, cellN080611, cellN61311"

## tidybulk says: to access the raw results (fitted GLM) do `attr(..., "internals")$edger_robust_likelihood_ratio`

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## leading edge analysis...

## done...

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## leading edge analysis...

## done...

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## leading edge analysis...

## done...

Step-by-step

We’ll set up the airway data for our RNA sequencing analysis. We’ll create a column with shorter sample names and a column with entrez ID. We can get the entrez ID for these Ensembl gene ids using the Bioconductor annotation package for human, org.Hs.eg.db.

# setup data workflow
counts <-
  airway %>%
  mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
                                        keys = feature,
                                        keytype = "ENSEMBL",
                                        column = "ENTREZID",
                                        multiVals = "first"
  ))

## 'select()' returned 1:many mapping between keys and columns

# take a look
counts

## # A SummarizedExperiment-tibble abstraction: 512,816 x 14
## [90m# Transcripts=64102 | Samples=8 | Assays=counts[39m
##    feature sample counts SampleName cell  dex   albut Run   avgLength Experiment
##    <chr>   <chr>   <int> <fct>      <fct> <fct> <fct> <fct>     <int> <fct>     
##  1 ENSG00… SRR10…    679 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  2 ENSG00… SRR10…      0 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  3 ENSG00… SRR10…    467 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  4 ENSG00… SRR10…    260 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  5 ENSG00… SRR10…     60 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  6 ENSG00… SRR10…      0 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  7 ENSG00… SRR10…   3251 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  8 ENSG00… SRR10…   1433 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  9 ENSG00… SRR10…    519 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## 10 ENSG00… SRR10…    394 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## # … with 40 more rows, and 4 more variables: Sample <fct>, BioSample <fct>,
## #   entrez <chr>, coordinate <list>

We filter out lowly expressed genes using tidybulk keep_abundant or identify_abundant. These functions can use the edgeR filterByExpr function described in [@law2016rna] to automatically identify the genes with adequate abundance for differential expression testing.

# Filtering counts
counts_abundant <- counts %>% 
      keep_abundant(factor_of_interest = dex) 

# take a look
counts_abundant

## # A SummarizedExperiment-tibble abstraction: 127,408 x 15
## [90m# Transcripts=15926 | Samples=8 | Assays=counts[39m
##    feature sample counts SampleName cell  dex   albut Run   avgLength Experiment
##    <chr>   <chr>   <int> <fct>      <fct> <fct> <fct> <fct>     <int> <fct>     
##  1 ENSG00… SRR10…    679 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  2 ENSG00… SRR10…    467 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  3 ENSG00… SRR10…    260 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  4 ENSG00… SRR10…     60 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  5 ENSG00… SRR10…   3251 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  6 ENSG00… SRR10…   1433 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  7 ENSG00… SRR10…    519 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  8 ENSG00… SRR10…    394 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  9 ENSG00… SRR10…    172 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## 10 ENSG00… SRR10…   2112 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
## # … with 40 more rows, and 5 more variables: Sample <fct>, BioSample <fct>,
## #   entrez <chr>, .abundant <lgl>, coordinate <list>

tidybulk integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood [@chen2016reads] (tidybulk default method), edgeR likelihood ratio [@mccarthy2012differential], limma-voom [@law2014voom] and DESeq2 [@love2014moderated]. A common question researchers have is which method to choose. With tidybulk, we can easily run multiple methods and compare.

We give test_differential_abundance our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance, such as batch differences, we could use the formula ~ dex. However, each treated and untreated sample is from a different cell line, so we add the cell line as an additional factor ~ dex + cell.

de_all <-
  counts_abundant %>%
  test_differential_abundance(
    ~ dex + cell,
    method="edger_robust_likelihood_ratio", 
    test_above_log2_fold_change = 1,
  )

Execute GSEA using MSigDB, clusterProfiler and enrichplot

de_all_gene_rank = 
  de_all %>%
  filter(PValue   %>% is.na %>% `!`) %>%
  test_gene_rank(
    .entrez = entrez,
    .arrange_desc = logFC ,
    species="Homo sapiens",
    gene_sets = c("H", "C2", "C5")
  )

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## leading edge analysis...

## done...

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## leading edge analysis...

## done...

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## leading edge analysis...

## done...

Examine significantly enriched gene sets

de_all_gene_rank %>%
  filter(gs_cat  == "C2" ) %>%
  dplyr::select(-fit) %>%
  unnest(test) %>% 
  filter(p.adjust < 0.05)

Visualise enrichment

de_all_gene_rank_plots = 
  de_all_gene_rank  %>% 
    unnest(test) %>% 
  
  # Select top 10
    slice(1:10) %>% 
    mutate(plot = pmap(
       list(fit, ID, idx_for_plotting, p.adjust), 
       ~ enrichplot::gseaplot2(
           ..1, 
           geneSetID = ..3, 
           title = sprintf("%s \nadj pvalue %s", ..2, round(..4, 2)),
           base_size = 6, rel_heights = c(1.5, 0.5), subplots = c(1, 2)
       )  
  ))

# Visualise the first plot
de_all_gene_rank_plots %>% 
  pull(plot) %>% 
  wrap_plots()

List all citations used in this analysis

c(
  get_bibliography(de_all_gene_rank),
  get_bibliography(de_all)
)

##  @Article{tidybulk,
##   title = {tidybulk: an R tidy framework for modular transcriptomic data analysis},
##   author = {Stefano Mangiola and Ramyar Molania and Ruining Dong and Maria A. Doyle & Anthony T. Papenfuss},
##   journal = {Genome Biology},
##   year = {2021},
##   volume = {22},
##   number = {42},
##   url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7},
##   }
## @article{wickham2019welcome,
##   title={Welcome to the Tidyverse},
##   author={Wickham, Hadley and Averick, Mara and Bryan, Jennifer and Chang, Winston and McGowan, Lucy D'Agostino and Francois, Romain and Grolemund, Garrett and Hayes, Alex and Henry, Lionel and Hester, Jim and others},
##   journal={Journal of Open Source Software},
##   volume={4},
##   number={43},
##   pages={1686},
##   year={2019}
##  }
## @article{robinson2010edger,
##   title={edgeR: a Bioconductor package for differential expression analysis of digital gene expression data},
##   author={Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K},
##   journal={Bioinformatics},
##   volume={26},
##   number={1},
##   pages={139--140},
##   year={2010},
##   publisher={Oxford University Press}
##  }
## @article{zhou2014robustly,
##   title={Robustly detecting differential expression in RNA sequencing data using observation weights},
##   author={Zhou, Xiaobei and Lindsay, Helen and Robinson, Mark D},
##   journal={Nucleic acids research},
##   volume={42},
##   number={11},
##   pages={e91--e91},
##   year={2014},
##   publisher={Oxford University Press}
##  }
## @article{McCarthy2009,
##   doi = {10.1093/bioinformatics/btp053},
##   url = {https://doi.org/10.1093/bioinformatics/btp053},
##   year = {2009},
##   month = jan,
##   publisher = {Oxford University Press ({OUP})},
##   volume = {25},
##   number = {6},
##   pages = {765--771},
##   author = {D. J. McCarthy and G. K. Smyth},
##   title = {Testing significance relative to a fold-change threshold is a {TREAT}},
##   journal = {Bioinformatics}
## }
## @Article{,
##     title = {clusterProfiler: an R package for comparing biological themes among gene clusters},
##     author = {Guangchuang Yu and Li-Gen Wang and Yanyan Han and Qing-Yu He},
##     journal = {OMICS: A Journal of Integrative Biology},
##     year = {2012},
##     volume = {16},
##     number = {5},
##     pages = {284-287},
##     pmid = {22455463},
##     doi = {10.1089/omi.2011.0118},
##   }
##   @Manual{,
##     title = {enrichplot: Visualization of Functional Enrichment Result},
##     author = {Guangchuang Yu},
##     year = {2021},
##     note = {R package version 1.11.2.994},
##     url = {https://yulab-smu.top/biomedical-knowledge-mining-book/},
##   }
## @Manual{,
##     title = {msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format},
##     author = {Igor Dolgalev},
##     year = {2020},
##     note = {R package version 7.1.1},
##     url = {https://CRAN.R-project.org/package=msigdbr},
##   }
##  @Article{tidybulk,
##   title = {tidybulk: an R tidy framework for modular transcriptomic data analysis},
##   author = {Stefano Mangiola and Ramyar Molania and Ruining Dong and Maria A. Doyle & Anthony T. Papenfuss},
##   journal = {Genome Biology},
##   year = {2021},
##   volume = {22},
##   number = {42},
##   url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7},
##   }
## @article{wickham2019welcome,
##   title={Welcome to the Tidyverse},
##   author={Wickham, Hadley and Averick, Mara and Bryan, Jennifer and Chang, Winston and McGowan, Lucy D'Agostino and Francois, Romain and Grolemund, Garrett and Hayes, Alex and Henry, Lionel and Hester, Jim and others},
##   journal={Journal of Open Source Software},
##   volume={4},
##   number={43},
##   pages={1686},
##   year={2019}
##  }
## @article{robinson2010edger,
##   title={edgeR: a Bioconductor package for differential expression analysis of digital gene expression data},
##   author={Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K},
##   journal={Bioinformatics},
##   volume={26},
##   number={1},
##   pages={139--140},
##   year={2010},
##   publisher={Oxford University Press}
##  }
## @article{zhou2014robustly,
##   title={Robustly detecting differential expression in RNA sequencing data using observation weights},
##   author={Zhou, Xiaobei and Lindsay, Helen and Robinson, Mark D},
##   journal={Nucleic acids research},
##   volume={42},
##   number={11},
##   pages={e91--e91},
##   year={2014},
##   publisher={Oxford University Press}
##  }
## @article{McCarthy2009,
##   doi = {10.1093/bioinformatics/btp053},
##   url = {https://doi.org/10.1093/bioinformatics/btp053},
##   year = {2009},
##   month = jan,
##   publisher = {Oxford University Press ({OUP})},
##   volume = {25},
##   number = {6},
##   pages = {765--771},
##   author = {D. J. McCarthy and G. K. Smyth},
##   title = {Testing significance relative to a fold-change threshold is a {TREAT}},
##   journal = {Bioinformatics}
## }

## NULL

Test and Visualise Gene Enrichment with GSEA over MSigDB

Stefano Mangiola

2021-07-12

Full pipeline

Step-by-step