Enrichment analysis

Here we perform enrichment analysis for the differentially expressed genes. The input required for this analysis is: 1. Gene symbols 2. log2 fold change values

Here we read the DESeq2 output file from the previous analysis, you can use any file which has gene symbols and their respective log2 fold change values.

#install and load the required packages
#BiocManager::install("clusterProfiler", version = "3.8")
#BiocManager::install("pathview")
#BiocManager::install("enrichplot")

library(clusterProfiler)

##

## clusterProfiler v4.0.0  For help: https://guangchuangyu.github.io/software/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

## 
## Attaching package: 'clusterProfiler'

## The following object is masked from 'package:stats':
## 
##     filter

library(enrichplot)
# we use ggplot2 to add x axis labels (ex: ridgeplot)
library(ggplot2)
library("org.Hs.eg.db")

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:clusterProfiler':
## 
##     rename

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:clusterProfiler':
## 
##     slice

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:clusterProfiler':
## 
##     select

##

res = read.csv("DESeqResults.csv")
head(res)

##                 X GeneSymbol    baseMean log2FoldChange     lfcSE       stat
## 1 ENSG00000000003     TSPAN6 1059.836945      0.2251021 0.1770706  1.2712567
## 2 ENSG00000000005       TNMD    6.098357     -0.3158054 0.4292711 -0.7356782
## 3 ENSG00000000419       DPM1  413.638732      0.7281105 0.1315831  5.5334650
## 4 ENSG00000000457      SCYL3  172.270121     -0.2422349 0.1094476 -2.2132508
## 5 ENSG00000000460   C1orf112   67.484755      0.9520331 0.2115252  4.5008025
## 6 ENSG00000000938        FGR   95.419268      0.2971990 0.2385383  1.2459175
##         pvalue         padj
## 1 2.036373e-01 3.402823e-01
## 2 4.619266e-01 6.131908e-01
## 3 3.139656e-08 1.811794e-06
## 4 2.688036e-02 7.163057e-02
## 5 6.769740e-06 1.046832e-04
## 6 2.127947e-01 3.514572e-01

#remove any rows where GeneSymbol is not present
num = which(is.na(res$GeneSymbol))
res = res[-num, ]

# we want the log2 fold change 
original_gene_list = res$log2FoldChange

# name the vector
names(original_gene_list) <- res$GeneSymbol

# omit any NA values 
gene_list<-na.omit(original_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)

gse <- gseGO(geneList=gene_list, 
             ont ="ALL", 
             keyType = "SYMBOL",
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = org.Hs.eg.db, 
             pAdjustMethod = "none")

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## leading edge analysis...

## done...

The parameters used in the gseGO command are:

keyType: This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Hs.eg.db, the options are:

“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID” “ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT” “GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL” “PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”

Check which options are available with the keytypes command, for example keytypes(org.Hs.eg.db).

ont: one of “BP”, “MF”, “CC” or “ALL”

minGSSize: minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).

maxGSSize: maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).

pvalueCutoff: pvalue Cutoff.

pAdjustMethod: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”

require(DOSE)

## Loading required package: DOSE

## DOSE v3.18.0  For help: https://guangchuangyu.github.io/software/DOSE
## 
## If you use DOSE in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609

dotplot(gse, showCategory=3, split=".sign") + facet_grid(.~.sign)

The dotplot shows the ontology terms which are either activated (up regulated) or suppresed(down regulated) in the dataset. The number of ontology terms per feature (BP, MF, CC) to be displayed is determined by the showCateory function. The color of the dots indicate the pvalues for each of the terms and the size of the dots depicts the number of genes overlapping in each feature.

x2 = pairwise_termsim(gse)
emapplot(x2, showCategory = 20)

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.

# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)

Plot of the Running Enrichment Score (green line) for a gene set as the analysis walks down the ranked gene list, including the location of the maximum enrichment score (the red line). The black lines in the Running Enrichment Score show where the members of the gene set appear in the ranked list of genes, indicating the leading edge subset.

The Ranked list metric shows the value of the ranking metric (log2 fold change) as you move down the list of ranked genes. The ranking metric measures a gene’s correlation with a phenotype.

Params:

Gene Set Integer: Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc.

For KEGG pathway enrichment using the gseKEGG() function, we need to convert id types. We can use the bitr function for this (included in clusterProfiler). It is normal for this call to produce some messages / warnings.

In the bitr function, the param fromType should be the same as keyType from the gseGO function above (the annotation source). This param is used again in the next two steps: creating dedup_ids and df2.

toType: in the bitr function has to be one of the available options from keyTypes(org.Dm.eg.db) and must map to one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’ because gseKEGG() only accepts one of these 4 options as it’s keytype parameter.

As our intial input, we use original_gene_list which we created above.

# Convert gene IDs for gseKEGG function
# We will lose some genes here because not all IDs will be converted
ids = bitr(names(original_gene_list), fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)

## 'select()' returned 1:many mapping between keys and columns

# remove duplicate IDS (here we use "SYMBOL", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("SYMBOL")]),]

# Create a new dataframe df2 which has the respective entrez IDs for the gene symbols.
colnames(dedup_ids) = c("GeneSymbol", "EntrezID")
df2 = merge(res, dedup_ids, by = "GeneSymbol")

# Create a vector of the gene universe
kegg_gene_list = df2$log2FoldChange

# Name vector with ENTREZ ids
names(kegg_gene_list) = df2$EntrezID

# omit any NA values 
kegg_gene_list = na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

kegg_organism = "hsa"
kk2 = gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid")

## Reading KEGG annotation online:

## Reading KEGG annotation online:
## 
## Reading KEGG annotation online:

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## leading edge analysis...

## done...

organism: KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code). Here we define this as kegg_organism first, because it is used again below when making the pathview plots.

nPerm: the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.

minGSSize: minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).

maxGSSize: maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).

pvalueCutoff: pvalue Cutoff.

pAdjustMethod: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.

keyType: one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’.

dotplot(kk2, showCategory = 5, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)

You can view the pathway enrichment results by extracting the results matrix from the kk2 object. The enriched pathways are stored in the ID and Description columns.

out = as.matrix(kk2@result)
out = out[, 1:10]
head(out)

##          ID         Description                               setSize
## hsa04657 "hsa04657" "IL-17 signaling pathway"                 " 91"  
## hsa04110 "hsa04110" "Cell cycle"                              "124"  
## hsa03013 "hsa03013" "RNA transport"                           "167"  
## hsa04080 "hsa04080" "Neuroactive ligand-receptor interaction" "297"  
## hsa03008 "hsa03008" "Ribosome biogenesis in eukaryotes"       " 77"  
## hsa03010 "hsa03010" "Ribosome"                                "135"  
##          enrichmentScore NES         pvalue         p.adjust      
## hsa04657 " 0.6678361"    " 2.621631" "1.000000e-10" "1.000000e-10"
## hsa04110 " 0.5477487"    " 2.239610" "1.316918e-09" "1.316918e-09"
## hsa03013 " 0.4899345"    " 2.064973" "6.361275e-09" "6.361275e-09"
## hsa04080 "-0.4684143"    "-1.831906" "3.275353e-08" "3.275353e-08"
## hsa03008 " 0.5784146"    " 2.212195" "3.192017e-07" "3.192017e-07"
## hsa03010 " 0.4711147"    " 1.962188" "6.768215e-07" "6.768215e-07"
##          qvalues        rank   leading_edge                    
## hsa04657 "2.357895e-08" "1300" "tags=30%, list=6%, signal=29%" 
## hsa04110 "1.552578e-07" "4404" "tags=48%, list=19%, signal=39%"
## hsa03013 "4.999739e-07" "6500" "tags=57%, list=28%, signal=41%"
## hsa04080 "1.930734e-06" "3500" "tags=40%, list=15%, signal=34%"
## hsa03008 "1.505288e-05" "6628" "tags=68%, list=29%, signal=48%"
## hsa03010 "2.659790e-05" "7287" "tags=59%, list=32%, signal=40%"

#Geneset plot

# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)

Traditional method for visualizing GSEA result.

Params:

geneSetID: Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc. Default: 1

library(pathview)

## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################

r1 = pathview(gene.data = kegg_gene_list, pathway.id = "hsa04657", species = kegg_organism)

## Info: Downloading xml files for hsa04657, 1/1 pathways..

## Info: Downloading png files for hsa04657, 1/1 pathways..

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/pranali

## Info: Writing image file hsa04657.pathview.png

This will create a PNG and different PDF of the enriched KEGG pathway.

Params:

gene.data: This is kegg_gene_list created above

pathway.id: The user needs to enter this. Enriched pathways + the pathway ID are provided in the gseKEGG output table (above).

species: Same as organism above in gseKEGG, which we defined as kegg_organism

Enrichment analysis

Pranali

12/07/2021