This workshop demonstrates how to perform ELMER analysis using matched RNA-seq and DNA methylation data.
You can find a detailed information about ELMER at http://bioconductor.org/packages/3.10/bioc/vignettes/ELMER/inst/doc/index.html.
Articles about ELMER:
Students will have a chance to run ELMER analysis on a provided MultiAssayExperiment
object created from TCGA data from GDC data portal.
The RNA-seq data, DNA methylation data and patients metadata (i.e age, gender) used in this workshop is structured as a MultiAssayExperiment. You can read more about a MultiAssayExperiment object at http://bioconductor.org/packages/MultiAssayExperiment/ and https://bioconductor.github.io/BiocWorkshops/workflow-for-multi-omics-analysis-with-multiassayexperiment.html.
Through the next section we will:
The data is available in this google drive.
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 148951 rows and 171 columns
## [2] Gene expression: RangedSummarizedExperiment with 56830 rows and 171 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 148951 rows and 171 columns
## [2] Gene expression: RangedSummarizedExperiment with 56830 rows and 171 columns
## [1] 56830
## [1] 171
## GRanges object with 56830 ranges and 3 metadata columns:
## seqnames ranges strand | ensembl_gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000000003 chrX 100627109-100639991 - | ENSG00000000003
## ENSG00000000005 chrX 100584802-100599885 + | ENSG00000000005
## ENSG00000000419 chr20 50934867-50958555 - | ENSG00000000419
## ENSG00000000457 chr1 169849631-169894267 - | ENSG00000000457
## ENSG00000000460 chr1 169662007-169854080 + | ENSG00000000460
## ... ... ... ... . ...
## ENSG00000281904 chr2 90365737-90367699 + | ENSG00000281904
## ENSG00000281909 chr15 22480439-22484840 - | ENSG00000281909
## ENSG00000281910 chr16 58559796-58559931 - | ENSG00000281910
## ENSG00000281912 chr1 45303910-45305619 + | ENSG00000281912
## ENSG00000281920 chr2 65623272-65628424 + | ENSG00000281920
## external_gene_name original_ensembl_gene_id
## <character> <character>
## ENSG00000000003 TSPAN6 ENSG00000000003.13
## ENSG00000000005 TNMD ENSG00000000005.5
## ENSG00000000419 DPM1 ENSG00000000419.11
## ENSG00000000457 SCYL3 ENSG00000000457.12
## ENSG00000000460 C1orf112 ENSG00000000460.15
## ... ... ...
## ENSG00000281904 AC233263.6 ENSG00000281904.1
## ENSG00000281909 HERC2P7 ENSG00000281909.1
## ENSG00000281910 SNORA50A ENSG00000281910.1
## ENSG00000281912 LINC01144 ENSG00000281912.1
## ENSG00000281920 AC007389.5 ENSG00000281920.1
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
## TCGA-2H-A9GF-01A-11R-A37I-31 TCGA-2H-A9GG-01A-11R-A37I-31
## ENSG00000000003 17.63165 16.758267
## ENSG00000000005 0.00000 7.721927
## ENSG00000000419 19.33125 19.188030
## ENSG00000000457 15.87565 16.251205
## TCGA-2H-A9GH-01A-11R-A37I-31 TCGA-2H-A9GI-01A-11R-A37I-31
## ENSG00000000003 17.624508 17.89604
## ENSG00000000005 9.236257 0.00000
## ENSG00000000419 19.765411 20.90188
## ENSG00000000457 15.745243 15.92405
## ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 148951 rows and 171 columns
## [2] Gene expression: RangedSummarizedExperiment with 56830 rows and 171 columns
## [1] 148951
## [1] 171
## GRanges object with 148951 ranges and 4 metadata columns:
## seqnames ranges strand | address_A address_B
## <Rle> <IRanges> <Rle> | <integer> <integer>
## cg08258224 chr1 864703-864704 - | 23712478 <NA>
## cg13938959 chr1 898803-898804 + | 38601418 <NA>
## cg12445832 chr1 898915-898916 - | 70726486 <NA>
## cg23999112 chr1 898976-898977 - | 49747457 <NA>
## cg11527153 chr1 902156-902157 + | 47662324 <NA>
## ... ... ... ... . ... ...
## cg16428758 chrX 155539968-155539969 - | 28623392 <NA>
## cg05682970 chrX 155609878-155609879 - | 34723345 <NA>
## cg07211220 chrX 155616254-155616255 + | 33714463 <NA>
## cg25059696 chrY 7563149-7563150 + | 61613414 <NA>
## cg13851368 chrY 11954167-11954168 + | 36601306 32634316
## channel designType
## <character> <character>
## cg08258224 Both II
## cg13938959 Both II
## cg12445832 Both II
## cg23999112 Both II
## cg11527153 Both II
## ... ... ...
## cg16428758 Both II
## cg05682970 Both II
## cg07211220 Both II
## cg25059696 Both II
## cg13851368 Grn I
## -------
## seqinfo: 26 sequences from an unspecified genome; no seqlengths
## TCGA-2H-A9GF-01A-11D-A37D-05 TCGA-2H-A9GG-01A-11D-A37D-05
## cg08258224 0.9068315 0.8888116
## cg13938959 0.2791042 0.2178040
## cg12445832 0.1900486 0.1202698
## cg23999112 0.3198171 0.1969021
## TCGA-2H-A9GH-01A-11D-A37D-05 TCGA-2H-A9GI-01A-11D-A37D-05
## cg08258224 0.8247016 0.8174693
## cg13938959 0.3241953 0.3408468
## cg12445832 0.2290422 0.1741125
## cg23999112 0.4268968 0.4687823
# metadata can be accessed using the function colData
# we will check the first 4 samples and their 4 columns
colData(mae)[1:4,1:4]
## DataFrame with 4 rows and 4 columns
## sample patient shortLetterCode
## <character> <character> <character>
## TCGA-2H-A9GF-01A TCGA-2H-A9GF-01A TCGA-2H-A9GF TP
## TCGA-2H-A9GG-01A TCGA-2H-A9GG-01A TCGA-2H-A9GG TP
## TCGA-2H-A9GH-01A TCGA-2H-A9GH-01A TCGA-2H-A9GH TP
## TCGA-2H-A9GI-01A TCGA-2H-A9GI-01A TCGA-2H-A9GI TP
## definition
## <character>
## TCGA-2H-A9GF-01A Primary solid Tumor
## TCGA-2H-A9GG-01A Primary solid Tumor
## TCGA-2H-A9GH-01A Primary solid Tumor
## TCGA-2H-A9GI-01A Primary solid Tumor
## [1] "Adenocarcinoma, NOS" "Adenocarcinoma, NOS" "Adenocarcinoma, NOS"
## [4] "Adenocarcinoma, NOS"
##
## Metastatic Primary solid Tumor Solid Tissue Normal
## 1 161 9
# or summarize the numbers of samples using two features
table(mae$primary_diagnosis,mae$definition)
##
## Metastatic
## Adenocarcinoma, NOS 0
## Basaloid squamous cell carcinoma 0
## Mucinous adenocarcinoma 0
## Squamous cell carcinoma, keratinizing, NOS 0
## Squamous cell carcinoma, NOS 1
## Tubular adenocarcinoma 0
##
## Primary solid Tumor
## Adenocarcinoma, NOS 78
## Basaloid squamous cell carcinoma 1
## Mucinous adenocarcinoma 1
## Squamous cell carcinoma, keratinizing, NOS 4
## Squamous cell carcinoma, NOS 76
## Tubular adenocarcinoma 1
##
## Solid Tissue Normal
## Adenocarcinoma, NOS 7
## Basaloid squamous cell carcinoma 0
## Mucinous adenocarcinoma 1
## Squamous cell carcinoma, keratinizing, NOS 0
## Squamous cell carcinoma, NOS 1
## Tubular adenocarcinoma 0
Since we will use only Primary solid Tumor samples we will remove the Metastatic and Solid Tissue Normal samples from our object
We will first define some parameters used in the majority of ELMER function
# which is the column of the samples metadata defining our groups
group.col <- "primary_diagnosis"
# which are the grops withing the group.col that we want to compare
group1 <- "Adenocarcinoma, NOS"
group2 <- "Squamous cell carcinoma, NOS"
# Identify probes hypo in group1 or hypo in group2 ?
direction <- "hypo" # hypo in group 1
cores <- 2
mode <- "supervised"
# Where do we want to save the results ?
dir.out <- paste0("analysis/",group.col,"-",
gsub("[[:punct:]]| ", "_", group1),
"_vs_",
gsub("[[:punct:]]| ", "_", group2),
"_", mode,
"/", direction)
dir.create(dir.out, showWarnings = FALSE,recursive = T)
# time expected with one core: 4 minuntes
diff.probes <- get.diff.meth(data = mae,
group.col = group.col,
group1 = group1,
group2 = group2,
diff.dir = direction, # Get probes hypometh. in group 1
cores = cores,
mode = mode,
pvalue = 0.01,
sig.dif = 0.3,
dir.out = dir.out,
save = TRUE)
## Supervised mode will use all samples from boths groups. Setting argument minSubgroupFrac to 1
## ELMER will search for probes hypomethylated in group Adenocarcinoma, NOS (n:78) compared to Squamous cell carcinoma, NOS (n:76)
## ooo Arguments ooo
## o Number of probes: 148951
## o Beta value difference cut-off: 0.3
## o FDR cut-off: 0.01
## o Mode: supervised
## o % of samples per group in each comparison: 1
## o Min number of samples per group in each comparison: 5
## o Nb of samples group1 in each comparison: 78
## o Nb of samples group2 in each comparison: 76
## Output direction: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo
## ooooooooooooooooo
## Progress disabled when using parallel plyr
## Saving results
## Number of relevant probes found: 1524
Now that we have our hypomethylated probes we will look for the 20 nearest genes
to each probes to identify if any of these genes could being affected by the hypomethylation in the distal regulatory region.
# time expected less than one minute
nearGenes <- GetNearGenes(data = mae,
probes = diff.probes$probe,
numFlankingGenes = 20)
## Searching for the 20 near genes
## Identifying gene position for each probe
Now that we have our hypomethylated probes and the 20 nearest genes we want to identify if any of these genes could being affected by the hypomethylatio in the distal regulatory region.
# time expected one minute
pair <- get.pair(data = mae,
nearGenes = nearGenes,
group.col = group.col,
group1 = group1,
group2 = group2,
mode = mode,
diff.dir = direction,
raw.pvalue = 0.001,
Pe = 0.001,
filter.probes = TRUE,
filter.percentage = 0.05,
filter.portion = 0.3,
dir.out = dir.out,
cores = cores,
label = direction)
## Using pre-defined groups. U (unmethylated): Adenocarcinoma, NOS, M (methylated): Squamous cell carcinoma, NOS
## -------------------
## * Filtering probes
## -------------------
## For more information see function preAssociationProbeFiltering
## Making sure we have at least 5% of beta values lesser than 0.3 and 5% of beta values greater 0.3.
## Removing 100701 probes out of 148951
## Calculating Pp (probe - gene) for all nearby genes
## Progress disabled when using parallel plyr
## File created: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo/getPair.hypo.all.pairs.statistic.csv
We can plot the correlation heatmap of DNA methylation and gene expression with the heatmapPairs
function. We will plot only the first 100 pairs.
heatmapPairs(data = mae,
group.col = group.col,
group1 = group1,
group2 = group2,
subset = TRUE,
pairs = pair[1:100,],
filename = NULL)
## Subsetting
## Ordering groups
We have hypomethylated regions that has a distal gene upregulated. We want to identify which is the MR TF binding to those regions and regulating the distal gene. First we identify which are the enriched motifs for those hypomethylated regions using the function get.enriched.motif
. This will output the enriched motifs and the respective hypomethylated probes with the motif signature around it (since a DNA methylation probes is 1bp, we extended the search window to \(\pm\) 250 bp).
# time expected less than one minute
enriched.motif <- get.enriched.motif(data = mae,
probes = unique(pair$Probe),
dir.out = dir.out,
label = direction,
min.incidence = 10,
lower.OR = 1.5)
## [1] "GATA4_HUMAN.H11MO.0.A" "GATA6_HUMAN.H11MO.0.A" "HNF4A_HUMAN.H11MO.0.A"
## [4] "GATA2_HUMAN.H11MO.1.A"
# and the paired hypomethylated probes with the motif signature
enriched.motif$GATA4_HUMAN.H11MO.0.A
## [1] "cg22050733" "cg04275506" "cg19860160" "cg22512779" "cg19626253"
## [6] "cg03601886" "cg13784017" "cg11959316" "cg07191152" "cg25098793"
## [11] "cg05625834" "cg17277896" "cg26787863" "cg00075975" "cg07005353"
## [16] "cg00577847" "cg07117596" "cg14520947" "cg23486701" "cg22107147"
## [21] "cg20043649" "cg08818866" "cg08847775" "cg12077754" "cg27239243"
## [26] "cg10560561" "cg05977696" "cg23367358" "cg10018933" "cg21926118"
## [31] "cg07613752" "cg01450522" "cg24590723" "cg03865925" "cg18342703"
## [36] "cg21425296" "cg00584840" "cg23174148" "cg22539744" "cg01377807"
## [41] "cg04319611" "cg11578055" "cg15634747" "cg23971987" "cg01923775"
## [46] "cg05626226" "cg03464224" "cg00796302" "cg26728709" "cg08988420"
## [51] "cg08830157" "cg18430895" "cg08609445" "cg16016281" "cg11821200"
## [56] "cg24606701" "cg27351783" "cg21566177" "cg03172688" "cg10505610"
## [61] "cg11584098" "cg19692192" "cg05165087" "cg07186962" "cg08015382"
## [66] "cg27565517" "cg15094236" "cg12663302" "cg23381884" "cg14493742"
## [71] "cg09305161" "cg10846969" "cg23226631" "cg10784258" "cg22443940"
## [76] "cg19929194" "cg12964881" "cg24645679" "cg00084338" "cg17287122"
## [81] "cg27246129" "cg09022230" "cg00772192" "cg02498579" "cg08828787"
## [86] "cg14033170" "cg18100008" "cg09422239" "cg15059474" "cg24300607"
## [91] "cg19935531" "cg14199261" "cg05921889" "cg05744395" "cg18991321"
## [96] "cg00987534" "cg02901002" "cg12018521" "cg01824603" "cg10087556"
## [101] "cg14533298" "cg17291136" "cg04237288" "cg18336854" "cg12726213"
## [106] "cg24027477" "cg15235922" "cg07273304" "cg26895804" "cg10131026"
## [111] "cg14325112" "cg03489427" "cg10063407" "cg01364755" "cg14509153"
## [116] "cg10476206" "cg17568611" "cg23681599" "cg10494981" "cg25227803"
## [121] "cg27065896" "cg11688410" "cg17425818" "cg04337594" "cg18203466"
## [126] "cg10955669" "cg19754622" "cg13608733" "cg26725502" "cg16311302"
## [131] "cg15722533" "cg20482760" "cg23091777" "cg03078767" "cg02856190"
## [136] "cg25015613" "cg15110243" "cg18246298" "cg02159731" "cg06978117"
## [141] "cg23327896" "cg16182447" "cg26540315" "cg18248586" "cg03554573"
## [146] "cg21127079" "cg21672401" "cg21738443" "cg25757470" "cg17046586"
## [151] "cg19091779" "cg01579458" "cg10621809" "cg17066349" "cg27614751"
## [156] "cg26545245" "cg04874286" "cg11658060" "cg13712197" "cg17843665"
## [161] "cg20360395" "cg16378261" "cg12194594" "cg16536610" "cg12128696"
## [166] "cg13995599" "cg02909176" "cg16217297" "cg07800658" "cg15935770"
## [171] "cg23639055" "cg14850604" "cg21207694" "cg19708418" "cg13594138"
## [176] "cg21523688" "cg04094640" "cg09113939" "cg07092029" "cg07098502"
## [181] "cg10020890" "cg00599850" "cg13688786" "cg04726374" "cg04136369"
## [186] "cg08659204" "cg03828329" "cg04770088" "cg04603419" "cg17415265"
## [191] "cg00777445" "cg19078037" "cg00423871" "cg09926389" "cg18788725"
## [196] "cg07604565" "cg26162564" "cg27403406" "cg21211039" "cg01327767"
## [201] "cg13758960" "cg05321038" "cg00389552"
motif.result.file <- file.path(dir.out,"getMotif.hypo.motif.enrichment.csv")
motif.enrichment.plot(motif.enrichment = motif.result.file,
significant = list(lowerOR = 2.0),
label = "hypo",
summary = TRUE,
save = FALSE)
## TableGrob (2 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Now that we have our enriched motifs, we have the following question: - Which is the candidate master regulator Transcription factor binding to those regions ?
It is important to highlight that since TF within the same family and subfamily have very close motifs, they will likely be indentified to bind the same regions. You can check that by verifying the intersection of probes from our enrichment analysis results.
## [1] 203
## [1] 189
## [1] 147
So more than 70% of probes are enriched for both GATA4 and GATA6. How could we infer which of those MR might be the one with higher impact ?
For that ELMER correlates the TF expression vs the average mean methylation of the hypomethylated probes with a motif signature and rank them. We expect that a TF with higher expression and lower methylation in the binding regions will be a better MR TF candidate than TF that are for example equally expressed on both groups.
# time expected ~5 minutes
TF <- get.TFs(data = mae,
group.col = group.col,
mode = mode,
group1 = group1,
group2 = group2,
diff.dir = direction,
enriched.motif = enriched.motif,
dir.out = dir.out,
label = direction,
cores = cores)
## Using pre-defined groups. U (unmethylated): Adenocarcinoma, NOS, M (methylated): Squamous cell carcinoma, NOS
## -------------------------------------------------------------------------------------------------------------------
## ** Downloading TF list from Lambert, Samuel A., et al. The human transcription factors. Cell 172.4 (2018): 650-665.
## -------------------------------------------------------------------------------------------------------------------
## Accessing TF families from TFClass database to indentify known potential TF
## Retrieving TFClass family classification from ELMER.data.
## Retrieving TFClass subfamily classification from ELMER.data.
## Calculating the average methylation at all motif-adjacent probes
## Progress disabled when using parallel plyr
## Performing Mann-Whitney U test
## Progress disabled when using parallel plyr
## Finding potential TF and known potential TF
## Progress disabled when using parallel plyr
ELMER provides an easy way to visualize the rank of all human TF evaluated with the function TF.rank.plot
.
When you perform several ELMER analysis with different arguments, you might want to summarize and visualize those several results. For that, we have the ELMER:::get.tabs
function , which accepts as input the directories of the results and if you want to select the MR TF within the family of subfamily classification.
In the Data/analysis_april_2019
folder we provided 4 ELMER analysis. The first step is to retrieve the directory names with the results. Also, it is a golden rule when running several ELMER runs to add a understable name to the dir.out
directories, such that you easily know which analysis were ran, with which arguments. In the example below, you can identify the tumor type, type of ELMER run (supervised/unsupervised), groups compared and direction of the analysis (hypo or hyper in group 1).
In the directory analysis_april_2019
we provide 4 results from different ELMER runs.
analsysis.dir <- unique(dirname(dir("Data/analysis_april_2019/",recursive = T,full.names = T)))
analsysis.dir
## [1] "Data/analysis_april_2019//COAD-READ/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypo"
## [2] "Data/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Adenocarcinoma.Solid.Tissue.Normal/hypo"
## [3] "Data/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Squamous.cell.carcinoma.Primary.solid.Tumor/hypo"
## [4] "Data/analysis_april_2019//PAAD/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypo"
With those directory names we will use the function ELMER:::get.tabs
to summarize the results.
## o Creating TF binary matrix
## Joining, by = "TF"
## Joining, by = "TF"
## Joining, by = "TF"
## o Creating TF FDR matrix
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## o Creating TF OR matrix
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## Joining, by = "motif"
## [1] "tab" "tab.pval" "tab.or" "tf.or.table"
# To be more readable we will change the path names to the analysis name
analysis.names <- c( "COAD-READ (vs Normal)",
"EAC (vs Normal)",
"EAC (vs ESCC)",
"PAAD (vs Normal)")
for (i in 1:3) colnames(tabs[[i]]) <- analysis.names
This function creates 4 matrices using some of the files in each analysis directory: 1. a binary matrix saying if the TF was found or not in the analysis: this function parses each getTF.hypo.significant.TFs.with.motif.summary.csv
(create after you run the get.TFs
function) file that has the MR TF for each enriched motif.
# 1. a binary matrix saying if the TF was found or not in the analysis
tab.binary <- tabs$tab
head(tab.binary)
getTF.hypo.significant.TFs.with.motif.summary.csv
to identify the MR TF and getTF.hypo.TFs.with.motif.pvalue.rda
, which is a matrix with TF as columns and enriched motifs as rows and the contains the p-value of the wilcoxon test of the TF expression for the Unmethylated vs Methylated groups.## COAD-READ (vs Normal) EAC (vs Normal) EAC (vs ESCC)
## ARID3A 8.397450e-08 NA NA
## FOXQ1 3.774568e-08 NA NA
## HNF1A 5.845199e-10 0.007700548 1.689635e-09
## PDX1 3.243946e-09 NA NA
## POU5F1B 1.259413e-10 NA NA
## SOX9 8.410768e-08 NA NA
## PAAD (vs Normal)
## ARID3A NA
## FOXQ1 NA
## HNF1A 1.777426e-07
## PDX1 NA
## POU5F1B NA
## SOX9 2.017003e-07
getTF.hypo.significant.TFs.with.motif.summary.csv
to get the MR TF and then uses getMotif.hypo.motif.enrichment.csv
to get the motifs OR to which the MR TF is known to bind and returns the highest OR.# 3. a matrix with the highest motif OR for the MR TF found in the analysis
tab.or <- tabs$tab.or
head(tab.or)
## COAD-READ (vs Normal) EAC (vs Normal) EAC (vs ESCC)
## ARID3A 1.348007 NA NA
## FOXQ1 1.362852 NA NA
## HNF1A 1.471907 1.501869 2.190133
## PDX1 1.730954 NA NA
## POU5F1B 1.786492 NA NA
## SOX9 1.689930 NA NA
## PAAD (vs Normal)
## ARID3A NA
## FOXQ1 NA
## HNF1A 1.960243
## PDX1 NA
## POU5F1B NA
## SOX9 1.684580
# 4. a matrix with 4 collums: MR TF, motif, FDR and analysis it was identified.
tf.or.table <- tabs$tf.or.table
# Rename analysis
tf.or.table$analysis <- gsub("ESCA","EAC",paste0(basename(dirname(dirname(dirname(tf.or.table$analysis)))),
" (vs ",
ifelse(grepl("Normal",tf.or.table$analysis),"Normal","ESCC"),")"))
head(tf.or.table)
With those summarized results we can create two summary plots:
The code below creates the heatmap using complexHeatmap package for the binary MR TF matrix and the p-value one.
library(ComplexHeatmap)
library(vegan)
col <- ifelse(classification == "family","top.potential.TF.family", "top.potential.TF.subfamily")
analysis.colors <- c(
"COAD-READ (vs Normal)" = '#ff964f',
"EAC (vs Normal)" = '#94e894',
"EAC (vs ESCC)" = '#eaadb9',
"PAAD (vs Normal)" = '#9ab9f9'
)
# get top TFs in each anaylsis
labels <- ELMER:::get.top.tf.by.pval(tab.pval,top = nrow(tab.pval))
hb = HeatmapAnnotation(df = data.frame("Mode" = rep("unsupervised",4),
"Analysis" = analysis.names),
col = list("Analysis" = analysis.colors,
"Mode" = c("unsupervised" = "#ebd540",
"supervised" = "#718da5")),
show_annotation_name = FALSE,
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 12))
# Binary heatmap
ht_list_binary <- Heatmap(as.matrix(tab.binary),
name = "Binary heatmap" ,
col = c("1" = "black", "0" = "white"),
column_names_gp = gpar(fontsize = 8),
show_column_names = FALSE,
use_raster = TRUE,
na_col = "white",
raster_device = c("png"),
raster_quality = 2,
show_row_names = F,
show_heatmap_legend = TRUE,
cluster_columns = FALSE,
cluster_rows = TRUE,
show_column_dend = FALSE,
show_row_dend = FALSE,
clustering_distance_rows = function(x) {
vegan::vegdist(tab.binary,method = "jaccard",binary = TRUE)
},
clustering_method_rows = "average",
top_annotation = hb,
width = unit(5, "cm"),
row_names_gp = gpar(fontsize = 1),
column_title = paste0("Binrary MR TF Heatmap"),
column_title_gp = gpar(fontsize = 10),
row_title_gp = gpar(fontsize = 16))
ht_list_binary <- ht_list_binary +
rowAnnotation(link = anno_mark(at = match(labels,rownames(tab.pval)),
labels = labels,labels_gp = gpar(fontsize = 10)),
width = unit(1, "cm") + max_text_width(labels)
)
draw(ht_list_binary,
newpage = FALSE,
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "left",
show_heatmap_legend = TRUE,
annotation_legend_side = "right")
# P-value heatmap
ht_list <- Heatmap(as.matrix(-log10(tab.pval)),
name = "-log10 (FDR)" ,
col = colorRampPalette(c('#3361A5', '#248AF3', '#14B3FF',
'#88CEEF', '#C1D5DC', '#EAD397',
'#FDB31A','#E42A2A', '#A31D1D'))(100),
column_names_gp = gpar(fontsize = 8),
show_column_names = FALSE,
use_raster = TRUE,
na_col = "white",
raster_device = c("png"),
raster_quality = 2,
show_row_names = F,
show_heatmap_legend = TRUE,
cluster_columns = FALSE,
cluster_rows = FALSE,
row_order = row_order(ht_list_binary),
top_annotation = hb,
width = unit(5, "cm"),
row_names_gp = gpar(fontsize = 1),
column_title = paste0("MR TF Heatmap"),
column_title_gp = gpar(fontsize = 10),
row_title_gp = gpar(fontsize = 16))
ht_list <- ht_list +
rowAnnotation(link = anno_mark(at = match(labels,rownames(tab.pval)),
labels = labels,labels_gp = gpar(fontsize = 10)),
width = unit(1, "cm") + max_text_width(labels)
)
draw(ht_list,
newpage = TRUE,
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "left",
show_heatmap_legend = TRUE,
annotation_legend_side = "right")
The main difference from the last plot is the requirement of the MAE object with the DNA methylation and gene expression data. For a given MR TF, we need to indentify the top enriched motif it binds and then for each enriched motifs its distal paired differently methylated probes using the getMotif.hypo.enriched.motifs.rda
object within the results folder.
Also, since we only provide one object used for two analysis (“EAC (vs Normal)” and “EAC (vs ESCC)” ) we will only plot the MR TF expression vs DNA methylation at binding motifs for them.
analsysis.dir.esca <- unique(dirname(dir("Data/analysis_april_2019/ESCA/",recursive = T,full.names = T)))
mae <- readRDS("Data/TCGA_ESCA_MAE_distal_regions.rds")
tab.pval.esca <- tab.pval[,2:3]
suppressMessages({
library(ELMER)
library(grid)
library(ggplot2)
library(ggpubr)
library(dplyr)
})
analysis.colors <- c(
"COAD-READ (vs Normal)" = '#ff964f',
"EAC (vs Normal)" = '#94e894',
"EAC (vs ESCC)" = '#eaadb9',
"PAAD (vs Normal)" = '#9ab9f9'
)
# Top has the number of MRTF to plot
for (top in c(5)) {
scatter.list <- list()
scatter.grid <- list()
scatter.labels <- c()
for (path in analsysis.dir.esca) {
# Load of object with enriched motifs and the probes mapped to it
load(dir(path = path,
pattern = "getMotif.hypo.enriched.motifs.rda",
recursive = T,full.names = T)
)
i <- which(path == analsysis.dir.esca)
tumor <- basename(dirname(dirname(dirname(path))))
mae.plot <- mae
# Change the data to plot based on the analysis performed
if (grepl("Squamous",path)) {
category <- "primary_diagnosis"
# keep only samples used in the analysis
mae.plot <- mae.plot[,mae.plot$primary_diagnosis %in% c("Adenocarcinoma, NOS",
"Squamous cell carcinoma, NOS") &
mae.plot$definition %in% c("Primary solid Tumor")]
} else {
category <- "definition"
# keep only samples used in the analysis
mae.plot <- mae.plot[,mae.plot$primary_diagnosis %in% c("Adenocarcinoma, NOS") &
mae.plot$definition %in% c("Primary solid Tumor","Solid Tissue Normal")]
}
# get top Tfs
topTFs <- rownames(tab.pval)[head(sort(DelayedArray::rowMins(tab.pval.esca[,i,drop = F],na.rm = T),
index.return = TRUE, decreasing = F)$ix, n = top)]
topTFs <- topTFs[1:top]
for (j in topTFs) {
TF <- j
scatter.labels <- c(scatter.labels,colnames(tab.pval.esca)[i])
motif <- tf.or.table[tf.or.table$TF == j &
tf.or.table$analysis == colnames(tab.pval.esca)[i],"motif"][1]
# Change the dots color based on the analysis performed
if (grepl("Squamous",path)) color.value <- c("#FF0000","#0000FF")
if (grepl("Adenocarcinoma.Solid.Tissue.Normal",path)) color.value <- c("#FF0000","#176d55")
if (is.na(motif)) next
# Plot TF expression vs avg DNA methylation of the TFBS
suppressMessages({
scatter <- scatter.plot(data = mae.plot,
correlation = TRUE,
byTF = list(TF = TF,
probe = enriched.motif[[motif]]),
category = category,
ylim = c(0,25),
dots.size = 0.8,
color.value = color.value,
save = FALSE,
lm_line = TRUE) +
ylab(bquote(.(TF) ~ log[2](RSEM + 1))) +
xlab(paste0("Avg. DNA met. at \n", length(enriched.motif[[motif]]),
" CpGs w/ \n enriched motif ",
sub("_HUMAN.H11MO.[0-1].[A-D]", "", motif))) +
theme(legend.position = "none",
plot.title = element_blank(),
axis.title.y = element_text(size = 8, face = "bold"),
axis.title.x = element_text(size = 8, face = "bold"),
strip.background = element_rect(fill = analysis.colors[i + 1]),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# Remove panel background
panel.background = element_blank()
)
})
# If this is the first plot of the analysis we will add a labels (rectangle + text)
if(j == topTFs[1]){
gt <- gtable::gtable(widths = unit(0.5,"cm"), heights = unit(3, "in")) %>%
gtable::gtable_add_grob(grid::rectGrob(gp=gpar(fill= analysis.colors[i + 1])), 1, 1, name = 1) %>%
gtable::gtable_add_grob(grid::textGrob(colnames(tab.pval.esca)[i],
rot = 90,
gp = grid::gpar(fontsize = 8)), 1, 1, name = 2)
scatter <- scatter %>% ggpubr::annotate_figure(left = gt)
}
scatter.list[[paste0(i,j)]] <- scatter
}
scatter.grid[[i]] <- ggpubr::ggarrange(plotlist = scatter.list[grep(paste0("^",i),names(scatter.list))],
ncol = top,
# the first column should be wider since it has the label
widths = c(1.15,rep(1,top - 1)),
nrow = 1,
common.legend = T,
legend = "bottom")
}
}
options(repr.plot.width=20, repr.plot.height=8)
# we use ggarrange to arrange the plots together
ggpubr::ggarrange(plotlist = scatter.grid,
ncol = 1,
# the first column should be wider since it has the label
widths = c(1.15,rep(1,top - 1)),
nrow = 2, # Number of analysis - rows of the plot
heights = c(rep(1.5,length(scatter.list) - 1), 2),
common.legend = F,
legend = "bottom")
The function getTCGA
uses TCGAbiolinks to retrieve TCGA data in ELMER accept format.
genome <- "hg38"
# Get RNA-seq and DNA methylation from GDC using TCGAbiolinks
# Data will be saved in Data folder by default
getTCGA(disease = "ESCA", genome = genome, Meth = TRUE,RNA = TRUE)
# We will keep only the distal probes
distal.probes <- get.feature.probe(feature = NULL,
genome = genome,
met.platform = "450K")
mae <- createMAE(exp = paste0("Data/",tumor,"/",tumor,"_RNA_hg38_no_filter.rda"),
met = paste0("Data/",tumor,"/",tumor,"_Meth_hg38_no_filter.rda"),
met.platform = "450K",
genome = "hg38",
linearize.exp = TRUE,
filter.probes = distal.probes,
save = FALSE,
TCGA = TRUE)
mae$group <- paste0(mae$name,"-",mae$definition)
mae <- mae[,!mae$is_ffpe] # remove FFPE samples from analysis
saveRDS(object = mae, file = "TCGA_ESCA_MAE_distal_regions.rds")
## R version 3.6.1 Patched (2019-08-29 r77095)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] dplyr_0.8.3 ggpubr_0.2.3
## [3] magrittr_1.5 ggplot2_3.2.1
## [5] vegan_2.5-6 lattice_0.20-38
## [7] permute_0.9-5 ComplexHeatmap_2.0.0
## [9] MultiAssayExperiment_1.10.4 SummarizedExperiment_1.14.1
## [11] DelayedArray_0.10.0 BiocParallel_1.18.1
## [13] matrixStats_0.55.0 Biobase_2.44.0
## [15] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [17] IRanges_2.18.2 S4Vectors_0.22.0
## [19] BiocGenerics_0.30.0 ELMER_2.9.2
## [21] ELMER.data_2.9.3
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.4 circlize_0.4.8
## [3] Hmisc_4.2-0 aroma.light_3.14.0
## [5] BiocFileCache_1.8.0 plyr_1.8.4
## [7] selectr_0.4-1 ConsensusClusterPlus_1.48.0
## [9] lazyeval_0.2.2 splines_3.6.1
## [11] sva_3.32.1 digest_0.6.20
## [13] foreach_1.4.7 ensembldb_2.8.0
## [15] htmltools_0.3.6 checkmate_1.9.4
## [17] memoise_1.1.0 BSgenome_1.52.0
## [19] cluster_2.1.0 doParallel_1.0.15
## [21] limma_3.40.6 Biostrings_2.52.0
## [23] readr_1.3.1 annotate_1.62.0
## [25] R.utils_2.9.0 askpass_1.1
## [27] prettyunits_1.0.2 colorspace_1.4-1
## [29] blob_1.2.0 rvest_0.3.4
## [31] rappdirs_0.3.1 ggrepel_0.8.1
## [33] xfun_0.9 crayon_1.3.4
## [35] RCurl_1.95-4.12 jsonlite_1.6
## [37] genefilter_1.66.0 zeallot_0.1.0
## [39] zoo_1.8-6 survival_2.44-1.1
## [41] VariantAnnotation_1.30.1 iterators_1.0.12
## [43] glue_1.3.1 survminer_0.4.6
## [45] gtable_0.3.0 zlibbioc_1.30.0
## [47] XVector_0.24.0 GetoptLong_0.1.7
## [49] shape_1.4.4 scales_1.0.0
## [51] DESeq_1.36.0 DBI_1.0.0
## [53] edgeR_3.26.8 ggthemes_4.2.0
## [55] Rcpp_1.0.2 viridisLite_0.3.0
## [57] xtable_1.8-4 progress_1.2.2
## [59] htmlTable_1.13.1 clue_0.3-57
## [61] foreign_0.8-72 bit_1.1-14
## [63] matlab_1.0.2 km.ci_0.5-2
## [65] Formula_1.2-3 htmlwidgets_1.3
## [67] httr_1.4.1 RColorBrewer_1.1-2
## [69] acepack_1.4.1 reshape_0.8.8
## [71] pkgconfig_2.0.2 XML_3.98-1.20
## [73] R.methodsS3_1.7.1 Gviz_1.28.1
## [75] nnet_7.3-12 dbplyr_1.4.2
## [77] locfit_1.5-9.1 labeling_0.3
## [79] tidyselect_0.2.5 rlang_0.4.0
## [81] AnnotationDbi_1.46.1 munsell_0.5.0
## [83] tools_3.6.1 downloader_0.4
## [85] generics_0.0.2 RSQLite_2.1.2
## [87] broom_0.5.2 evaluate_0.14
## [89] stringr_1.4.0 yaml_2.2.0
## [91] knitr_1.24 bit64_0.9-7
## [93] survMisc_0.5.5 purrr_0.3.2
## [95] AnnotationFilter_1.8.0 TCGAbiolinks_2.13.5
## [97] nlme_3.1-141 EDASeq_2.18.0
## [99] R.oo_1.22.0 xml2_1.2.2
## [101] biomaRt_2.41.8 compiler_3.6.1
## [103] rstudioapi_0.10 plotly_4.9.0
## [105] curl_4.0 png_0.1-7
## [107] ggsignif_0.6.0 tibble_2.1.3
## [109] geneplotter_1.62.0 stringi_1.4.3
## [111] GenomicFeatures_1.36.4 ProtGenerics_1.16.0
## [113] Matrix_1.2-17 KMsurv_0.1-5
## [115] vctrs_0.2.0 pillar_1.4.2
## [117] GlobalOptions_0.1.0 cowplot_1.0.0
## [119] data.table_1.12.2 bitops_1.0-6
## [121] rtracklayer_1.44.3 R6_2.4.0
## [123] latticeExtra_0.6-28 hwriter_1.3.2
## [125] ShortRead_1.42.0 gridExtra_2.3
## [127] codetools_0.2-16 dichromat_2.0-0
## [129] MASS_7.3-51.4 assertthat_0.2.1
## [131] openssl_1.4.1 rjson_0.2.20
## [133] withr_2.1.2 GenomicAlignments_1.20.1
## [135] Rsamtools_2.0.0 GenomeInfoDbData_1.2.1
## [137] mgcv_1.8-28 hms_0.5.1
## [139] rpart_4.1-15 tidyr_0.8.3
## [141] rmarkdown_1.15 biovizBase_1.32.0
## [143] base64enc_0.1-3
We have a set of recorded videos, explaining some of the workshops.