Methylation, the addition of a methyl group to a CG dinucleotide of the DNA, is the most extensively studied epigenetic mark. Although it can be measured in many ways, the most popular platform for measuring methylation has been the Illumina HumanMethylation450 (450k) array. In this workshop, we will introduce the features of the 450k array and how to analyse the methylation array data using R. We will cover the steps involved in the standard 450k analysis pipeline including: quality control, filtering, normalization, data exploration and probe-wise differential methylation array analysis. We will also cover other approaches such as differential methylation analysis of regions, differential variability analysis and gene ontology analysis. Finally, we will provide some examples of how to visualise methylation array data
To participate in this workshop you will need the following software installed on your computer:
You should then start RStudio and install Bioconductor (3.2) and several required packages by running the following commands:
## install required Bioconductor packages
source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite(c("minfi","limma","IlluminaHumanMethylation450kanno.ilmn12.hg19","minfiData",
"IlluminaHumanMethylation450kmanifest","RColorBrewer","missMethyl",
"DMRcate","stringi","plyr","Gviz","matrixStats","GO.db","BiasedUrn"))
To demonstrate the various aspects of analysing methylation data, we will be using a small methylation dataset that has been previously analysed and published. The data is of the methylation profiles of various sorted T-cell types. More information on the samples and the dataset can be found in the original paper. The data can be downloaded from the MCRI Owncloud (password: biosummer2015) until Dec 12th. Please download all of the data files into the same directory.
Illumina HumanMethylation450 (450k) arrays are a cost-effective alternative to whole genome bisulfite sequencing, and as such have been widely used to profile DNA methylation, particularly for studies with large numbers of samples. Illumina have recently release a new platform, the Illumina MethylationEPIC array, which increases the number of CpG probes from ~450,000 to ~850,000. Although the number of CpGs covered is increased on the new array, the same probe chemistry is being used and thus the principles for analysing the 450k array should be largely applicable to the new 850k array.
This workshop will cover various aspects of a basic differential methylation analysis of 450k arrays. There are many R Bioconductor packages available for analysing 450k array methylation data, including minfi, missMethyl, wateRmelon and methylumi. We will begin with an example of a probe-wise methylation analysis using minfi and limma. By probe-wise analysis we mean that we will obtain a moderated t-statistic and p-value for each individual CpG probe, which we can use to determine which individual CpGs are differentially methylated at some significance level for our comparisons of interest.
We will begin by loading all the package libraries that will be required for the analysis.
## load packages required for analysis
library(limma)
library(minfi)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:limma':
##
## plotMA
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: lattice
## Loading required package: GenomicRanges
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: bumphunter
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: locfit
## locfit 1.5-9.1 2013-03-22
## Setting options('download.file.method.GEOquery'='auto')
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
## Loading required package: DBI
library(matrixStats)
## matrixStats v0.15.0 (2015-10-26) successfully loaded. See ?matrixStats for help.
##
## Attaching package: 'matrixStats'
##
## The following object is masked from 'package:GenomicRanges':
##
## rowRanges
##
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
library(minfiData)
library(Gviz)
## Loading required package: grid
library(DMRcate)
## Loading required package: DMRcatedata
Load the Illumina manifest which contains all of the annotation information for each of the CpG probes on the 450k array. This will help us determine where our differentially methylated probes are located in a genomic context.
## get the 450k annotation data
ann450k = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)
## DataFrame with 6 rows and 33 columns
## chr pos strand Name AddressA
## <character> <integer> <character> <character> <character>
## cg00050873 chrY 9363356 - cg00050873 32735311
## cg00212031 chrY 21239348 - cg00212031 29674443
## cg00213748 chrY 8148233 - cg00213748 30703409
## cg00214611 chrY 15815688 - cg00214611 69792329
## cg00455876 chrY 9385539 - cg00455876 27653438
## cg01707559 chrY 6778695 + cg01707559 45652402
## AddressB ProbeSeqA
## <character> <character>
## cg00050873 31717405 ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA
## cg00212031 38703326 CCCAATTAACCACAAAAACTAAACAAATTATACAATCAAAAAAACATACA
## cg00213748 36767301 TTTTAACACCTAACACCATTTTAACAATAAAAATTCTACAAAAAAAAACA
## cg00214611 46723459 CTAACTTCCAAACCACACTTTATATACTAAACTACAATATAACACAAACA
## cg00455876 69732350 AACTCTAAACTACCCAACACAAACTCCAAAAACTTCTCAAAAAAAACTCA
## cg01707559 64689504 ACAAATTAAAAACACTAAAACAAACACAACAACTACAACAACAAAAAACA
## ProbeSeqB Type
## <character> <character>
## cg00050873 ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG I
## cg00212031 CCCAATTAACCGCAAAAACTAAACAAATTATACGATCGAAAAAACGTACG I
## cg00213748 TTTTAACGCCTAACACCGTTTTAACGATAAAAATTCTACAAAAAAAAACG I
## cg00214611 CTAACTTCCGAACCGCGCTTTATATACTAAACTACAATATAACGCGAACG I
## cg00455876 AACTCTAAACTACCCGACACAAACTCCAAAAACTTCTCGAAAAAAACTCG I
## cg01707559 GCGAATTAAAAACACTAAAACGAACGCGACGACTACAACGACAAAAAACG I
## NextBase Color Probe_rs Probe_maf CpG_rs
## <character> <character> <character> <numeric> <character>
## cg00050873 A Red NA NA NA
## cg00212031 T Red NA NA NA
## cg00213748 A Red NA NA NA
## cg00214611 A Red NA NA NA
## cg00455876 A Red NA NA NA
## cg01707559 A Red NA NA NA
## CpG_maf SBE_rs SBE_maf Islands_Name
## <numeric> <character> <numeric> <character>
## cg00050873 NA NA NA chrY:9363680-9363943
## cg00212031 NA NA NA chrY:21238448-21240005
## cg00213748 NA NA NA chrY:8147877-8148210
## cg00214611 NA NA NA chrY:15815488-15815779
## cg00455876 NA NA NA chrY:9385471-9385777
## cg01707559 NA NA NA chrY:6778574-6780028
## Relation_to_Island
## <character>
## cg00050873 N_Shore
## cg00212031 Island
## cg00213748 S_Shore
## cg00214611 Island
## cg00455876 Island
## cg01707559 Island
## Forward_Sequence
## <character>
## cg00050873 TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTGGAGTGGGTGGACCC[CG]GCCAAGACGGCCTGGGCTGACCAGAGACGGGAGGCAGAAAAAGTGGGCAGGTGGTTGCAG
## cg00212031 CCATTGGCCCGCCCCAGTTGGCCGCAGGGACTGAGCAAGTTATGCGGTCGGGAAGACGTG[CG]TTAAAGGGCTGAAGGGGAGGGACGGAACTGACAGTCTCTGTGACAGCTCTGAGGTGGGAG
## cg00213748 TCTGTGGGACCATTTTAACGCCTGGCACCGTTTTAACGATGGAGGTTCTGCAGGAGGGGG[CG]ACCTGGGGTAGGAGGCGTGCTAGTGGTGGATGACATTGTGGCAGAGATGGAGGTGGTGGC
## cg00214611 GCGCCGGCAGGACTAGCTTCCGGGCCGCGCTTTGTGTGCTGGGCTGCAGTGTGGCGCGGG[CG]AGGAAGCTGGTAGGGCGGTTGTCGCAAGCTCCAGCTGCAGCCTCCGCCTACGTGAGAAGA
## cg00455876 CGCGTGTGCCTGGACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCT[CG]GGGACGGAAGGCGTGGGGTGAGTGGGCTGGAGATGCAGGCGCGCCCGTGGCTGTGCAGCC
## cg01707559 AGCGGCCGCTCCCAGTGGTGGTCACCGCCAGTGCCAATCCCTTGCGCCGCCGTGCAGTCC[CG]CCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGCGCTCGGGAGTGA
## SourceSeq Random_Loci
## <character> <character>
## cg00050873 CGGGGTCCACCCACTCCAAAAACCACCACAGTTGTGCGTTGCCTCCTCGC
## cg00212031 CGCACGTCTTCCCGACCGCATAACTTGCTCAGTCCCTGCGGCCAACTGGG
## cg00213748 CGCCCCCTCCTGCAGAACCTCCATCGTTAAAACGGTGCCAGGCGTTAAAA
## cg00214611 CGCCCGCGCCACACTGCAGCCCAGCACACAAAGCGCGGCCCGGAAGCTAG
## cg00455876 GACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCTCG
## cg01707559 CGCCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGC
## Methyl27_Loci UCSC_RefGene_Name UCSC_RefGene_Accession
## <character> <character> <character>
## cg00050873 TSPY4;FAM197Y2 NM_001164471;NR_001553
## cg00212031 TTTY14 NR_001543
## cg00213748
## cg00214611 TMSB4Y;TMSB4Y NM_004202;NM_004202
## cg00455876
## cg01707559 TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;NM_134258
## UCSC_RefGene_Group Phantom DMR Enhancer
## <character> <character> <character> <character>
## cg00050873 Body;TSS1500
## cg00212031 TSS200
## cg00213748
## cg00214611 1stExon;5'UTR
## cg00455876
## cg01707559 TSS200;TSS200;TSS200
## HMM_Island Regulatory_Feature_Name
## <character> <character>
## cg00050873 Y:9973136-9976273
## cg00212031 Y:19697854-19699393
## cg00213748 Y:8207555-8208234
## cg00214611 Y:14324883-14325218 Y:15815422-15815706
## cg00455876 Y:9993394-9995882
## cg01707559 Y:6838022-6839951
## Regulatory_Feature_Group DHS
## <character> <character>
## cg00050873
## cg00212031
## cg00213748
## cg00214611 Promoter_Associated_Cell_type_specific
## cg00455876
## cg01707559
Next, we will load in the data. The sample sheet file is essentially a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. The format expected by the read.450k.sheet function is based on the sample sheet file Illumina provides. It is also very similar to the targets file made popular by the limma package (see the extensive Limma User’s Guide). Raeding in the sample sheet essentially creates a data.frame. If they exist, a column named Sentrix_Position is renamed to Array and Sentrix_ID is renamed to Slide. The function also attempts to derive the path to the IDAT files using the baseDir and the Slide and Array information. This will be stored in a column named Basename.
## set up a path to the directory you donloaded the data to
dataDirectory = "D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data"
## read in the sample sheet for the experiment
targets = read.450k.sheet(dataDirectory, pattern="SampleSheet.csv")
## [read.450k.sheet] Found the following CSV files:
## [1] "D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/SampleSheet.csv"
targets
## Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1 1 A1 M28 naive naive
## 2 2 B1 M28 rTreg rTreg
## 3 3 C1 M28 act_naive act_naive
## 4 4 D1 M29 naive naive
## 5 5 E1 M29 act_naive act_naive
## 6 6 F1 M29 act_rTreg act_rTreg
## 7 7 G1 M30 naive naive
## 8 8 H1 M30 rTreg rTreg
## 9 9 A2 M30 act_naive act_naive
## 10 10 B2 M30 act_rTreg act_rTreg
## 11 11 H06 VICS-72098-18-B birth birth
## Pool_ID Array Slide
## 1 NA R01C01 6264509100
## 2 NA R02C01 6264509100
## 3 NA R03C01 6264509100
## 4 NA R04C01 6264509100
## 5 NA R05C01 6264509100
## 6 NA R06C01 6264509100
## 7 NA R01C02 6264509100
## 8 NA R02C02 6264509100
## 9 NA R03C02 6264509100
## 10 NA R04C02 6264509100
## 11 NA R06C02 5975827018
## Basename
## 1 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C01
## 2 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C01
## 3 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C01
## 4 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C01
## 5 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R05C01
## 6 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R06C01
## 7 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C02
## 8 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C02
## 9 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C02
## 10 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C02
## 11 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/5975827018/5975827018_R06C02
Now that we have imported the information about what and where all the data is, we can load the raw methylation data into R from the IDAT files. We can then rename the samples with more descriptive names and calculate detection p-values, which will allow us to identify poor quality probes.
## read in the raw data from the IDAT files
rgSet = read.450k.exp(targets=targets)
rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 11 samples
## element names: Green, Red
## An object of class 'AnnotatedDataFrame'
## sampleNames: 6264509100_R01C01 6264509100_R02C01 ...
## 5975827018_R06C02 (11 total)
## varLabels: Sample_Name Sample_Well ... filenames (10 total)
## varMetadata: labelDescription
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## give the samples meaningful names
targets$ID = paste(targets$Sample_Group,targets$Sample_Name,sep=".")
sampleNames(rgSet) = targets$ID
rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 11 samples
## element names: Green, Red
## An object of class 'AnnotatedDataFrame'
## sampleNames: naive.1 rTreg.2 ... birth.11 (11 total)
## varLabels: Sample_Name Sample_Well ... filenames (10 total)
## varMetadata: labelDescription
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## calculate the detection P-values
detP = detectionP(rgSet)
head(detP)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5 act_rTreg.6
## cg00050873 0 0 0.000000e+00 0 0.00000e+00 0
## cg00212031 0 0 0.000000e+00 0 0.00000e+00 0
## cg00213748 0 0 1.181832e-12 0 8.21565e-15 0
## cg00214611 0 0 0.000000e+00 0 0.00000e+00 0
## cg00455876 0 0 0.000000e+00 0 0.00000e+00 0
## cg01707559 0 0 0.000000e+00 0 0.00000e+00 0
## naive.7 rTreg.8 act_naive.9 act_rTreg.10 birth.11
## cg00050873 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg00212031 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg00213748 0 1.469801e-05 0 1.365951e-08 0.6735224
## cg00214611 0 0.000000e+00 0 0.000000e+00 0.7344451
## cg00455876 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg01707559 0 0.000000e+00 0 0.000000e+00 0.0000000
Plotting the mean detection p-value for each sample will allow us to gauge whether any samples have many failed probes - this will be indicated by a large mean detection p-value. Samples with mean detection p-values exceeding a cutoff such as o.05 can be excluded from further analysis.
pal = brewer.pal(8,"Dark2")
## look at the mean detection P-values across all samples to identify any failed samples
par(mfrow=c(1,2))
barplot(colMeans(detP),col=pal[factor(targets$Sample_Group)],las=2,cex.names=0.8,
main="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft",legend=levels(factor(targets$Sample_Group)),fill=pal,bg="white")
barplot(colMeans(detP),col=pal[factor(targets$Sample_Group)],las=2,cex.names=0.8,ylim=c(0,0.002),
main="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft",legend=levels(factor(targets$Sample_Group)),fill=pal,bg="white")
## remove failed samples
keep = colMeans(detP) < 0.05
rgSet = rgSet[,keep]
targets = targets[keep,]
detP = detP[,keep]
rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 10 samples
## element names: Green, Red
## An object of class 'AnnotatedDataFrame'
## sampleNames: naive.1 rTreg.2 ... act_rTreg.10 (10 total)
## varLabels: Sample_Name Sample_Well ... filenames (10 total)
## varMetadata: labelDescription
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
targets
## Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label Pool_ID
## 1 1 A1 M28 naive naive NA
## 2 2 B1 M28 rTreg rTreg NA
## 3 3 C1 M28 act_naive act_naive NA
## 4 4 D1 M29 naive naive NA
## 5 5 E1 M29 act_naive act_naive NA
## 6 6 F1 M29 act_rTreg act_rTreg NA
## 7 7 G1 M30 naive naive NA
## 8 8 H1 M30 rTreg rTreg NA
## 9 9 A2 M30 act_naive act_naive NA
## 10 10 B2 M30 act_rTreg act_rTreg NA
## Array Slide
## 1 R01C01 6264509100
## 2 R02C01 6264509100
## 3 R03C01 6264509100
## 4 R04C01 6264509100
## 5 R05C01 6264509100
## 6 R06C01 6264509100
## 7 R01C02 6264509100
## 8 R02C02 6264509100
## 9 R03C02 6264509100
## 10 R04C02 6264509100
## Basename
## 1 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C01
## 2 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C01
## 3 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C01
## 4 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C01
## 5 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R05C01
## 6 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R06C01
## 7 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C02
## 8 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C02
## 9 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C02
## 10 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C02
## ID
## 1 naive.1
## 2 rTreg.2
## 3 act_naive.3
## 4 naive.4
## 5 act_naive.5
## 6 act_rTreg.6
## 7 naive.7
## 8 rTreg.8
## 9 act_naive.9
## 10 act_rTreg.10
You can also generate other quality control plots using the minfi qcReport function. The minfi vignette describes the various plots in this report and how they should be interpreted. Generally, samples that look poor based on mean detection p-value will also look poor using other metrics and it is usually advisable to exclude them from further analysis.
## produce a general QC report for the data
qcReport(rgSet,sampNames=targets$ID,sampGroups=targets$Sample_Group,pdf="qcReport.pdf")
To minimise the unwanted variation within and between samples, various data normalizations can be applied. As 450k arrays use 2 different probe types, Infinium I and II, on the same array, studies suggest that it is advisable to perform within-array normalization between the 2 probe types. Normalization can also be performed between arrays. Here we provide examples of 2 different normalization types: SWAN (within-array only) and SQN (within and between array).
## create a MethylSet object from the raw data
mSetRaw = preprocessRaw(rgSet)
mSetRaw
## MethylSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 10 samples
## element names: Meth, Unmeth
## An object of class 'AnnotatedDataFrame'
## sampleNames: naive.1 rTreg.2 ... act_rTreg.10 (10 total)
## varLabels: Sample_Name Sample_Well ... filenames (10 total)
## varMetadata: labelDescription
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.14.0
## Manifest version: 0.4.0
## normalize the data
mSetSw = preprocessSWAN(rgSet = rgSet, mSet = mSetRaw, verbose=TRUE) ## SWAN method
## [preprocessSwan] Normalizing array 1 of 10
## [preprocessSwan] Normalizing array 2 of 10
## [preprocessSwan] Normalizing array 3 of 10
## [preprocessSwan] Normalizing array 4 of 10
## [preprocessSwan] Normalizing array 5 of 10
## [preprocessSwan] Normalizing array 6 of 10
## [preprocessSwan] Normalizing array 7 of 10
## [preprocessSwan] Normalizing array 8 of 10
## [preprocessSwan] Normalizing array 9 of 10
## [preprocessSwan] Normalizing array 10 of 10
mSetSw = mapToGenome(mSetSw)
mSetSq = preprocessQuantile(rgSet) ## SQN method
## [preprocessQuantile] Mapping to genome.
## [preprocessQuantile] Fixing outliers.
## Warning in .getSex(CN = CN, xIndex = xIndex, yIndex = yIndex, cutoff
## = cutoff): An inconsistency was encountered while determining sex. One
## possibility is that only one sex is present. We recommend further checks,
## for example with the plotSex function.
## [preprocessQuantile] Quantile normalizing.
par(mfrow=c(1,3))
densityPlot(rgSet, sampGroups = targets$Sample_Group,main="Raw")
densityPlot(getBeta(mSetSw), sampGroups = targets$Sample_Group,main="SWAN")
densityPlot(getBeta(mSetSq), sampGroups = targets$Sample_Group,main="SQN")
Multi dimensional scaling (MDS) plots are excellent for visualising your data, and are usually one of the first kinds of plots you should make when exploring your data. It is based on principle componenents analysis and is an unsupervised method for looking at the similarities and differences between the samples of the data. Samples that are more similar to each other should cluster together, and samples that are very different should be further apart on the plot. Dimension 1 (or principle component 1) captures the greatest source of variation in the data, dimension 2 captures the second greatest source of variation in the data and so on. Colouring in the data points or labels by known factors of interest can often pull out exactly what the greatest sources of variation are in the data. It’s also possible to use MDSplots for figuring out sample mix-ups.
## MDS plots to look at largets sources of variation
par(mfrow=c(1,2))
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)])
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Source)])
legend("topright",legend=levels(factor(targets$Sample_Source)),text.col=pal)
par(mfrow=c(1,1))
## Examine higher dimensions to look at other sources of variation
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(1,3))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(2,3))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(3,4))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
Poor performing probes are generally filtered out prior to differential methylation analysis. As the signal from these probes is unreliable, by removing them we perform fewer statistical tests and thus incur a reduced multiple testing adjustment. We filter out probes that have failed in one or more samples based on detection p-value. The detection p-values are automatically calculated using the signal from the negative control probes on the array; probes that are determined to not have a signal significantly different from the background level have high detection p-values.
## remove any probes that have failed in one or more samples
detP = detP[match(featureNames(mSetSq),rownames(detP)),] # ensure probes are in the same order in the mSetSq and detP objects
keep = rowSums(detP < 0.01) == ncol(mSetSq)
table(keep)
## keep
## FALSE TRUE
## 977 484535
mSetSqFlt = mSetSq[keep,]
mSetSqFlt
## class: GenomicRatioSet
## dim: 484535 10
## exptData(0):
## assays(2): M CN
## rownames(484535): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowRanges metadata column names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
## predictedSex
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.14.0
## Manifest version: 0.4.0
Depending on the nature of your samples and your biological question you may also choose to filter out the probes from the X and Y chromosomes or probes that are known to have common SNPs at the CpG site. This can be done using the code below, however, we will not be filtering out these probes as part of the analysis in this workshop.
## if your data infludes males and females, remove the sex chromosomes
keep = !(featureNames(mSetSqFlt) %in% ann450k$Name[ann450k$chr %in% c("chrX","chrY")])
table(keep)
mSetSqFlt = mSetSqFlt[keep,]
## remove probes with SNPs at CpG or SBE site
mSetSqFlt = dropLociWithSnps(mSetSqFlt)
We will also filter out probes that have shown to be cross-reactive i.e. map to multiple places in the genome. This list was oroginally published by Chen et al. in 2013 and can be downloaded directly from the author’s website. The Excel spreadsheet containing the cross-reactive probes is also available as part of the data download for this workshop.
## exclude cross reactive probes
xReactiveProbes = read.csv(file=paste(dataDirectory,"48639-non-specific-probes-Illumina450k.csv",sep="/"), stringsAsFactors=FALSE)
keep = !(featureNames(mSetSqFlt) %in% xReactiveProbes$TargetID)
table(keep)
## keep
## FALSE TRUE
## 29171 455364
mSetSqFlt = mSetSqFlt[keep,]
mSetSqFlt
## class: GenomicRatioSet
## dim: 455364 10
## exptData(0):
## assays(2): M CN
## rownames(455364): cg13869341 cg24669183 ... cg08265308 cg14273923
## rowRanges metadata column names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
## predictedSex
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.14.0
## Manifest version: 0.4.0
After the data has been normalised and filtered, we can calculate M and \(\beta\) values. \[M=log2(methylated/unmethylated)\] \[\beta=methylated/(methylated+unmethylated)\] M-values have nicer statistical properties and are thus better for use in statistical analysis of methylation data whilst \(\beta\) values are easy to interpret and are thus better for displaying data. A detailed comparison of M and \(\beta\) values was published by Du et al. in 2010.
## calculate M-values for statistical analysis
mVals = getM(mSetSqFlt)
head(mVals)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg13869341 2.421276 2.515948 2.165745 2.286314 2.109441
## cg24669183 2.169414 2.235964 2.280734 1.632309 2.184435
## cg15560884 1.761176 1.577578 1.597503 1.777486 1.764999
## cg01014490 -3.504268 -3.825119 -5.384735 -4.537864 -4.296526
## cg17505339 3.082191 3.924931 4.163206 3.255373 3.654134
## cg11954957 1.546401 1.912204 1.727910 2.441267 1.618331
## act_rTreg.6 naive.7 rTreg.8 act_naive.9 act_rTreg.10
## cg13869341 2.083313 3.004332 2.949430 3.323265 2.920691
## cg24669183 2.175771 1.914738 2.537203 2.085423 2.090007
## cg15560884 1.760819 1.709728 1.655782 1.834341 1.874285
## cg01014490 -3.859792 -5.313593 -5.100400 -3.575205 -4.360973
## cg17505339 3.701096 3.000690 2.768876 3.185991 3.848438
## cg11954957 2.107829 1.853192 2.494570 2.470994 2.323683
bVals = getBeta(mSetSqFlt)
head(bVals)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462 0.8177504 0.82987650 0.81186174
## cg24669183 0.81812908 0.82489238 0.8293297 0.75610281 0.81967323
## cg15560884 0.77219626 0.74903910 0.7516263 0.77417882 0.77266205
## cg01014490 0.08098986 0.06590459 0.0233755 0.04127262 0.04842397
## cg17505339 0.89439216 0.93822870 0.9471357 0.90520570 0.92641305
## cg11954957 0.74495496 0.79008516 0.7681146 0.84450764 0.75431167
## act_rTreg.6 naive.7 rTreg.8 act_naive.9 act_rTreg.10
## cg13869341 0.8090798 0.8891851 0.88537940 0.90916748 0.88334231
## cg24669183 0.8187838 0.7903763 0.85304116 0.80930568 0.80979554
## cg15560884 0.7721528 0.7658623 0.75909061 0.78099397 0.78569274
## cg01014490 0.0644404 0.0245281 0.02832358 0.07740468 0.04640659
## cg17505339 0.9286016 0.8889361 0.87205348 0.90099782 0.93508348
## cg11954957 0.8116911 0.7832207 0.84929777 0.84719430 0.83350220
par(mfrow=c(1,2))
densityPlot(bVals,sampGroups = targets$Sample_Group,main="Beta values")
densityPlot(mVals,sampGroups = targets$Sample_Group,main="M values")
The primary biological goal for this dataset is to discover differentially methylated probes between the different cell types. However, there is another factor that we need to take into account when we perform the modelling. In the targets file, there is a column called Sample_Source, and this refers to the individuals from whom the samples came. In this dataset, any one particular individual contributes more than one cell type. For example, individual M28 contributes naive, rTreg and actnaive samples. From the MDS plots we know that the samples cluster according to individual first, and then by cell type, implying that individual is the greatest source of variation in methylation. In order to get the cell type differences, we need to take into account individual variation. Hence when we specify our design matrix, we need to include two factors: individual and cell type. This style of analysis is called a paired analysis; differences between cell types are calculated within each individual, and then these differences are summed across individuals to determine whether overall there is a significant difference in the mean methylation level for each CpG site.
cellType = factor(targets$Sample_Group) ## this is the factor of interest
individual = factor(targets$Sample_Source) ## this is the individual effect that we need to account for
## use the above to create a design matrix
design = model.matrix(~0+cellType+individual, data=targets)
colnames(design) = c(levels(cellType),levels(individual)[-1])
fit = lmFit(mVals, design) ## fit the linear model
## create a contrast matrix for specific comparisons
contMatrix = makeContrasts(naive-rTreg,
naive-act_naive,
rTreg-act_rTreg,
act_naive-act_rTreg,
levels=design)
contMatrix
## Contrasts
## Levels naive - rTreg naive - act_naive rTreg - act_rTreg
## act_naive 0 -1 0
## act_rTreg 0 0 -1
## naive 1 1 0
## rTreg -1 0 1
## M29 0 0 0
## M30 0 0 0
## Contrasts
## Levels act_naive - act_rTreg
## act_naive 1
## act_rTreg -1
## naive 0
## rTreg 0
## M29 0
## M30 0
## fit the contrasts
fit2 = contrasts.fit(fit, contMatrix)
fit2 = eBayes(fit2)
## look at the numbers of DM CpGs at FDR < 0.05
summary(decideTests(fit2))
## naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1 1671 409 0 579
## 0 452255 454729 455364 453838
## 1 1438 226 0 947
## get the table of results for the first contrast (naive - rTreg)
ann450kSub = ann450k[match(rownames(mVals),ann450k$Name),c(1:4,12:19,24:ncol(ann450k))]
DMPs = topTable(fit2, num=Inf, coef=1, genelist = ann450kSub)
head(DMPs)
## chr pos strand Name Probe_rs Probe_maf CpG_rs
## cg07499259 chr1 12188502 + cg07499259 <NA> NA <NA>
## cg26992245 chr8 29848579 - cg26992245 <NA> NA <NA>
## cg09747445 chr15 70387268 - cg09747445 <NA> NA <NA>
## cg18808929 chr8 61825469 - cg18808929 <NA> NA <NA>
## cg25015733 chr2 99342986 - cg25015733 <NA> NA <NA>
## cg21179654 chr3 114057297 + cg21179654 <NA> NA <NA>
## CpG_maf SBE_rs SBE_maf Islands_Name
## cg07499259 NA <NA> NA
## cg26992245 NA <NA> NA
## cg09747445 NA <NA> NA chr15:70387929-70393206
## cg18808929 NA <NA> NA chr8:61822358-61823028
## cg25015733 NA <NA> NA chr2:99346882-99348177
## cg21179654 NA <NA> NA
## Relation_to_Island
## cg07499259 OpenSea
## cg26992245 OpenSea
## cg09747445 N_Shore
## cg18808929 S_Shelf
## cg25015733 N_Shelf
## cg21179654 OpenSea
## UCSC_RefGene_Name
## cg07499259 TNFRSF8;TNFRSF8
## cg26992245
## cg09747445 TLE3;TLE3;TLE3
## cg18808929
## cg25015733 MGAT4A
## cg21179654 ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20
## UCSC_RefGene_Accession
## cg07499259 NM_152942;NM_001243
## cg26992245
## cg09747445 NM_001105192;NM_020908;NM_005078
## cg18808929
## cg25015733 NM_012214
## cg21179654 NM_001164343;NM_001164346;NM_001164345;NM_001164342;NM_001164344;NM_001164347;NM_015642
## UCSC_RefGene_Group Phantom DMR Enhancer
## cg07499259 5'UTR;Body
## cg26992245 TRUE
## cg09747445 Body;Body;Body
## cg18808929 TRUE
## cg25015733 5'UTR
## cg21179654 3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR
## HMM_Island Regulatory_Feature_Name
## cg07499259 1:12111023-12111225
## cg26992245
## cg09747445
## cg18808929
## cg25015733
## cg21179654 3:114057192-114057775
## Regulatory_Feature_Group DHS logFC AveExpr
## cg07499259 3.654104 2.46652171
## cg26992245 4.450696 -0.09180715
## cg09747445 -3.337299 -0.25201484
## cg18808929 -2.990263 0.77522878
## cg25015733 -3.054336 0.83280190
## cg21179654 Unclassified_Cell_type_specific 2.859016 1.32460816
## t P.Value adj.P.Val B
## cg07499259 18.70862 7.257675e-08 0.005243551 7.449952
## cg26992245 18.31538 8.565762e-08 0.005243551 7.358404
## cg09747445 -18.21811 8.928886e-08 0.005243551 7.335167
## cg18808929 -17.86928 1.038065e-07 0.005243551 7.249834
## cg25015733 -17.29868 1.336369e-07 0.005243551 7.103198
## cg21179654 17.24589 1.368512e-07 0.005243551 7.089171
The results of the analysis for the first comparison, naive vs rTreg, are stored in the DMPs object. We can write out this file as a comma separated file, which can be opened in excel.
write.table(DMPs, file="DMPs.csv", sep=",", row.names=FALSE)
It is always useful to plot the top differentially methylated CpG sites - it serves as a sanity check to make sure the results make sense, and if they don’t it usually means there is an error in the code somewhere, or perhaps the design matrix was not set up correctly. It is easier to interpret methylation on the beta value scale, so although we always perform analysis on the Mvalue scale, we usually visualise data on the beta value scale.
par(mfrow=c(2,2))
sapply(rownames(DMPs)[1:4], function(cpg){
plotCpg(bVals, cpg=cpg, pheno = targets$Sample_Group)
})
## $cg07499259
## NULL
##
## $cg26992245
## NULL
##
## $cg09747445
## NULL
##
## $cg18808929
## NULL
Once you have performed a differential methylation analysis, there may be a very long list of significant CpG sites to interpret. One question a researcher may have is which gene pathways may be targeted by the significant CpGs. Sometimes it is easy to link the top differentially methylated CpGs to genes that make sense in terms of the cell types/samples being studied, but there can be many more thousands of CpGs significantly differentially methylated. In order to get an idea of the biological processes that the significant CpGs may be involved in, we can perform gene ontology testing with the gometh function in the missMethyl package.
Let us consider the first comparison, naive vs rTreg, with the results of the analysis in the DMPs table. The gometh function takes as input a character vector of the names (e.g. cg20832020) of the significant CpG sites, and optionally, a character vector of all CpGs tested. This is recommended if you have performed quite a lot of filtering of the CpGs before analysis. In the DMPs table, the Name column corresponds to the CpG name. We will select all CpG sites that have adjusted p-value of less than 0.05.
## Get the significant CpG sites at less than 5% FDR
sigCpGs = DMPs$Name[DMPs$adj.P.Val<0.05]
## First 10 significant CpGs
sigCpGs[1:10]
## [1] "cg07499259" "cg26992245" "cg09747445" "cg18808929" "cg25015733"
## [6] "cg21179654" "cg26280976" "cg16943019" "cg10898310" "cg25130381"
## Total number of significant CpGs at 5% FDR
length(sigCpGs)
## [1] 3109
## Get all the CpG sites used in the analysis to form the background
all = DMPs$Name
## Total number of CpG sites tested
length(all)
## [1] 455364
What makes gometh work particularly well with methylation data is that it can take into account the varying numbers of CpGs associated with genes. For the 450K array, the numbers of CpGs mapping to genes can vary from as few as 1 to as many as 1200. The genes that have more CpGs associated with them will have a higher probability of being identified as differentially methylated compared to genes with fewer CpGs. We can look at the bias in the data by specifying plot=TRUE in the call to gometh.
par(mfrow=c(1,1))
gst <- gometh(sig.cpg=sigCpGs, all.cpg=all, plot.bias=TRUE)
## Warning in alias2SymbolTable(flat$symbol): Multiple symbols ignored for one
## or more aliases
The gst object is a dataframe with each row corresponding to the GO category being tested. The top 20 gene ontology categories can be displayed using the topGO function:
## Top 20 GO categories
topGO(gst)
## Term Ont N DE
## GO:0002376 immune system process BP 2271 347
## GO:0001775 cell activation BP 874 165
## GO:0045321 leukocyte activation BP 643 127
## GO:0002682 regulation of immune system process BP 1156 190
## GO:0002684 positive regulation of immune system process BP 764 140
## GO:0046649 lymphocyte activation BP 540 112
## GO:0006955 immune response BP 1363 202
## GO:0007159 leukocyte cell-cell adhesion BP 430 94
## GO:0048583 regulation of response to stimulus BP 3259 440
## GO:0042110 T cell activation BP 390 87
## GO:0070489 T cell aggregation BP 390 87
## GO:0071593 lymphocyte aggregation BP 392 87
## GO:0070486 leukocyte aggregation BP 398 88
## GO:0006952 defense response BP 1455 202
## GO:0050778 positive regulation of immune response BP 525 100
## GO:0002520 immune system development BP 763 139
## GO:0034109 homotypic cell-cell adhesion BP 458 95
## GO:0050776 regulation of immune response BP 752 128
## GO:0002253 activation of immune response BP 402 83
## GO:0007166 cell surface receptor signaling pathway BP 3269 399
## P.DE FDR
## GO:0002376 1.086240e-28 2.118603e-24
## GO:0001775 2.218695e-20 2.163671e-16
## GO:0045321 5.486652e-18 3.567055e-14
## GO:0002682 8.340629e-18 3.685697e-14
## GO:0002684 9.448567e-18 3.685697e-14
## GO:0046649 2.448200e-17 7.958281e-14
## GO:0006955 3.221465e-17 8.975923e-14
## GO:0007159 3.901551e-16 9.511982e-13
## GO:0048583 4.523733e-16 9.803432e-13
## GO:0042110 2.898667e-15 5.139601e-12
## GO:0070489 2.898667e-15 5.139601e-12
## GO:0071593 4.318704e-15 6.809324e-12
## GO:0070486 4.538618e-15 6.809324e-12
## GO:0006952 2.464757e-14 3.433759e-11
## GO:0050778 2.723173e-14 3.540851e-11
## GO:0002520 4.112259e-14 5.012844e-11
## GO:0034109 7.733471e-14 8.872565e-11
## GO:0050776 8.507945e-14 9.218831e-11
## GO:0002253 2.216633e-13 2.275432e-10
## GO:0007166 3.338006e-13 3.243997e-10
Here we can see lots of GO categories corresponding to immune system and T cell processes, which makes sense as the cell types being studied form part of the immune system. The total number of significant GO categories, adjusted for multiple testing, is:
## Total number of significant GO categories at 5% FDR
sum(gst$FDR<0.05)
## [1] 447
Rather than testing for differences in mean methylation, we may be interested in testing for differences between group variances. For example, it has been hypothesised that highly variable CpGs in cancer are important for tumour progression. Hence we may be interested in CpG sites that are consistently methylated in one group, but variably methylated in another group. In general we recommend at least 10 samples in each group for testing differential variability in order to get a good estimate of the variance. It is more tricky to accurately estimate variances as opposed to means, so you generally need a larger sample size for testing differences in group variances as opposing to testing differences in group means. For the purpose of this workshop, we won’t worry about this sample size issue.
The way we specify which groups we are interested in testing is a little bit different to the way you specify a model in limma for differential methylation, particularly if you fit an intercept model. For our data, we have set up a design matrix without an intercept term, and we will specify which groups we want to test for differential variability in a contrasts statement. We can use the same design matrix that we set up for testing differential methylation.
## Fit the model for differential variability, specifying that the four cell types are the groups of interest
fitvar <- varFit(mVals, design = design, coef = c(1,2,3,4))
## Specify the groups we are interested in testing for differential variability
contr <- makeContrasts(naive-rTreg,
naive-act_naive,
rTreg-act_rTreg,
act_naive-act_rTreg,
levels=design)
fitvar <- contrasts.varFit(fitvar,contrasts=contr)
## Summary of differential variability
summary(decideTests(fitvar))
## naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1 46940 27942 45614 47639
## 0 373371 398381 365505 375064
## 1 35053 29041 44245 32661
topDV <- topVar(fitvar, coef=1)
## Top 10 differentially variable CpGs between naive and rTreg
topDV
## SampleVar LogVarRatio DiffLevene t P.Value
## cg17800497 3.8641666 -5.477617 -4.138581 -10.962447 5.802882e-28
## cg10962762 4.1689981 -5.519295 -4.559108 -10.832188 2.427623e-27
## cg00143364 1.9321804 -5.431527 -3.011704 -9.763057 1.623980e-22
## cg07665222 0.6970484 -10.665516 -1.658147 -9.647707 5.032801e-22
## cg24338920 0.2121397 -9.069818 -1.257323 -9.606404 7.521706e-22
## cg20560169 6.2982954 7.100821 4.514489 9.552649 1.265693e-21
## cg00731661 1.5733049 -5.156593 -3.247191 -9.307495 1.310201e-20
## cg03281139 0.2439198 -9.684861 -1.257629 -9.024908 1.800136e-19
## cg05257260 1.2329180 -5.816190 -2.235748 -8.946056 3.687428e-19
## cg27594095 1.9241378 -5.588855 -2.478976 -8.942318 3.814327e-19
## Adj.P.Value
## cg17800497 2.642424e-22
## cg10962762 5.527261e-22
## cg00143364 2.465006e-17
## cg07665222 5.729391e-17
## cg24338920 6.850229e-17
## cg20560169 9.605850e-17
## cg00731661 8.523119e-16
## cg03281139 1.024647e-14
## cg05257260 1.736907e-14
## cg27594095 1.736907e-14
Just as we had a look at the beta values for the significant differentially methylated CpGs, it is useful to plot differentially variable CpGs too.
par(mfrow=c(2,2))
sapply(rownames(topDV)[1:4], function(cpg){
plotCpg(bVals, cpg=cpg, pheno = targets$Sample_Group)
})
## $cg17800497
## NULL
##
## $cg10962762
## NULL
##
## $cg00143364
## NULL
##
## $cg07665222
## NULL
An example of testing for differential variability when you have a design matrix with an intercept term is detailed in the missMethyl vignette.
Although performing a probe-wise analysis is useful and informative, sometimes we are interested in knowing whether several proximal CpGs are concordantly differentially methylated i.e. we want to identify differentially methylated regions. There are several Bioconductor packages that can identify differentially methylated regions from 450k data. Some of the most popular are DMRcate, minfi plus bumphunter and charm. They are each based on different statistical methods. In this workshop, we are using DMRcate because it is faster than the other methods and is based on limma, so we can directly use the design and contMatrix we previously defined.
Firstly, our matrix of M-values is annotated with relevant information about the probes such as their genomic position, gene annotation, etc. By default, this is done using the ilmn12.hg19 annotation, but this can be substituted for any argument compatible with the interface provided by the minfi package. The backbone of the limma pipeline is then used for differential array analysis to calculate t-statistics and, optionally, filter probes by their FDR-corrected p-value.
myAnnotation = cpg.annotate(mVals, analysis.type="differential", design=design, contrasts = TRUE,
cont.matrix = contMatrix, coef="naive - rTreg")
## Your contrast returned 3109 individually significant probes. We recommend the default setting of pcutoff in dmrcate().
str(myAnnotation)
## List of 8
## $ ID : Factor w/ 455364 levels "cg00000029","cg00000108",..: 240653 405745 269558 20085 300229 209908 392600 349036 288638 3548 ...
## $ weights: num [1:455364] 0.0488 -2.0753 0.7697 -0.0304 -0.7638 ...
## $ CHR : Factor w/ 24 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pos : int [1:455364] 15865 534242 710097 714177 720865 758829 763119 765028 779995 790667 ...
## $ gene : Factor w/ 41377 levels "","A1BG","A1CF;A1CF;A1CF",..: 39230 1 1 1 1 1 23874 20098 20098 1 ...
## $ group : Factor w/ 4077 levels "","1stExon","1stExon;1stExon",..: 2001 1 1 1 1 1 3197 2001 2001 1 ...
## $ betafc : num [1:455364] 0.00039 -0.04534 0.01594 0.00251 -0.00869 ...
## $ indfdr : num [1:455364] 0.994 0.567 0.872 0.997 0.873 ...
## - attr(*, "row.names")= int [1:455364] 440613 57689 242178 446183 243529 192153 275582 271 7655 80075 ...
## - attr(*, "class")= chr "annot"
We can then use the dmrcate function to look for differentially methylated regions. The main output table DMRs$results contains all of the regions found, along with their genomic annotations and p-values.
DMRs = dmrcate(myAnnotation, lambda=1000, C=2)
## Fitting chr1...
## Fitting chr10...
## Fitting chr11...
## Fitting chr12...
## Fitting chr13...
## Fitting chr14...
## Fitting chr15...
## Fitting chr16...
## Fitting chr17...
## Fitting chr18...
## Fitting chr19...
## Fitting chr2...
## Fitting chr20...
## Fitting chr21...
## Fitting chr22...
## Fitting chr3...
## Fitting chr4...
## Fitting chr5...
## Fitting chr6...
## Fitting chr7...
## Fitting chr8...
## Fitting chr9...
## Fitting chrX...
## Fitting chrY...
## Demarcating regions...
## Done!
head(DMRs$results)
## gene_assoc group hg19coord
## 863 PHF1,CUTA Body,3'UTR,1stExon chr6:33383412-33385365
## 724 TIGIT TSS1500,TSS200,5'UTR,1stExon chr3:114012316-114012912
## 20 TNFRSF8 5'UTR,Body chr1:12188502-12188637
## 343 APBA2 TSS1500,TSS200,5'UTR,1stExon chr15:29212889-29213860
## 903 ZC3H12D 5'UTR,1stExon,TSS200,TSS1500 chr6:149805995-149806732
## 42 SLC9A1 Body chr1:27440463-27440721
## no.probes minpval meanpval maxbetafc
## 863 19 1.126484e-198 6.213227e-21 0.5469172
## 724 5 4.754071e-181 4.451614e-146 0.5434277
## 20 3 2.002732e-131 3.316704e-131 0.3744725
## 343 8 2.106110e-125 4.567477e-59 -0.4453199
## 903 11 1.599959e-119 1.156680e-64 0.4879785
## 42 3 4.314861e-116 1.121256e-111 -0.4526273
The regions can be quickly examined using the DMR.plot function provided in the DMRcate package.
par(mfrow=c(1,1))
DMR.plot(dmrcoutput=DMRs, dmr=1, betas=bVals, phen.col=pal[factor(targets$Sample_Group)],
pch=16, toscale=TRUE, plotmedians=TRUE)
legend("bottomleft",legend=levels(factor(targets$Sample_Group)),col=pal,lty=1)
The Gviz package offers some powerful functionality for plotting genomic data in its genomic context. Their extensive vignette covers the various types of plots that can be produced using their framework. We will re-plot the top differentially methylated region from the previous analysis to demonstrate the richness of the visualisations that can be created.
Firstly, we will set up the genomic region we would like to plot by extracting the genomic coordinates of the top differentially methylated region.
gen = "hg19"
coords = strsplit2(DMRs$results$hg19coord[1],":")
chrom = coords[1]
minbase = as.numeric(strsplit2(coords[2],"-")[1]) - 2000
maxbase = as.numeric(strsplit2(coords[2],"-")[2]) + 2000
Next we will load up some genomic annotations of interest such as the locations of CpG islands and DNAseI hypersensitive sites.
## CpG islands
islandHMM = read.csv(paste(dataDirectory,"model-based-cpg-islands-hg19.txt",sep="/"),
sep="\t", stringsAsFactors =FALSE, header=TRUE)
head(islandHMM)
## chr start end length CpGcount GCcontent pctGC obsExp
## 1 chr10 93098 93818 721 32 403 0.559 0.572
## 2 chr10 94002 94165 164 12 97 0.591 0.841
## 3 chr10 94527 95302 776 65 538 0.693 0.702
## 4 chr10 119652 120193 542 53 369 0.681 0.866
## 5 chr10 122133 122621 489 51 339 0.693 0.880
## 6 chr10 180265 180720 456 32 256 0.561 0.893
islandData = GRanges(seqnames = Rle(islandHMM$chr),
ranges = IRanges(islandHMM$start, end = islandHMM$end),
strand = Rle(strand(rep("*",nrow(islandHMM)))))
islandData = islandData[seqnames(islandData) == chrom &
(start(islandData) >= minbase & end(islandData) <= maxbase)] # this is to reduce the amount of data in memory
islandData
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr6 [33385344, 33386207] *
## -------
## seqinfo: 81 sequences from an unspecified genome; no seqlengths
## DNAseI hypersensitive sites
dnase = read.csv(paste(dataDirectory,"wgEncodeRegDnaseClusteredV3.bed",sep="/"),
sep="\t",stringsAsFactors=FALSE,header=FALSE)
head(dnase)
## V1 V2 V3 V4 V5 V6
## 1 chr1 10100 10330 38 261 38
## 2 chr1 10345 10590 4 310 4
## 3 chr1 16100 16315 5 158 5
## 4 chr1 65905 66055 1 157 1
## 5 chr1 91405 91615 4 278 4
## 6 chr1 115600 115790 3 545 3
## V7
## 1 3,12,13,15,21,22,32,37,36,38,39,40,50,56,57,58,59,60,53,54,62,70,76,85,93,95,103,111,117,120,1,31,77,79,73,75,87,116,
## 2 10,85,95,31,
## 3 19,13,15,46,111,
## 4 19,
## 5 9,26,72,12,
## 6 15,103,86,
## V8
## 1 50,247,129,38,52,89,138,61,54,65,35,108,198,34,68,31,48,26,59,42,109,34,105,253,56,204,99,261,101,97,19,59,88,53,72,49,46,140,
## 2 37,142,124,310,
## 3 143,158,102,33,80,
## 4 157,
## 5 172,278,223,62,
## 6 324,57,545,
dnaseData = GRanges(seqnames = dnase[,1],
ranges = IRanges(dnase[,2], end = dnase[,3]),
strand = Rle(rep("*",nrow(dnase))),
data = dnase[,5])
dnaseData = dnaseData[seqnames(dnaseData) == chrom &
(start(dnaseData) >= minbase & end(dnaseData) <= maxbase)] # this is to reduce the amount of data in memory
dnaseData
## GRanges object with 14 ranges and 1 metadata column:
## seqnames ranges strand | data
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr6 [33381425, 33381575] * | 223
## [2] chr6 [33382000, 33382210] * | 386
## [3] chr6 [33382260, 33382410] * | 101
## [4] chr6 [33382460, 33382715] * | 1000
## [5] chr6 [33382765, 33382995] * | 290
## ... ... ... ... ... ...
## [10] chr6 [33384905, 33385555] * | 1000
## [11] chr6 [33385625, 33386275] * | 1000
## [12] chr6 [33386345, 33386495] * | 362
## [13] chr6 [33386500, 33386515] * | 362
## [14] chr6 [33386620, 33387055] * | 777
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Now, set up the ideogram, genome and RefSeq tracks that will provide context for our methylation data.
iTrack = IdeogramTrack(genome = gen, chromosome = chrom, name="")
gTrack = GenomeAxisTrack(col="black", cex=1, name="", fontcolor="black")
rTrack = UcscTrack(genome = gen, chromosome = chrom, track = "refGene", from = minbase,
to = maxbase, trackType = "GeneRegionTrack", rstarts = "exonStarts",
rends = "exonEnds", gene = "name", symbol = "name2", transcript = "name",
strand = "strand", fill = "darkblue",stacking = "squish", name = "RefSeq",
showId=TRUE, geneSymbol=TRUE)
Ensure that the methylation data is ordered by chromosome and base position.
ann450kOrd = ann450kSub[order(ann450kSub$chr,ann450kSub$pos),]
head(ann450kOrd)
## DataFrame with 6 rows and 22 columns
## chr pos strand Name Probe_rs
## <character> <integer> <character> <character> <character>
## cg13869341 chr1 15865 + cg13869341 NA
## cg24669183 chr1 534242 - cg24669183 rs6680725
## cg15560884 chr1 710097 + cg15560884 NA
## cg01014490 chr1 714177 - cg01014490 NA
## cg17505339 chr1 720865 - cg17505339 NA
## cg11954957 chr1 758829 + cg11954957 rs115498424
## Probe_maf CpG_rs CpG_maf SBE_rs SBE_maf
## <numeric> <character> <numeric> <character> <numeric>
## cg13869341 NA NA NA NA NA
## cg24669183 0.108100 NA NA NA NA
## cg15560884 NA NA NA NA NA
## cg01014490 NA NA NA NA NA
## cg17505339 NA NA NA NA NA
## cg11954957 0.029514 NA NA NA NA
## Islands_Name Relation_to_Island UCSC_RefGene_Name
## <character> <character> <character>
## cg13869341 OpenSea WASH5P
## cg24669183 chr1:533219-534114 S_Shore
## cg15560884 chr1:713984-714547 N_Shelf
## cg01014490 chr1:713984-714547 Island
## cg17505339 OpenSea
## cg11954957 chr1:762416-763445 N_Shelf
## UCSC_RefGene_Accession UCSC_RefGene_Group Phantom
## <character> <character> <character>
## cg13869341 NR_024540 Body
## cg24669183
## cg15560884
## cg01014490
## cg17505339
## cg11954957
## DMR Enhancer HMM_Island Regulatory_Feature_Name
## <character> <character> <character> <character>
## cg13869341
## cg24669183 1:523025-524193
## cg15560884
## cg01014490 1:703784-704410 1:713802-715219
## cg17505339
## cg11954957
## Regulatory_Feature_Group DHS
## <character> <character>
## cg13869341
## cg24669183
## cg15560884
## cg01014490 Promoter_Associated
## cg17505339
## cg11954957
bValsOrd = bVals[match(ann450kOrd$Name,rownames(bVals)),]
head(bValsOrd)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462 0.8177504 0.82987650 0.81186174
## cg24669183 0.81812908 0.82489238 0.8293297 0.75610281 0.81967323
## cg15560884 0.77219626 0.74903910 0.7516263 0.77417882 0.77266205
## cg01014490 0.08098986 0.06590459 0.0233755 0.04127262 0.04842397
## cg17505339 0.89439216 0.93822870 0.9471357 0.90520570 0.92641305
## cg11954957 0.74495496 0.79008516 0.7681146 0.84450764 0.75431167
## act_rTreg.6 naive.7 rTreg.8 act_naive.9 act_rTreg.10
## cg13869341 0.8090798 0.8891851 0.88537940 0.90916748 0.88334231
## cg24669183 0.8187838 0.7903763 0.85304116 0.80930568 0.80979554
## cg15560884 0.7721528 0.7658623 0.75909061 0.78099397 0.78569274
## cg01014490 0.0644404 0.0245281 0.02832358 0.07740468 0.04640659
## cg17505339 0.9286016 0.8889361 0.87205348 0.90099782 0.93508348
## cg11954957 0.8116911 0.7832207 0.84929777 0.84719430 0.83350220
Create the data tracks using the appropriate track type for each data type.
methTrack = DataTrack(start=ann450kOrd$pos, width=1,
data=t(bValsOrd),
groups=targets$Sample_Group,genome = gen,chromosome = chrom, ylim=c(-0.05,1.05),
col=pal,type=c("a","p"), name="DNA Meth.\n(beta value)", background.panel="white",legend=TRUE,
cex.title=0.4,cex.axis=0.5,cex.legend=0.8)
islandTrack = AnnotationTrack(range=islandData, genome = gen, name = "CpG Is.", fill="grey40",
chromosome = chrom)
dnaseTrack = DataTrack(range=dnaseData, genome = gen, name = "DNAseI", type="gradient",
chromosome = chrom)
Set up the track list and indicate the relative sizes of the different tracks. Finally, make the plot using the plotTracks function.
tracks = list(iTrack,gTrack,methTrack,islandTrack,dnaseTrack,rTrack)
sizes = c(3,3,5,3,3,5)
plotTracks(tracks, from = minbase, to = maxbase, showTitle = TRUE, add53 = TRUE, add35 = TRUE,
grid=TRUE, lty.grid=3, sizes=sizes, length(tracks))
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] GO.db_3.1.2
## [2] org.Hs.eg.db_3.1.2
## [3] AnnotationDbi_1.30.1
## [4] DMRcate_1.4.2
## [5] DMRcatedata_1.4.0
## [6] Gviz_1.12.1
## [7] minfiData_0.10.0
## [8] matrixStats_0.15.0
## [9] missMethyl_1.2.0
## [10] RSQLite_1.0.0
## [11] DBI_0.3.1
## [12] RColorBrewer_1.1-2
## [13] IlluminaHumanMethylation450kmanifest_0.4.0
## [14] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
## [15] minfi_1.14.0
## [16] bumphunter_1.8.0
## [17] locfit_1.5-9.1
## [18] iterators_1.0.8
## [19] foreach_1.4.3
## [20] Biostrings_2.36.4
## [21] XVector_0.8.0
## [22] GenomicRanges_1.20.8
## [23] GenomeInfoDb_1.4.3
## [24] IRanges_2.2.9
## [25] S4Vectors_0.6.6
## [26] lattice_0.20-33
## [27] Biobase_2.28.0
## [28] BiocGenerics_0.14.0
## [29] limma_3.24.15
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-122 bitops_1.0-6
## [3] tools_3.2.2 doRNG_1.6
## [5] nor1mix_1.2-1 rpart_4.1-10
## [7] Hmisc_3.17-0 colorspace_1.2-6
## [9] nnet_7.3-11 methylumi_2.14.0
## [11] gridExtra_2.0.0 base64_1.1
## [13] preprocessCore_1.30.0 formatR_1.2.1
## [15] BiasedUrn_1.06.1 pkgmaker_0.22
## [17] rtracklayer_1.28.10 scales_0.3.0
## [19] genefilter_1.50.0 quadprog_1.5-5
## [21] stringr_1.0.0 digest_0.6.8
## [23] Rsamtools_1.20.5 foreign_0.8-66
## [25] illuminaio_0.10.0 rmarkdown_0.8.1
## [27] siggenes_1.42.0 GEOquery_2.34.0
## [29] dichromat_2.0-0 htmltools_0.2.6
## [31] BSgenome_1.36.3 ruv_0.9.6
## [33] mclust_5.1 BiocParallel_1.2.22
## [35] acepack_1.3-3.3 VariantAnnotation_1.14.13
## [37] RCurl_1.95-4.7 magrittr_1.5
## [39] Formula_1.2-1 futile.logger_1.4.1
## [41] Rcpp_0.12.2 munsell_0.4.2
## [43] proto_0.3-10 stringi_1.0-1
## [45] yaml_2.1.13 MASS_7.3-45
## [47] zlibbioc_1.14.0 plyr_1.8.3
## [49] splines_3.2.2 multtest_2.24.0
## [51] GenomicFeatures_1.20.6 annotate_1.46.1
## [53] knitr_1.11 beanplot_1.2
## [55] rngtools_1.2.4 reshape2_1.4.1
## [57] codetools_0.2-14 biomaRt_2.24.1
## [59] futile.options_1.0.0 XML_3.98-1.3
## [61] evaluate_0.8 biovizBase_1.16.0
## [63] latticeExtra_0.6-26 lambda.r_1.1.7
## [65] gtable_0.1.2 reshape_0.8.5
## [67] ggplot2_1.0.1 xtable_1.8-0
## [69] survival_2.38-3 GenomicAlignments_1.4.2
## [71] registry_0.3 cluster_2.0.3
## [73] statmod_1.4.22