1 Introduction

Methylation, the addition of a methyl group to a CG dinucleotide of the DNA, is the most extensively studied epigenetic mark. Although it can be measured in many ways, the most popular platform for measuring methylation has been the Illumina HumanMethylation450 (450k) array. In this workshop, we will introduce the features of the 450k array and how to analyse the methylation array data using R. We will cover the steps involved in the standard 450k analysis pipeline including: quality control, filtering, normalization, data exploration and probe-wise differential methylation array analysis. We will also cover other approaches such as differential methylation analysis of regions, differential variability analysis and gene ontology analysis. Finally, we will provide some examples of how to visualise methylation array data

1.1 Preparing for the workshop

1.1.1 Software

To participate in this workshop you will need the following software installed on your computer:

You should then start RStudio and install Bioconductor (3.2) and several required packages by running the following commands:

## install required Bioconductor packages
source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite(c("minfi","limma","IlluminaHumanMethylation450kanno.ilmn12.hg19","minfiData",
          "IlluminaHumanMethylation450kmanifest","RColorBrewer","missMethyl",
          "DMRcate","stringi","plyr","Gviz","matrixStats","GO.db","BiasedUrn"))

1.1.2 Data

To demonstrate the various aspects of analysing methylation data, we will be using a small methylation dataset that has been previously analysed and published. The data is of the methylation profiles of various sorted T-cell types. More information on the samples and the dataset can be found in the original paper. The data can be downloaded from the MCRI Owncloud (password: biosummer2015) until Dec 12th. Please download all of the data files into the same directory.

1.2 Illumina HumanMethylation450 BeadChips

Illumina HumanMethylation450 (450k) arrays are a cost-effective alternative to whole genome bisulfite sequencing, and as such have been widely used to profile DNA methylation, particularly for studies with large numbers of samples. Illumina have recently release a new platform, the Illumina MethylationEPIC array, which increases the number of CpG probes from ~450,000 to ~850,000. Although the number of CpGs covered is increased on the new array, the same probe chemistry is being used and thus the principles for analysing the 450k array should be largely applicable to the new 850k array.

2 Standard analysis

This workshop will cover various aspects of a basic differential methylation analysis of 450k arrays. There are many R Bioconductor packages available for analysing 450k array methylation data, including minfi, missMethyl, wateRmelon and methylumi. We will begin with an example of a probe-wise methylation analysis using minfi and limma. By probe-wise analysis we mean that we will obtain a moderated t-statistic and p-value for each individual CpG probe, which we can use to determine which individual CpGs are differentially methylated at some significance level for our comparisons of interest.

2.0.1 Loading the data

We will begin by loading all the package libraries that will be required for the analysis.

## load packages required for analysis
library(limma)
library(minfi)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:limma':
## 
##     plotMA
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: lattice
## Loading required package: GenomicRanges
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: bumphunter
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: locfit
## locfit 1.5-9.1    2013-03-22
## Setting options('download.file.method.GEOquery'='auto')
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
## Loading required package: DBI
library(matrixStats)
## matrixStats v0.15.0 (2015-10-26) successfully loaded. See ?matrixStats for help.
## 
## Attaching package: 'matrixStats'
## 
## The following object is masked from 'package:GenomicRanges':
## 
##     rowRanges
## 
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
library(minfiData)
library(Gviz)
## Loading required package: grid
library(DMRcate)
## Loading required package: DMRcatedata

Load the Illumina manifest which contains all of the annotation information for each of the CpG probes on the 450k array. This will help us determine where our differentially methylated probes are located in a genomic context.

## get the 450k annotation data
ann450k = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)
## DataFrame with 6 rows and 33 columns
##                    chr       pos      strand        Name    AddressA
##            <character> <integer> <character> <character> <character>
## cg00050873        chrY   9363356           -  cg00050873    32735311
## cg00212031        chrY  21239348           -  cg00212031    29674443
## cg00213748        chrY   8148233           -  cg00213748    30703409
## cg00214611        chrY  15815688           -  cg00214611    69792329
## cg00455876        chrY   9385539           -  cg00455876    27653438
## cg01707559        chrY   6778695           +  cg01707559    45652402
##               AddressB                                          ProbeSeqA
##            <character>                                        <character>
## cg00050873    31717405 ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA
## cg00212031    38703326 CCCAATTAACCACAAAAACTAAACAAATTATACAATCAAAAAAACATACA
## cg00213748    36767301 TTTTAACACCTAACACCATTTTAACAATAAAAATTCTACAAAAAAAAACA
## cg00214611    46723459 CTAACTTCCAAACCACACTTTATATACTAAACTACAATATAACACAAACA
## cg00455876    69732350 AACTCTAAACTACCCAACACAAACTCCAAAAACTTCTCAAAAAAAACTCA
## cg01707559    64689504 ACAAATTAAAAACACTAAAACAAACACAACAACTACAACAACAAAAAACA
##                                                     ProbeSeqB        Type
##                                                   <character> <character>
## cg00050873 ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG           I
## cg00212031 CCCAATTAACCGCAAAAACTAAACAAATTATACGATCGAAAAAACGTACG           I
## cg00213748 TTTTAACGCCTAACACCGTTTTAACGATAAAAATTCTACAAAAAAAAACG           I
## cg00214611 CTAACTTCCGAACCGCGCTTTATATACTAAACTACAATATAACGCGAACG           I
## cg00455876 AACTCTAAACTACCCGACACAAACTCCAAAAACTTCTCGAAAAAAACTCG           I
## cg01707559 GCGAATTAAAAACACTAAAACGAACGCGACGACTACAACGACAAAAAACG           I
##               NextBase       Color    Probe_rs Probe_maf      CpG_rs
##            <character> <character> <character> <numeric> <character>
## cg00050873           A         Red          NA        NA          NA
## cg00212031           T         Red          NA        NA          NA
## cg00213748           A         Red          NA        NA          NA
## cg00214611           A         Red          NA        NA          NA
## cg00455876           A         Red          NA        NA          NA
## cg01707559           A         Red          NA        NA          NA
##              CpG_maf      SBE_rs   SBE_maf           Islands_Name
##            <numeric> <character> <numeric>            <character>
## cg00050873        NA          NA        NA   chrY:9363680-9363943
## cg00212031        NA          NA        NA chrY:21238448-21240005
## cg00213748        NA          NA        NA   chrY:8147877-8148210
## cg00214611        NA          NA        NA chrY:15815488-15815779
## cg00455876        NA          NA        NA   chrY:9385471-9385777
## cg01707559        NA          NA        NA   chrY:6778574-6780028
##            Relation_to_Island
##                   <character>
## cg00050873            N_Shore
## cg00212031             Island
## cg00213748            S_Shore
## cg00214611             Island
## cg00455876             Island
## cg01707559             Island
##                                                                                                                        Forward_Sequence
##                                                                                                                             <character>
## cg00050873 TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTGGAGTGGGTGGACCC[CG]GCCAAGACGGCCTGGGCTGACCAGAGACGGGAGGCAGAAAAAGTGGGCAGGTGGTTGCAG
## cg00212031 CCATTGGCCCGCCCCAGTTGGCCGCAGGGACTGAGCAAGTTATGCGGTCGGGAAGACGTG[CG]TTAAAGGGCTGAAGGGGAGGGACGGAACTGACAGTCTCTGTGACAGCTCTGAGGTGGGAG
## cg00213748 TCTGTGGGACCATTTTAACGCCTGGCACCGTTTTAACGATGGAGGTTCTGCAGGAGGGGG[CG]ACCTGGGGTAGGAGGCGTGCTAGTGGTGGATGACATTGTGGCAGAGATGGAGGTGGTGGC
## cg00214611 GCGCCGGCAGGACTAGCTTCCGGGCCGCGCTTTGTGTGCTGGGCTGCAGTGTGGCGCGGG[CG]AGGAAGCTGGTAGGGCGGTTGTCGCAAGCTCCAGCTGCAGCCTCCGCCTACGTGAGAAGA
## cg00455876 CGCGTGTGCCTGGACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCT[CG]GGGACGGAAGGCGTGGGGTGAGTGGGCTGGAGATGCAGGCGCGCCCGTGGCTGTGCAGCC
## cg01707559 AGCGGCCGCTCCCAGTGGTGGTCACCGCCAGTGCCAATCCCTTGCGCCGCCGTGCAGTCC[CG]CCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGCGCTCGGGAGTGA
##                                                     SourceSeq Random_Loci
##                                                   <character> <character>
## cg00050873 CGGGGTCCACCCACTCCAAAAACCACCACAGTTGTGCGTTGCCTCCTCGC            
## cg00212031 CGCACGTCTTCCCGACCGCATAACTTGCTCAGTCCCTGCGGCCAACTGGG            
## cg00213748 CGCCCCCTCCTGCAGAACCTCCATCGTTAAAACGGTGCCAGGCGTTAAAA            
## cg00214611 CGCCCGCGCCACACTGCAGCCCAGCACACAAAGCGCGGCCCGGAAGCTAG            
## cg00455876 GACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCTCG            
## cg01707559 CGCCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGC            
##            Methyl27_Loci UCSC_RefGene_Name        UCSC_RefGene_Accession
##              <character>       <character>                   <character>
## cg00050873                  TSPY4;FAM197Y2        NM_001164471;NR_001553
## cg00212031                          TTTY14                     NR_001543
## cg00213748                                                              
## cg00214611                   TMSB4Y;TMSB4Y           NM_004202;NM_004202
## cg00455876                                                              
## cg01707559               TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;NM_134258
##              UCSC_RefGene_Group     Phantom         DMR    Enhancer
##                     <character> <character> <character> <character>
## cg00050873         Body;TSS1500                                    
## cg00212031               TSS200                                    
## cg00213748                                                         
## cg00214611        1stExon;5'UTR                                    
## cg00455876                                                         
## cg01707559 TSS200;TSS200;TSS200                                    
##                     HMM_Island Regulatory_Feature_Name
##                    <character>             <character>
## cg00050873   Y:9973136-9976273                        
## cg00212031 Y:19697854-19699393                        
## cg00213748   Y:8207555-8208234                        
## cg00214611 Y:14324883-14325218     Y:15815422-15815706
## cg00455876   Y:9993394-9995882                        
## cg01707559   Y:6838022-6839951                        
##                          Regulatory_Feature_Group         DHS
##                                       <character> <character>
## cg00050873                                                   
## cg00212031                                                   
## cg00213748                                                   
## cg00214611 Promoter_Associated_Cell_type_specific            
## cg00455876                                                   
## cg01707559

Next, we will load in the data. The sample sheet file is essentially a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. The format expected by the read.450k.sheet function is based on the sample sheet file Illumina provides. It is also very similar to the targets file made popular by the limma package (see the extensive Limma User’s Guide). Raeding in the sample sheet essentially creates a data.frame. If they exist, a column named Sentrix_Position is renamed to Array and Sentrix_ID is renamed to Slide. The function also attempts to derive the path to the IDAT files using the baseDir and the Slide and Array information. This will be stored in a column named Basename.

## set up a path to the directory you donloaded the data to
dataDirectory = "D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data"

## read in the sample sheet for the experiment
targets = read.450k.sheet(dataDirectory, pattern="SampleSheet.csv")
## [read.450k.sheet] Found the following CSV files:
## [1] "D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/SampleSheet.csv"
targets
##    Sample_Name Sample_Well   Sample_Source Sample_Group Sample_Label
## 1            1          A1             M28        naive        naive
## 2            2          B1             M28        rTreg        rTreg
## 3            3          C1             M28    act_naive    act_naive
## 4            4          D1             M29        naive        naive
## 5            5          E1             M29    act_naive    act_naive
## 6            6          F1             M29    act_rTreg    act_rTreg
## 7            7          G1             M30        naive        naive
## 8            8          H1             M30        rTreg        rTreg
## 9            9          A2             M30    act_naive    act_naive
## 10          10          B2             M30    act_rTreg    act_rTreg
## 11          11         H06 VICS-72098-18-B        birth        birth
##    Pool_ID  Array      Slide
## 1       NA R01C01 6264509100
## 2       NA R02C01 6264509100
## 3       NA R03C01 6264509100
## 4       NA R04C01 6264509100
## 5       NA R05C01 6264509100
## 6       NA R06C01 6264509100
## 7       NA R01C02 6264509100
## 8       NA R02C02 6264509100
## 9       NA R03C02 6264509100
## 10      NA R04C02 6264509100
## 11      NA R06C02 5975827018
##                                                                            Basename
## 1  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C01
## 2  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C01
## 3  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C01
## 4  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C01
## 5  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R05C01
## 6  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R06C01
## 7  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C02
## 8  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C02
## 9  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C02
## 10 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C02
## 11 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/5975827018/5975827018_R06C02

Now that we have imported the information about what and where all the data is, we can load the raw methylation data into R from the IDAT files. We can then rename the samples with more descriptive names and calculate detection p-values, which will allow us to identify poor quality probes.

## read in the raw data from the IDAT files
rgSet = read.450k.exp(targets=targets)
rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 11 samples 
##   element names: Green, Red 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: 6264509100_R01C01 6264509100_R02C01 ...
##     5975827018_R06C02 (11 total)
##   varLabels: Sample_Name Sample_Well ... filenames (10 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## give the samples meaningful names
targets$ID = paste(targets$Sample_Group,targets$Sample_Name,sep=".")
sampleNames(rgSet) = targets$ID
rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 11 samples 
##   element names: Green, Red 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: naive.1 rTreg.2 ... birth.11 (11 total)
##   varLabels: Sample_Name Sample_Well ... filenames (10 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## calculate the detection P-values
detP = detectionP(rgSet)
head(detP)
##            naive.1 rTreg.2  act_naive.3 naive.4 act_naive.5 act_rTreg.6
## cg00050873       0       0 0.000000e+00       0 0.00000e+00           0
## cg00212031       0       0 0.000000e+00       0 0.00000e+00           0
## cg00213748       0       0 1.181832e-12       0 8.21565e-15           0
## cg00214611       0       0 0.000000e+00       0 0.00000e+00           0
## cg00455876       0       0 0.000000e+00       0 0.00000e+00           0
## cg01707559       0       0 0.000000e+00       0 0.00000e+00           0
##            naive.7      rTreg.8 act_naive.9 act_rTreg.10  birth.11
## cg00050873       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg00212031       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg00213748       0 1.469801e-05           0 1.365951e-08 0.6735224
## cg00214611       0 0.000000e+00           0 0.000000e+00 0.7344451
## cg00455876       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg01707559       0 0.000000e+00           0 0.000000e+00 0.0000000

2.0.2 Quality control

Plotting the mean detection p-value for each sample will allow us to gauge whether any samples have many failed probes - this will be indicated by a large mean detection p-value. Samples with mean detection p-values exceeding a cutoff such as o.05 can be excluded from further analysis.

pal = brewer.pal(8,"Dark2")
## look at the mean detection P-values across all samples to identify any failed samples
par(mfrow=c(1,2))
barplot(colMeans(detP),col=pal[factor(targets$Sample_Group)],las=2,cex.names=0.8,
        main="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft",legend=levels(factor(targets$Sample_Group)),fill=pal,bg="white")

barplot(colMeans(detP),col=pal[factor(targets$Sample_Group)],las=2,cex.names=0.8,ylim=c(0,0.002),
        main="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft",legend=levels(factor(targets$Sample_Group)),fill=pal,bg="white")
## remove failed samples
keep = colMeans(detP) < 0.05
rgSet = rgSet[,keep]
targets = targets[keep,]
detP = detP[,keep]

rgSet
## RGChannelSet (storageMode: lockedEnvironment)
## assayData: 622399 features, 10 samples 
##   element names: Green, Red 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: naive.1 rTreg.2 ... act_rTreg.10 (10 total)
##   varLabels: Sample_Name Sample_Well ... filenames (10 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
targets
##    Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label Pool_ID
## 1            1          A1           M28        naive        naive      NA
## 2            2          B1           M28        rTreg        rTreg      NA
## 3            3          C1           M28    act_naive    act_naive      NA
## 4            4          D1           M29        naive        naive      NA
## 5            5          E1           M29    act_naive    act_naive      NA
## 6            6          F1           M29    act_rTreg    act_rTreg      NA
## 7            7          G1           M30        naive        naive      NA
## 8            8          H1           M30        rTreg        rTreg      NA
## 9            9          A2           M30    act_naive    act_naive      NA
## 10          10          B2           M30    act_rTreg    act_rTreg      NA
##     Array      Slide
## 1  R01C01 6264509100
## 2  R02C01 6264509100
## 3  R03C01 6264509100
## 4  R04C01 6264509100
## 5  R05C01 6264509100
## 6  R06C01 6264509100
## 7  R01C02 6264509100
## 8  R02C02 6264509100
## 9  R03C02 6264509100
## 10 R04C02 6264509100
##                                                                            Basename
## 1  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C01
## 2  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C01
## 3  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C01
## 4  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C01
## 5  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R05C01
## 6  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R06C01
## 7  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R01C02
## 8  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R02C02
## 9  D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R03C02
## 10 D:/Work/BioinfoSummer2015/450kAnalysisWorkshop/data/6264509100/6264509100_R04C02
##              ID
## 1       naive.1
## 2       rTreg.2
## 3   act_naive.3
## 4       naive.4
## 5   act_naive.5
## 6   act_rTreg.6
## 7       naive.7
## 8       rTreg.8
## 9   act_naive.9
## 10 act_rTreg.10

You can also generate other quality control plots using the minfi qcReport function. The minfi vignette describes the various plots in this report and how they should be interpreted. Generally, samples that look poor based on mean detection p-value will also look poor using other metrics and it is usually advisable to exclude them from further analysis.

## produce a general QC report for the data
qcReport(rgSet,sampNames=targets$ID,sampGroups=targets$Sample_Group,pdf="qcReport.pdf")

2.0.3 Normalisation

To minimise the unwanted variation within and between samples, various data normalizations can be applied. As 450k arrays use 2 different probe types, Infinium I and II, on the same array, studies suggest that it is advisable to perform within-array normalization between the 2 probe types. Normalization can also be performed between arrays. Here we provide examples of 2 different normalization types: SWAN (within-array only) and SQN (within and between array).

## create a MethylSet object from the raw data
mSetRaw = preprocessRaw(rgSet)
mSetRaw
## MethylSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 10 samples 
##   element names: Meth, Unmeth 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: naive.1 rTreg.2 ... act_rTreg.10 (10 total)
##   varLabels: Sample_Name Sample_Well ... filenames (10 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.14.0
##   Manifest version: 0.4.0
## normalize the data
mSetSw = preprocessSWAN(rgSet = rgSet, mSet = mSetRaw, verbose=TRUE) ## SWAN method
## [preprocessSwan] Normalizing array 1 of 10
## [preprocessSwan] Normalizing array 2 of 10
## [preprocessSwan] Normalizing array 3 of 10
## [preprocessSwan] Normalizing array 4 of 10
## [preprocessSwan] Normalizing array 5 of 10
## [preprocessSwan] Normalizing array 6 of 10
## [preprocessSwan] Normalizing array 7 of 10
## [preprocessSwan] Normalizing array 8 of 10
## [preprocessSwan] Normalizing array 9 of 10
## [preprocessSwan] Normalizing array 10 of 10
mSetSw = mapToGenome(mSetSw)

mSetSq = preprocessQuantile(rgSet) ## SQN method
## [preprocessQuantile] Mapping to genome.
## [preprocessQuantile] Fixing outliers.
## Warning in .getSex(CN = CN, xIndex = xIndex, yIndex = yIndex, cutoff
## = cutoff): An inconsistency was encountered while determining sex. One
## possibility is that only one sex is present. We recommend further checks,
## for example with the plotSex function.
## [preprocessQuantile] Quantile normalizing.
par(mfrow=c(1,3))
densityPlot(rgSet, sampGroups = targets$Sample_Group,main="Raw")
densityPlot(getBeta(mSetSw), sampGroups = targets$Sample_Group,main="SWAN")
densityPlot(getBeta(mSetSq), sampGroups = targets$Sample_Group,main="SQN")

2.0.4 Data exploration

Multi dimensional scaling (MDS) plots are excellent for visualising your data, and are usually one of the first kinds of plots you should make when exploring your data. It is based on principle componenents analysis and is an unsupervised method for looking at the similarities and differences between the samples of the data. Samples that are more similar to each other should cluster together, and samples that are very different should be further apart on the plot. Dimension 1 (or principle component 1) captures the greatest source of variation in the data, dimension 2 captures the second greatest source of variation in the data and so on. Colouring in the data points or labels by known factors of interest can often pull out exactly what the greatest sources of variation are in the data. It’s also possible to use MDSplots for figuring out sample mix-ups.

## MDS plots to look at largets sources of variation
par(mfrow=c(1,2))
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)])
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)

plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Source)])
legend("topright",legend=levels(factor(targets$Sample_Source)),text.col=pal)
par(mfrow=c(1,1))
## Examine higher dimensions to look at other sources of variation
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(1,3))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(2,3))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)
plotMDS(getM(mSetSq), top=1000, gene.selection = "common", col=pal[factor(targets$Sample_Group)], dim=c(3,4))
legend("topright",legend=levels(factor(targets$Sample_Group)),text.col=pal)

2.0.5 Filtering

Poor performing probes are generally filtered out prior to differential methylation analysis. As the signal from these probes is unreliable, by removing them we perform fewer statistical tests and thus incur a reduced multiple testing adjustment. We filter out probes that have failed in one or more samples based on detection p-value. The detection p-values are automatically calculated using the signal from the negative control probes on the array; probes that are determined to not have a signal significantly different from the background level have high detection p-values.

## remove any probes that have failed in one or more samples
detP = detP[match(featureNames(mSetSq),rownames(detP)),] # ensure probes are in the same order in the mSetSq and detP objects
keep = rowSums(detP < 0.01) == ncol(mSetSq) 
table(keep)
## keep
##  FALSE   TRUE 
##    977 484535
mSetSqFlt = mSetSq[keep,]
mSetSqFlt
## class: GenomicRatioSet 
## dim: 484535 10 
## exptData(0):
## assays(2): M CN
## rownames(484535): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowRanges metadata column names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
##   predictedSex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.14.0
##   Manifest version: 0.4.0

Depending on the nature of your samples and your biological question you may also choose to filter out the probes from the X and Y chromosomes or probes that are known to have common SNPs at the CpG site. This can be done using the code below, however, we will not be filtering out these probes as part of the analysis in this workshop.

## if your data infludes males and females, remove the sex chromosomes
keep = !(featureNames(mSetSqFlt) %in% ann450k$Name[ann450k$chr %in% c("chrX","chrY")])
table(keep)

mSetSqFlt = mSetSqFlt[keep,]

## remove probes with SNPs at CpG or SBE site
mSetSqFlt = dropLociWithSnps(mSetSqFlt)

We will also filter out probes that have shown to be cross-reactive i.e. map to multiple places in the genome. This list was oroginally published by Chen et al. in 2013 and can be downloaded directly from the author’s website. The Excel spreadsheet containing the cross-reactive probes is also available as part of the data download for this workshop.

## exclude cross reactive probes 
xReactiveProbes = read.csv(file=paste(dataDirectory,"48639-non-specific-probes-Illumina450k.csv",sep="/"), stringsAsFactors=FALSE)
keep = !(featureNames(mSetSqFlt) %in% xReactiveProbes$TargetID)
table(keep)
## keep
##  FALSE   TRUE 
##  29171 455364
mSetSqFlt = mSetSqFlt[keep,] 
mSetSqFlt
## class: GenomicRatioSet 
## dim: 455364 10 
## exptData(0):
## assays(2): M CN
## rownames(455364): cg13869341 cg24669183 ... cg08265308 cg14273923
## rowRanges metadata column names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
##   predictedSex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.14.0
##   Manifest version: 0.4.0

After the data has been normalised and filtered, we can calculate M and \(\beta\) values. \[M=log2(methylated/unmethylated)\] \[\beta=methylated/(methylated+unmethylated)\] M-values have nicer statistical properties and are thus better for use in statistical analysis of methylation data whilst \(\beta\) values are easy to interpret and are thus better for displaying data. A detailed comparison of M and \(\beta\) values was published by Du et al. in 2010.

## calculate M-values for statistical analysis
mVals = getM(mSetSqFlt)
head(mVals)
##              naive.1   rTreg.2 act_naive.3   naive.4 act_naive.5
## cg13869341  2.421276  2.515948    2.165745  2.286314    2.109441
## cg24669183  2.169414  2.235964    2.280734  1.632309    2.184435
## cg15560884  1.761176  1.577578    1.597503  1.777486    1.764999
## cg01014490 -3.504268 -3.825119   -5.384735 -4.537864   -4.296526
## cg17505339  3.082191  3.924931    4.163206  3.255373    3.654134
## cg11954957  1.546401  1.912204    1.727910  2.441267    1.618331
##            act_rTreg.6   naive.7   rTreg.8 act_naive.9 act_rTreg.10
## cg13869341    2.083313  3.004332  2.949430    3.323265     2.920691
## cg24669183    2.175771  1.914738  2.537203    2.085423     2.090007
## cg15560884    1.760819  1.709728  1.655782    1.834341     1.874285
## cg01014490   -3.859792 -5.313593 -5.100400   -3.575205    -4.360973
## cg17505339    3.701096  3.000690  2.768876    3.185991     3.848438
## cg11954957    2.107829  1.853192  2.494570    2.470994     2.323683
bVals = getBeta(mSetSqFlt)
head(bVals)
##               naive.1    rTreg.2 act_naive.3    naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462   0.8177504 0.82987650  0.81186174
## cg24669183 0.81812908 0.82489238   0.8293297 0.75610281  0.81967323
## cg15560884 0.77219626 0.74903910   0.7516263 0.77417882  0.77266205
## cg01014490 0.08098986 0.06590459   0.0233755 0.04127262  0.04842397
## cg17505339 0.89439216 0.93822870   0.9471357 0.90520570  0.92641305
## cg11954957 0.74495496 0.79008516   0.7681146 0.84450764  0.75431167
##            act_rTreg.6   naive.7    rTreg.8 act_naive.9 act_rTreg.10
## cg13869341   0.8090798 0.8891851 0.88537940  0.90916748   0.88334231
## cg24669183   0.8187838 0.7903763 0.85304116  0.80930568   0.80979554
## cg15560884   0.7721528 0.7658623 0.75909061  0.78099397   0.78569274
## cg01014490   0.0644404 0.0245281 0.02832358  0.07740468   0.04640659
## cg17505339   0.9286016 0.8889361 0.87205348  0.90099782   0.93508348
## cg11954957   0.8116911 0.7832207 0.84929777  0.84719430   0.83350220
par(mfrow=c(1,2))
densityPlot(bVals,sampGroups = targets$Sample_Group,main="Beta values")
densityPlot(mVals,sampGroups = targets$Sample_Group,main="M values")

2.0.6 Differential methylation analysis (probes)

The primary biological goal for this dataset is to discover differentially methylated probes between the different cell types. However, there is another factor that we need to take into account when we perform the modelling. In the targets file, there is a column called Sample_Source, and this refers to the individuals from whom the samples came. In this dataset, any one particular individual contributes more than one cell type. For example, individual M28 contributes naive, rTreg and actnaive samples. From the MDS plots we know that the samples cluster according to individual first, and then by cell type, implying that individual is the greatest source of variation in methylation. In order to get the cell type differences, we need to take into account individual variation. Hence when we specify our design matrix, we need to include two factors: individual and cell type. This style of analysis is called a paired analysis; differences between cell types are calculated within each individual, and then these differences are summed across individuals to determine whether overall there is a significant difference in the mean methylation level for each CpG site.

cellType = factor(targets$Sample_Group) ## this is the factor of interest
individual = factor(targets$Sample_Source) ## this is the individual effect that we need to account for

## use the above to create a design matrix
design = model.matrix(~0+cellType+individual, data=targets)
colnames(design) = c(levels(cellType),levels(individual)[-1])
  
fit = lmFit(mVals, design) ## fit the linear model
## create a contrast matrix for specific comparisons
contMatrix = makeContrasts(naive-rTreg,
                           naive-act_naive,
                           rTreg-act_rTreg,
                           act_naive-act_rTreg,
                           levels=design)
contMatrix
##            Contrasts
## Levels      naive - rTreg naive - act_naive rTreg - act_rTreg
##   act_naive             0                -1                 0
##   act_rTreg             0                 0                -1
##   naive                 1                 1                 0
##   rTreg                -1                 0                 1
##   M29                   0                 0                 0
##   M30                   0                 0                 0
##            Contrasts
## Levels      act_naive - act_rTreg
##   act_naive                     1
##   act_rTreg                    -1
##   naive                         0
##   rTreg                         0
##   M29                           0
##   M30                           0
## fit the contrasts
fit2 = contrasts.fit(fit, contMatrix)
fit2 = eBayes(fit2)

## look at the numbers of DM CpGs at FDR < 0.05
summary(decideTests(fit2))
##    naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1          1671               409                 0                   579
## 0         452255            454729            455364                453838
## 1           1438               226                 0                   947
## get the table of results for the first contrast (naive - rTreg)
ann450kSub = ann450k[match(rownames(mVals),ann450k$Name),c(1:4,12:19,24:ncol(ann450k))]
DMPs = topTable(fit2, num=Inf, coef=1, genelist = ann450kSub)
head(DMPs)
##              chr       pos strand       Name Probe_rs Probe_maf CpG_rs
## cg07499259  chr1  12188502      + cg07499259     <NA>        NA   <NA>
## cg26992245  chr8  29848579      - cg26992245     <NA>        NA   <NA>
## cg09747445 chr15  70387268      - cg09747445     <NA>        NA   <NA>
## cg18808929  chr8  61825469      - cg18808929     <NA>        NA   <NA>
## cg25015733  chr2  99342986      - cg25015733     <NA>        NA   <NA>
## cg21179654  chr3 114057297      + cg21179654     <NA>        NA   <NA>
##            CpG_maf SBE_rs SBE_maf            Islands_Name
## cg07499259      NA   <NA>      NA                        
## cg26992245      NA   <NA>      NA                        
## cg09747445      NA   <NA>      NA chr15:70387929-70393206
## cg18808929      NA   <NA>      NA  chr8:61822358-61823028
## cg25015733      NA   <NA>      NA  chr2:99346882-99348177
## cg21179654      NA   <NA>      NA                        
##            Relation_to_Island
## cg07499259            OpenSea
## cg26992245            OpenSea
## cg09747445            N_Shore
## cg18808929            S_Shelf
## cg25015733            N_Shelf
## cg21179654            OpenSea
##                                           UCSC_RefGene_Name
## cg07499259                                  TNFRSF8;TNFRSF8
## cg26992245                                                 
## cg09747445                                   TLE3;TLE3;TLE3
## cg18808929                                                 
## cg25015733                                           MGAT4A
## cg21179654 ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20
##                                                                             UCSC_RefGene_Accession
## cg07499259                                                                     NM_152942;NM_001243
## cg26992245                                                                                        
## cg09747445                                                        NM_001105192;NM_020908;NM_005078
## cg18808929                                                                                        
## cg25015733                                                                               NM_012214
## cg21179654 NM_001164343;NM_001164346;NM_001164345;NM_001164342;NM_001164344;NM_001164347;NM_015642
##                                   UCSC_RefGene_Group Phantom DMR Enhancer
## cg07499259                                5'UTR;Body                     
## cg26992245                                                           TRUE
## cg09747445                            Body;Body;Body                     
## cg18808929                                                           TRUE
## cg25015733                                     5'UTR                     
## cg21179654 3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR                     
##                     HMM_Island Regulatory_Feature_Name
## cg07499259 1:12111023-12111225                        
## cg26992245                                            
## cg09747445                                            
## cg18808929                                            
## cg25015733                                            
## cg21179654                       3:114057192-114057775
##                   Regulatory_Feature_Group DHS     logFC     AveExpr
## cg07499259                                      3.654104  2.46652171
## cg26992245                                      4.450696 -0.09180715
## cg09747445                                     -3.337299 -0.25201484
## cg18808929                                     -2.990263  0.77522878
## cg25015733                                     -3.054336  0.83280190
## cg21179654 Unclassified_Cell_type_specific      2.859016  1.32460816
##                    t      P.Value   adj.P.Val        B
## cg07499259  18.70862 7.257675e-08 0.005243551 7.449952
## cg26992245  18.31538 8.565762e-08 0.005243551 7.358404
## cg09747445 -18.21811 8.928886e-08 0.005243551 7.335167
## cg18808929 -17.86928 1.038065e-07 0.005243551 7.249834
## cg25015733 -17.29868 1.336369e-07 0.005243551 7.103198
## cg21179654  17.24589 1.368512e-07 0.005243551 7.089171

The results of the analysis for the first comparison, naive vs rTreg, are stored in the DMPs object. We can write out this file as a comma separated file, which can be opened in excel.

write.table(DMPs, file="DMPs.csv", sep=",", row.names=FALSE)

It is always useful to plot the top differentially methylated CpG sites - it serves as a sanity check to make sure the results make sense, and if they don’t it usually means there is an error in the code somewhere, or perhaps the design matrix was not set up correctly. It is easier to interpret methylation on the beta value scale, so although we always perform analysis on the Mvalue scale, we usually visualise data on the beta value scale.

par(mfrow=c(2,2))
sapply(rownames(DMPs)[1:4], function(cpg){
  plotCpg(bVals, cpg=cpg, pheno = targets$Sample_Group)
})
## $cg07499259
## NULL
## 
## $cg26992245
## NULL
## 
## $cg09747445
## NULL
## 
## $cg18808929
## NULL

3 Additional analyses

3.1 Gene ontology testing

Once you have performed a differential methylation analysis, there may be a very long list of significant CpG sites to interpret. One question a researcher may have is which gene pathways may be targeted by the significant CpGs. Sometimes it is easy to link the top differentially methylated CpGs to genes that make sense in terms of the cell types/samples being studied, but there can be many more thousands of CpGs significantly differentially methylated. In order to get an idea of the biological processes that the significant CpGs may be involved in, we can perform gene ontology testing with the gometh function in the missMethyl package.

Let us consider the first comparison, naive vs rTreg, with the results of the analysis in the DMPs table. The gometh function takes as input a character vector of the names (e.g. cg20832020) of the significant CpG sites, and optionally, a character vector of all CpGs tested. This is recommended if you have performed quite a lot of filtering of the CpGs before analysis. In the DMPs table, the Name column corresponds to the CpG name. We will select all CpG sites that have adjusted p-value of less than 0.05.

## Get the significant CpG sites at less than 5% FDR
sigCpGs = DMPs$Name[DMPs$adj.P.Val<0.05]
## First 10 significant CpGs
sigCpGs[1:10]
##  [1] "cg07499259" "cg26992245" "cg09747445" "cg18808929" "cg25015733"
##  [6] "cg21179654" "cg26280976" "cg16943019" "cg10898310" "cg25130381"
## Total number of significant CpGs at 5% FDR
length(sigCpGs)
## [1] 3109
## Get all the CpG sites used in the analysis to form the background
all = DMPs$Name
## Total number of CpG sites tested
length(all)
## [1] 455364

What makes gometh work particularly well with methylation data is that it can take into account the varying numbers of CpGs associated with genes. For the 450K array, the numbers of CpGs mapping to genes can vary from as few as 1 to as many as 1200. The genes that have more CpGs associated with them will have a higher probability of being identified as differentially methylated compared to genes with fewer CpGs. We can look at the bias in the data by specifying plot=TRUE in the call to gometh.

par(mfrow=c(1,1))
gst <- gometh(sig.cpg=sigCpGs, all.cpg=all, plot.bias=TRUE)
## Warning in alias2SymbolTable(flat$symbol): Multiple symbols ignored for one
## or more aliases

The gst object is a dataframe with each row corresponding to the GO category being tested. The top 20 gene ontology categories can be displayed using the topGO function:

## Top 20 GO categories
topGO(gst)
##                                                    Term Ont    N  DE
## GO:0002376                        immune system process  BP 2271 347
## GO:0001775                              cell activation  BP  874 165
## GO:0045321                         leukocyte activation  BP  643 127
## GO:0002682          regulation of immune system process  BP 1156 190
## GO:0002684 positive regulation of immune system process  BP  764 140
## GO:0046649                        lymphocyte activation  BP  540 112
## GO:0006955                              immune response  BP 1363 202
## GO:0007159                 leukocyte cell-cell adhesion  BP  430  94
## GO:0048583           regulation of response to stimulus  BP 3259 440
## GO:0042110                            T cell activation  BP  390  87
## GO:0070489                           T cell aggregation  BP  390  87
## GO:0071593                       lymphocyte aggregation  BP  392  87
## GO:0070486                        leukocyte aggregation  BP  398  88
## GO:0006952                             defense response  BP 1455 202
## GO:0050778       positive regulation of immune response  BP  525 100
## GO:0002520                    immune system development  BP  763 139
## GO:0034109                 homotypic cell-cell adhesion  BP  458  95
## GO:0050776                regulation of immune response  BP  752 128
## GO:0002253                activation of immune response  BP  402  83
## GO:0007166      cell surface receptor signaling pathway  BP 3269 399
##                    P.DE          FDR
## GO:0002376 1.086240e-28 2.118603e-24
## GO:0001775 2.218695e-20 2.163671e-16
## GO:0045321 5.486652e-18 3.567055e-14
## GO:0002682 8.340629e-18 3.685697e-14
## GO:0002684 9.448567e-18 3.685697e-14
## GO:0046649 2.448200e-17 7.958281e-14
## GO:0006955 3.221465e-17 8.975923e-14
## GO:0007159 3.901551e-16 9.511982e-13
## GO:0048583 4.523733e-16 9.803432e-13
## GO:0042110 2.898667e-15 5.139601e-12
## GO:0070489 2.898667e-15 5.139601e-12
## GO:0071593 4.318704e-15 6.809324e-12
## GO:0070486 4.538618e-15 6.809324e-12
## GO:0006952 2.464757e-14 3.433759e-11
## GO:0050778 2.723173e-14 3.540851e-11
## GO:0002520 4.112259e-14 5.012844e-11
## GO:0034109 7.733471e-14 8.872565e-11
## GO:0050776 8.507945e-14 9.218831e-11
## GO:0002253 2.216633e-13 2.275432e-10
## GO:0007166 3.338006e-13 3.243997e-10

Here we can see lots of GO categories corresponding to immune system and T cell processes, which makes sense as the cell types being studied form part of the immune system. The total number of significant GO categories, adjusted for multiple testing, is:

## Total number of significant GO categories at 5% FDR
sum(gst$FDR<0.05)
## [1] 447

3.2 Differential variability

Rather than testing for differences in mean methylation, we may be interested in testing for differences between group variances. For example, it has been hypothesised that highly variable CpGs in cancer are important for tumour progression. Hence we may be interested in CpG sites that are consistently methylated in one group, but variably methylated in another group. In general we recommend at least 10 samples in each group for testing differential variability in order to get a good estimate of the variance. It is more tricky to accurately estimate variances as opposed to means, so you generally need a larger sample size for testing differences in group variances as opposing to testing differences in group means. For the purpose of this workshop, we won’t worry about this sample size issue.

The way we specify which groups we are interested in testing is a little bit different to the way you specify a model in limma for differential methylation, particularly if you fit an intercept model. For our data, we have set up a design matrix without an intercept term, and we will specify which groups we want to test for differential variability in a contrasts statement. We can use the same design matrix that we set up for testing differential methylation.

## Fit the model for differential variability, specifying that the four cell types are the groups of interest
fitvar <- varFit(mVals, design = design, coef = c(1,2,3,4))
## Specify the groups we are interested in testing for differential variability
contr <- makeContrasts(naive-rTreg,
                       naive-act_naive,
                       rTreg-act_rTreg,
                       act_naive-act_rTreg,
                       levels=design)
fitvar <- contrasts.varFit(fitvar,contrasts=contr)
## Summary of differential variability
summary(decideTests(fitvar))
##    naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1         46940             27942             45614                 47639
## 0         373371            398381            365505                375064
## 1          35053             29041             44245                 32661
topDV <- topVar(fitvar, coef=1)
## Top 10 differentially variable CpGs between naive and rTreg
topDV
##            SampleVar LogVarRatio DiffLevene          t      P.Value
## cg17800497 3.8641666   -5.477617  -4.138581 -10.962447 5.802882e-28
## cg10962762 4.1689981   -5.519295  -4.559108 -10.832188 2.427623e-27
## cg00143364 1.9321804   -5.431527  -3.011704  -9.763057 1.623980e-22
## cg07665222 0.6970484  -10.665516  -1.658147  -9.647707 5.032801e-22
## cg24338920 0.2121397   -9.069818  -1.257323  -9.606404 7.521706e-22
## cg20560169 6.2982954    7.100821   4.514489   9.552649 1.265693e-21
## cg00731661 1.5733049   -5.156593  -3.247191  -9.307495 1.310201e-20
## cg03281139 0.2439198   -9.684861  -1.257629  -9.024908 1.800136e-19
## cg05257260 1.2329180   -5.816190  -2.235748  -8.946056 3.687428e-19
## cg27594095 1.9241378   -5.588855  -2.478976  -8.942318 3.814327e-19
##             Adj.P.Value
## cg17800497 2.642424e-22
## cg10962762 5.527261e-22
## cg00143364 2.465006e-17
## cg07665222 5.729391e-17
## cg24338920 6.850229e-17
## cg20560169 9.605850e-17
## cg00731661 8.523119e-16
## cg03281139 1.024647e-14
## cg05257260 1.736907e-14
## cg27594095 1.736907e-14

Just as we had a look at the beta values for the significant differentially methylated CpGs, it is useful to plot differentially variable CpGs too.

par(mfrow=c(2,2))
sapply(rownames(topDV)[1:4], function(cpg){
  plotCpg(bVals, cpg=cpg, pheno = targets$Sample_Group)
})
## $cg17800497
## NULL
## 
## $cg10962762
## NULL
## 
## $cg00143364
## NULL
## 
## $cg07665222
## NULL

An example of testing for differential variability when you have a design matrix with an intercept term is detailed in the missMethyl vignette.

3.3 Differential methylation analysis (regions)

Although performing a probe-wise analysis is useful and informative, sometimes we are interested in knowing whether several proximal CpGs are concordantly differentially methylated i.e. we want to identify differentially methylated regions. There are several Bioconductor packages that can identify differentially methylated regions from 450k data. Some of the most popular are DMRcate, minfi plus bumphunter and charm. They are each based on different statistical methods. In this workshop, we are using DMRcate because it is faster than the other methods and is based on limma, so we can directly use the design and contMatrix we previously defined.

Firstly, our matrix of M-values is annotated with relevant information about the probes such as their genomic position, gene annotation, etc. By default, this is done using the ilmn12.hg19 annotation, but this can be substituted for any argument compatible with the interface provided by the minfi package. The backbone of the limma pipeline is then used for differential array analysis to calculate t-statistics and, optionally, filter probes by their FDR-corrected p-value.

myAnnotation = cpg.annotate(mVals, analysis.type="differential", design=design, contrasts = TRUE,
                             cont.matrix = contMatrix, coef="naive - rTreg")
## Your contrast returned 3109 individually significant probes. We recommend the default setting of pcutoff in dmrcate().
str(myAnnotation)
## List of 8
##  $ ID     : Factor w/ 455364 levels "cg00000029","cg00000108",..: 240653 405745 269558 20085 300229 209908 392600 349036 288638 3548 ...
##  $ weights: num [1:455364] 0.0488 -2.0753 0.7697 -0.0304 -0.7638 ...
##  $ CHR    : Factor w/ 24 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pos    : int [1:455364] 15865 534242 710097 714177 720865 758829 763119 765028 779995 790667 ...
##  $ gene   : Factor w/ 41377 levels "","A1BG","A1CF;A1CF;A1CF",..: 39230 1 1 1 1 1 23874 20098 20098 1 ...
##  $ group  : Factor w/ 4077 levels "","1stExon","1stExon;1stExon",..: 2001 1 1 1 1 1 3197 2001 2001 1 ...
##  $ betafc : num [1:455364] 0.00039 -0.04534 0.01594 0.00251 -0.00869 ...
##  $ indfdr : num [1:455364] 0.994 0.567 0.872 0.997 0.873 ...
##  - attr(*, "row.names")= int [1:455364] 440613 57689 242178 446183 243529 192153 275582 271 7655 80075 ...
##  - attr(*, "class")= chr "annot"

We can then use the dmrcate function to look for differentially methylated regions. The main output table DMRs$results contains all of the regions found, along with their genomic annotations and p-values.

DMRs = dmrcate(myAnnotation, lambda=1000, C=2)
## Fitting chr1...
## Fitting chr10...
## Fitting chr11...
## Fitting chr12...
## Fitting chr13...
## Fitting chr14...
## Fitting chr15...
## Fitting chr16...
## Fitting chr17...
## Fitting chr18...
## Fitting chr19...
## Fitting chr2...
## Fitting chr20...
## Fitting chr21...
## Fitting chr22...
## Fitting chr3...
## Fitting chr4...
## Fitting chr5...
## Fitting chr6...
## Fitting chr7...
## Fitting chr8...
## Fitting chr9...
## Fitting chrX...
## Fitting chrY...
## Demarcating regions...
## Done!
head(DMRs$results)
##     gene_assoc                        group                hg19coord
## 863  PHF1,CUTA           Body,3'UTR,1stExon   chr6:33383412-33385365
## 724      TIGIT TSS1500,TSS200,5'UTR,1stExon chr3:114012316-114012912
## 20     TNFRSF8                   5'UTR,Body   chr1:12188502-12188637
## 343      APBA2 TSS1500,TSS200,5'UTR,1stExon  chr15:29212889-29213860
## 903    ZC3H12D 5'UTR,1stExon,TSS200,TSS1500 chr6:149805995-149806732
## 42      SLC9A1                         Body   chr1:27440463-27440721
##     no.probes       minpval      meanpval  maxbetafc
## 863        19 1.126484e-198  6.213227e-21  0.5469172
## 724         5 4.754071e-181 4.451614e-146  0.5434277
## 20          3 2.002732e-131 3.316704e-131  0.3744725
## 343         8 2.106110e-125  4.567477e-59 -0.4453199
## 903        11 1.599959e-119  1.156680e-64  0.4879785
## 42          3 4.314861e-116 1.121256e-111 -0.4526273

The regions can be quickly examined using the DMR.plot function provided in the DMRcate package.

par(mfrow=c(1,1))
DMR.plot(dmrcoutput=DMRs, dmr=1, betas=bVals, phen.col=pal[factor(targets$Sample_Group)],
         pch=16, toscale=TRUE, plotmedians=TRUE)
legend("bottomleft",legend=levels(factor(targets$Sample_Group)),col=pal,lty=1)

4 Visualisations

The Gviz package offers some powerful functionality for plotting genomic data in its genomic context. Their extensive vignette covers the various types of plots that can be produced using their framework. We will re-plot the top differentially methylated region from the previous analysis to demonstrate the richness of the visualisations that can be created.

Firstly, we will set up the genomic region we would like to plot by extracting the genomic coordinates of the top differentially methylated region.

gen = "hg19"
coords = strsplit2(DMRs$results$hg19coord[1],":")
chrom = coords[1]
minbase = as.numeric(strsplit2(coords[2],"-")[1]) - 2000
maxbase = as.numeric(strsplit2(coords[2],"-")[2]) + 2000

Next we will load up some genomic annotations of interest such as the locations of CpG islands and DNAseI hypersensitive sites.

## CpG islands
islandHMM = read.csv(paste(dataDirectory,"model-based-cpg-islands-hg19.txt",sep="/"), 
                     sep="\t", stringsAsFactors =FALSE, header=TRUE)
head(islandHMM)
##     chr  start    end length CpGcount GCcontent pctGC obsExp
## 1 chr10  93098  93818    721       32       403 0.559  0.572
## 2 chr10  94002  94165    164       12        97 0.591  0.841
## 3 chr10  94527  95302    776       65       538 0.693  0.702
## 4 chr10 119652 120193    542       53       369 0.681  0.866
## 5 chr10 122133 122621    489       51       339 0.693  0.880
## 6 chr10 180265 180720    456       32       256 0.561  0.893
islandData = GRanges(seqnames = Rle(islandHMM$chr), 
                     ranges = IRanges(islandHMM$start, end = islandHMM$end),
                    strand = Rle(strand(rep("*",nrow(islandHMM)))))
islandData = islandData[seqnames(islandData) == chrom & 
                         (start(islandData) >= minbase & end(islandData) <= maxbase)] # this is to reduce the amount of data in memory
islandData
## GRanges object with 1 range and 0 metadata columns:
##       seqnames               ranges strand
##          <Rle>            <IRanges>  <Rle>
##   [1]     chr6 [33385344, 33386207]      *
##   -------
##   seqinfo: 81 sequences from an unspecified genome; no seqlengths
## DNAseI hypersensitive sites
dnase = read.csv(paste(dataDirectory,"wgEncodeRegDnaseClusteredV3.bed",sep="/"),
                 sep="\t",stringsAsFactors=FALSE,header=FALSE)
head(dnase)
##     V1     V2     V3 V4  V5 V6
## 1 chr1  10100  10330 38 261 38
## 2 chr1  10345  10590  4 310  4
## 3 chr1  16100  16315  5 158  5
## 4 chr1  65905  66055  1 157  1
## 5 chr1  91405  91615  4 278  4
## 6 chr1 115600 115790  3 545  3
##                                                                                                                      V7
## 1 3,12,13,15,21,22,32,37,36,38,39,40,50,56,57,58,59,60,53,54,62,70,76,85,93,95,103,111,117,120,1,31,77,79,73,75,87,116,
## 2                                                                                                          10,85,95,31,
## 3                                                                                                      19,13,15,46,111,
## 4                                                                                                                   19,
## 5                                                                                                           9,26,72,12,
## 6                                                                                                            15,103,86,
##                                                                                                                               V8
## 1 50,247,129,38,52,89,138,61,54,65,35,108,198,34,68,31,48,26,59,42,109,34,105,253,56,204,99,261,101,97,19,59,88,53,72,49,46,140,
## 2                                                                                                                37,142,124,310,
## 3                                                                                                             143,158,102,33,80,
## 4                                                                                                                           157,
## 5                                                                                                                172,278,223,62,
## 6                                                                                                                    324,57,545,
dnaseData = GRanges(seqnames = dnase[,1],
                    ranges = IRanges(dnase[,2], end = dnase[,3]),
                    strand = Rle(rep("*",nrow(dnase))),
                    data = dnase[,5])
dnaseData = dnaseData[seqnames(dnaseData) == chrom & 
                          (start(dnaseData) >= minbase & end(dnaseData) <= maxbase)] # this is to reduce the amount of data in memory
dnaseData
## GRanges object with 14 ranges and 1 metadata column:
##        seqnames               ranges strand   |      data
##           <Rle>            <IRanges>  <Rle>   | <integer>
##    [1]     chr6 [33381425, 33381575]      *   |       223
##    [2]     chr6 [33382000, 33382210]      *   |       386
##    [3]     chr6 [33382260, 33382410]      *   |       101
##    [4]     chr6 [33382460, 33382715]      *   |      1000
##    [5]     chr6 [33382765, 33382995]      *   |       290
##    ...      ...                  ...    ... ...       ...
##   [10]     chr6 [33384905, 33385555]      *   |      1000
##   [11]     chr6 [33385625, 33386275]      *   |      1000
##   [12]     chr6 [33386345, 33386495]      *   |       362
##   [13]     chr6 [33386500, 33386515]      *   |       362
##   [14]     chr6 [33386620, 33387055]      *   |       777
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Now, set up the ideogram, genome and RefSeq tracks that will provide context for our methylation data.

iTrack = IdeogramTrack(genome = gen, chromosome = chrom, name="")
gTrack = GenomeAxisTrack(col="black", cex=1, name="", fontcolor="black")
rTrack = UcscTrack(genome = gen, chromosome = chrom, track = "refGene", from = minbase, 
                            to = maxbase, trackType = "GeneRegionTrack", rstarts = "exonStarts", 
                            rends = "exonEnds", gene = "name", symbol = "name2", transcript = "name", 
                            strand = "strand", fill = "darkblue",stacking = "squish", name = "RefSeq", 
                            showId=TRUE, geneSymbol=TRUE)

Ensure that the methylation data is ordered by chromosome and base position.

ann450kOrd = ann450kSub[order(ann450kSub$chr,ann450kSub$pos),]
head(ann450kOrd)
## DataFrame with 6 rows and 22 columns
##                    chr       pos      strand        Name    Probe_rs
##            <character> <integer> <character> <character> <character>
## cg13869341        chr1     15865           +  cg13869341          NA
## cg24669183        chr1    534242           -  cg24669183   rs6680725
## cg15560884        chr1    710097           +  cg15560884          NA
## cg01014490        chr1    714177           -  cg01014490          NA
## cg17505339        chr1    720865           -  cg17505339          NA
## cg11954957        chr1    758829           +  cg11954957 rs115498424
##            Probe_maf      CpG_rs   CpG_maf      SBE_rs   SBE_maf
##            <numeric> <character> <numeric> <character> <numeric>
## cg13869341        NA          NA        NA          NA        NA
## cg24669183  0.108100          NA        NA          NA        NA
## cg15560884        NA          NA        NA          NA        NA
## cg01014490        NA          NA        NA          NA        NA
## cg17505339        NA          NA        NA          NA        NA
## cg11954957  0.029514          NA        NA          NA        NA
##                  Islands_Name Relation_to_Island UCSC_RefGene_Name
##                   <character>        <character>       <character>
## cg13869341                               OpenSea            WASH5P
## cg24669183 chr1:533219-534114            S_Shore                  
## cg15560884 chr1:713984-714547            N_Shelf                  
## cg01014490 chr1:713984-714547             Island                  
## cg17505339                               OpenSea                  
## cg11954957 chr1:762416-763445            N_Shelf                  
##            UCSC_RefGene_Accession UCSC_RefGene_Group     Phantom
##                       <character>        <character> <character>
## cg13869341              NR_024540               Body            
## cg24669183                                                      
## cg15560884                                                      
## cg01014490                                                      
## cg17505339                                                      
## cg11954957                                                      
##                    DMR    Enhancer      HMM_Island Regulatory_Feature_Name
##            <character> <character>     <character>             <character>
## cg13869341                                                                
## cg24669183                         1:523025-524193                        
## cg15560884                                                                
## cg01014490                         1:703784-704410         1:713802-715219
## cg17505339                                                                
## cg11954957                                                                
##            Regulatory_Feature_Group         DHS
##                         <character> <character>
## cg13869341                                     
## cg24669183                                     
## cg15560884                                     
## cg01014490      Promoter_Associated            
## cg17505339                                     
## cg11954957
bValsOrd = bVals[match(ann450kOrd$Name,rownames(bVals)),]
head(bValsOrd)
##               naive.1    rTreg.2 act_naive.3    naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462   0.8177504 0.82987650  0.81186174
## cg24669183 0.81812908 0.82489238   0.8293297 0.75610281  0.81967323
## cg15560884 0.77219626 0.74903910   0.7516263 0.77417882  0.77266205
## cg01014490 0.08098986 0.06590459   0.0233755 0.04127262  0.04842397
## cg17505339 0.89439216 0.93822870   0.9471357 0.90520570  0.92641305
## cg11954957 0.74495496 0.79008516   0.7681146 0.84450764  0.75431167
##            act_rTreg.6   naive.7    rTreg.8 act_naive.9 act_rTreg.10
## cg13869341   0.8090798 0.8891851 0.88537940  0.90916748   0.88334231
## cg24669183   0.8187838 0.7903763 0.85304116  0.80930568   0.80979554
## cg15560884   0.7721528 0.7658623 0.75909061  0.78099397   0.78569274
## cg01014490   0.0644404 0.0245281 0.02832358  0.07740468   0.04640659
## cg17505339   0.9286016 0.8889361 0.87205348  0.90099782   0.93508348
## cg11954957   0.8116911 0.7832207 0.84929777  0.84719430   0.83350220

Create the data tracks using the appropriate track type for each data type.

methTrack = DataTrack(start=ann450kOrd$pos, width=1, 
                      data=t(bValsOrd), 
                     groups=targets$Sample_Group,genome = gen,chromosome = chrom, ylim=c(-0.05,1.05),
                     col=pal,type=c("a","p"), name="DNA Meth.\n(beta value)", background.panel="white",legend=TRUE,
                     cex.title=0.4,cex.axis=0.5,cex.legend=0.8)
islandTrack = AnnotationTrack(range=islandData, genome = gen, name = "CpG Is.", fill="grey40",
                              chromosome = chrom)
dnaseTrack = DataTrack(range=dnaseData, genome = gen, name = "DNAseI", type="gradient",
                        chromosome = chrom)

Set up the track list and indicate the relative sizes of the different tracks. Finally, make the plot using the plotTracks function.

tracks = list(iTrack,gTrack,methTrack,islandTrack,dnaseTrack,rTrack)
sizes = c(3,3,5,3,3,5)

plotTracks(tracks, from = minbase, to = maxbase, showTitle = TRUE, add53 = TRUE, add35 = TRUE, 
             grid=TRUE, lty.grid=3, sizes=sizes, length(tracks))

5 Software versions

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] GO.db_3.1.2                                       
##  [2] org.Hs.eg.db_3.1.2                                
##  [3] AnnotationDbi_1.30.1                              
##  [4] DMRcate_1.4.2                                     
##  [5] DMRcatedata_1.4.0                                 
##  [6] Gviz_1.12.1                                       
##  [7] minfiData_0.10.0                                  
##  [8] matrixStats_0.15.0                                
##  [9] missMethyl_1.2.0                                  
## [10] RSQLite_1.0.0                                     
## [11] DBI_0.3.1                                         
## [12] RColorBrewer_1.1-2                                
## [13] IlluminaHumanMethylation450kmanifest_0.4.0        
## [14] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
## [15] minfi_1.14.0                                      
## [16] bumphunter_1.8.0                                  
## [17] locfit_1.5-9.1                                    
## [18] iterators_1.0.8                                   
## [19] foreach_1.4.3                                     
## [20] Biostrings_2.36.4                                 
## [21] XVector_0.8.0                                     
## [22] GenomicRanges_1.20.8                              
## [23] GenomeInfoDb_1.4.3                                
## [24] IRanges_2.2.9                                     
## [25] S4Vectors_0.6.6                                   
## [26] lattice_0.20-33                                   
## [27] Biobase_2.28.0                                    
## [28] BiocGenerics_0.14.0                               
## [29] limma_3.24.15                                     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-122              bitops_1.0-6             
##  [3] tools_3.2.2               doRNG_1.6                
##  [5] nor1mix_1.2-1             rpart_4.1-10             
##  [7] Hmisc_3.17-0              colorspace_1.2-6         
##  [9] nnet_7.3-11               methylumi_2.14.0         
## [11] gridExtra_2.0.0           base64_1.1               
## [13] preprocessCore_1.30.0     formatR_1.2.1            
## [15] BiasedUrn_1.06.1          pkgmaker_0.22            
## [17] rtracklayer_1.28.10       scales_0.3.0             
## [19] genefilter_1.50.0         quadprog_1.5-5           
## [21] stringr_1.0.0             digest_0.6.8             
## [23] Rsamtools_1.20.5          foreign_0.8-66           
## [25] illuminaio_0.10.0         rmarkdown_0.8.1          
## [27] siggenes_1.42.0           GEOquery_2.34.0          
## [29] dichromat_2.0-0           htmltools_0.2.6          
## [31] BSgenome_1.36.3           ruv_0.9.6                
## [33] mclust_5.1                BiocParallel_1.2.22      
## [35] acepack_1.3-3.3           VariantAnnotation_1.14.13
## [37] RCurl_1.95-4.7            magrittr_1.5             
## [39] Formula_1.2-1             futile.logger_1.4.1      
## [41] Rcpp_0.12.2               munsell_0.4.2            
## [43] proto_0.3-10              stringi_1.0-1            
## [45] yaml_2.1.13               MASS_7.3-45              
## [47] zlibbioc_1.14.0           plyr_1.8.3               
## [49] splines_3.2.2             multtest_2.24.0          
## [51] GenomicFeatures_1.20.6    annotate_1.46.1          
## [53] knitr_1.11                beanplot_1.2             
## [55] rngtools_1.2.4            reshape2_1.4.1           
## [57] codetools_0.2-14          biomaRt_2.24.1           
## [59] futile.options_1.0.0      XML_3.98-1.3             
## [61] evaluate_0.8              biovizBase_1.16.0        
## [63] latticeExtra_0.6-26       lambda.r_1.1.7           
## [65] gtable_0.1.2              reshape_0.8.5            
## [67] ggplot2_1.0.1             xtable_1.8-0             
## [69] survival_2.38-3           GenomicAlignments_1.4.2  
## [71] registry_0.3              cluster_2.0.3            
## [73] statmod_1.4.22