Introduction

DNA methylation is an epigenetic mechanism used by cells to control gene expression. A methylated gene has a methyl group added to a CG dinucleotide of DNA, resulting in that gene being turned to the “off” position. Here we will use the Bioconductor workflow “methylationArrayAnalysis”" to analyze methylation data produced by the Illumina Infinium HumanMethylation450 array. This platform measures methylation at approximately 450,000 CpG sites and is the most common way to interrogate methylation across the human genome. Samples are treated with bisulfite conversion, which changes unmethylated cytosine to uracil. Then methylated cytosines can be distinguished from unmethylated cytosines by the base added downstream, either “G” for methylated CpG sites, or “A” for unmethylated sites. Each locus is detected in two colors (red and green), and methylation status is determined by comparing the two colors from the one position. For each CpG, there are two measurements: a methylated intensity (denoted by M) and an unmethylated intensity (denoted by U). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as beta values: (beta = M/(M+U)) or M-values (Mvalue=log2(M/U)). Each individual CpG probe can be tested for differential methylation for the comparisons of interest and p-values and moderated t-statistics can be generated for each CpG probe.

We will be using a small, publicly available 450k methylation dataset (Zhang et al., Blood 2013; GEO accession number GSE49667). The dataset contains 10 samples in total: there are 4 different sorted T-cell types (naive, rTreg, act_naive, act_rTreg, collected from 3 different individuals (M28, M29, M30).

Loading the data

The data required for this workflow has been bundled with the R package that we will install first:

source("http://bioconductor.org/workflows.R")
workflowInstall("methylationArrayAnalysis")

## package 'methylationArrayAnalysis' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Chrit\AppData\Local\Temp\RtmpuQIUI1\downloaded_packages

We need to set up a path to the data directory.

dataDirectory <- system.file("extdata", package = "methylationArrayAnalysis")
list.files(dataDirectory, recursive = TRUE)

##  [1] "48639-non-specific-probes-Illumina450k.csv"
##  [2] "5975827018/5975827018_R06C02_Grn.idat"     
##  [3] "5975827018/5975827018_R06C02_Red.idat"     
##  [4] "6264509100/6264509100_R01C01_Grn.idat"     
##  [5] "6264509100/6264509100_R01C01_Red.idat"     
##  [6] "6264509100/6264509100_R01C02_Grn.idat"     
##  [7] "6264509100/6264509100_R01C02_Red.idat"     
##  [8] "6264509100/6264509100_R02C01_Grn.idat"     
##  [9] "6264509100/6264509100_R02C01_Red.idat"     
## [10] "6264509100/6264509100_R02C02_Grn.idat"     
## [11] "6264509100/6264509100_R02C02_Red.idat"     
## [12] "6264509100/6264509100_R03C01_Grn.idat"     
## [13] "6264509100/6264509100_R03C01_Red.idat"     
## [14] "6264509100/6264509100_R03C02_Grn.idat"     
## [15] "6264509100/6264509100_R03C02_Red.idat"     
## [16] "6264509100/6264509100_R04C01_Grn.idat"     
## [17] "6264509100/6264509100_R04C01_Red.idat"     
## [18] "6264509100/6264509100_R04C02_Grn.idat"     
## [19] "6264509100/6264509100_R04C02_Red.idat"     
## [20] "6264509100/6264509100_R05C01_Grn.idat"     
## [21] "6264509100/6264509100_R05C01_Red.idat"     
## [22] "6264509100/6264509100_R05C02_Grn.idat"     
## [23] "6264509100/6264509100_R05C02_Red.idat"     
## [24] "6264509100/6264509100_R06C01_Grn.idat"     
## [25] "6264509100/6264509100_R06C01_Red.idat"     
## [26] "6264509100/6264509100_R06C02_Grn.idat"     
## [27] "6264509100/6264509100_R06C02_Red.idat"     
## [28] "ageData.RData"                             
## [29] "human_c2_v5.rdata"                         
## [30] "model-based-cpg-islands-hg19-chr17.txt"    
## [31] "SampleSheet.csv"                           
## [32] "wgEncodeRegDnaseClusteredV3chr17.bed"

Next we will load all of the required packages for this analysis.

library(limma)
library(minfi)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
library(matrixStats)
library(minfiData)
library(Gviz)
library(DMRcate)
library(stringr)

The IlluminaHumanMethylation450kmanifest package provides the Illumina manifest as an R object which can easily be loaded into the environment. The manifest contains all of the annotation information for each of the CpG probes on the 450k array. This is useful for determining where any differentially methylated probes are located in a genomic context.

ann450k = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)

## DataFrame with 6 rows and 33 columns
##                    chr       pos      strand        Name    AddressA
##            <character> <integer> <character> <character> <character>
## cg00050873        chrY   9363356           -  cg00050873    32735311
## cg00212031        chrY  21239348           -  cg00212031    29674443
## cg00213748        chrY   8148233           -  cg00213748    30703409
## cg00214611        chrY  15815688           -  cg00214611    69792329
## cg00455876        chrY   9385539           -  cg00455876    27653438
## cg01707559        chrY   6778695           +  cg01707559    45652402
##               AddressB                                          ProbeSeqA
##            <character>                                        <character>
## cg00050873    31717405 ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA
## cg00212031    38703326 CCCAATTAACCACAAAAACTAAACAAATTATACAATCAAAAAAACATACA
## cg00213748    36767301 TTTTAACACCTAACACCATTTTAACAATAAAAATTCTACAAAAAAAAACA
## cg00214611    46723459 CTAACTTCCAAACCACACTTTATATACTAAACTACAATATAACACAAACA
## cg00455876    69732350 AACTCTAAACTACCCAACACAAACTCCAAAAACTTCTCAAAAAAAACTCA
## cg01707559    64689504 ACAAATTAAAAACACTAAAACAAACACAACAACTACAACAACAAAAAACA
##                                                     ProbeSeqB        Type
##                                                   <character> <character>
## cg00050873 ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG           I
## cg00212031 CCCAATTAACCGCAAAAACTAAACAAATTATACGATCGAAAAAACGTACG           I
## cg00213748 TTTTAACGCCTAACACCGTTTTAACGATAAAAATTCTACAAAAAAAAACG           I
## cg00214611 CTAACTTCCGAACCGCGCTTTATATACTAAACTACAATATAACGCGAACG           I
## cg00455876 AACTCTAAACTACCCGACACAAACTCCAAAAACTTCTCGAAAAAAACTCG           I
## cg01707559 GCGAATTAAAAACACTAAAACGAACGCGACGACTACAACGACAAAAAACG           I
##               NextBase       Color    Probe_rs Probe_maf      CpG_rs
##            <character> <character> <character> <numeric> <character>
## cg00050873           A         Red          NA        NA          NA
## cg00212031           T         Red          NA        NA          NA
## cg00213748           A         Red          NA        NA          NA
## cg00214611           A         Red          NA        NA          NA
## cg00455876           A         Red          NA        NA          NA
## cg01707559           A         Red          NA        NA          NA
##              CpG_maf      SBE_rs   SBE_maf           Islands_Name
##            <numeric> <character> <numeric>            <character>
## cg00050873        NA          NA        NA   chrY:9363680-9363943
## cg00212031        NA          NA        NA chrY:21238448-21240005
## cg00213748        NA          NA        NA   chrY:8147877-8148210
## cg00214611        NA          NA        NA chrY:15815488-15815779
## cg00455876        NA          NA        NA   chrY:9385471-9385777
## cg01707559        NA          NA        NA   chrY:6778574-6780028
##            Relation_to_Island
##                   <character>
## cg00050873            N_Shore
## cg00212031             Island
## cg00213748            S_Shore
## cg00214611             Island
## cg00455876             Island
## cg01707559             Island
##                                                                                                                        Forward_Sequence
##                                                                                                                             <character>
## cg00050873 TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTGGAGTGGGTGGACCC[CG]GCCAAGACGGCCTGGGCTGACCAGAGACGGGAGGCAGAAAAAGTGGGCAGGTGGTTGCAG
## cg00212031 CCATTGGCCCGCCCCAGTTGGCCGCAGGGACTGAGCAAGTTATGCGGTCGGGAAGACGTG[CG]TTAAAGGGCTGAAGGGGAGGGACGGAACTGACAGTCTCTGTGACAGCTCTGAGGTGGGAG
## cg00213748 TCTGTGGGACCATTTTAACGCCTGGCACCGTTTTAACGATGGAGGTTCTGCAGGAGGGGG[CG]ACCTGGGGTAGGAGGCGTGCTAGTGGTGGATGACATTGTGGCAGAGATGGAGGTGGTGGC
## cg00214611 GCGCCGGCAGGACTAGCTTCCGGGCCGCGCTTTGTGTGCTGGGCTGCAGTGTGGCGCGGG[CG]AGGAAGCTGGTAGGGCGGTTGTCGCAAGCTCCAGCTGCAGCCTCCGCCTACGTGAGAAGA
## cg00455876 CGCGTGTGCCTGGACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCT[CG]GGGACGGAAGGCGTGGGGTGAGTGGGCTGGAGATGCAGGCGCGCCCGTGGCTGTGCAGCC
## cg01707559 AGCGGCCGCTCCCAGTGGTGGTCACCGCCAGTGCCAATCCCTTGCGCCGCCGTGCAGTCC[CG]CCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGCGCTCGGGAGTGA
##                                                     SourceSeq Random_Loci
##                                                   <character> <character>
## cg00050873 CGGGGTCCACCCACTCCAAAAACCACCACAGTTGTGCGTTGCCTCCTCGC            
## cg00212031 CGCACGTCTTCCCGACCGCATAACTTGCTCAGTCCCTGCGGCCAACTGGG            
## cg00213748 CGCCCCCTCCTGCAGAACCTCCATCGTTAAAACGGTGCCAGGCGTTAAAA            
## cg00214611 CGCCCGCGCCACACTGCAGCCCAGCACACAAAGCGCGGCCCGGAAGCTAG            
## cg00455876 GACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCTCG            
## cg01707559 CGCCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGC            
##            Methyl27_Loci UCSC_RefGene_Name        UCSC_RefGene_Accession
##              <character>       <character>                   <character>
## cg00050873                  TSPY4;FAM197Y2        NM_001164471;NR_001553
## cg00212031                          TTTY14                     NR_001543
## cg00213748                                                              
## cg00214611                   TMSB4Y;TMSB4Y           NM_004202;NM_004202
## cg00455876                                                              
## cg01707559               TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;NM_134258
##              UCSC_RefGene_Group     Phantom         DMR    Enhancer
##                     <character> <character> <character> <character>
## cg00050873         Body;TSS1500                                    
## cg00212031               TSS200                                    
## cg00213748                                                         
## cg00214611        1stExon;5'UTR                                    
## cg00455876                                                         
## cg01707559 TSS200;TSS200;TSS200                                    
##                     HMM_Island Regulatory_Feature_Name
##                    <character>             <character>
## cg00050873   Y:9973136-9976273                        
## cg00212031 Y:19697854-19699393                        
## cg00213748   Y:8207555-8208234                        
## cg00214611 Y:14324883-14325218     Y:15815422-15815706
## cg00455876   Y:9993394-9995882                        
## cg01707559   Y:6838022-6839951                        
##                          Regulatory_Feature_Group         DHS
##                                       <character> <character>
## cg00050873                                                   
## cg00212031                                                   
## cg00213748                                                   
## cg00214611 Promoter_Associated_Cell_type_specific            
## cg00455876                                                   
## cg01707559

Illumina methylation data is usually obtained in the form of Intensity Data (IDAT) Files. We will import the raw methylation data into R is using the minfi function read.metharray.sheet, along with the path to the IDAT files and a sample sheet. The sample sheet is a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. This function will also create a column called Basename which specifies the location of each individual IDAT file in the experiment.

targets <- read.metharray.sheet(dataDirectory, pattern="SampleSheet.csv")

## [1] "C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/SampleSheet.csv"

targets

##    Sample_Name Sample_Well   Sample_Source Sample_Group Sample_Label
## 1            1          A1             M28        naive        naive
## 2            2          B1             M28        rTreg        rTreg
## 3            3          C1             M28    act_naive    act_naive
## 4            4          D1             M29        naive        naive
## 5            5          E1             M29    act_naive    act_naive
## 6            6          F1             M29    act_rTreg    act_rTreg
## 7            7          G1             M30        naive        naive
## 8            8          H1             M30        rTreg        rTreg
## 9            9          A2             M30    act_naive    act_naive
## 10          10          B2             M30    act_rTreg    act_rTreg
## 11          11         H06 VICS-72098-18-B        birth        birth
##    Pool_ID  Array      Slide
## 1     <NA> R01C01 6264509100
## 2     <NA> R02C01 6264509100
## 3     <NA> R03C01 6264509100
## 4     <NA> R04C01 6264509100
## 5     <NA> R05C01 6264509100
## 6     <NA> R06C01 6264509100
## 7     <NA> R01C02 6264509100
## 8     <NA> R02C02 6264509100
## 9     <NA> R03C02 6264509100
## 10    <NA> R04C02 6264509100
## 11    <NA> R06C02 5975827018
##                                                                                                    Basename
## 1  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R01C01
## 2  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R02C01
## 3  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R03C01
## 4  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R04C01
## 5  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R05C01
## 6  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R06C01
## 7  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R01C02
## 8  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R02C02
## 9  C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R03C02
## 10 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R04C02
## 11 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/5975827018/5975827018_R06C02

Next we will read the raw intensity signals into R from the IDAT files using the read.metharray.exp function. This creates an RGChannelSet object that contains all the raw intensity data, from both the red and green color channels, for each of the samples.

rgSet <- read.metharray.exp(targets=targets)
rgSet

## class: RGChannelSet 
## dim: 622399 11 
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): 6264509100_R01C01 6264509100_R02C01 ...
##   6264509100_R04C02 5975827018_R06C02
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19

We will also rename the samples with more descriptive names.

targets$ID <- paste(targets$Sample_Group,targets$Sample_Name,sep=".")
sampleNames(rgSet) <- targets$ID
rgSet

## class: RGChannelSet 
## dim: 622399 11 
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): naive.1 rTreg.2 ... act_rTreg.10 birth.11
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19

Quality Control

We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal (M+U) for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal, while large p-values, for example >0.01, generally indicate a poor quality signal.

detP <- detectionP(rgSet)
head(detP)

##            naive.1 rTreg.2  act_naive.3 naive.4 act_naive.5 act_rTreg.6
## cg00050873       0       0 0.000000e+00       0 0.00000e+00           0
## cg00212031       0       0 0.000000e+00       0 0.00000e+00           0
## cg00213748       0       0 1.181832e-12       0 8.21565e-15           0
## cg00214611       0       0 0.000000e+00       0 0.00000e+00           0
## cg00455876       0       0 0.000000e+00       0 0.00000e+00           0
## cg01707559       0       0 0.000000e+00       0 0.00000e+00           0
##            naive.7      rTreg.8 act_naive.9 act_rTreg.10  birth.11
## cg00050873       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg00212031       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg00213748       0 1.469801e-05           0 1.365951e-08 0.6735224
## cg00214611       0 0.000000e+00           0 0.000000e+00 0.7344451
## cg00455876       0 0.000000e+00           0 0.000000e+00 0.0000000
## cg01707559       0 0.000000e+00           0 0.000000e+00 0.0000000

Next we need to examine mean detection p-values across all samples to identify any failed samples, and plot the mean detection p-value for each sample.

pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2, 
        cex.names=0.8, ylim=c(0,0.002), ylab="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal, 
       bg="white")

This workflow introduced a birth sample from another study to demonstrate what a low quality sample would look like. We will use a detection p-value cutoff of >0.05, which removes the birth sample from further analysis.

#remove poor quality samples from rgSet
keep <- colMeans(detP) < 0.05
rgSet <- rgSet[,keep]
rgSet

## class: RGChannelSet 
## dim: 622399 10 
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19

#remove poor quality samples from targets
targets <- targets[keep,]
targets[,1:5]

##    Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1            1          A1           M28        naive        naive
## 2            2          B1           M28        rTreg        rTreg
## 3            3          C1           M28    act_naive    act_naive
## 4            4          D1           M29        naive        naive
## 5            5          E1           M29    act_naive    act_naive
## 6            6          F1           M29    act_rTreg    act_rTreg
## 7            7          G1           M30        naive        naive
## 8            8          H1           M30        rTreg        rTreg
## 9            9          A2           M30    act_naive    act_naive
## 10          10          B2           M30    act_rTreg    act_rTreg

#remove poor quality samples from detection p-values
detP <- detP[,keep]
dim(detP)

## [1] 485512     10

Normalization

The purpose of data normalization is to minimize the unwanted variation within and between samples. We will apply the preprocessQuantile method to our data. This function implements a stratified quantile normalisation procedure which is applied to the methylated and unmethylated signal intensities separately, and takes into account the different probe types. Note that after normalization, the data is housed in a GenomicRatioSet object. This is a much more compact representation of the data as the colour channel information has been discarded and the M and U intensity information has been converted to M-values and beta values, together with associated genomic coordinates. We can visualize what the data look like before and after normalization.

mSetSq <- preprocessQuantile(rgSet) 
mSetRaw <- preprocessRaw(rgSet)
par(mfrow=c(1,2))
densityPlot(rgSet, sampGroups=targets$Sample_Group,main="Raw", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)), 
       text.col=brewer.pal(8,"Dark2"))
densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group,
            main="Normalized", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)), 
       text.col=brewer.pal(8,"Dark2"))

Data Exploration

We will use multi-dimensional scaling (MDS) plots for visualising the data. MDS plots are based on principal components analysis and are an unsupervised method for looking at the similarities and differences between the various samples. Samples that are more similar to each other should cluster together, and samples that are very different should be further apart on the plot. In this dataset, the largest source of variation is between individuals. Therefore, the statistical model will have to account for individual to individual variation.

par(mfrow=c(1,2))
plotMDS(getM(mSetSq), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Group)])
legend("top", legend=levels(factor(targets$Sample_Group)), text.col=pal,
       bg="white", cex=0.7)

plotMDS(getM(mSetSq), top=1000, gene.selection="common",  
        col=pal[factor(targets$Sample_Source)])
legend("top", legend=levels(factor(targets$Sample_Source)), text.col=pal,
       bg="white", cex=0.7)

# Examine higher dimensions to look at other sources of variation
par(mfrow=c(1,3))
plotMDS(getM(mSetSq), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Group)], dim=c(1,3))
legend("top", legend=levels(factor(targets$Sample_Group)), text.col=pal, 
       cex=0.7, bg="white")

plotMDS(getM(mSetSq), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Group)], dim=c(2,3))
legend("topleft", legend=levels(factor(targets$Sample_Group)), text.col=pal,
       cex=0.7, bg="white")

plotMDS(getM(mSetSq), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Group)], dim=c(3,4))
legend("topright", legend=levels(factor(targets$Sample_Group)), text.col=pal,
       cex=0.7, bg="white")

Filtering

Next we will filter out probes that have failed in one or more samples based on detection p-value.

detP <- detP[match(featureNames(mSetSq),rownames(detP)),] 
keep <- rowSums(detP < 0.01) == ncol(mSetSq) 
table(keep)

## keep
##  FALSE   TRUE 
##    977 484535

mSetSqFlt <- mSetSq[keep,]
mSetSqFlt

## class: GenomicRatioSet 
## dim: 484535 10 
## metadata(0):
## assays(2): M CN
## rownames(484535): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
##   predictedSex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.22.0
##   Manifest version: 0.4.0

We will also remove probes where common SNPs may affect the CpG.

mSetSqFlt <- dropLociWithSnps(mSetSqFlt)
mSetSqFlt

## class: GenomicRatioSet 
## dim: 467351 10 
## metadata(0):
## assays(2): M CN
## rownames(467351): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
##   predictedSex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.22.0
##   Manifest version: 0.4.0

We also need to filter out probes that have shown to be cross-reactive, that is, probes that have been demonstrated to map to multiple places in the genome.

xReactiveProbes <- read.csv(file=paste(dataDirectory,
                                       "48639-non-specific-probes-Illumina450k.csv",
                                       sep="/"), stringsAsFactors=FALSE)
keep <- !(featureNames(mSetSqFlt) %in% xReactiveProbes$TargetID)
table(keep)

## keep
##  FALSE   TRUE 
##  27433 439918

mSetSqFlt <- mSetSqFlt[keep,] 
mSetSqFlt

## class: GenomicRatioSet 
## dim: 439918 10 
## metadata(0):
## assays(2): M CN
## rownames(439918): cg13869341 cg24669183 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
##   predictedSex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.22.0
##   Manifest version: 0.4.0

Now that the data has been filtered and normalised, we will re-examine the MDS plots to see if the relationship between the samples has changed. It is apparent from the new MDS plots that much of the inter-individual variation has been removed as this is no longer the first principal component, likely due to the removal of the SNP-affected CpG probes.

par(mfrow=c(1,2))
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Group)], cex=0.8)
legend("right", legend=levels(factor(targets$Sample_Group)), text.col=pal,
       cex=0.65, bg="white")

plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Source)])
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
       cex=0.7, bg="white")

# Examine higher dimensions to look at other sources of variation
par(mfrow=c(1,3))
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Source)], dim=c(1,3))
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
       cex=0.7, bg="white")

plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Source)], dim=c(2,3))
legend("topright", legend=levels(factor(targets$Sample_Source)), text.col=pal,
       cex=0.7, bg="white")

plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common", 
        col=pal[factor(targets$Sample_Source)], dim=c(3,4))
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
       cex=0.7, bg="white")

The next step is to calculate M-values and beta values and compare their distributions on density plots.

mVals <- getM(mSetSqFlt)
head(mVals[,1:5])

##              naive.1   rTreg.2 act_naive.3   naive.4 act_naive.5
## cg13869341  2.421276  2.515948    2.165745  2.286314    2.109441
## cg24669183  2.169414  2.235964    2.280734  1.632309    2.184435
## cg15560884  1.761176  1.577578    1.597503  1.777486    1.764999
## cg01014490 -3.504268 -3.825119   -5.384735 -4.537864   -4.296526
## cg17505339  3.082191  3.924931    4.163206  3.255373    3.654134
## cg11954957  1.546401  1.912204    1.727910  2.441267    1.618331

bVals <- getBeta(mSetSqFlt)
head(bVals[,1:5])

##               naive.1    rTreg.2 act_naive.3    naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462   0.8177504 0.82987650  0.81186174
## cg24669183 0.81812908 0.82489238   0.8293297 0.75610281  0.81967323
## cg15560884 0.77219626 0.74903910   0.7516263 0.77417882  0.77266205
## cg01014490 0.08098986 0.06590459   0.0233755 0.04127262  0.04842397
## cg17505339 0.89439216 0.93822870   0.9471357 0.90520570  0.92641305
## cg11954957 0.74495496 0.79008516   0.7681146 0.84450764  0.75431167

par(mfrow=c(1,2))
densityPlot(bVals, sampGroups=targets$Sample_Group, main="Beta values", 
            legend=FALSE, xlab="Beta values")
legend("top", legend = levels(factor(targets$Sample_Group)), 
       text.col=brewer.pal(8,"Dark2"))
densityPlot(mVals, sampGroups=targets$Sample_Group, main="M-values", 
            legend=FALSE, xlab="M values")
legend("topleft", legend = levels(factor(targets$Sample_Group)), 
       text.col=brewer.pal(8,"Dark2"))

We are interested in pairwise comparisons between the four cell types, taking into account individual to individual variation. We perform this analysis on the matrix of M-values in limma, obtaining moderated t-statistics and associated p-values for each CpG site. Since we are performing hundreds of thousands of hypothesis tests, we need to adjust the p-values for multiple testing. We will assign a false discovery rate of 5%. This will allow us to find significantly differentially methylated CpGs between cell types.

# this is the factor of interest
cellType <- factor(targets$Sample_Group)
# this is the individual effect that we need to account for
individual <- factor(targets$Sample_Source) 

# use the above to create a design matrix
design <- model.matrix(~0+cellType+individual, data=targets)
colnames(design) <- c(levels(cellType),levels(individual)[-1])
 
# fit the linear model 
fit <- lmFit(mVals, design)
# create a contrast matrix for specific comparisons
contMatrix <- makeContrasts(naive-rTreg,
                           naive-act_naive,
                           rTreg-act_rTreg,
                           act_naive-act_rTreg,
                           levels=design)
contMatrix

##            Contrasts
## Levels      naive - rTreg naive - act_naive rTreg - act_rTreg
##   act_naive             0                -1                 0
##   act_rTreg             0                 0                -1
##   naive                 1                 1                 0
##   rTreg                -1                 0                 1
##   M29                   0                 0                 0
##   M30                   0                 0                 0
##            Contrasts
## Levels      act_naive - act_rTreg
##   act_naive                     1
##   act_rTreg                    -1
##   naive                         0
##   rTreg                         0
##   M29                           0
##   M30                           0

# fit the contrasts
fit2 <- contrasts.fit(fit, contMatrix)
fit2 <- eBayes(fit2)

# look at the numbers of DM CpGs at FDR < 0.05
summary(decideTests(fit2))

##    naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1          1618               400                 0                   559
## 0         436897            439291            439918                438440
## 1           1403               227                 0                   919

Next we will extract the tables of differentially expressed CpGs for each comparison, ordered by B-statistic by default, using the topTable function in limma. The results of the analysis for the first comparison, naive vs. rTreg, can be saved as a data.frame by setting coef=1. The coef parameter explicitly refers to the column in the contrasts matrix which corresponds to the comparison of interest. The resulting data.frame can easily be written to a CSV file.

ann450kSub <- ann450k[match(rownames(mVals),ann450k$Name),
                      c(1:4,12:19,24:ncol(ann450k))]
DMPs <- topTable(fit2, num=Inf, coef=1, genelist=ann450kSub)
head(DMPs)

##              chr       pos strand       Name Probe_rs Probe_maf CpG_rs
## cg07499259  chr1  12188502      + cg07499259     <NA>        NA   <NA>
## cg26992245  chr8  29848579      - cg26992245     <NA>        NA   <NA>
## cg09747445 chr15  70387268      - cg09747445     <NA>        NA   <NA>
## cg18808929  chr8  61825469      - cg18808929     <NA>        NA   <NA>
## cg25015733  chr2  99342986      - cg25015733     <NA>        NA   <NA>
## cg21179654  chr3 114057297      + cg21179654     <NA>        NA   <NA>
##            CpG_maf SBE_rs SBE_maf            Islands_Name
## cg07499259      NA   <NA>      NA                        
## cg26992245      NA   <NA>      NA                        
## cg09747445      NA   <NA>      NA chr15:70387929-70393206
## cg18808929      NA   <NA>      NA  chr8:61822358-61823028
## cg25015733      NA   <NA>      NA  chr2:99346882-99348177
## cg21179654      NA   <NA>      NA                        
##            Relation_to_Island
## cg07499259            OpenSea
## cg26992245            OpenSea
## cg09747445            N_Shore
## cg18808929            S_Shelf
## cg25015733            N_Shelf
## cg21179654            OpenSea
##                                           UCSC_RefGene_Name
## cg07499259                                  TNFRSF8;TNFRSF8
## cg26992245                                                 
## cg09747445                                   TLE3;TLE3;TLE3
## cg18808929                                                 
## cg25015733                                           MGAT4A
## cg21179654 ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20
##                                                                             UCSC_RefGene_Accession
## cg07499259                                                                     NM_152942;NM_001243
## cg26992245                                                                                        
## cg09747445                                                        NM_001105192;NM_020908;NM_005078
## cg18808929                                                                                        
## cg25015733                                                                               NM_012214
## cg21179654 NM_001164343;NM_001164346;NM_001164345;NM_001164342;NM_001164344;NM_001164347;NM_015642
##                                   UCSC_RefGene_Group Phantom DMR Enhancer
## cg07499259                                5'UTR;Body                     
## cg26992245                                                           TRUE
## cg09747445                            Body;Body;Body                     
## cg18808929                                                           TRUE
## cg25015733                                     5'UTR                     
## cg21179654 3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR                     
##                     HMM_Island Regulatory_Feature_Name
## cg07499259 1:12111023-12111225                        
## cg26992245                                            
## cg09747445                                            
## cg18808929                                            
## cg25015733                                            
## cg21179654                       3:114057192-114057775
##                   Regulatory_Feature_Group DHS     logFC     AveExpr
## cg07499259                                      3.654104  2.46652171
## cg26992245                                      4.450696 -0.09180715
## cg09747445                                     -3.337299 -0.25201484
## cg18808929                                     -2.990263  0.77522878
## cg25015733                                     -3.054336  0.83280190
## cg21179654 Unclassified_Cell_type_specific      2.859016  1.32460816
##                    t      P.Value   adj.P.Val        B
## cg07499259  18.73131 7.267204e-08 0.005067836 7.453206
## cg26992245  18.32674 8.615461e-08 0.005067836 7.359096
## cg09747445 -18.24438 8.923101e-08 0.005067836 7.339443
## cg18808929 -17.90181 1.034217e-07 0.005067836 7.255825
## cg25015733 -17.32615 1.333546e-07 0.005067836 7.108231
## cg21179654  17.27804 1.362674e-07 0.005067836 7.095476

write.table(DMPs, file="DMPs.csv", sep=",", row.names=FALSE)

Next we will plot sample-wise methylation levels for the top differentially methylated CpG site using the plotCpg function in minfi, which is a convenient way to plot the sample-wise beta values stratified by the grouping variable.

par(mfrow=c(2,2))
sapply(rownames(DMPs)[1:4], function(cpg){
  plotCpg(bVals, cpg=cpg, pheno=targets$Sample_Group, ylab = "Beta values")
})

## $cg07499259
## NULL
## 
## $cg26992245
## NULL
## 
## $cg09747445
## NULL
## 
## $cg18808929
## NULL

Differential methylation analysis of regions

We are interested in knowing whether several proximal CpGs are concordantly differentially methylated, that is, we want to identify differentially methylated regions. First, our matrix of M-values must be annotated with the relevant information about the probes such as their genomic position, gene annotation, etc. The limma pipeline is then used for differential methylation analysis to calculate moderated t-statistics.

myAnnotation <- cpg.annotate(object = mVals, datatype = "array", what = "M", 
                             analysis.type = "differential", design = design, 
                             contrasts = TRUE, cont.matrix = contMatrix, 
                             coef = "naive - rTreg", arraytype = "450K")
str(myAnnotation)

## List of 7
##  $ ID    : Factor w/ 439918 levels "cg00000029","cg00000108",..: 232388 391918 260351 19418 289954 202723 379224 278730 105471 57735 ...
##  $ stat  : num [1:439918] 0.0489 -2.0773 0.7711 -0.0304 -0.764 ...
##  $ CHR   : Factor w/ 24 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pos   : int [1:439918] 15865 534242 710097 714177 720865 758829 763119 779995 805102 805338 ...
##  $ betafc: num [1:439918] 0.00039 -0.04534 0.01594 0.00251 -0.00869 ...
##  $ indfdr: num [1:439918] 0.994 0.565 0.872 0.997 0.873 ...
##  $ is.sig: logi [1:439918] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "row.names")= int [1:439918] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "class")= chr "annot"

Next we will use the dmrcate function to combine them to identify differentially methylated regions. The main output table DMRs$results contains all of the regions found, along with their genomic annotations and p-values.

DMRs <- dmrcate(myAnnotation, lambda=1000, C=2)
head(DMRs$results)

##                          coord no.cpgs        minfdr     Stouffer
## 457    chr17:57915665-57918682      12  4.957890e-91 6.639928e-10
## 733   chr3:114012316-114012912       5 1.622885e-180 1.515378e-07
## 469    chr17:74639731-74640078       6  9.516873e-90 1.527961e-07
## 1069    chrX:49121205-49122718       6  6.753751e-84 2.936984e-07
## 492    chr18:21452730-21453131       7 5.702319e-115 7.674943e-07
## 186  chr10:135202522-135203200       6  1.465070e-65 7.918224e-07
##       maxbetafc meanbetafc
## 457   0.3982862  0.3131611
## 733   0.5434277  0.4251622
## 469  -0.2528645 -0.1951904
## 1069  0.4529088  0.3006242
## 492  -0.3867474 -0.2546089
## 186   0.2803157  0.2293419

The regions can easily be viewed using the DMR.plot function provided in the DMRcate package. The DMRcate “DMR.plot” function allows you to quickly visualise DMRs in their genomic context. By default, the plot shows the location of the DMR in the genome, the position of any genes that are nearby, the base pair positions of the CpG probes, the methylation levels of the individual samples as a heatmap and the mean methylation levels for the various sample groups in the experiment. This plot shows the top ranked DMR identified by the DMRcate analysis.

# convert the regions to annotated genomic ranges
data(dmrcatedata)
results.ranges <- extractRanges(DMRs, genome = "hg19")

# set up the grouping variables and colors
groups <- pal[1:length(unique(targets$Sample_Group))]
names(groups) <- levels(factor(targets$Sample_Group))
cols <- groups[as.character(factor(targets$Sample_Group))]
samps <- 1:nrow(targets)

# draw the plot for the top DMR
par(mfrow=c(1,1))
DMR.plot(ranges=results.ranges, dmr=1, CpGs=bVals, phen.col=cols, what = "Beta",
         arraytype = "450K", pch=16, toscale=TRUE, plotmedians=TRUE, 
         genome="hg19", samps=samps)

Final Project