DNA methylation is an epigenetic mechanism used by cells to control gene expression. A methylated gene has a methyl group added to a CG dinucleotide of DNA, resulting in that gene being turned to the “off” position. Here we will use the Bioconductor workflow “methylationArrayAnalysis”" to analyze methylation data produced by the Illumina Infinium HumanMethylation450 array. This platform measures methylation at approximately 450,000 CpG sites and is the most common way to interrogate methylation across the human genome. Samples are treated with bisulfite conversion, which changes unmethylated cytosine to uracil. Then methylated cytosines can be distinguished from unmethylated cytosines by the base added downstream, either “G” for methylated CpG sites, or “A” for unmethylated sites. Each locus is detected in two colors (red and green), and methylation status is determined by comparing the two colors from the one position. For each CpG, there are two measurements: a methylated intensity (denoted by M) and an unmethylated intensity (denoted by U). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as beta values: (beta = M/(M+U)) or M-values (Mvalue=log2(M/U)). Each individual CpG probe can be tested for differential methylation for the comparisons of interest and p-values and moderated t-statistics can be generated for each CpG probe.
We will be using a small, publicly available 450k methylation dataset (Zhang et al., Blood 2013; GEO accession number GSE49667). The dataset contains 10 samples in total: there are 4 different sorted T-cell types (naive, rTreg, act_naive, act_rTreg, collected from 3 different individuals (M28, M29, M30).
The data required for this workflow has been bundled with the R package that we will install first:
source("http://bioconductor.org/workflows.R")
workflowInstall("methylationArrayAnalysis")
## package 'methylationArrayAnalysis' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Chrit\AppData\Local\Temp\RtmpuQIUI1\downloaded_packages
We need to set up a path to the data directory.
dataDirectory <- system.file("extdata", package = "methylationArrayAnalysis")
list.files(dataDirectory, recursive = TRUE)
## [1] "48639-non-specific-probes-Illumina450k.csv"
## [2] "5975827018/5975827018_R06C02_Grn.idat"
## [3] "5975827018/5975827018_R06C02_Red.idat"
## [4] "6264509100/6264509100_R01C01_Grn.idat"
## [5] "6264509100/6264509100_R01C01_Red.idat"
## [6] "6264509100/6264509100_R01C02_Grn.idat"
## [7] "6264509100/6264509100_R01C02_Red.idat"
## [8] "6264509100/6264509100_R02C01_Grn.idat"
## [9] "6264509100/6264509100_R02C01_Red.idat"
## [10] "6264509100/6264509100_R02C02_Grn.idat"
## [11] "6264509100/6264509100_R02C02_Red.idat"
## [12] "6264509100/6264509100_R03C01_Grn.idat"
## [13] "6264509100/6264509100_R03C01_Red.idat"
## [14] "6264509100/6264509100_R03C02_Grn.idat"
## [15] "6264509100/6264509100_R03C02_Red.idat"
## [16] "6264509100/6264509100_R04C01_Grn.idat"
## [17] "6264509100/6264509100_R04C01_Red.idat"
## [18] "6264509100/6264509100_R04C02_Grn.idat"
## [19] "6264509100/6264509100_R04C02_Red.idat"
## [20] "6264509100/6264509100_R05C01_Grn.idat"
## [21] "6264509100/6264509100_R05C01_Red.idat"
## [22] "6264509100/6264509100_R05C02_Grn.idat"
## [23] "6264509100/6264509100_R05C02_Red.idat"
## [24] "6264509100/6264509100_R06C01_Grn.idat"
## [25] "6264509100/6264509100_R06C01_Red.idat"
## [26] "6264509100/6264509100_R06C02_Grn.idat"
## [27] "6264509100/6264509100_R06C02_Red.idat"
## [28] "ageData.RData"
## [29] "human_c2_v5.rdata"
## [30] "model-based-cpg-islands-hg19-chr17.txt"
## [31] "SampleSheet.csv"
## [32] "wgEncodeRegDnaseClusteredV3chr17.bed"
Next we will load all of the required packages for this analysis.
library(limma)
library(minfi)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
library(matrixStats)
library(minfiData)
library(Gviz)
library(DMRcate)
library(stringr)
The IlluminaHumanMethylation450kmanifest package provides the Illumina manifest as an R object which can easily be loaded into the environment. The manifest contains all of the annotation information for each of the CpG probes on the 450k array. This is useful for determining where any differentially methylated probes are located in a genomic context.
ann450k = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)
## DataFrame with 6 rows and 33 columns
## chr pos strand Name AddressA
## <character> <integer> <character> <character> <character>
## cg00050873 chrY 9363356 - cg00050873 32735311
## cg00212031 chrY 21239348 - cg00212031 29674443
## cg00213748 chrY 8148233 - cg00213748 30703409
## cg00214611 chrY 15815688 - cg00214611 69792329
## cg00455876 chrY 9385539 - cg00455876 27653438
## cg01707559 chrY 6778695 + cg01707559 45652402
## AddressB ProbeSeqA
## <character> <character>
## cg00050873 31717405 ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA
## cg00212031 38703326 CCCAATTAACCACAAAAACTAAACAAATTATACAATCAAAAAAACATACA
## cg00213748 36767301 TTTTAACACCTAACACCATTTTAACAATAAAAATTCTACAAAAAAAAACA
## cg00214611 46723459 CTAACTTCCAAACCACACTTTATATACTAAACTACAATATAACACAAACA
## cg00455876 69732350 AACTCTAAACTACCCAACACAAACTCCAAAAACTTCTCAAAAAAAACTCA
## cg01707559 64689504 ACAAATTAAAAACACTAAAACAAACACAACAACTACAACAACAAAAAACA
## ProbeSeqB Type
## <character> <character>
## cg00050873 ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG I
## cg00212031 CCCAATTAACCGCAAAAACTAAACAAATTATACGATCGAAAAAACGTACG I
## cg00213748 TTTTAACGCCTAACACCGTTTTAACGATAAAAATTCTACAAAAAAAAACG I
## cg00214611 CTAACTTCCGAACCGCGCTTTATATACTAAACTACAATATAACGCGAACG I
## cg00455876 AACTCTAAACTACCCGACACAAACTCCAAAAACTTCTCGAAAAAAACTCG I
## cg01707559 GCGAATTAAAAACACTAAAACGAACGCGACGACTACAACGACAAAAAACG I
## NextBase Color Probe_rs Probe_maf CpG_rs
## <character> <character> <character> <numeric> <character>
## cg00050873 A Red NA NA NA
## cg00212031 T Red NA NA NA
## cg00213748 A Red NA NA NA
## cg00214611 A Red NA NA NA
## cg00455876 A Red NA NA NA
## cg01707559 A Red NA NA NA
## CpG_maf SBE_rs SBE_maf Islands_Name
## <numeric> <character> <numeric> <character>
## cg00050873 NA NA NA chrY:9363680-9363943
## cg00212031 NA NA NA chrY:21238448-21240005
## cg00213748 NA NA NA chrY:8147877-8148210
## cg00214611 NA NA NA chrY:15815488-15815779
## cg00455876 NA NA NA chrY:9385471-9385777
## cg01707559 NA NA NA chrY:6778574-6780028
## Relation_to_Island
## <character>
## cg00050873 N_Shore
## cg00212031 Island
## cg00213748 S_Shore
## cg00214611 Island
## cg00455876 Island
## cg01707559 Island
## Forward_Sequence
## <character>
## cg00050873 TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTGGAGTGGGTGGACCC[CG]GCCAAGACGGCCTGGGCTGACCAGAGACGGGAGGCAGAAAAAGTGGGCAGGTGGTTGCAG
## cg00212031 CCATTGGCCCGCCCCAGTTGGCCGCAGGGACTGAGCAAGTTATGCGGTCGGGAAGACGTG[CG]TTAAAGGGCTGAAGGGGAGGGACGGAACTGACAGTCTCTGTGACAGCTCTGAGGTGGGAG
## cg00213748 TCTGTGGGACCATTTTAACGCCTGGCACCGTTTTAACGATGGAGGTTCTGCAGGAGGGGG[CG]ACCTGGGGTAGGAGGCGTGCTAGTGGTGGATGACATTGTGGCAGAGATGGAGGTGGTGGC
## cg00214611 GCGCCGGCAGGACTAGCTTCCGGGCCGCGCTTTGTGTGCTGGGCTGCAGTGTGGCGCGGG[CG]AGGAAGCTGGTAGGGCGGTTGTCGCAAGCTCCAGCTGCAGCCTCCGCCTACGTGAGAAGA
## cg00455876 CGCGTGTGCCTGGACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCT[CG]GGGACGGAAGGCGTGGGGTGAGTGGGCTGGAGATGCAGGCGCGCCCGTGGCTGTGCAGCC
## cg01707559 AGCGGCCGCTCCCAGTGGTGGTCACCGCCAGTGCCAATCCCTTGCGCCGCCGTGCAGTCC[CG]CCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGCGCTCGGGAGTGA
## SourceSeq Random_Loci
## <character> <character>
## cg00050873 CGGGGTCCACCCACTCCAAAAACCACCACAGTTGTGCGTTGCCTCCTCGC
## cg00212031 CGCACGTCTTCCCGACCGCATAACTTGCTCAGTCCCTGCGGCCAACTGGG
## cg00213748 CGCCCCCTCCTGCAGAACCTCCATCGTTAAAACGGTGCCAGGCGTTAAAA
## cg00214611 CGCCCGCGCCACACTGCAGCCCAGCACACAAAGCGCGGCCCGGAAGCTAG
## cg00455876 GACTCTGAGCTACCCGGCACAAGCTCCAAGGGCTTCTCGGAGGAGGCTCG
## cg01707559 CGCCCTCTGTCGCTGCAGCCGCCGCGCCCGCTCCAGTGCCCCCAATTCGC
## Methyl27_Loci UCSC_RefGene_Name UCSC_RefGene_Accession
## <character> <character> <character>
## cg00050873 TSPY4;FAM197Y2 NM_001164471;NR_001553
## cg00212031 TTTY14 NR_001543
## cg00213748
## cg00214611 TMSB4Y;TMSB4Y NM_004202;NM_004202
## cg00455876
## cg01707559 TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;NM_134258
## UCSC_RefGene_Group Phantom DMR Enhancer
## <character> <character> <character> <character>
## cg00050873 Body;TSS1500
## cg00212031 TSS200
## cg00213748
## cg00214611 1stExon;5'UTR
## cg00455876
## cg01707559 TSS200;TSS200;TSS200
## HMM_Island Regulatory_Feature_Name
## <character> <character>
## cg00050873 Y:9973136-9976273
## cg00212031 Y:19697854-19699393
## cg00213748 Y:8207555-8208234
## cg00214611 Y:14324883-14325218 Y:15815422-15815706
## cg00455876 Y:9993394-9995882
## cg01707559 Y:6838022-6839951
## Regulatory_Feature_Group DHS
## <character> <character>
## cg00050873
## cg00212031
## cg00213748
## cg00214611 Promoter_Associated_Cell_type_specific
## cg00455876
## cg01707559
Illumina methylation data is usually obtained in the form of Intensity Data (IDAT) Files. We will import the raw methylation data into R is using the minfi function read.metharray.sheet, along with the path to the IDAT files and a sample sheet. The sample sheet is a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. This function will also create a column called Basename which specifies the location of each individual IDAT file in the experiment.
targets <- read.metharray.sheet(dataDirectory, pattern="SampleSheet.csv")
## [1] "C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/SampleSheet.csv"
targets
## Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1 1 A1 M28 naive naive
## 2 2 B1 M28 rTreg rTreg
## 3 3 C1 M28 act_naive act_naive
## 4 4 D1 M29 naive naive
## 5 5 E1 M29 act_naive act_naive
## 6 6 F1 M29 act_rTreg act_rTreg
## 7 7 G1 M30 naive naive
## 8 8 H1 M30 rTreg rTreg
## 9 9 A2 M30 act_naive act_naive
## 10 10 B2 M30 act_rTreg act_rTreg
## 11 11 H06 VICS-72098-18-B birth birth
## Pool_ID Array Slide
## 1 <NA> R01C01 6264509100
## 2 <NA> R02C01 6264509100
## 3 <NA> R03C01 6264509100
## 4 <NA> R04C01 6264509100
## 5 <NA> R05C01 6264509100
## 6 <NA> R06C01 6264509100
## 7 <NA> R01C02 6264509100
## 8 <NA> R02C02 6264509100
## 9 <NA> R03C02 6264509100
## 10 <NA> R04C02 6264509100
## 11 <NA> R06C02 5975827018
## Basename
## 1 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R01C01
## 2 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R02C01
## 3 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R03C01
## 4 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R04C01
## 5 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R05C01
## 6 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R06C01
## 7 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R01C02
## 8 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R02C02
## 9 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R03C02
## 10 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/6264509100/6264509100_R04C02
## 11 C:/Users/Chrit/Documents/R/win-library/3.4/methylationArrayAnalysis/extdata/5975827018/5975827018_R06C02
Next we will read the raw intensity signals into R from the IDAT files using the read.metharray.exp function. This creates an RGChannelSet object that contains all the raw intensity data, from both the red and green color channels, for each of the samples.
rgSet <- read.metharray.exp(targets=targets)
rgSet
## class: RGChannelSet
## dim: 622399 11
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): 6264509100_R01C01 6264509100_R02C01 ...
## 6264509100_R04C02 5975827018_R06C02
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
We will also rename the samples with more descriptive names.
targets$ID <- paste(targets$Sample_Group,targets$Sample_Name,sep=".")
sampleNames(rgSet) <- targets$ID
rgSet
## class: RGChannelSet
## dim: 622399 11
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): naive.1 rTreg.2 ... act_rTreg.10 birth.11
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal (M+U) for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal, while large p-values, for example >0.01, generally indicate a poor quality signal.
detP <- detectionP(rgSet)
head(detP)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5 act_rTreg.6
## cg00050873 0 0 0.000000e+00 0 0.00000e+00 0
## cg00212031 0 0 0.000000e+00 0 0.00000e+00 0
## cg00213748 0 0 1.181832e-12 0 8.21565e-15 0
## cg00214611 0 0 0.000000e+00 0 0.00000e+00 0
## cg00455876 0 0 0.000000e+00 0 0.00000e+00 0
## cg01707559 0 0 0.000000e+00 0 0.00000e+00 0
## naive.7 rTreg.8 act_naive.9 act_rTreg.10 birth.11
## cg00050873 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg00212031 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg00213748 0 1.469801e-05 0 1.365951e-08 0.6735224
## cg00214611 0 0.000000e+00 0 0.000000e+00 0.7344451
## cg00455876 0 0.000000e+00 0 0.000000e+00 0.0000000
## cg01707559 0 0.000000e+00 0 0.000000e+00 0.0000000
Next we need to examine mean detection p-values across all samples to identify any failed samples, and plot the mean detection p-value for each sample.
pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2,
cex.names=0.8, ylim=c(0,0.002), ylab="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal,
bg="white")
This workflow introduced a birth sample from another study to demonstrate what a low quality sample would look like. We will use a detection p-value cutoff of >0.05, which removes the birth sample from further analysis.
#remove poor quality samples from rgSet
keep <- colMeans(detP) < 0.05
rgSet <- rgSet[,keep]
rgSet
## class: RGChannelSet
## dim: 622399 10
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
#remove poor quality samples from targets
targets <- targets[keep,]
targets[,1:5]
## Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1 1 A1 M28 naive naive
## 2 2 B1 M28 rTreg rTreg
## 3 3 C1 M28 act_naive act_naive
## 4 4 D1 M29 naive naive
## 5 5 E1 M29 act_naive act_naive
## 6 6 F1 M29 act_rTreg act_rTreg
## 7 7 G1 M30 naive naive
## 8 8 H1 M30 rTreg rTreg
## 9 9 A2 M30 act_naive act_naive
## 10 10 B2 M30 act_rTreg act_rTreg
#remove poor quality samples from detection p-values
detP <- detP[,keep]
dim(detP)
## [1] 485512 10
The purpose of data normalization is to minimize the unwanted variation within and between samples. We will apply the preprocessQuantile method to our data. This function implements a stratified quantile normalisation procedure which is applied to the methylated and unmethylated signal intensities separately, and takes into account the different probe types. Note that after normalization, the data is housed in a GenomicRatioSet object. This is a much more compact representation of the data as the colour channel information has been discarded and the M and U intensity information has been converted to M-values and beta values, together with associated genomic coordinates. We can visualize what the data look like before and after normalization.
mSetSq <- preprocessQuantile(rgSet)
mSetRaw <- preprocessRaw(rgSet)
par(mfrow=c(1,2))
densityPlot(rgSet, sampGroups=targets$Sample_Group,main="Raw", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group,
main="Normalized", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
We will use multi-dimensional scaling (MDS) plots for visualising the data. MDS plots are based on principal components analysis and are an unsupervised method for looking at the similarities and differences between the various samples. Samples that are more similar to each other should cluster together, and samples that are very different should be further apart on the plot. In this dataset, the largest source of variation is between individuals. Therefore, the statistical model will have to account for individual to individual variation.
par(mfrow=c(1,2))
plotMDS(getM(mSetSq), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Group)])
legend("top", legend=levels(factor(targets$Sample_Group)), text.col=pal,
bg="white", cex=0.7)
plotMDS(getM(mSetSq), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Source)])
legend("top", legend=levels(factor(targets$Sample_Source)), text.col=pal,
bg="white", cex=0.7)
# Examine higher dimensions to look at other sources of variation
par(mfrow=c(1,3))
plotMDS(getM(mSetSq), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Group)], dim=c(1,3))
legend("top", legend=levels(factor(targets$Sample_Group)), text.col=pal,
cex=0.7, bg="white")
plotMDS(getM(mSetSq), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Group)], dim=c(2,3))
legend("topleft", legend=levels(factor(targets$Sample_Group)), text.col=pal,
cex=0.7, bg="white")
plotMDS(getM(mSetSq), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Group)], dim=c(3,4))
legend("topright", legend=levels(factor(targets$Sample_Group)), text.col=pal,
cex=0.7, bg="white")
Next we will filter out probes that have failed in one or more samples based on detection p-value.
detP <- detP[match(featureNames(mSetSq),rownames(detP)),]
keep <- rowSums(detP < 0.01) == ncol(mSetSq)
table(keep)
## keep
## FALSE TRUE
## 977 484535
mSetSqFlt <- mSetSq[keep,]
mSetSqFlt
## class: GenomicRatioSet
## dim: 484535 10
## metadata(0):
## assays(2): M CN
## rownames(484535): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
## predictedSex
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.22.0
## Manifest version: 0.4.0
We will also remove probes where common SNPs may affect the CpG.
mSetSqFlt <- dropLociWithSnps(mSetSqFlt)
mSetSqFlt
## class: GenomicRatioSet
## dim: 467351 10
## metadata(0):
## assays(2): M CN
## rownames(467351): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
## predictedSex
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.22.0
## Manifest version: 0.4.0
We also need to filter out probes that have shown to be cross-reactive, that is, probes that have been demonstrated to map to multiple places in the genome.
xReactiveProbes <- read.csv(file=paste(dataDirectory,
"48639-non-specific-probes-Illumina450k.csv",
sep="/"), stringsAsFactors=FALSE)
keep <- !(featureNames(mSetSqFlt) %in% xReactiveProbes$TargetID)
table(keep)
## keep
## FALSE TRUE
## 27433 439918
mSetSqFlt <- mSetSqFlt[keep,]
mSetSqFlt
## class: GenomicRatioSet
## dim: 439918 10
## metadata(0):
## assays(2): M CN
## rownames(439918): cg13869341 cg24669183 ... cg08265308 cg14273923
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(11): Sample_Name Sample_Well ... filenames
## predictedSex
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.22.0
## Manifest version: 0.4.0
Now that the data has been filtered and normalised, we will re-examine the MDS plots to see if the relationship between the samples has changed. It is apparent from the new MDS plots that much of the inter-individual variation has been removed as this is no longer the first principal component, likely due to the removal of the SNP-affected CpG probes.
par(mfrow=c(1,2))
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Group)], cex=0.8)
legend("right", legend=levels(factor(targets$Sample_Group)), text.col=pal,
cex=0.65, bg="white")
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Source)])
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
cex=0.7, bg="white")
# Examine higher dimensions to look at other sources of variation
par(mfrow=c(1,3))
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Source)], dim=c(1,3))
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
cex=0.7, bg="white")
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Source)], dim=c(2,3))
legend("topright", legend=levels(factor(targets$Sample_Source)), text.col=pal,
cex=0.7, bg="white")
plotMDS(getM(mSetSqFlt), top=1000, gene.selection="common",
col=pal[factor(targets$Sample_Source)], dim=c(3,4))
legend("right", legend=levels(factor(targets$Sample_Source)), text.col=pal,
cex=0.7, bg="white")
The next step is to calculate M-values and beta values and compare their distributions on density plots.
mVals <- getM(mSetSqFlt)
head(mVals[,1:5])
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg13869341 2.421276 2.515948 2.165745 2.286314 2.109441
## cg24669183 2.169414 2.235964 2.280734 1.632309 2.184435
## cg15560884 1.761176 1.577578 1.597503 1.777486 1.764999
## cg01014490 -3.504268 -3.825119 -5.384735 -4.537864 -4.296526
## cg17505339 3.082191 3.924931 4.163206 3.255373 3.654134
## cg11954957 1.546401 1.912204 1.727910 2.441267 1.618331
bVals <- getBeta(mSetSqFlt)
head(bVals[,1:5])
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg13869341 0.84267937 0.85118462 0.8177504 0.82987650 0.81186174
## cg24669183 0.81812908 0.82489238 0.8293297 0.75610281 0.81967323
## cg15560884 0.77219626 0.74903910 0.7516263 0.77417882 0.77266205
## cg01014490 0.08098986 0.06590459 0.0233755 0.04127262 0.04842397
## cg17505339 0.89439216 0.93822870 0.9471357 0.90520570 0.92641305
## cg11954957 0.74495496 0.79008516 0.7681146 0.84450764 0.75431167
par(mfrow=c(1,2))
densityPlot(bVals, sampGroups=targets$Sample_Group, main="Beta values",
legend=FALSE, xlab="Beta values")
legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
densityPlot(mVals, sampGroups=targets$Sample_Group, main="M-values",
legend=FALSE, xlab="M values")
legend("topleft", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
We are interested in pairwise comparisons between the four cell types, taking into account individual to individual variation. We perform this analysis on the matrix of M-values in limma, obtaining moderated t-statistics and associated p-values for each CpG site. Since we are performing hundreds of thousands of hypothesis tests, we need to adjust the p-values for multiple testing. We will assign a false discovery rate of 5%. This will allow us to find significantly differentially methylated CpGs between cell types.
# this is the factor of interest
cellType <- factor(targets$Sample_Group)
# this is the individual effect that we need to account for
individual <- factor(targets$Sample_Source)
# use the above to create a design matrix
design <- model.matrix(~0+cellType+individual, data=targets)
colnames(design) <- c(levels(cellType),levels(individual)[-1])
# fit the linear model
fit <- lmFit(mVals, design)
# create a contrast matrix for specific comparisons
contMatrix <- makeContrasts(naive-rTreg,
naive-act_naive,
rTreg-act_rTreg,
act_naive-act_rTreg,
levels=design)
contMatrix
## Contrasts
## Levels naive - rTreg naive - act_naive rTreg - act_rTreg
## act_naive 0 -1 0
## act_rTreg 0 0 -1
## naive 1 1 0
## rTreg -1 0 1
## M29 0 0 0
## M30 0 0 0
## Contrasts
## Levels act_naive - act_rTreg
## act_naive 1
## act_rTreg -1
## naive 0
## rTreg 0
## M29 0
## M30 0
# fit the contrasts
fit2 <- contrasts.fit(fit, contMatrix)
fit2 <- eBayes(fit2)
# look at the numbers of DM CpGs at FDR < 0.05
summary(decideTests(fit2))
## naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg
## -1 1618 400 0 559
## 0 436897 439291 439918 438440
## 1 1403 227 0 919
Next we will extract the tables of differentially expressed CpGs for each comparison, ordered by B-statistic by default, using the topTable function in limma. The results of the analysis for the first comparison, naive vs. rTreg, can be saved as a data.frame by setting coef=1. The coef parameter explicitly refers to the column in the contrasts matrix which corresponds to the comparison of interest. The resulting data.frame can easily be written to a CSV file.
ann450kSub <- ann450k[match(rownames(mVals),ann450k$Name),
c(1:4,12:19,24:ncol(ann450k))]
DMPs <- topTable(fit2, num=Inf, coef=1, genelist=ann450kSub)
head(DMPs)
## chr pos strand Name Probe_rs Probe_maf CpG_rs
## cg07499259 chr1 12188502 + cg07499259 <NA> NA <NA>
## cg26992245 chr8 29848579 - cg26992245 <NA> NA <NA>
## cg09747445 chr15 70387268 - cg09747445 <NA> NA <NA>
## cg18808929 chr8 61825469 - cg18808929 <NA> NA <NA>
## cg25015733 chr2 99342986 - cg25015733 <NA> NA <NA>
## cg21179654 chr3 114057297 + cg21179654 <NA> NA <NA>
## CpG_maf SBE_rs SBE_maf Islands_Name
## cg07499259 NA <NA> NA
## cg26992245 NA <NA> NA
## cg09747445 NA <NA> NA chr15:70387929-70393206
## cg18808929 NA <NA> NA chr8:61822358-61823028
## cg25015733 NA <NA> NA chr2:99346882-99348177
## cg21179654 NA <NA> NA
## Relation_to_Island
## cg07499259 OpenSea
## cg26992245 OpenSea
## cg09747445 N_Shore
## cg18808929 S_Shelf
## cg25015733 N_Shelf
## cg21179654 OpenSea
## UCSC_RefGene_Name
## cg07499259 TNFRSF8;TNFRSF8
## cg26992245
## cg09747445 TLE3;TLE3;TLE3
## cg18808929
## cg25015733 MGAT4A
## cg21179654 ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20;ZBTB20
## UCSC_RefGene_Accession
## cg07499259 NM_152942;NM_001243
## cg26992245
## cg09747445 NM_001105192;NM_020908;NM_005078
## cg18808929
## cg25015733 NM_012214
## cg21179654 NM_001164343;NM_001164346;NM_001164345;NM_001164342;NM_001164344;NM_001164347;NM_015642
## UCSC_RefGene_Group Phantom DMR Enhancer
## cg07499259 5'UTR;Body
## cg26992245 TRUE
## cg09747445 Body;Body;Body
## cg18808929 TRUE
## cg25015733 5'UTR
## cg21179654 3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR;3'UTR
## HMM_Island Regulatory_Feature_Name
## cg07499259 1:12111023-12111225
## cg26992245
## cg09747445
## cg18808929
## cg25015733
## cg21179654 3:114057192-114057775
## Regulatory_Feature_Group DHS logFC AveExpr
## cg07499259 3.654104 2.46652171
## cg26992245 4.450696 -0.09180715
## cg09747445 -3.337299 -0.25201484
## cg18808929 -2.990263 0.77522878
## cg25015733 -3.054336 0.83280190
## cg21179654 Unclassified_Cell_type_specific 2.859016 1.32460816
## t P.Value adj.P.Val B
## cg07499259 18.73131 7.267204e-08 0.005067836 7.453206
## cg26992245 18.32674 8.615461e-08 0.005067836 7.359096
## cg09747445 -18.24438 8.923101e-08 0.005067836 7.339443
## cg18808929 -17.90181 1.034217e-07 0.005067836 7.255825
## cg25015733 -17.32615 1.333546e-07 0.005067836 7.108231
## cg21179654 17.27804 1.362674e-07 0.005067836 7.095476
write.table(DMPs, file="DMPs.csv", sep=",", row.names=FALSE)
Next we will plot sample-wise methylation levels for the top differentially methylated CpG site using the plotCpg function in minfi, which is a convenient way to plot the sample-wise beta values stratified by the grouping variable.
par(mfrow=c(2,2))
sapply(rownames(DMPs)[1:4], function(cpg){
plotCpg(bVals, cpg=cpg, pheno=targets$Sample_Group, ylab = "Beta values")
})
## $cg07499259
## NULL
##
## $cg26992245
## NULL
##
## $cg09747445
## NULL
##
## $cg18808929
## NULL
We are interested in knowing whether several proximal CpGs are concordantly differentially methylated, that is, we want to identify differentially methylated regions. First, our matrix of M-values must be annotated with the relevant information about the probes such as their genomic position, gene annotation, etc. The limma pipeline is then used for differential methylation analysis to calculate moderated t-statistics.
myAnnotation <- cpg.annotate(object = mVals, datatype = "array", what = "M",
analysis.type = "differential", design = design,
contrasts = TRUE, cont.matrix = contMatrix,
coef = "naive - rTreg", arraytype = "450K")
str(myAnnotation)
## List of 7
## $ ID : Factor w/ 439918 levels "cg00000029","cg00000108",..: 232388 391918 260351 19418 289954 202723 379224 278730 105471 57735 ...
## $ stat : num [1:439918] 0.0489 -2.0773 0.7711 -0.0304 -0.764 ...
## $ CHR : Factor w/ 24 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pos : int [1:439918] 15865 534242 710097 714177 720865 758829 763119 779995 805102 805338 ...
## $ betafc: num [1:439918] 0.00039 -0.04534 0.01594 0.00251 -0.00869 ...
## $ indfdr: num [1:439918] 0.994 0.565 0.872 0.997 0.873 ...
## $ is.sig: logi [1:439918] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "row.names")= int [1:439918] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "class")= chr "annot"
Next we will use the dmrcate function to combine them to identify differentially methylated regions. The main output table DMRs$results contains all of the regions found, along with their genomic annotations and p-values.
DMRs <- dmrcate(myAnnotation, lambda=1000, C=2)
head(DMRs$results)
## coord no.cpgs minfdr Stouffer
## 457 chr17:57915665-57918682 12 4.957890e-91 6.639928e-10
## 733 chr3:114012316-114012912 5 1.622885e-180 1.515378e-07
## 469 chr17:74639731-74640078 6 9.516873e-90 1.527961e-07
## 1069 chrX:49121205-49122718 6 6.753751e-84 2.936984e-07
## 492 chr18:21452730-21453131 7 5.702319e-115 7.674943e-07
## 186 chr10:135202522-135203200 6 1.465070e-65 7.918224e-07
## maxbetafc meanbetafc
## 457 0.3982862 0.3131611
## 733 0.5434277 0.4251622
## 469 -0.2528645 -0.1951904
## 1069 0.4529088 0.3006242
## 492 -0.3867474 -0.2546089
## 186 0.2803157 0.2293419
The regions can easily be viewed using the DMR.plot function provided in the DMRcate package. The DMRcate “DMR.plot” function allows you to quickly visualise DMRs in their genomic context. By default, the plot shows the location of the DMR in the genome, the position of any genes that are nearby, the base pair positions of the CpG probes, the methylation levels of the individual samples as a heatmap and the mean methylation levels for the various sample groups in the experiment. This plot shows the top ranked DMR identified by the DMRcate analysis.
# convert the regions to annotated genomic ranges
data(dmrcatedata)
results.ranges <- extractRanges(DMRs, genome = "hg19")
# set up the grouping variables and colors
groups <- pal[1:length(unique(targets$Sample_Group))]
names(groups) <- levels(factor(targets$Sample_Group))
cols <- groups[as.character(factor(targets$Sample_Group))]
samps <- 1:nrow(targets)
# draw the plot for the top DMR
par(mfrow=c(1,1))
DMR.plot(ranges=results.ranges, dmr=1, CpGs=bVals, phen.col=cols, what = "Beta",
arraytype = "450K", pch=16, toscale=TRUE, plotmedians=TRUE,
genome="hg19", samps=samps)