Rebecca Johnston 05/03/2014
This seminar is inspired by a STAT 540 project in 2013: Analysis of Gene Expression Omnibus Leukaemia Data from the Illumina HumanMethylation450 Array by Alice Zhu, Rachel Edgar, Shaun Jackman and Nick Fishbane. See their project website here.
The Illumina HumanMethylation450 Array is a microarray based high throughput platform for methylation profiling on 450,000 pre-selected probes for CpGs across the human genome. A great number of datasets have been made publicly available through Gene Expression Omnibus (GEO). Each dataset in GEO has a unique GSE ID that we can use to retrieve it from GEO. There are several Bioconductor packages that are very useful for obtaining data from GEO and microarray analysis. If you haven't done so yet, install and load these packages.
source("http://bioconductor.org/biocLite.R")
library(GEOquery)
library(wateRmelon)
library(IlluminaHumanMethylation450k.db)
library(ggplot2)
library(reshape2)
library(limma)
library(gplots)
In this seminar, we are going to perform differential methylation analysis for Acute lymphoblastic leukemia (ALL) cases and healthy B cells as control group. The datasets we are going to use are: GSE39141 which contains 29 ALL bone marrow samples and 4 healthy B cell samples, and to make up for the small sample size for healthy cells, we will also use GSE42865 which contains 9 healthy B cell samples and 7 samples with other conditions.
First, let's retrieve our datasets from GEO with getGEO from GEOquery package. Warning: this may take several minutes! So to avoid re-downloading in the future, save the data once you get it into a good shape.
As you can see, the datasets we got are large lists including Beta values as well as metadata. We can extract this information separately. Then we can isolate a subset of GSE42865 containing only the healthy cells.
if (file.exists("methyl_ALL.Rdata")) {
# if previously downloaded
load("methyl_ALL.Rdata")
} else {
# if downloading for the first time
GSE39141 <- getGEO("GSE39141")
show(GSE39141) ## 33 samples (29 ALL and 4 healthy B cells)
GSE42865 <- getGEO("GSE42865") # took ~2 mins for JB
show(GSE42865) ## 16 samples (9 healthy cells B cells and 7 other cells)
# Extract expression matrices (turn into data frames at once)
ALL.dat <- as.data.frame(exprs(GSE39141[[1]]))
CTRL.dat <- as.data.frame(exprs(GSE42865[[1]]))
# Obtain the meta-data for the samples and rename them perhaps?
ALL.meta <- pData(phenoData(GSE39141[[1]]))
CTRL.meta <- pData(phenoData(GSE42865[[1]]))
# create some labels
ALL.meta$Group <- c(rep("ALL", 29), rep("HBC", 4))
## ALL: Case; HBC: Healthy B Cells
# Subset both meta-data and data for control (healthy) donors
CTRL.meta <- droplevels(subset(CTRL.meta, grepl("Healthy donor", characteristics_ch1.1)))
CTRL.dat <- subset(CTRL.dat, select = as.character(CTRL.meta$geo_accession))
# Rename variables
names(ALL.dat) <- paste(ALL.meta$Group, gsub("GSM", "", names(ALL.dat)),
sep = "_")
names(CTRL.dat) <- paste("HBC", gsub("GSM", "", names(CTRL.dat)), sep = "_")
# save the data to avoid future re-downloading
save(ALL.dat, CTRL.dat, ALL.meta, CTRL.meta, file = "methyl_ALL.Rdata")
}
Now, we can do some exploratory analysis of the data, for example, looking at distribution of Beta values in each sample or each probe. Here is a density plot of average Beta values for probes in the two datasets. See if you get similar results.
Hint: you can use
rowMeansfunction to calculate the average Beta value for each probe. Don't forget to usena.rm = Tto exclude NAs.
str(ALL.dat, max.level = 0)
## 'data.frame': 485577 obs. of 33 variables:
str(ALL.meta, max.level = 0)
## 'data.frame': 33 obs. of 39 variables:
str(CTRL.dat, max.level = 0)
## 'data.frame': 485577 obs. of 9 variables:
str(CTRL.meta, max.level = 0)
## 'data.frame': 9 obs. of 34 variables:
dat.probeMeans <- c(rowMeans(ALL.dat, na.rm = TRUE), rowMeans(CTRL.dat, na.rm = TRUE))
head(dat.probeMeans)
## cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289
## 0.5689 0.9170 0.8152 0.2300 0.8023 0.5863
plotDat <- data.frame(Beta = dat.probeMeans, Dataset = rep(c("ALL", "CTRL"),
each = nrow(ALL.dat)))
head(plotDat)
## Beta Dataset
## 1 0.5689 ALL
## 2 0.9170 ALL
## 3 0.8152 ALL
## 4 0.2300 ALL
## 5 0.8023 ALL
## 6 0.5863 ALL
tail(plotDat)
## Beta Dataset
## 971149 0.5965 CTRL
## 971150 0.6144 CTRL
## 971151 0.6151 CTRL
## 971152 0.4495 CTRL
## 971153 0.2907 CTRL
## 971154 0.7265 CTRL
nrow(plotDat)
## [1] 971154
(probeAvg <- ggplot(data = plotDat, aes(x = Beta, col = Dataset)) + geom_density() +
ggtitle("Average Beta value density of two experiments") + xlab("Beta") +
ylab("Density"))
Remember the “ALL” dataset includes 4 control samples. Nevertheless the distribution of beta values is different between the “ALL” and “CTRL”“ experiments.
You can see the distribution of Beta values are not the same between different experiments. That is why it is important to normalize the data before we proceed. The Bioconductor package wateRmelon offers 15 methods for 450K methylation array normalization. If you are interested in comparing these methods, check out this paper. For the sake of convenience, we will just use the simplest method, betaqn, which performs quantile normalization on the beta values.
# combine data from two experiments into one matrix, each column represents
# beta values of one sample
beta.matrix <- as.matrix(cbind(ALL.dat, CTRL.dat))
str(beta.matrix, max.level = 0)
## num [1:485577, 1:42] 0.512 0.911 0.857 0.149 0.729 ...
## - attr(*, "dimnames")=List of 2
# quantile normalization
system.time(beta.norm <- betaqn(beta.matrix))
## user system elapsed
## 54.01 3.70 57.92
As you can see, after normalization, beta values from the different experiments have more similar distributions. However, this distribution is on the interval [0, 1], with two modes near the two extremes. This type of distribution is not very compatible with the typical assumptions of linear models. A common solution is to apply a logit transformation on the data to convert it to a continuous variable that spans \((-\infty, \infty)\), i.e. compute the M value. The beta2m() function in the wateRmelon package does this conversion.
\[M = log({Beta\over 1-Beta})\]
M.norm <- beta2m(beta.norm)
Now we can go ahead with differential methylation analysis – cases vs. controls or leukemia vs. healthy. You can perform this analysis on each probe, but in this seminar, we will aggregate the probes into CpG Islands (CGIs), i.e. the CpG dense regions of the genome, and then detect differentially methylated CGIs. The biological function of CGIs is better studied, so the interpretation of our results will be easier if we focus on CGIs. Conveniently, the Bioconductor package IlluminaHumanMethylation450k.db provides all sorts of annotation information for the 450K methylation array including the association of probes to CGIs. We will extract this information and use the mean of M values for all probes in a specific CGI to represent its methylation level. Check if you get the same boxplot for CGI M values.
# Extracting probe ID to CpG islands association
cginame <- as.data.frame(IlluminaHumanMethylation450kCPGINAME)
names(cginame) <- c("Probe_ID", "cginame")
rownames(cginame) <- cginame$Probe_ID
length(levels(factor(cginame$cginame))) # No. of CGIs
## [1] 27176
# restrict probes to those within CGIs
beta.inCGI <- beta.norm[cginame$Probe_ID, ]
M.inCGI <- M.norm[cginame$Probe_ID, ]
nrow(M.inCGI) # No. of probes within CGIs
## [1] 309465
# aggregate probes to CGIs
beta.CGI <- aggregate(beta.inCGI, by = list(cginame$cginame), mean, na.rm = TRUE)
rownames(beta.CGI) <- beta.CGI[, "Group.1"]
beta.CGI <- subset(beta.CGI, select = -Group.1)
str(beta.CGI, max.level = 0)
## 'data.frame': 27176 obs. of 42 variables:
M.CGI <- aggregate(M.inCGI, by = list(cginame$cginame), mean, na.rm = TRUE)
rownames(M.CGI) <- M.CGI[, "Group.1"]
M.CGI <- subset(M.CGI, select = -Group.1)
str(M.CGI, max.level = 0)
## 'data.frame': 27176 obs. of 42 variables:
## Sample CGI M
## 1 ALL_956761 chr1:10003165-10003585 -3.386
## 2 ALL_956762 chr1:10003165-10003585 -3.233
## 3 ALL_956763 chr1:10003165-10003585 -3.351
## 4 ALL_956764 chr1:10003165-10003585 -3.151
## 5 ALL_956765 chr1:10003165-10003585 -3.136
## 6 ALL_956766 chr1:10003165-10003585 -3.262
## Sample CGI M Group
## 1 ALL_956761 chr1:10003165-10003585 -3.386 ALL
## 2 ALL_956762 chr1:10003165-10003585 -3.233 ALL
## 3 ALL_956763 chr1:10003165-10003585 -3.351 ALL
## 4 ALL_956764 chr1:10003165-10003585 -3.151 ALL
## 5 ALL_956765 chr1:10003165-10003585 -3.136 ALL
## 6 ALL_956766 chr1:10003165-10003585 -3.262 ALL
Next, we can use a linear model to identify differentially methylated CGIs with limma:
design <- data.frame(Group = relevel(factor(gsub("_[0-9]+", "", colnames(M.CGI))),
ref = "HBC"), row.names = colnames(M.CGI))
str(design)
## 'data.frame': 42 obs. of 1 variable:
## $ Group: Factor w/ 2 levels "HBC","ALL": 2 2 2 2 2 2 2 2 2 2 ...
(DesMat <- model.matrix(~Group, design))
## (Intercept) GroupALL
## ALL_956761 1 1
## ALL_956762 1 1
## ALL_956763 1 1
## ALL_956764 1 1
## ALL_956765 1 1
## ALL_956766 1 1
## ALL_956767 1 1
## ALL_956768 1 1
## ALL_956769 1 1
## ALL_956770 1 1
## ALL_956771 1 1
## ALL_956772 1 1
## ALL_956773 1 1
## ALL_956774 1 1
## ALL_956775 1 1
## ALL_956776 1 1
## ALL_956777 1 1
## ALL_956778 1 1
## ALL_956779 1 1
## ALL_956780 1 1
## ALL_956781 1 1
## ALL_956782 1 1
## ALL_956783 1 1
## ALL_956784 1 1
## ALL_956785 1 1
## ALL_956786 1 1
## ALL_956787 1 1
## ALL_956788 1 1
## ALL_956789 1 1
## HBC_956790 1 0
## HBC_956791 1 0
## HBC_956792 1 0
## HBC_956793 1 0
## HBC_1052420 1 0
## HBC_1052421 1 0
## HBC_1052422 1 0
## HBC_1052423 1 0
## HBC_1052424 1 0
## HBC_1052425 1 0
## HBC_1052426 1 0
## HBC_1052427 1 0
## HBC_1052428 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Group
## [1] "contr.treatment"
DMRfit <- lmFit(M.CGI, DesMat)
DMRfitEb <- eBayes(DMRfit)
cutoff <- 0.01
DMR <- topTable(DMRfitEb, coef = "GroupALL", number = Inf, p.value = cutoff)
head(DMR) # top hits
## logFC AveExpr t P.Value adj.P.Val B
## chr19:49828412-49828668 1.3084 2.283 12.05 2.798e-15 7.604e-11 24.31
## chr4:156680095-156681386 -1.1115 -2.521 -10.95 6.245e-14 8.485e-10 21.39
## chr20:11871374-11872207 -1.2639 -2.284 -10.49 2.370e-13 1.803e-09 20.13
## chr19:33726654-33726946 0.9429 1.887 10.39 3.172e-13 1.803e-09 19.85
## chr18:77166704-77167043 0.8104 3.199 10.36 3.429e-13 1.803e-09 19.78
## chr18:46447718-46448083 -0.8990 2.034 -10.31 3.980e-13 1.803e-09 19.64
nrow(DMR)
## [1] 4115
So using a cutoff of FDR = 0.01, we identified 4115 CGIs that are differentially methylated between ALL and control group. Now we can make some plots to check these hits.
First, heatmap of beta values of top 100 hits:
DMR100 <- topTable(DMRfitEb, coef = "GroupALL", number = 100)
DMR.CGI <- t(as.matrix(subset(beta.CGI, rownames(beta.CGI) %in% rownames(DMR100))))
str(DMR.CGI, max.level = 0)
## num [1:42, 1:100] 0.707 0.688 0.69 0.728 0.692 ...
## - attr(*, "dimnames")=List of 2
col <- c(rep("darkgoldenrod1", times = nrow(DMR.CGI)))
col[grepl("HBC", rownames(DMR.CGI))] <- "forestgreen"
op <- par(mai = rep(0.5, 4))
heatmap.2(DMR.CGI, col = redblue(256), RowSideColors = col, density.info = "none",
trace = "none", Rowv = TRUE, Colv = TRUE, labCol = FALSE, labRow = FALSE,
dendrogram = "row")
legend("topright", c("ALL", "HBC"), col = c("darkgoldenrod1", "forestgreen"),
pch = 15)
par(op)
The differences between the top differentially methylated CGIs for ALL and CTRL/HBC are not that striking?
Next, stripplot of beta values of probes within top 5 CGI hits:
DMR5 <- topTable(DMRfitEb, coef = "GroupALL", number = 5)
beta.DMR5probe <- beta.inCGI[cginame[rownames(beta.inCGI), ]$cginame %in% rownames(DMR5),
]
beta.DMR5probe.tall <- melt(beta.DMR5probe, value.name = "M", varnames = c("Probe_ID",
"Sample"))
beta.DMR5probe.tall$Group <- factor(gsub("_[0-9]+", "", beta.DMR5probe.tall$Sample))
beta.DMR5probe.tall$CGI <- factor(cginame[as.character(beta.DMR5probe.tall$Probe_ID),
]$cginame)
(beta.DMR5.stripplot <- ggplot(data = beta.DMR5probe.tall, aes(x = Group, y = M,
color = Group)) + geom_point(position = position_jitter(width = 0.05), na.rm = TRUE) +
stat_summary(fun.y = mean, aes(group = 1), geom = "line", color = "black") +
facet_grid(. ~ CGI) + ggtitle("Probe beta values within top 5 DM CGIs") +
xlab("Group") + ylab("beta") + facet_wrap(~CGI))
Again, these results for top differentially methylated probes between ALL and CTRL/HBC don't look too convincing?
Finally, plot location of differentially methylated probes along each chromosome:
# get the length of chromosome 1-22 and X
chrlen <- unlist(as.list(IlluminaHumanMethylation450kCHRLENGTHS)[c(as.character(1:22),
"X")])
chrlen <- data.frame(chr = factor(names(chrlen)), length = chrlen)
chr <- IlluminaHumanMethylation450kCHR # get the chromosome of each probe
# get the probe identifiers that are mapped to chromosome
chr <- unlist(as.list(chr[mappedkeys(chr)]))
# get chromosome coordinate of each probe
coord <- IlluminaHumanMethylation450kCPGCOORDINATE
# get the probe identifiers that are mapped to coordinate
coord <- unlist(as.list(coord[mappedkeys(coord)]))
coord <- data.frame(chr = chr[intersect(names(chr), names(coord))], coord = coord[intersect(names(chr),
names(coord))])
# coordinates of probes in DM CGIs
coordDMRprobe <- droplevels(na.omit(coord[cginame[cginame$cginame %in% rownames(DMR),
]$Probe_ID, ]))
(coord.plot <- ggplot(data = coordDMRprobe) + geom_linerange(aes(factor(chr,
levels = c("X", as.character(22:1))), ymin = 0, ymax = length), data = chrlen,
alpha = 0.5) + geom_point(aes(x = factor(chr, levels = c("X", as.character(22:1))),
y = coord), position = position_jitter(width = 0.03), na.rm = T) + ggtitle("DMR positions on chromosomes") +
ylab("Position of DMRs") + xlab("chr") + coord_flip())
The next step in the analysis pipeline would be to interpret our result by associating these differentially methylated regions (DMRs) with biological features. DNA methylation in different genomic regions, i.e. promoter regions, enhancers, gene body etc., has different impact on gene transcription. A common practice in DMR analysis is to separate DMRs associated with different types of genomic regions and study them separately. Our key interest is to see how these DMRs affect gene expression level and consequently biological function. For that purpose, after we separate DMRs into different genomic regions, we will associate them with genes and then biological functions. There are many ways to do this. The IlluminaHumanMethylation450k.db package itself contains some annotation information, including associating probes with different types of genomic regions and gene IDs. You can obtain this information similarly to how we extracted chromosome coordinates above. Here is a full list of all objects available in this package. Their names are self-explanatory, but if want more detailed description, check the package manual document.
ls("package:IlluminaHumanMethylation450k.db")
## [1] "IlluminaHumanMethylation450k"
## [2] "IlluminaHumanMethylation450k_dbconn"
## [3] "IlluminaHumanMethylation450k_dbfile"
## [4] "IlluminaHumanMethylation450k_dbInfo"
## [5] "IlluminaHumanMethylation450k_dbschema"
## [6] "IlluminaHumanMethylation450k_get27k"
## [7] "IlluminaHumanMethylation450k_getControls"
## [8] "IlluminaHumanMethylation450k_getProbeOrdering"
## [9] "IlluminaHumanMethylation450k_getProbes"
## [10] "IlluminaHumanMethylation450kACCNUM"
## [11] "IlluminaHumanMethylation450kALIAS2PROBE"
## [12] "IlluminaHumanMethylation450kBLAME"
## [13] "IlluminaHumanMethylation450kBUILD"
## [14] "IlluminaHumanMethylation450kCHR"
## [15] "IlluminaHumanMethylation450kCHR36"
## [16] "IlluminaHumanMethylation450kCHR37"
## [17] "IlluminaHumanMethylation450kCHRLENGTHS"
## [18] "IlluminaHumanMethylation450kCHRLOC"
## [19] "IlluminaHumanMethylation450kCHRLOCEND"
## [20] "IlluminaHumanMethylation450kCOLORCHANNEL"
## [21] "IlluminaHumanMethylation450kCPG36"
## [22] "IlluminaHumanMethylation450kCPG37"
## [23] "IlluminaHumanMethylation450kCPGCOORDINATE"
## [24] "IlluminaHumanMethylation450kCPGILOCATION"
## [25] "IlluminaHumanMethylation450kCPGINAME"
## [26] "IlluminaHumanMethylation450kCPGIRELATION"
## [27] "IlluminaHumanMethylation450kCPGS"
## [28] "IlluminaHumanMethylation450kDESIGN"
## [29] "IlluminaHumanMethylation450kDHS"
## [30] "IlluminaHumanMethylation450kDMR"
## [31] "IlluminaHumanMethylation450kENHANCER"
## [32] "IlluminaHumanMethylation450kENSEMBL"
## [33] "IlluminaHumanMethylation450kENSEMBL2PROBE"
## [34] "IlluminaHumanMethylation450kENTREZID"
## [35] "IlluminaHumanMethylation450kENZYME"
## [36] "IlluminaHumanMethylation450kENZYME2PROBE"
## [37] "IlluminaHumanMethylation450kFANTOM"
## [38] "IlluminaHumanMethylation450kGENEBODY"
## [39] "IlluminaHumanMethylation450kGENENAME"
## [40] "IlluminaHumanMethylation450kGO"
## [41] "IlluminaHumanMethylation450kGO2ALLPROBES"
## [42] "IlluminaHumanMethylation450kGO2PROBE"
## [43] "IlluminaHumanMethylation450kISCPGISLAND"
## [44] "IlluminaHumanMethylation450kMAP"
## [45] "IlluminaHumanMethylation450kMAPCOUNTS"
## [46] "IlluminaHumanMethylation450kMETHYL27"
## [47] "IlluminaHumanMethylation450kNUID"
## [48] "IlluminaHumanMethylation450kOMIM"
## [49] "IlluminaHumanMethylation450kORGANISM"
## [50] "IlluminaHumanMethylation450kORGPKG"
## [51] "IlluminaHumanMethylation450kPATH"
## [52] "IlluminaHumanMethylation450kPATH2PROBE"
## [53] "IlluminaHumanMethylation450kPFAM"
## [54] "IlluminaHumanMethylation450kPMID"
## [55] "IlluminaHumanMethylation450kPMID2PROBE"
## [56] "IlluminaHumanMethylation450kPROBELOCATION"
## [57] "IlluminaHumanMethylation450kPROSITE"
## [58] "IlluminaHumanMethylation450kRANDOM"
## [59] "IlluminaHumanMethylation450kREFSEQ"
## [60] "IlluminaHumanMethylation450kREGULATORYGROUP"
## [61] "IlluminaHumanMethylation450kREGULATORYLOCATION"
## [62] "IlluminaHumanMethylation450kREGULATORYNAME"
## [63] "IlluminaHumanMethylation450kSVNID"
## [64] "IlluminaHumanMethylation450kSYMBOL"
## [65] "IlluminaHumanMethylation450kUNIGENE"
## [66] "IlluminaHumanMethylation450kUNIPROT"
There are a lot of resources that provide association between genes and biological function. Gene Ontology (GO) is a popular database that provides a controlled vocabulary of terms for describing gene product characteristics and functions including molecular function, biological process and cellular component. Also, there are databases like KEGG and PANTHER that provide association of genes with pathways for pathway analysis, and InterPro and Pfam etc. that provide protein domain analysis. You might have noticed from the list above that IlluminaHumanMethylation450k.db also provides some of these association analyses like GO, but you would have more control and flexibility if you use other tools more specialized for functional enrichment analysis. There are two online tools that I found very helpful, both extremely easy to use. DAVID takes a list of genes and outputs annotation and enrichment analysis from all sorts of databases including all the ones we discussed above. The results are available for downloading in tab delimited text format. Another more recent tool is GREAT. It was originally developed to associate regulatory regions with genes and biological functions and the required input is a list of coordinates. It also performs functional enrichment analysis from different databases and it can make some fancy plots for output.
The interpretation of differential methylation analysis results is a very flexible and most often context-specific process. Functional enrichment analysis is just one popular way to approach this but there is really no gold-standard pipeline we can follow. Although we won't be able to go into it in this seminar, you should understand that identifying DMRs is really the easy part and digging up the biology behind it and make a story of it would be something you need to spend a lot of time considering.