From My blogger The t-test for CpG islands and volcano plot

In DNA methylation data analysis, t-test statistics is used to identify differences in DNA methylation at single CpG sites.

Now I’m going to use a modified t-test in an R package, limma, to identify differentially methylated CpGs between samples from colon cancer and normal tissues.

library(limma)
  1. Load data.
load("dna1.rda")
  1. Baseline model design matrix

The design matrix indicates which arrays are from cancer tissues.

design <- model.matrix(~pd$Status)
  1. Fit the baseline model

Fit a linear model for each gene to estimate the fold changes and standard errors.

fit <- lmFit(meth, design)
  1. empirical Bayes moderation

Apply empirical Bayes smoothing to the standard errors.

eb <- ebayes(fit)

A volcano plot reveals effect size on x-axis and the statistical significance on y-axis so that highly dysregulated genes appear farther to the right and left sides while highly significant changes are higher on the plot.

library(ggplot2)

fc = fit$coef[,2]
sig = -log10(eb$p.value[,2])
df <- data.frame(fc, sig)
df$thre <- as.factor(abs(fc) < 0.4 & sig < -log10(0.05))

ggplot(data=df, aes(x=fc, y = sig, color=thre)) +
  geom_point(alpha=.6, size=1.2) +
  theme(legend.position="none") +
  xlab("Effect size") +
  ylab("-log10 p value")

Why do these two colors match so well? I like it!