Adjusting for Batch Effect in Microarray Expression Data

November 29, 2016

Background: the central dogma of biology

The full map of our species \(\approx\) 20K genes was determined in 2003 with the completion of the Human Genome Project.
DNA is a double-stranded discrete sequence of base pairs, with each codon (triplet) of base pairs mapping to one of 20 amino acids which chain together to form proteins.
While each cell in your body contains an identical genome (complete set of genes), they differ in expression because of epigenetic factors.
The central dogma of biology is: DNA maps to messenger RNA which builds proteins.
DNA Microarrays let us compare the gene expression levels (for thousands of genes simultaneously) between different cells in the same person (liver vs. skin) or between the same cell in different people (female vs. male) or disease status (cancer vs. non-cancer).

How do microarrays work?

From a given cell we can extract the frequency of messenger RNA, convert it to complementary DNA (cDNA) and then make millions of copies using a polymerase chain reaction.
Each microarray is segmented into small squares which each contain thousands of identical oligonucleotides which will hybridize (bind) to specific sequences known to map to certain genes.
By tagging the cDNA with fluorescent dyes we can determine the intensity of gene expression.

What are batch effects?

Hybridization intensities are always normalized within each microarray.
However, batch effects are any artifacts (i.e. non-biological sources of variation) that remain after normalization.
Common reasons for batch effects include:
- Magnitude of chemical reagents used on plates
- Time of day when the assay is done
- Temperature in lab
Batch effects lead to either:
- Increased noise (best case scenario)
- Spurious results (worst case scenario, i.e. when correlated with biological controls)

Section 1: The Spielman et al. slipup

"This phenotype differs significantly between European and Asian populations…"

This article (Spielman et al. 2007) has been cited as a paper that failed to adjust for batch effects (Irizarry and Love 2015).

Modern genomics research data can be found at the NCBI's Gene Expression Omnibus (GEO)

Chart 1: Spurious findings

Section 2: How to spot a batch effect

Monte Carlo simulations: check if random sub-samples of the biological control are normally distributed.
Principal Component (PC) Analysis: see if the primary sources of supposed biological variation are randomly distributed with respect to non-biological factors.
Multidimensional scaling (MDS) and linear classification: compare the classification rules of a Support Vector Machine (SVM) with a linear kernel for data labelled biological and non-biological.
Genetic heatmaps: would hierarchical clustering be able to separate biological and non-biological variation?

Chart 2: How to spot a batch effect

Section 3: Combating batch effects with ComBat

Uses an Empirical Bayes (EB) approach to adjust for known batch effects (Johnson, Li, and Rabinovic 2006).
Robust to small batch-sample observations (our batches range from 2-23 people).
Pools information across genes with Bayesian hierarchical modelling.
Assumes that \(G_{b,i,g}\sim N(\gamma_{b,g},\sigma^2_{b,g})\) for batch \(b\), person \(i\), and gene \(g\).
Hierarchical priors \(\gamma_{b,g}\sim N(Y_b,\tau^2_b)\) and \(\sigma^2_{b,g}\sim IG(\lambda_b,\theta_b)\) which are estimated empirically.
EB adjusted data: \(G_{b,i,g}^*=\frac{\hat{\delta}_g}{\hat{\sigma}_g^*}(G_{b,i,g}-\hat{\gamma}_{b,g}^*)+\hat{\alpha}_g+X\hat{\beta_g}\) where \(\hat{\delta}_g\) and \(\hat{\alpha}_g\) are the mean and standard error calculated during normalization, \(^*\) denotes the posterior estimate, and \(X\) are known batch features.

Section 3: Combating batch effects with ComBat

After adjusting our data we see that the statistically signficant differences between ethnicities are gone.
Note: while ComBat is considered the one of the best adjustment methods (Chen et al. 2011), it will inevitably lose some biological information.

References and Replication

Code to replicate this presentation can be found here.

Chen, Chao, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. 2011. “Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods.” PLoS ONE 6 (2). doi:10.1371/journal.pone.0017238.

Irizarry, Rafael, and Michael Love. 2015. Data Analysis for the Life Sciences. Leanpub.

Johnson, W. E., C. Li, and A. Rabinovic. 2006. “Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods.” Biostatistics 8 (1): 118–27. doi:10.1093/biostatistics/kxj037.

Spielman, Richard S, Laurel A Bastone, Joshua T Burdick, Michael Morley, Warren J Ewens, and Vivian G Cheung. 2007. “Common Genetic Variants Account for Differences in Gene Expression Among Ethnic Groups.” Nature Genetics 39 (2): 226–31. doi:10.1038/ng1955.