Rationale

If you’re into transcriptomics, it’s likely you’ll have come across MA plots before. If not:

In papers you’ll often see MA plots used to illustrate differential expression, as the further a gene is away from an “M” of zero, the more different it is between samples or groups.

MA plots also a good way to check for batch effects, or whether your normalisation has worked. Since we only expect a minority of genes to be differentially expressed, MA plots should generally be symmetrical, with M averaging at 0 for each abundance. MA plots also often have a “funnel” shape, the variance decreasing as the abundance gets bigger.

When carrying out quality control, MA plots are used to compare a sample to the whole dataset’s median, or to another sample. The problem: even if you have a dataset with only 10 samples, thats over 100 MA plots. If you have 40 samples? 1600 MA plots.

Another problem is, if we have wonky MA plots with a relationship between M and A, we then want to see if our attempts to fix it has worked. This we might do through:

If we want to see what effect each of these has, each time we’re multiplying the total number of potential MA plots to trawl through. When performing transcriptomic analyses for my PhD project, the total would have gone up to about half a million.

THE MADNESS ENDS HERE.

Here we use Hoeffding’s D statistic as a non-parametric measure of independence between M and A. If a sample’s D statistic is high, this means M and A have some kind of relationship (not good).

The advantage of Hoeffding’s D statistic is that unlike linear tests, it shouldn’t let bad “happy” or “sad” face MA plots, ie non-monotonic relationships, through the net. Five functions are included in this package: MAplot(), Dmedian(), Dplot(), Dall() and Dheatmap().

Set up

library(devtools)
devtools::install_github("ambermoeba/sweetD")
library(sweetD)

Dmedian and Dplot: Hoeffding’s D statistic against the median

We’re busy people or we have a lot of samples, so we’re only going to compare each sample to the median of its expression matrix, rather than calculating D for every sample-sample combination.

Let’s say we want to look at Hoeffding’s D statistic at three stages of our transcriptomics pipeline: the raw data, after batch correction, and after quantile normalisation. We have 20 samples, labelled “S1-20” in our original expression matrices.

S1 S2 S3 S4 S5
Gene1 44.18806 37.23387 45.87413 52.99353 52.27348
Gene2 24.16835 16.58718 21.36511 28.00712 30.13795
Gene3 18.58088 13.40999 21.63976 24.65555 22.77302
Gene4 31.11300 23.55571 30.41126 37.81969 38.93698
Gene5 26.68123 21.03494 26.63327 34.64319 33.07522
Result = Dmedian(expr.raw, expr.batchcorrected, expr.normalised)

This gives us a dataframe which looks like this:

head(Result, 3)
Dataset Sample D
expr.raw S14 0.3503459
expr.raw S13 0.2955352
expr.batchcorrected S14 0.1870632

We can then use Dplot() to visualise how the distribution of Dstatistics changes with normalisation. Each dot represents a sample, the pinker dots with a high D statistics corresponding to samples which may have MA plots to be concerned about.

Dplot(Result)

It looks like D is going down a lot when we normalise, which is what we want! So, does this seem to reflect actual MA plot wonkiness? Let’s check a sample with a high D statistic in the raw data.

MAplot(expr.raw, "S14")

MAplot(expr.normalised, "S14")

Normalisation certainly seems to be doing a big job here!

Dheatmap: Hoeffding’s D statistic against other samples

If we have fewer samples or feel like making the computer work hard, we can also compare each sample to every other sample in our dataset. This is especially good for identifying batch effects or outliers.

Results_all = Dall(expr.raw, expr.batchcorrected, expr.normalised)
Dataset Sample1 Sample2 D
expr.raw S11 S14 0.5311666
expr.raw S14 S11 0.5311666
expr.raw S11 S13 0.4981118

Now let’s plot a heatmap of D statistics to visualise this:

Dheatmap(Results_all)

The more pink means greater the dependence between M and A, and therefore we can take a guess that these MA plots are probably the wonky ones! For example, it looks like Sample 11 and 14 are quite at odds with eachother in the raw data. Let’s take closer look.

MAplot(expr.raw, "S11", "S14")

MAplot(expr.normalised, "S11", "S14")

Once again, quantile normalisation comes to the rescue!

To conclude…

While sweetD may not be able to replace looking carefully through all of your MA plots to make sure they look okay, I hope that this will be a good time saving tool for anyone a large transcriptomics dataset and little patience.