sweetD: Hoeffding’s D statistic as a Summary Statistic for MA plots

Rationale

If you’re into transcriptomics, it’s likely you’ll have come across MA plots before. If not:

Each dot represents a gene (or other feature)
The y axis “M” represents the log fold difference between two samples
The x axis “A” represents the average abundance (intensity or counts) of a gene

In papers you’ll often see MA plots used to illustrate differential expression, as the further a gene is away from an “M” of zero, the more different it is between samples or groups.

MA plots also a good way to check for batch effects, or whether your normalisation has worked. Since we only expect a minority of genes to be differentially expressed, MA plots should generally be symmetrical, with M averaging at 0 for each abundance. MA plots also often have a “funnel” shape, the variance decreasing as the abundance gets bigger.

When carrying out quality control, MA plots are used to compare a sample to the whole dataset’s median, or to another sample. The problem: even if you have a dataset with only 10 samples, thats over 100 MA plots. If you have 40 samples? 1600 MA plots.

Another problem is, if we have wonky MA plots with a relationship between M and A, we then want to see if our attempts to fix it has worked. This we might do through:

Normalisation e.g. quantile normalisation
Batch correction e.g. using ComBAT to integrate datasets
Exclusion of outlying samples

If we want to see what effect each of these has, each time we’re multiplying the total number of potential MA plots to trawl through. When performing transcriptomic analyses for my PhD project, the total would have gone up to about half a million.

THE MADNESS ENDS HERE.

Here we use Hoeffding’s D statistic as a non-parametric measure of independence between M and A. If a sample’s D statistic is high, this means M and A have some kind of relationship (not good).

The advantage of Hoeffding’s D statistic is that unlike linear tests, it shouldn’t let bad “happy” or “sad” face MA plots, ie non-monotonic relationships, through the net. Five functions are included in this package: MAplot(), Dmedian(), Dplot(), Dall() and Dheatmap().

Set up

library(devtools)
devtools::install_github("ambermoeba/sweetD")
library(sweetD)

Dmedian and Dplot: Hoeffding’s D statistic against the median

We’re busy people or we have a lot of samples, so we’re only going to compare each sample to the median of its expression matrix, rather than calculating D for every sample-sample combination.

Let’s say we want to look at Hoeffding’s D statistic at three stages of our transcriptomics pipeline: the raw data, after batch correction, and after quantile normalisation. We have 20 samples, labelled “S1-20” in our original expression matrices.

	S1	S2	S3	S4	S5
Gene1	44.18806	37.23387	45.87413	52.99353	52.27348
Gene2	24.16835	16.58718	21.36511	28.00712	30.13795
Gene3	18.58088	13.40999	21.63976	24.65555	22.77302
Gene4	31.11300	23.55571	30.41126	37.81969	38.93698
Gene5	26.68123	21.03494	26.63327	34.64319	33.07522

Result = Dmedian(expr.raw, expr.batchcorrected, expr.normalised)

This gives us a dataframe which looks like this:

head(Result, 3)

Dataset	Sample	D
expr.raw	S14	0.3503459
expr.raw	S13	0.2955352
expr.batchcorrected	S14	0.1870632

We can then use Dplot() to visualise how the distribution of Dstatistics changes with normalisation. Each dot represents a sample, the pinker dots with a high D statistics corresponding to samples which may have MA plots to be concerned about.

Dplot(Result)

It looks like D is going down a lot when we normalise, which is what we want! So, does this seem to reflect actual MA plot wonkiness? Let’s check a sample with a high D statistic in the raw data.

MAplot(expr.raw, "S14")

MAplot(expr.normalised, "S14")

Normalisation certainly seems to be doing a big job here!

Dheatmap: Hoeffding’s D statistic against other samples

If we have fewer samples or feel like making the computer work hard, we can also compare each sample to every other sample in our dataset. This is especially good for identifying batch effects or outliers.

Results_all = Dall(expr.raw, expr.batchcorrected, expr.normalised)

Dataset	Sample1	Sample2	D
expr.raw	S11	S14	0.5311666
expr.raw	S14	S11	0.5311666
expr.raw	S11	S13	0.4981118

Now let’s plot a heatmap of D statistics to visualise this:

Dheatmap(Results_all)

The more pink means greater the dependence between M and A, and therefore we can take a guess that these MA plots are probably the wonky ones! For example, it looks like Sample 11 and 14 are quite at odds with eachother in the raw data. Let’s take closer look.

MAplot(expr.raw, "S11", "S14")

MAplot(expr.normalised, "S11", "S14")

Once again, quantile normalisation comes to the rescue!

To conclude…

While sweetD may not be able to replace looking carefully through all of your MA plots to make sure they look okay, I hope that this will be a good time saving tool for anyone a large transcriptomics dataset and little patience.