If you’re into transcriptomics, it’s likely you’ll have come across MA plots before. If not:
In papers you’ll often see MA plots used to illustrate differential expression, as the further a gene is away from an “M” of zero, the more different it is between samples or groups.
MA plots also a good way to check for batch effects, or whether your normalisation has worked. Since we only expect a minority of genes to be differentially expressed, MA plots should generally be symmetrical, with M averaging at 0 for each abundance. MA plots also often have a “funnel” shape, the variance decreasing as the abundance gets bigger.
When carrying out quality control, MA plots are used to compare a sample to the whole dataset’s median, or to another sample. The problem: even if you have a dataset with only 10 samples, thats over 100 MA plots. If you have 40 samples? 1600 MA plots.
Another problem is, if we have wonky MA plots with a relationship between M and A, we then want to see if our attempts to fix it has worked. This we might do through:
If we want to see what effect each of these has, each time we’re multiplying the total number of potential MA plots to trawl through. When performing transcriptomic analyses for my PhD project, the total would have gone up to about half a million.
THE MADNESS ENDS HERE.
Here we use Hoeffding’s D statistic as a non-parametric measure of independence between M and A. If a sample’s D statistic is high, this means M and A have some kind of relationship (not good).
The advantage of Hoeffding’s D statistic is that unlike linear tests, it shouldn’t let bad “happy” or “sad” face MA plots, ie non-monotonic relationships, through the net. Five functions are included in this package: MAplot(), Dmedian(), Dplot(), Dall() and Dheatmap().
library(devtools)
devtools::install_github("ambermoeba/sweetD")
library(sweetD)
We’re busy people or we have a lot of samples, so we’re only going to compare each sample to the median of its expression matrix, rather than calculating D for every sample-sample combination.
Let’s say we want to look at Hoeffding’s D statistic at three stages of our transcriptomics pipeline: the raw data, after batch correction, and after quantile normalisation. We have 20 samples, labelled “S1-20” in our original expression matrices.
S1 | S2 | S3 | S4 | S5 | |
---|---|---|---|---|---|
Gene1 | 44.18806 | 37.23387 | 45.87413 | 52.99353 | 52.27348 |
Gene2 | 24.16835 | 16.58718 | 21.36511 | 28.00712 | 30.13795 |
Gene3 | 18.58088 | 13.40999 | 21.63976 | 24.65555 | 22.77302 |
Gene4 | 31.11300 | 23.55571 | 30.41126 | 37.81969 | 38.93698 |
Gene5 | 26.68123 | 21.03494 | 26.63327 | 34.64319 | 33.07522 |
Result = Dmedian(expr.raw, expr.batchcorrected, expr.normalised)
This gives us a dataframe which looks like this:
head(Result, 3)
Dataset | Sample | D |
---|---|---|
expr.raw | S14 | 0.3503459 |
expr.raw | S13 | 0.2955352 |
expr.batchcorrected | S14 | 0.1870632 |
We can then use Dplot() to visualise how the distribution of Dstatistics changes with normalisation. Each dot represents a sample, the pinker dots with a high D statistics corresponding to samples which may have MA plots to be concerned about.
Dplot(Result)
It looks like D is going down a lot when we normalise, which is what we want! So, does this seem to reflect actual MA plot wonkiness? Let’s check a sample with a high D statistic in the raw data.
MAplot(expr.raw, "S14")
MAplot(expr.normalised, "S14")
Normalisation certainly seems to be doing a big job here!
If we have fewer samples or feel like making the computer work hard, we can also compare each sample to every other sample in our dataset. This is especially good for identifying batch effects or outliers.
Results_all = Dall(expr.raw, expr.batchcorrected, expr.normalised)
Dataset | Sample1 | Sample2 | D |
---|---|---|---|
expr.raw | S11 | S14 | 0.5311666 |
expr.raw | S14 | S11 | 0.5311666 |
expr.raw | S11 | S13 | 0.4981118 |
Now let’s plot a heatmap of D statistics to visualise this:
Dheatmap(Results_all)
The more pink means greater the dependence between M and A, and therefore we can take a guess that these MA plots are probably the wonky ones! For example, it looks like Sample 11 and 14 are quite at odds with eachother in the raw data. Let’s take closer look.
MAplot(expr.raw, "S11", "S14")
MAplot(expr.normalised, "S11", "S14")
Once again, quantile normalisation comes to the rescue!
While sweetD may not be able to replace looking carefully through all of your MA plots to make sure they look okay, I hope that this will be a good time saving tool for anyone a large transcriptomics dataset and little patience.