This notebook will (attempt) to demonstrate the statistical concepts
behind the “generative” model employed by the proDA package to account
for “dropout” zeros/NAs in proteomics data. Following the mathematics
included in the paper, we will simulate a dataset that exactly matches
the statistical, distributional expectations of the proDA model, and
then examine how the model actually performs with said simulated data.
Along the way, we will gain better understanding of what proDA is
attempting to accomplish, and how it does so …
library(ggplot2)
library(extraDistr)
library(LaplacesDemon)
Attaching package: ‘LaplacesDemon’
The following objects are masked from ‘package:extraDistr’:
dbern, dcat, ddirichlet, dgpd, dgpois, dinvchisq, dinvgamma, dlaplace, dpareto, pbern,
plaplace, ppareto, qbern, qcat, qlaplace, qpareto, rbern, rcat, rdirichlet, rgpd, rinvchisq,
rinvgamma, rlaplace, rpareto
library(reshape2)
library(plyr)
library(ggbeeswarm)
library(proDA)
library(limma)
To fully (?) understand the proDA generative model, we will work
backwards from the various hyperparameters of parent distributions that
define the sharing of means and variances across peptides, and
(eventually) simulate a phony dataset. This, in essence, is exactly what
a generative model represents: an attempt to represent real data via a
simulation, which is governed by definable probability distributions.
Remember: “all models are wrong; some happen to be useful” (George
Box).
The best way to deal with a generative model is to “work backwards”.
Let us begin:
Starting with Appendix equation (4), the variance (squared s.d.),
\(\sigma_i^2\), of each peptide \(i\) is sampled from an Inverse Scaled
Chi-Squared distribution with hyperparameters \(\tt{df}_0\) and \(\tau_0^2\); the rinvchisq R
function expects a parameterization of \(\nu\) (positive shape parameter, same as
\(\tt{df}_0\)) and \(\tau\) (positive scale, or in this case the
inverse of variance, precision). Let’s see what that looks like
for a (really and completely arbitrary) pair of hyperparameters:
set.seed(123) # the well-known "Jackson Five" constant.
npep <- 5000 # number of peptides in our simulation
df_0 <- 20
tau_0 <- 4
var_i <- rinvchisq(npep, df_0, tau_0)
names(var_i) <- paste("fragment", 1:npep, sep="-")
hist(var_i, 100, freq=F)

OK, moving further backwards to Appendix equation (3), this is the
so-called “non-standardized” version of the Student’s t-distribution,
with three parameters: the “normality” parameter \(\tt{df}_{\tt{loc}}\), the central overall
“grand mean” peptide log2 abundance \(\mu_0\), and the variance \(\sigma_0^2\) i.e. range expected
of the individual peptide abundances for each peptide in each group,
\(\mu_{i,j}\) sampled from this
distribution. In the proDA paper, they assume that \(\tt{df}_{\tt{loc}}\) is pre-specified, and
is not estimated from the data; in the code the default is 3. Let’s see
what that looks like for some (again, somewhat arbitrarily selected) set
of hyperparameters:
groups <- c("Ctrl", "TrtA", "TrtB")
ngrp <- length(groups)
df_loc <- 3
mu_0 <- 15
sd_0 <- 1
mu_ij <- matrix(rst(n = npep*ngrp, mu = mu_0, sigma = sd_0^2, nu = df_loc),
ncol=ngrp)
colnames(mu_ij) <- groups
rownames(mu_ij) <- names(var_i)
names(dimnames(mu_ij)) <- c("peptide", "group")
hist(mu_ij, 100, freq=F, xlim=c(5, 25))

OK, so now we have parent or “background” distributions for the means
and variance of each peptide: with different means for the same peptide
in different experimental groups, but enforcing (“assuming”)
shared/constant variance across all experimental groups for each
peptide, i.e. “homoscedasticity” – that’s a really big assumption, and
one that real data very often violates. We can return to that problem
later.
So now we’re ready to simulate the (latent, unobserved) log2
abundance values, \(z_{i,j}\), for
every sample replicate in each group (Appendix equation (1), second
line). These are the values we would have observed in the MS if there
were absolutely no dropout, in which case every log2 abundance would be
measured directly, with some biological (and technical) variation \(\sigma_i^2\).
nrep <- 3 # number of replicates per group
z_ij <- matrix(data = NA, nrow=npep, ncol=0)
for (idx in 1:ncol(mu_ij)) {
group <- groups[idx]
reps <- matrix(rnorm(nrep*npep, mean=mu_ij[,idx], sd=sqrt(var_i)),
byrow = F,
ncol=nrep)
colnames(reps) <- paste(group, rep(1:nrep), sep="-")
z_ij <- cbind(z_ij, reps)
}
rownames(z_ij) <- rownames(mu_ij)
names(dimnames(z_ij)) <- c("peptide", "sampleID")
print(head(z_ij), digits=4)
sampleID
peptide Ctrl-1 Ctrl-2 Ctrl-3 TrtA-1 TrtA-2 TrtA-3 TrtB-1 TrtB-2 TrtB-3
fragment-1 11.25 10.90 11.71 17.63 13.65 15.24 18.70 19.56 25.45
fragment-2 15.87 14.57 12.74 14.09 13.43 12.87 14.80 15.21 13.99
fragment-3 16.32 18.00 14.57 13.84 21.55 17.57 13.54 13.07 11.32
fragment-4 19.14 14.26 17.61 13.69 11.30 14.26 13.50 15.92 12.76
fragment-5 14.95 11.66 13.66 12.74 16.65 16.51 13.15 15.34 13.47
fragment-6 14.71 16.44 16.37 14.57 13.61 19.26 16.16 14.12 11.93
Just to confirm (sanity check) that we did this right, let’s make a
plot of the intended group means, the \(\mu_{i,j}\), with (+/- 2 s.d., from \(\sigma_i^2\)) for each group, and overlay
the \(z_{i,j}\) values, for the first
handful of fragments:
zs <- melt(z_ij[1:6,], value.name = "abundance") # first six peptides
zs$group <- gsub("(.*?)\\-\\d", "\\1", zs$sampleID)
mus <- melt(mu_ij[1:6,], value.name = "abundance")
mus <- merge(mus, melt(var_i[1:6], value.name = "var"), by.x="peptide", by.y=0)
ggplot(zs, aes(x=group, y=abundance, color=sampleID)) + geom_point() +
scale_color_brewer(palette="Set1") +
geom_crossbar(data=mus, mapping = aes(y=abundance,
ymin=abundance-2*sqrt(var),
ymax=abundance+2*sqrt(var)),
color=I("black")) +
facet_wrap(~peptide) + theme_bw()

OK, so with that confirmed, let’s remember that this is a rather
idealized dataset where data from every sample comes already
“pre-normalized” to all the others. Let’s see what I mean:
ggplot(melt(z_ij, value.name = "abundance"),
aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() +
scale_color_brewer(palette="Set1") + theme_bw()

In real, “raw” data, the distribution of abundances in each sample
would vary a little bit, reflecting differences in sample concentration
relative to each other (especially in a multiplexed experiment).
Between-sample normalization techniques (quantile normalize, median
normalization, etc.) are used to “align” the distributions, but we don’t
need to do such a step here. However, as discussed in the proDA paper,
real “raw” data does generally need to be normalized, and that
normalization step can (and does) introduce distortions when the “lower
end” of the abundance distribution has a higher dropout rate, leading to
more dropout NAs (or zeroes) – especially when the entire distribution
for a given sample is shifted lower, due to an overall lower relative
sample abundance (more dropouts/zeroes than in other samples).
OK, now to simulate the dropouts in each sample. The proDA model uses
a sigmoid curve (using the complement of the Normal CDF function for
convenience, equation 2), with parameters \(\rho_j\) (inflection point) and \(\zeta_j\) (scale) to represent the dropoff
probability, at any given abundance value. Let’s look at a few example
dropout curves for a couple different parameter choices:
ipnorm <- function(...) {
1 - pnorm(...)
}
ggplot() +
stat_function(fun = ipnorm,
color = "blue",
args = list(mean = 1, sd = 1)) +
stat_function(fun = ipnorm,
color="red",
args = list(mean=10, sd=5)) +
geom_vline(xintercept=1, color="blue", linetype=2) +
geom_vline(xintercept=10, color="red", linetype=2) +
geom_hline(yintercept=0.5, color="grey", linetype=2) +
xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

From these two example dropout curves, we can see that the inflection
point (\(\rho_j\), specified as the
mean to pnorm Normal CDF function) is where the probability
is exactly 0.5 (see dotted lines); and the scale (\(\zeta_j\), specified as the s.d. to
pnorm defines the slope of the curve, how “sharp” the
demarcation is between high and low dropout probability over a given
range of abundances).
OK, with some intuition in hand now, we can start to reason about how
the dropout curves might differ for each sample. In the proDA model, the
sample-specific inflection points and scales don’t have any common,
shared parent distribution, but are estimated individually for each
sample, based on an optimization of the likelihood function for each
sample; since there are so many peptide fragments to inform that
parameter estimation, that is quite computationally efficient. However,
we’ve been entirely probabilistic so far in this simulation, with every
value of interest derived as a random draw from some parent
distribution, so let’s stick with that theme. Given that we specified
the “grand mean” \(\mu_0\) of all
peptide abundances at 15, with a “range” \(\sigma_0^2\) of 1, let’s say we set the
“average” inflection point (where the dropout probability will be 0.5),
\(\rho_0\), over all samples to be
\(\mu_0 - 4\sigma_0\), 4 times the SD;
and then let each sample’s \(\rho_j\)
vary from this by \(\sigma_{\rho_0}^2\). Similarly, let’s let
the average scale, \(\zeta_0\) be a
hyperparameter, and draw each sample’s \(\zeta_j\) from the InvChiSquare
distribution as we did earlier for the analagous peptide variances:
rho_0 <- mu_0 - 4 * sd_0
sigma_rho_0 <- 1.2
rho_j <- rnorm(ncol(z_ij), rho_0, sigma_rho_0)
names(rho_j) <- colnames(z_ij)
zeta_0 <- 10
zeta_tau_0 <- 3
zeta_j <- rinvchisq(ncol(z_ij), zeta_0, zeta_tau_0)
names(zeta_j) <- colnames(z_ij)
ggplot() +
lapply(1:length(zeta_j), function (i) {
stat_function(fun = ipnorm,
mapping = aes(color=colnames(z_ij)[i]),
args = list(mean=rho_j[i], sd=zeta_j[i]))
}) +
scale_color_brewer(name="sampleID", palette="Set1") +
xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

OK, great; we’ve got the underlying dropout curves, one for each
sample. Now we can simulate which of the \(z_{i,j}\) values will be observed (not a
dropout) and which will be “NA” (is a dropout), by flipping a coin with
the probability of dropout defined for each \(z_{i,j}\) abundance value given by the
dropout curve for sample \(j\). This
the “dropout matrix”, \(d_{i,j}\).
d_ij <- z_ij
d_ij[,] <- 0
for (j in 1:ncol(d_ij)) {
d_ij[,j] <- rbern(n=nrow(d_ij), prob=ipnorm(z_ij[,j], rho_j[j], zeta_j[j]))
}
qplot(rho_j, 100 * colSums(d_ij)/nrow(d_ij), color=zeta_j, size=I(5)) +
geom_label(aes(label=names(zeta_j))) +
scale_y_log10() +
ylab("% of dropouts in sample") + theme_bw()

From the above, we see that the range of dropout % in each sample
correlates with the \(\rho_j\)
inflection points – the higher (right-shifted) the inflection point, the
more abundance values are lost due to dropout; and also, for a given
\(\rho_j\) value, the higher the \(\zeta_j\) value, the shallower the dropoff
slope, and so even more abundances dropout.
Now we have everything we need from Appendix equation (1) to produce
a final, simulated dataset of experimentally observed abundances, \(y_{i,j}\).
# start by just copying z_ij into y_ij:
y_ij <- z_ij
# now dropout the values according to d_ij:
for (j in 1:ncol(y_ij)) {
y_ij[d_ij[,j] == 1, j] <- NA
}
print(head(y_ij), digits=4)
sampleID
peptide Ctrl-1 Ctrl-2 Ctrl-3 TrtA-1 TrtA-2 TrtA-3 TrtB-1 TrtB-2 TrtB-3
fragment-1 11.25 10.90 NA 17.63 NA 15.24 18.70 19.56 25.45
fragment-2 15.87 14.57 12.74 14.09 13.43 12.87 14.80 15.21 NA
fragment-3 16.32 NA NA 13.84 21.55 17.57 13.54 13.07 11.32
fragment-4 19.14 14.26 17.61 13.69 NA 14.26 13.50 15.92 NA
fragment-5 NA 11.66 13.66 12.74 16.65 16.51 13.15 15.34 NA
fragment-6 14.71 16.44 16.37 14.57 13.61 19.26 16.16 NA NA
In the first six fragments above, we can already see a handful of
NA’s that represent simulated dropouts.
OK, now we can test proDA using our simulated data, for which we know
the true values for each single parameter of interest, and we can see
how well proDA is able to recover what we’ve setup:
fit <- proDA(y_ij, design = factor(rep(groups, each=nrep)), verbose=T)
Fitting the hyper-parameters for the probabilistic dropout model.
Starting iter: 1
So, perhaps not too unsurprisingly, when we use proDA to generate a
fit with data that exactly matches the modeling expectations, it seems
to do a pretty good job! Let’s look at the various hyperparameters
first:
hps <- fit$hyper_parameters
knitr::kable(
data.frame(
name=c("grand mean",
"grand range",
"normality parameter",
"variance scale",
"variance shape"),
values=c(mu_0, sd_0^2, df_loc, tau_0^2, df_0),
estimate=c(hps$location_prior_mean,
hps$location_prior_scale,
hps$location_prior_df, # given as a fixed constant, not estimated
hps$variance_prior_scale,
hps$variance_prior_df)), format="pipe")
| grand mean |
15 |
15.004408 |
| grand range |
1 |
3.581907 |
| normality parameter |
3 |
3.000000 |
| variance scale |
16 |
3.949251 |
| variance shape |
20 |
18.304869 |
While the parameters (and random seed) of this notebook could change,
at least with the current (?) values, we see that nearly all of the
hyperparameters match very well, with the exception of the “grand
range”, or \(\sigma_0^2\), which proDA
has over-estimated as larger than the true distributed range of
group-wise mean values; I expect this is an undue influence of the
dropout values themselves, which will tend to “inflate” the expected
variance of the group means, at the bottom/left end, even though the
top/right end is accurately and explictly represented in the data (this
amounts to a suggestion for the proDA authors to consider estimating the
variance from a left-truncated distribution of group means, to avoid
this artifact).
Next, let’s checkout the dropout curve parameters:
ggplot(rbind(data.frame(variable="dropout position",
value=rho_j,
estimate=hps$dropout_curve_position,
sampleID = names(rho_j)),
data.frame(variable="dropout scale",
value=zeta_j,
estimate=-hps$dropout_curve_scale,
sampleID = names(zeta_j))),
aes(x=value, y=estimate)) +
geom_abline(slope=1, intercept=0) +
stat_smooth(method="lm", formula = y ~ x) +
geom_point(aes(color=sampleID)) +
facet_wrap(~variable, scales="free") + theme_bw()

Interesting! While both the location (\(\rho\)) and scale (\(\zeta\)) parameters are very strongly
correlated, the location is very slightly under-estimated (though not
statistically signicant in this case), while the scale is distinctly and
systematically over-estimated in all samples. Off the top of my
head, I don’t have any rationale for this behavior, so I’m going to
leave this as a question for the authors (!?!)
OK, so most importantly, let’s look at how accurate proDA’s
estimation of the group means ended up, in the face of all these
dropouts, and (mostly) correct hyperparameter estimations:
coef <- fit$coefficients
names(dimnames(coef)) <- c("peptide", "group")
dtall <- melt(d_ij, value.name = "drop")
dtall$group <- gsub("(.*?)\\-\\d+", "\\1", dtall$sampleID)
dropdf <- ddply(dtall, .(peptide, group), function(dx) {
data.frame(dropnum = sum(dx$drop))
})
ses <- matrix(unlist(lapply(lapply(fit$coefficient_variance_matrices, diag), sqrt)),
ncol=3, byrow=T)
colnames(ses) <- colnames(fit$coefficient_variance_matrices[[1]])
ses <- as.data.frame(ses)
ses$peptide <- rownames(fit$coefficients)
ses$lowqt <- qt(0.025, fit$feature_parameters$df)
ses$highqt <- qt(0.975, fit$feature_parameters$df)
cis <- melt(ses,
id=c("peptide", "highqt", "lowqt"),
variable.name = "group",
value.name = "ses")
cis$low <- cis$lowqt * cis$ses
cis$high <- cis$highqt * cis$ses
df <- merge(dropdf, merge(melt(coef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
df <- merge(df, cis)
df$estlow <- df$estimate + df$low
df$esthigh <- df$estimate + df$high
df$estrange <- df$esthigh - df$estlow
ggplot(df, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
geom_point() + geom_abline(slope=1, intercept=0) +
scale_color_brewer(name="# dropouts", palette="Set2") +
facet_wrap(~dropnum) + theme_bw()

Again, interesting! When the number of dropouts in a group is 0, 1,
or even 2, then proDA does a pretty remarkable job recovering the true
value of the group mean. But when all of the replicates in a group are
dropouts, we’re left with coefficient estimates that are all ~12. The
95% confidence interval ranges look like this:
ggplot(df, aes(x=estrange, fill=factor(dropnum))) +
stat_bin(bins=50) +
scale_color_brewer(name="# dropouts", palette="Set2") +
facet_wrap(~dropnum, scales="free_y", ncol=1) + theme_bw() +
xlab("width of 95% confidence interval")

So, as expected: tighter interval ranges for groups having all
observations; wider ranges with less observations, and essentially fixed
(wide) interval when no observations are made.
Lastly, let’s see how well proDA was able to estimate variances for
each peptide:
vardf <- data.frame(value = var_i,
estimate = fit$feature_parameters$s2,
missing = ncol(y_ij) - fit$feature_parameters$n_obs)
ggplot(vardf, aes(x=value, y=estimate)) + stat_binhex() +
geom_abline(slope=1, intercept=0) + theme_bw() +
xlab("true variance") + ylab("proDA estimated variance")

The majority of variances are correctly estimated, but a handful of
peptides having the largest variances remain under-estimated,
potentially inflating any downstream contrast statistics.
effects of normalization with dropouts
We observed earlier that the simulated data is already perfectly
normalized, at least prior to having random low-abundance values removed
due to dropout; and each sample has slightly different dropout curves,
so now the distributions may not be so perfectly comparable:
ggplot(melt(y_ij, value.name = "abundance"),
aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() +
scale_color_brewer(palette="Set1") + theme_bw()

What happens if we quantile normalize the observations, and then
compare to their original values (removing/ignoring any difference less
than 0.2 units, for graphical clarity)?
normy <- normalizeBetweenArrays(y_ij, method="quantile")
medy <- median_normalization(y_ij)
deltas <- colMeans(medy - y_ij, na.rm=T)
deltas <- data.frame(delta=deltas, sampleID = names(deltas))
ggplot(subset(melt(normy - y_ij, value.name = "delta"),
abs(delta) >= 0.2),
aes(x=sampleID, y=delta, color=sampleID)) + geom_quasirandom() +
scale_color_brewer(palette="Set1") + theme_bw()

It appears, at least for this simulated dataset that doesn’t have
gross differences in mean abundances or dropout rates, that quantile
normalization has a large effect on only a small number of fragments. If
we refit proDA to this (re)normalized data, we’d expect to
see some differences:
normfit <- proDA(normy, design = factor(rep(groups, each=nrep)), verbose=T)
Fitting the hyper-parameters for the probabilistic dropout model.
Starting iter: 1
The inferred parameters are:
location_prior_mean: 14.8
location_prior_scale: 3.99
dropout_curve_position: 10.2, 10.7, 11.2, 9.6, 10.7, 11.5, 9.11, 10, 12.5
dropout_curve_scale: -4.64, -5.09, -5.11, -4.26, -4.89, -5.46, -3.62, -4.5, -6.11
variance_prior_scale: 4.24
variance_prior_df: 12.8
Error: 0.59
Starting iter: 2
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.73106e-16
The inferred parameters are:
location_prior_mean: 14.8
location_prior_scale: 3.94
dropout_curve_position: 9.9, 10.4, 10.9, 9.33, 10.4, 11.3, 8.88, 9.75, 12.3
dropout_curve_scale: -4.94, -5.48, -5.52, -4.52, -5.26, -5.96, -3.79, -4.79, -6.75
variance_prior_scale: 4.01
variance_prior_df: 16.3
Error: 0.13
Starting iter: 3
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.50063e-16
The inferred parameters are:
location_prior_mean: 14.9
location_prior_scale: 3.79
dropout_curve_position: 9.76, 10.2, 10.8, 9.2, 10.3, 11.1, 8.79, 9.62, 12.2
dropout_curve_scale: -5.13, -5.74, -5.83, -4.67, -5.52, -6.32, -3.88, -4.97, -7.28
variance_prior_scale: 3.99
variance_prior_df: 16.8
Error: 0.047
Starting iter: 4
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.3672e-16
The inferred parameters are:
location_prior_mean: 14.9
location_prior_scale: 3.68
dropout_curve_position: 9.7, 10.2, 10.7, 9.17, 10.2, 11, 8.76, 9.57, 12.1
dropout_curve_scale: -5.24, -5.87, -6, -4.74, -5.66, -6.54, -3.93, -5.07, -7.62
variance_prior_scale: 3.98
variance_prior_df: 17.3
Error: 0.017
Starting iter: 5
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.29826e-16
The inferred parameters are:
location_prior_mean: 15
location_prior_scale: 3.61
dropout_curve_position: 9.68, 10.2, 10.7, 9.15, 10.2, 11, 8.76, 9.54, 12
dropout_curve_scale: -5.3, -5.95, -6.11, -4.78, -5.75, -6.67, -3.96, -5.13, -7.85
variance_prior_scale: 3.97
variance_prior_df: 17.7
Error: 0.0066
Starting iter: 6
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.26042e-16
The inferred parameters are:
location_prior_mean: 15
location_prior_scale: 3.56
dropout_curve_position: 9.67, 10.1, 10.7, 9.14, 10.2, 11, 8.76, 9.53, 12
dropout_curve_scale: -5.34, -6.01, -6.19, -4.81, -5.8, -6.76, -3.97, -5.16, -8.01
variance_prior_scale: 3.96
variance_prior_df: 17.9
Error: 0.0032
Starting iter: 7
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.23447e-16
The inferred parameters are:
location_prior_mean: 15
location_prior_scale: 3.52
dropout_curve_position: 9.66, 10.1, 10.6, 9.14, 10.2, 11, 8.76, 9.52, 12
dropout_curve_scale: -5.37, -6.04, -6.24, -4.82, -5.83, -6.82, -3.98, -5.19, -8.09
variance_prior_scale: 3.96
variance_prior_df: 18.2
Error: 0.0016
Starting iter: 8
Warning in solve.default(coef_hessian) :
system is computationally singular: reciprocal condition number = 1.21986e-16
Converged!
The inferred parameters are:
location_prior_mean: 15
location_prior_scale: 3.5
dropout_curve_position: 9.66, 10.1, 10.6, 9.14, 10.1, 11, 8.77, 9.52, 12
dropout_curve_scale: -5.37, -6.07, -6.26, -4.83, -5.86, -6.85, -3.98, -5.19, -8.17
variance_prior_scale: 3.96
variance_prior_df: 18.3
Error: 0.00052
normcoef <- normfit$coefficients
names(dimnames(normcoef)) <- c("peptide", "group")
coefdiff <- normcoef - coef
names(dimnames(coefdiff)) <- c("peptide", "group")
ggplot(subset(melt(coefdiff, value.name = "diff"),
abs(diff) >= 0.2),
aes(x=group, y=diff)) + geom_quasirandom() +
theme_bw() + ylab("difference in coefficient estimates")

ggplot(merge(melt(coefdiff, value.name="diff"),
melt(coef, value.name="estimate")),
aes(x=estimate, y=diff)) + geom_point() + facet_wrap(~group) + theme_bw() +
ylab("difference in coefficient estimates") + xlab("original estimate")

Perhaps as expected, the few coefficients that are grossly affected
by the normalization procedure are confined to the (mostly upper)
extremes on the abundance scale.
But let’s also look at what happens to the dropout curve
parameters:
normhps <- normfit$hyper_parameters
ggplot(rbind(data.frame(variable="dropout position",
value=rho_j,
estimate=normhps$dropout_curve_position,
sampleID = names(rho_j)),
data.frame(variable="dropout scale",
value=zeta_j,
estimate=-normhps$dropout_curve_scale,
sampleID = names(zeta_j))),
aes(x=value, y=estimate)) +
geom_abline(slope=1, intercept=0) +
stat_smooth(method="lm", formula = y ~ x) +
geom_point(aes(color=sampleID)) +
facet_wrap(~variable, scales="free") + theme_bw()

The consequence of renormalizing forces the dropout midpoint
locations to be uniformly over/under estimated for curves having true
lower/higher locations (and the scaling factor remains as before,
uniformly over-estimated). These small distortions in dropout curves
could have knock-on effects for estimation of coefficients from groups
having multiple dropout samples:
normses <- matrix(unlist(lapply(lapply(normfit$coefficient_variance_matrices, diag), sqrt)),
ncol=3, byrow=T)
colnames(normses) <- colnames(normfit$coefficient_variance_matrices[[1]])
normses <- as.data.frame(normses)
normses$peptide <- rownames(normfit$coefficients)
normses$lowqt <- qt(0.025, normfit$feature_parameters$df)
normses$highqt <- qt(0.975, normfit$feature_parameters$df)
normcis <- melt(normses,
id=c("peptide", "highqt", "lowqt"),
variable.name = "group",
value.name = "ses")
normcis$low <- normcis$lowqt * normcis$ses
normcis$high <- normcis$highqt * normcis$ses
normdf <- merge(dropdf, merge(melt(normcoef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
normdf <- merge(normdf, normcis)
normdf$estlow <- normdf$estimate + normdf$low
normdf$esthigh <- normdf$estimate + normdf$high
normdf$estrange <- normdf$esthigh - normdf$estlow
ggplot(normdf, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
geom_point() + geom_abline(slope=1, intercept=0) +
scale_color_brewer(name="# dropouts", palette="Set2") +
facet_wrap(~dropnum) + theme_bw()

Sample QC using distances
We know that differences in dropout abundances can lead to spurious
PCA projections during sample-level QC; proDA provides a
probabilistic distance measurement that could avoid such. Let’s
explore:
maty <- y_ij
maty[is.na(maty)] <- log2(1)
maty <- normalizeBetweenArrays(maty, method="quantile")
pca <- prcomp(maty, scale=T)
ggplot(as.data.frame(pca$rotation), aes(x=PC1, y=PC2, color=colnames(maty))) +
geom_point() + theme_bw()

dm <- dist_approx(fit)
heatmap(as.matrix(dm$mean))

While each sample has fewer/more dropouts, the heatmap pairwise
sample distances appear uniform within/between each experimental group,
as expected.
---
title: "proDA model simulation"
output: html_notebook
---

This notebook will (attempt) to demonstrate the statistical concepts behind the "generative" model employed by the proDA package to account for "dropout" zeros/NAs in proteomics data. Following the mathematics included in the paper, we will simulate a dataset that exactly matches the statistical, distributional expectations of the proDA model, and then examine how the model actually performs with said simulated data. Along the way, we will gain better understanding of what proDA is attempting to accomplish, and how it does so ...

```{r}
library(ggplot2)
library(extraDistr)
library(LaplacesDemon)
library(reshape2)
library(plyr)
library(ggbeeswarm)
library(proDA)
library(limma)
```

To fully (?) understand the proDA generative model, we will work backwards from
the various hyperparameters of parent distributions that define the sharing of
means and variances across peptides, and (eventually) simulate a phony dataset. This, in essence, is exactly what a generative model represents: an attempt to  represent real data via a simulation, which is governed by definable probability distributions. Remember: "all models are wrong; some happen to be useful" (George Box).

The best way to deal with a generative model is to "work backwards". Let us begin:

Starting with Appendix equation (4), the variance (squared s.d.), $\sigma_i^2$,
of each peptide $i$ is sampled from an Inverse Scaled Chi-Squared distribution
with hyperparameters $\tt{df}_0$ and $\tau_0^2$; the `rinvchisq` R function expects a parameterization of $\nu$ (positive shape parameter, same as $\tt{df}_0$) and $\tau$ (positive scale, or in this case the inverse of variance, _precision_). Let's see what that looks like for a (really and completely arbitrary) pair of hyperparameters:

```{r}
set.seed(123) # the well-known "Jackson Five" constant.
npep <- 5000 # number of peptides in our simulation
df_0 <- 20
tau_0 <- 4
var_i <- rinvchisq(npep, df_0, tau_0)
names(var_i) <- paste("fragment", 1:npep, sep="-")
hist(var_i, 100, freq=F)
```
OK, moving further backwards to Appendix equation (3), this is the so-called "non-standardized" version of the Student's t-distribution, with three parameters: the "normality" parameter $\tt{df}_{\tt{loc}}$, the central overall "grand mean" peptide log2 abundance $\mu_0$, and the variance $\sigma_0^2$ i.e. _range_ expected of the individual peptide abundances for each peptide in each group, $\mu_{i,j}$ sampled from this distribution. In the proDA paper, they assume that $\tt{df}_{\tt{loc}}$ is pre-specified, and is not estimated from the data; in the code the default is 3. Let's see what that looks like for some (again, somewhat arbitrarily selected) set of hyperparameters:
```{r}
groups <- c("Ctrl", "TrtA", "TrtB")
ngrp <- length(groups)
df_loc <- 3
mu_0 <- 15
sd_0 <- 1
mu_ij <- matrix(rst(n = npep*ngrp, mu = mu_0, sigma = sd_0^2, nu = df_loc),
                ncol=ngrp)
colnames(mu_ij) <- groups
rownames(mu_ij) <- names(var_i)
names(dimnames(mu_ij)) <- c("peptide", "group")


hist(mu_ij, 100, freq=F, xlim=c(5, 25))
```
OK, so now we have parent or "background" distributions for the means and variance of each peptide: with different means for the same peptide in different experimental groups, but enforcing ("assuming") shared/constant variance across all experimental groups for each peptide, i.e. "homoscedasticity" -- that's a really big assumption, and one that real data very often violates. We can return to that problem later.

So now we're ready to simulate the (latent, unobserved) log2 abundance values, $z_{i,j}$, for every sample replicate in each group (Appendix equation (1), second line). These are the values we would have observed in the MS if there were absolutely no dropout, in which case every log2 abundance would be measured directly, with some biological (and technical) variation $\sigma_i^2$.

```{r}
nrep <- 3 # number of replicates per group
z_ij <- matrix(data = NA, nrow=npep, ncol=0)
for (idx in 1:ncol(mu_ij)) {
  group <- groups[idx]
  reps <- matrix(rnorm(nrep*npep, mean=mu_ij[,idx], sd=sqrt(var_i)),
                 byrow = F,
                 ncol=nrep)
  colnames(reps) <- paste(group, rep(1:nrep), sep="-")
  z_ij <- cbind(z_ij, reps)
}
rownames(z_ij) <- rownames(mu_ij)
names(dimnames(z_ij)) <- c("peptide", "sampleID")
print(head(z_ij), digits=4)
```

Just to confirm (sanity check) that we did this right, let's make a plot of the intended group means, the $\mu_{i,j}$, with (+/- 2 s.d., from $\sigma_i^2$) for each group, and overlay the $z_{i,j}$ values, for the first handful of fragments:
```{r}
zs <- melt(z_ij[1:6,], value.name = "abundance") # first six peptides
zs$group <- gsub("(.*?)\\-\\d", "\\1", zs$sampleID)
mus <- melt(mu_ij[1:6,], value.name = "abundance")
mus <- merge(mus, melt(var_i[1:6], value.name = "var"), by.x="peptide", by.y=0)
ggplot(zs, aes(x=group, y=abundance, color=sampleID)) + geom_point() + 
  scale_color_brewer(palette="Set1") +
  geom_crossbar(data=mus, mapping = aes(y=abundance,
                                        ymin=abundance-2*sqrt(var),
                                        ymax=abundance+2*sqrt(var)),
                color=I("black")) +
  facet_wrap(~peptide) + theme_bw()
```

OK, so with that confirmed, let's remember that this is a rather idealized dataset where data from every sample comes already "pre-normalized" to all the others. Let's see what I mean:
```{r}
ggplot(melt(z_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```
In real, "raw" data, the distribution of abundances in each sample would vary a little bit, reflecting differences in sample concentration relative to each other (especially in a multiplexed experiment). Between-sample normalization techniques (quantile normalize, median normalization, etc.) are used to "align" the distributions, but we don't need to do such a step here. However, as discussed in the proDA paper, real "raw" data _does_ generally need to be normalized, and that normalization step can (and does) introduce distortions when the "lower end" of the abundance distribution has a higher dropout rate, leading to more dropout NAs (or zeroes) -- especially when the entire distribution for a given sample is shifted lower, due to an overall lower relative sample abundance (more dropouts/zeroes than in other samples).

OK, now to simulate the dropouts in each sample. The proDA model uses a sigmoid curve (using the complement of the Normal CDF function for convenience, equation 2), with parameters $\rho_j$ (inflection point) and $\zeta_j$ (scale) to represent the dropoff probability, at any given abundance value. Let's look at a few example dropout curves for a couple different parameter choices:
```{r}
ipnorm <- function(...) {
  1 - pnorm(...)
}
ggplot() +
  stat_function(fun = ipnorm,
                color = "blue",
                args = list(mean = 1, sd = 1)) +
  stat_function(fun = ipnorm,
                color="red",
                args = list(mean=10, sd=5)) +
  geom_vline(xintercept=1, color="blue", linetype=2) +
  geom_vline(xintercept=10, color="red", linetype=2) +
  geom_hline(yintercept=0.5, color="grey", linetype=2) +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()
```
From these two example dropout curves, we can see that the inflection point ($\rho_j$, specified as the mean to `pnorm` Normal CDF function) is where the probability is exactly 0.5 (see dotted lines); and the scale ($\zeta_j$, specified as the s.d. to `pnorm` defines the slope of the curve, how "sharp" the demarcation is between high and low dropout probability over a given range of abundances).

OK, with some intuition in hand now, we can start to reason about how the dropout curves might differ for each sample. In the proDA model, the sample-specific inflection points and scales don't have any common, shared parent distribution, but are estimated individually for each sample, based on an optimization of the likelihood function for each sample; since there are so many peptide fragments to inform that parameter estimation, that is quite computationally efficient. However, we've been entirely probabilistic so far in this simulation, with every value of interest derived as a random draw from some parent distribution, so let's stick with that theme. Given that we specified the "grand mean" $\mu_0$ of all peptide abundances at `r mu_0`, with a "range" $\sigma_0^2$ of `r sd_0^2`, let's say we set the "average" inflection point (where the dropout probability will be 0.5), $\rho_0$, over all samples to be $\mu_0 - 4\sigma_0$, 4 times the SD; and then let each sample's $\rho_j$ vary from this by $\sigma_{\rho_0}^2$. Similarly, let's let the average scale, $\zeta_0$ be a hyperparameter, and draw each sample's $\zeta_j$ from the InvChiSquare distribution as we did earlier for the analagous peptide variances:

```{r}
rho_0 <- mu_0 - 4 * sd_0
sigma_rho_0 <- 1.2
rho_j <- rnorm(ncol(z_ij), rho_0, sigma_rho_0)
names(rho_j) <- colnames(z_ij)

zeta_0 <- 10
zeta_tau_0 <- 3
zeta_j <- rinvchisq(ncol(z_ij), zeta_0, zeta_tau_0)
names(zeta_j) <- colnames(z_ij)

ggplot() +
  lapply(1:length(zeta_j), function (i) {
     stat_function(fun = ipnorm,
                   mapping = aes(color=colnames(z_ij)[i]),
                   args = list(mean=rho_j[i], sd=zeta_j[i]))
    }) +
  scale_color_brewer(name="sampleID", palette="Set1") +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

```
OK, great; we've got the underlying dropout curves, one for each sample. Now we can simulate which of the $z_{i,j}$ values will be observed (not a dropout) and which will be "NA" (is a dropout), by flipping a coin with the probability of dropout defined for each $z_{i,j}$ abundance value given by the dropout curve for sample $j$. This the "dropout matrix", $d_{i,j}$.

```{r}
d_ij <- z_ij
d_ij[,] <- 0
for (j in 1:ncol(d_ij)) {
  d_ij[,j] <- rbern(n=nrow(d_ij), prob=ipnorm(z_ij[,j], rho_j[j], zeta_j[j]))
}
qplot(rho_j, 100 * colSums(d_ij)/nrow(d_ij), color=zeta_j, size=I(5)) +
  geom_label(aes(label=names(zeta_j))) + 
  scale_y_log10() +
  ylab("% of dropouts in sample") + theme_bw()
```

From the above, we see that the range of dropout % in each sample correlates with the $\rho_j$ inflection points -- the higher (right-shifted) the inflection point, the more abundance values are lost due to dropout; and also, for a given $\rho_j$ value, the higher the $\zeta_j$ value, the shallower the dropoff slope, and so even more abundances dropout.

Now we have everything we need from Appendix equation (1) to produce a final, simulated dataset of experimentally observed abundances, $y_{i,j}$.

```{r}
# start by just copying z_ij into y_ij:
y_ij <- z_ij

# now dropout the values according to d_ij:
for (j in 1:ncol(y_ij)) {
  y_ij[d_ij[,j] == 1, j] <- NA
}

print(head(y_ij), digits=4)
```

In the first six fragments above, we can already see a handful of NA's that represent simulated dropouts.

OK, now we can test proDA using our simulated data, for which we know the true values for each single parameter of interest, and we can see how well proDA is able to recover what we've setup:

```{r}
fit <- proDA(y_ij, design = factor(rep(groups, each=nrep)), verbose=T)
```
So, perhaps not too unsurprisingly, when we use proDA to generate a fit with data that exactly matches the modeling expectations, it seems to do a pretty good job! Let's look at the various hyperparameters first:
```{r}
hps <- fit$hyper_parameters
knitr::kable(
  data.frame(
    name=c("grand mean",
           "grand range",
           "normality parameter",
           "variance scale",
           "variance shape"),
    values=c(mu_0, sd_0^2, df_loc, tau_0^2, df_0),
    estimate=c(hps$location_prior_mean,
               hps$location_prior_scale,
               hps$location_prior_df, # given as a fixed constant, not estimated
               hps$variance_prior_scale,
               hps$variance_prior_df)), format="pipe")
```

While the parameters (and random seed) of this notebook could change, at least with the current (?) values, we see that nearly all of the hyperparameters match very well, with the exception of the "grand range", or $\sigma_0^2$, which proDA has over-estimated as larger than the true distributed range of group-wise mean values; I expect this is an undue influence of the dropout values themselves, which will tend to "inflate" the expected variance of the group means, at the bottom/left end, even though the top/right end is accurately and explictly represented in the data (this amounts to a suggestion for the proDA authors to consider estimating the variance from a left-truncated distribution of group means, to avoid this artifact).

Next, let's checkout the dropout curve parameters:
```{r}
ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=hps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-hps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()
```
Interesting! While both the location ($\rho$) and scale ($\zeta$) parameters are very strongly correlated, the location is very slightly under-estimated (though not statistically signicant in this case), while the scale is distinctly and systematically _over-estimated_ in all samples. Off the top of my head, I don't have any rationale for this behavior, so I'm going to leave this as a question for the authors (!?!)

OK, so most importantly, let's look at how accurate proDA's estimation of the group means ended up, in the face of all these dropouts, and (mostly) correct hyperparameter estimations:

```{r}
coef <- fit$coefficients
names(dimnames(coef)) <- c("peptide", "group")
dtall <- melt(d_ij, value.name = "drop")
dtall$group <- gsub("(.*?)\\-\\d+", "\\1", dtall$sampleID)

dropdf <- ddply(dtall, .(peptide, group), function(dx) {
  data.frame(dropnum = sum(dx$drop))
})

ses <- matrix(unlist(lapply(lapply(fit$coefficient_variance_matrices, diag), sqrt)),
              ncol=3, byrow=T)
colnames(ses) <- colnames(fit$coefficient_variance_matrices[[1]])
ses <- as.data.frame(ses)
ses$peptide <- rownames(fit$coefficients)
ses$lowqt <- qt(0.025, fit$feature_parameters$df)
ses$highqt <- qt(0.975, fit$feature_parameters$df)
cis <- melt(ses,
            id=c("peptide", "highqt", "lowqt"),
            variable.name = "group",
            value.name = "ses")
cis$low <- cis$lowqt * cis$ses
cis$high <- cis$highqt * cis$ses

df <- merge(dropdf, merge(melt(coef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
df <- merge(df, cis)

df$estlow <- df$estimate + df$low
df$esthigh <- df$estimate + df$high
df$estrange <- df$esthigh - df$estlow

ggplot(df, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()
```

Again, interesting! When the number of dropouts in a group is 0, 1, or even 2, then proDA does a pretty remarkable job recovering the true value of the group mean. But when all of the replicates in a group are dropouts, we're left with coefficient estimates that are all ~12. The 95% confidence interval ranges look like this:

```{r}
ggplot(df, aes(x=estrange, fill=factor(dropnum))) +
  stat_bin(bins=50) + 
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum, scales="free_y", ncol=1) + theme_bw() +
  xlab("width of 95% confidence interval")
```

So, as expected: tighter interval ranges for groups having all observations; wider ranges with less observations, and essentially fixed (wide) interval when no observations are made.

Lastly, let's see how well proDA was able to estimate variances for each peptide:
```{r}
vardf <- data.frame(value = var_i,
                    estimate = fit$feature_parameters$s2,
                    missing = ncol(y_ij) - fit$feature_parameters$n_obs)

ggplot(vardf, aes(x=value, y=estimate)) + stat_binhex() +
  geom_abline(slope=1, intercept=0) + theme_bw() +
  xlab("true variance") + ylab("proDA estimated variance")

```
The majority of variances are correctly estimated, but a handful of peptides having the largest variances remain under-estimated, potentially inflating any downstream contrast statistics.

# effects of normalization with dropouts

We observed earlier that the simulated data is already perfectly normalized, at least prior to having random low-abundance values removed due to dropout; and each sample has slightly different dropout curves, so now the distributions may not be so perfectly comparable:

```{r}
ggplot(melt(y_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```

What happens if we quantile normalize the observations, and then compare to their original values (removing/ignoring any difference less than 0.2 units, for graphical clarity)?

```{r}
normy <- normalizeBetweenArrays(y_ij, method="quantile")
medy <- median_normalization(y_ij)
deltas <- colMeans(medy - y_ij, na.rm=T)
deltas <- data.frame(delta=deltas, sampleID = names(deltas))
ggplot(subset(melt(normy - y_ij, value.name = "delta"),
              abs(delta) >= 0.2),
       aes(x=sampleID, y=delta, color=sampleID)) + geom_quasirandom() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```
It appears, at least for this simulated dataset that doesn't have gross differences in mean abundances or dropout rates, that quantile normalization has a large effect on only a small number of fragments. If we refit `proDA` to this (re)normalized data, we'd expect to see some differences:
```{r}
normfit <- proDA(normy, design = factor(rep(groups, each=nrep)), verbose=T)
```

```{r}
normcoef <- normfit$coefficients
names(dimnames(normcoef)) <- c("peptide", "group")

coefdiff <- normcoef - coef
names(dimnames(coefdiff)) <- c("peptide", "group")
ggplot(subset(melt(coefdiff, value.name = "diff"),
              abs(diff) >= 0.2),
       aes(x=group, y=diff)) + geom_quasirandom() +
  theme_bw() + ylab("difference in coefficient estimates")

ggplot(merge(melt(coefdiff, value.name="diff"),
             melt(coef, value.name="estimate")),
       aes(x=estimate, y=diff)) + geom_point() + facet_wrap(~group) + theme_bw() +
  ylab("difference in coefficient estimates") + xlab("original estimate")
```
Perhaps as expected, the few coefficients that are grossly affected by the normalization procedure are confined to the (mostly upper) extremes on the abundance scale.

But let's also look at what happens to the dropout curve parameters:
```{r}
normhps <- normfit$hyper_parameters
ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=normhps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-normhps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()
```
The consequence of renormalizing forces the dropout midpoint locations to be uniformly over/under estimated for curves having true lower/higher locations (and the scaling factor remains as before, uniformly over-estimated). These small distortions in dropout curves could have knock-on effects for estimation of coefficients from groups having multiple dropout samples:

```{r}
normses <- matrix(unlist(lapply(lapply(normfit$coefficient_variance_matrices, diag), sqrt)),
                  ncol=3, byrow=T)
colnames(normses) <- colnames(normfit$coefficient_variance_matrices[[1]])
normses <- as.data.frame(normses)
normses$peptide <- rownames(normfit$coefficients)
normses$lowqt <- qt(0.025, normfit$feature_parameters$df)
normses$highqt <- qt(0.975, normfit$feature_parameters$df)
normcis <- melt(normses,
                id=c("peptide", "highqt", "lowqt"),
                variable.name = "group",
                value.name = "ses")
normcis$low <- normcis$lowqt * normcis$ses
normcis$high <- normcis$highqt * normcis$ses

normdf <- merge(dropdf, merge(melt(normcoef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
normdf <- merge(normdf, normcis)

normdf$estlow <- normdf$estimate + normdf$low
normdf$esthigh <- normdf$estimate + normdf$high
normdf$estrange <- normdf$esthigh - normdf$estlow

ggplot(normdf, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()
```
# Sample QC using distances

We know that differences in dropout abundances can lead to spurious PCA projections during sample-level QC; `proDA` provides a probabilistic distance measurement that could avoid such. Let's explore:

```{r}
maty <- y_ij
maty[is.na(maty)] <- log2(1)
maty <- normalizeBetweenArrays(maty, method="quantile")

pca <- prcomp(maty, scale=T)
ggplot(as.data.frame(pca$rotation), aes(x=PC1, y=PC2, color=colnames(maty))) +
  geom_point() + theme_bw()

dm <- dist_approx(fit)
heatmap(as.matrix(dm$mean))
```

While each sample has fewer/more dropouts, the heatmap pairwise sample distances appear uniform within/between each experimental group, as expected.