This notebook will (attempt) to demonstrate the statistical concepts behind the “generative” model employed by the proDA package to account for “dropout” zeros/NAs in proteomics data. Following the mathematics included in the paper, we will simulate a dataset that exactly matches the statistical, distributional expectations of the proDA model, and then examine how the model actually performs with said simulated data. Along the way, we will gain better understanding of what proDA is attempting to accomplish, and how it does so …

library(ggplot2)
library(extraDistr)
library(LaplacesDemon)

Attaching package: ‘LaplacesDemon’

The following objects are masked from ‘package:extraDistr’:

    dbern, dcat, ddirichlet, dgpd, dgpois, dinvchisq, dinvgamma, dlaplace, dpareto, pbern,
    plaplace, ppareto, qbern, qcat, qlaplace, qpareto, rbern, rcat, rdirichlet, rgpd, rinvchisq,
    rinvgamma, rlaplace, rpareto
library(reshape2)
library(plyr)
library(ggbeeswarm)
library(proDA)
library(limma)

To fully (?) understand the proDA generative model, we will work backwards from the various hyperparameters of parent distributions that define the sharing of means and variances across peptides, and (eventually) simulate a phony dataset. This, in essence, is exactly what a generative model represents: an attempt to represent real data via a simulation, which is governed by definable probability distributions. Remember: “all models are wrong; some happen to be useful” (George Box).

The best way to deal with a generative model is to “work backwards”. Let us begin:

Starting with Appendix equation (4), the variance (squared s.d.), \(\sigma_i^2\), of each peptide \(i\) is sampled from an Inverse Scaled Chi-Squared distribution with hyperparameters \(\tt{df}_0\) and \(\tau_0^2\); the rinvchisq R function expects a parameterization of \(\nu\) (positive shape parameter, same as \(\tt{df}_0\)) and \(\tau\) (positive scale, or in this case the inverse of variance, precision). Let’s see what that looks like for a (really and completely arbitrary) pair of hyperparameters:

set.seed(123) # the well-known "Jackson Five" constant.
npep <- 5000 # number of peptides in our simulation
df_0 <- 20
tau_0 <- 4
var_i <- rinvchisq(npep, df_0, tau_0)
names(var_i) <- paste("fragment", 1:npep, sep="-")
hist(var_i, 100, freq=F)

OK, moving further backwards to Appendix equation (3), this is the so-called “non-standardized” version of the Student’s t-distribution, with three parameters: the “normality” parameter \(\tt{df}_{\tt{loc}}\), the central overall “grand mean” peptide log2 abundance \(\mu_0\), and the variance \(\sigma_0^2\) i.e. range expected of the individual peptide abundances for each peptide in each group, \(\mu_{i,j}\) sampled from this distribution. In the proDA paper, they assume that \(\tt{df}_{\tt{loc}}\) is pre-specified, and is not estimated from the data; in the code the default is 3. Let’s see what that looks like for some (again, somewhat arbitrarily selected) set of hyperparameters:

groups <- c("Ctrl", "TrtA", "TrtB")
ngrp <- length(groups)
df_loc <- 3
mu_0 <- 15
sd_0 <- 1
mu_ij <- matrix(rst(n = npep*ngrp, mu = mu_0, sigma = sd_0^2, nu = df_loc),
                ncol=ngrp)
colnames(mu_ij) <- groups
rownames(mu_ij) <- names(var_i)
names(dimnames(mu_ij)) <- c("peptide", "group")


hist(mu_ij, 100, freq=F, xlim=c(5, 25))

OK, so now we have parent or “background” distributions for the means and variance of each peptide: with different means for the same peptide in different experimental groups, but enforcing (“assuming”) shared/constant variance across all experimental groups for each peptide, i.e. “homoscedasticity” – that’s a really big assumption, and one that real data very often violates. We can return to that problem later.

So now we’re ready to simulate the (latent, unobserved) log2 abundance values, \(z_{i,j}\), for every sample replicate in each group (Appendix equation (1), second line). These are the values we would have observed in the MS if there were absolutely no dropout, in which case every log2 abundance would be measured directly, with some biological (and technical) variation \(\sigma_i^2\).

nrep <- 3 # number of replicates per group
z_ij <- matrix(data = NA, nrow=npep, ncol=0)
for (idx in 1:ncol(mu_ij)) {
  group <- groups[idx]
  reps <- matrix(rnorm(nrep*npep, mean=mu_ij[,idx], sd=sqrt(var_i)),
                 byrow = F,
                 ncol=nrep)
  colnames(reps) <- paste(group, rep(1:nrep), sep="-")
  z_ij <- cbind(z_ij, reps)
}
rownames(z_ij) <- rownames(mu_ij)
names(dimnames(z_ij)) <- c("peptide", "sampleID")
print(head(z_ij), digits=4)
            sampleID
peptide      Ctrl-1 Ctrl-2 Ctrl-3 TrtA-1 TrtA-2 TrtA-3 TrtB-1 TrtB-2 TrtB-3
  fragment-1  11.25  10.90  11.71  17.63  13.65  15.24  18.70  19.56  25.45
  fragment-2  15.87  14.57  12.74  14.09  13.43  12.87  14.80  15.21  13.99
  fragment-3  16.32  18.00  14.57  13.84  21.55  17.57  13.54  13.07  11.32
  fragment-4  19.14  14.26  17.61  13.69  11.30  14.26  13.50  15.92  12.76
  fragment-5  14.95  11.66  13.66  12.74  16.65  16.51  13.15  15.34  13.47
  fragment-6  14.71  16.44  16.37  14.57  13.61  19.26  16.16  14.12  11.93

Just to confirm (sanity check) that we did this right, let’s make a plot of the intended group means, the \(\mu_{i,j}\), with (+/- 2 s.d., from \(\sigma_i^2\)) for each group, and overlay the \(z_{i,j}\) values, for the first handful of fragments:

zs <- melt(z_ij[1:6,], value.name = "abundance") # first six peptides
zs$group <- gsub("(.*?)\\-\\d", "\\1", zs$sampleID)
mus <- melt(mu_ij[1:6,], value.name = "abundance")
mus <- merge(mus, melt(var_i[1:6], value.name = "var"), by.x="peptide", by.y=0)
ggplot(zs, aes(x=group, y=abundance, color=sampleID)) + geom_point() + 
  scale_color_brewer(palette="Set1") +
  geom_crossbar(data=mus, mapping = aes(y=abundance,
                                        ymin=abundance-2*sqrt(var),
                                        ymax=abundance+2*sqrt(var)),
                color=I("black")) +
  facet_wrap(~peptide) + theme_bw()

OK, so with that confirmed, let’s remember that this is a rather idealized dataset where data from every sample comes already “pre-normalized” to all the others. Let’s see what I mean:

ggplot(melt(z_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()

In real, “raw” data, the distribution of abundances in each sample would vary a little bit, reflecting differences in sample concentration relative to each other (especially in a multiplexed experiment). Between-sample normalization techniques (quantile normalize, median normalization, etc.) are used to “align” the distributions, but we don’t need to do such a step here. However, as discussed in the proDA paper, real “raw” data does generally need to be normalized, and that normalization step can (and does) introduce distortions when the “lower end” of the abundance distribution has a higher dropout rate, leading to more dropout NAs (or zeroes) – especially when the entire distribution for a given sample is shifted lower, due to an overall lower relative sample abundance (more dropouts/zeroes than in other samples).

OK, now to simulate the dropouts in each sample. The proDA model uses a sigmoid curve (using the complement of the Normal CDF function for convenience, equation 2), with parameters \(\rho_j\) (inflection point) and \(\zeta_j\) (scale) to represent the dropoff probability, at any given abundance value. Let’s look at a few example dropout curves for a couple different parameter choices:

ipnorm <- function(...) {
  1 - pnorm(...)
}
ggplot() +
  stat_function(fun = ipnorm,
                color = "blue",
                args = list(mean = 1, sd = 1)) +
  stat_function(fun = ipnorm,
                color="red",
                args = list(mean=10, sd=5)) +
  geom_vline(xintercept=1, color="blue", linetype=2) +
  geom_vline(xintercept=10, color="red", linetype=2) +
  geom_hline(yintercept=0.5, color="grey", linetype=2) +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

From these two example dropout curves, we can see that the inflection point (\(\rho_j\), specified as the mean to pnorm Normal CDF function) is where the probability is exactly 0.5 (see dotted lines); and the scale (\(\zeta_j\), specified as the s.d. to pnorm defines the slope of the curve, how “sharp” the demarcation is between high and low dropout probability over a given range of abundances).

OK, with some intuition in hand now, we can start to reason about how the dropout curves might differ for each sample. In the proDA model, the sample-specific inflection points and scales don’t have any common, shared parent distribution, but are estimated individually for each sample, based on an optimization of the likelihood function for each sample; since there are so many peptide fragments to inform that parameter estimation, that is quite computationally efficient. However, we’ve been entirely probabilistic so far in this simulation, with every value of interest derived as a random draw from some parent distribution, so let’s stick with that theme. Given that we specified the “grand mean” \(\mu_0\) of all peptide abundances at 15, with a “range” \(\sigma_0^2\) of 1, let’s say we set the “average” inflection point (where the dropout probability will be 0.5), \(\rho_0\), over all samples to be \(\mu_0 - 4\sigma_0\), 4 times the SD; and then let each sample’s \(\rho_j\) vary from this by \(\sigma_{\rho_0}^2\). Similarly, let’s let the average scale, \(\zeta_0\) be a hyperparameter, and draw each sample’s \(\zeta_j\) from the InvChiSquare distribution as we did earlier for the analagous peptide variances:

rho_0 <- mu_0 - 4 * sd_0
sigma_rho_0 <- 1.2
rho_j <- rnorm(ncol(z_ij), rho_0, sigma_rho_0)
names(rho_j) <- colnames(z_ij)

zeta_0 <- 10
zeta_tau_0 <- 3
zeta_j <- rinvchisq(ncol(z_ij), zeta_0, zeta_tau_0)
names(zeta_j) <- colnames(z_ij)

ggplot() +
  lapply(1:length(zeta_j), function (i) {
     stat_function(fun = ipnorm,
                   mapping = aes(color=colnames(z_ij)[i]),
                   args = list(mean=rho_j[i], sd=zeta_j[i]))
    }) +
  scale_color_brewer(name="sampleID", palette="Set1") +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

OK, great; we’ve got the underlying dropout curves, one for each sample. Now we can simulate which of the \(z_{i,j}\) values will be observed (not a dropout) and which will be “NA” (is a dropout), by flipping a coin with the probability of dropout defined for each \(z_{i,j}\) abundance value given by the dropout curve for sample \(j\). This the “dropout matrix”, \(d_{i,j}\).

d_ij <- z_ij
d_ij[,] <- 0
for (j in 1:ncol(d_ij)) {
  d_ij[,j] <- rbern(n=nrow(d_ij), prob=ipnorm(z_ij[,j], rho_j[j], zeta_j[j]))
}
qplot(rho_j, 100 * colSums(d_ij)/nrow(d_ij), color=zeta_j, size=I(5)) +
  geom_label(aes(label=names(zeta_j))) + 
  scale_y_log10() +
  ylab("% of dropouts in sample") + theme_bw()

From the above, we see that the range of dropout % in each sample correlates with the \(\rho_j\) inflection points – the higher (right-shifted) the inflection point, the more abundance values are lost due to dropout; and also, for a given \(\rho_j\) value, the higher the \(\zeta_j\) value, the shallower the dropoff slope, and so even more abundances dropout.

Now we have everything we need from Appendix equation (1) to produce a final, simulated dataset of experimentally observed abundances, \(y_{i,j}\).

# start by just copying z_ij into y_ij:
y_ij <- z_ij

# now dropout the values according to d_ij:
for (j in 1:ncol(y_ij)) {
  y_ij[d_ij[,j] == 1, j] <- NA
}

print(head(y_ij), digits=4)
            sampleID
peptide      Ctrl-1 Ctrl-2 Ctrl-3 TrtA-1 TrtA-2 TrtA-3 TrtB-1 TrtB-2 TrtB-3
  fragment-1  11.25  10.90     NA  17.63     NA  15.24  18.70  19.56  25.45
  fragment-2  15.87  14.57  12.74  14.09  13.43  12.87  14.80  15.21     NA
  fragment-3  16.32     NA     NA  13.84  21.55  17.57  13.54  13.07  11.32
  fragment-4  19.14  14.26  17.61  13.69     NA  14.26  13.50  15.92     NA
  fragment-5     NA  11.66  13.66  12.74  16.65  16.51  13.15  15.34     NA
  fragment-6  14.71  16.44  16.37  14.57  13.61  19.26  16.16     NA     NA

In the first six fragments above, we can already see a handful of NA’s that represent simulated dropouts.

OK, now we can test proDA using our simulated data, for which we know the true values for each single parameter of interest, and we can see how well proDA is able to recover what we’ve setup:

fit <- proDA(y_ij, design = factor(rep(groups, each=nrep)), verbose=T)
Fitting the hyper-parameters for the probabilistic dropout model.
Starting iter: 1

So, perhaps not too unsurprisingly, when we use proDA to generate a fit with data that exactly matches the modeling expectations, it seems to do a pretty good job! Let’s look at the various hyperparameters first:

hps <- fit$hyper_parameters
knitr::kable(
  data.frame(
    name=c("grand mean",
           "grand range",
           "normality parameter",
           "variance scale",
           "variance shape"),
    values=c(mu_0, sd_0^2, df_loc, tau_0^2, df_0),
    estimate=c(hps$location_prior_mean,
               hps$location_prior_scale,
               hps$location_prior_df, # given as a fixed constant, not estimated
               hps$variance_prior_scale,
               hps$variance_prior_df)), format="pipe")
name values estimate
grand mean 15 15.004408
grand range 1 3.581907
normality parameter 3 3.000000
variance scale 16 3.949251
variance shape 20 18.304869

While the parameters (and random seed) of this notebook could change, at least with the current (?) values, we see that nearly all of the hyperparameters match very well, with the exception of the “grand range”, or \(\sigma_0^2\), which proDA has over-estimated as larger than the true distributed range of group-wise mean values; I expect this is an undue influence of the dropout values themselves, which will tend to “inflate” the expected variance of the group means, at the bottom/left end, even though the top/right end is accurately and explictly represented in the data (this amounts to a suggestion for the proDA authors to consider estimating the variance from a left-truncated distribution of group means, to avoid this artifact).

Next, let’s checkout the dropout curve parameters:

ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=hps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-hps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()

Interesting! While both the location (\(\rho\)) and scale (\(\zeta\)) parameters are very strongly correlated, the location is very slightly under-estimated (though not statistically signicant in this case), while the scale is distinctly and systematically over-estimated in all samples. Off the top of my head, I don’t have any rationale for this behavior, so I’m going to leave this as a question for the authors (!?!)

OK, so most importantly, let’s look at how accurate proDA’s estimation of the group means ended up, in the face of all these dropouts, and (mostly) correct hyperparameter estimations:

coef <- fit$coefficients
names(dimnames(coef)) <- c("peptide", "group")
dtall <- melt(d_ij, value.name = "drop")
dtall$group <- gsub("(.*?)\\-\\d+", "\\1", dtall$sampleID)

dropdf <- ddply(dtall, .(peptide, group), function(dx) {
  data.frame(dropnum = sum(dx$drop))
})

ses <- matrix(unlist(lapply(lapply(fit$coefficient_variance_matrices, diag), sqrt)),
              ncol=3, byrow=T)
colnames(ses) <- colnames(fit$coefficient_variance_matrices[[1]])
ses <- as.data.frame(ses)
ses$peptide <- rownames(fit$coefficients)
ses$lowqt <- qt(0.025, fit$feature_parameters$df)
ses$highqt <- qt(0.975, fit$feature_parameters$df)
cis <- melt(ses,
            id=c("peptide", "highqt", "lowqt"),
            variable.name = "group",
            value.name = "ses")
cis$low <- cis$lowqt * cis$ses
cis$high <- cis$highqt * cis$ses

df <- merge(dropdf, merge(melt(coef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
df <- merge(df, cis)

df$estlow <- df$estimate + df$low
df$esthigh <- df$estimate + df$high
df$estrange <- df$esthigh - df$estlow

ggplot(df, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()

Again, interesting! When the number of dropouts in a group is 0, 1, or even 2, then proDA does a pretty remarkable job recovering the true value of the group mean. But when all of the replicates in a group are dropouts, we’re left with coefficient estimates that are all ~12. The 95% confidence interval ranges look like this:

ggplot(df, aes(x=estrange, fill=factor(dropnum))) +
  stat_bin(bins=50) + 
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum, scales="free_y", ncol=1) + theme_bw() +
  xlab("width of 95% confidence interval")

So, as expected: tighter interval ranges for groups having all observations; wider ranges with less observations, and essentially fixed (wide) interval when no observations are made.

Lastly, let’s see how well proDA was able to estimate variances for each peptide:

vardf <- data.frame(value = var_i,
                    estimate = fit$feature_parameters$s2,
                    missing = ncol(y_ij) - fit$feature_parameters$n_obs)

ggplot(vardf, aes(x=value, y=estimate)) + stat_binhex() +
  geom_abline(slope=1, intercept=0) + theme_bw() +
  xlab("true variance") + ylab("proDA estimated variance")

The majority of variances are correctly estimated, but a handful of peptides having the largest variances remain under-estimated, potentially inflating any downstream contrast statistics.

effects of normalization with dropouts

We observed earlier that the simulated data is already perfectly normalized, at least prior to having random low-abundance values removed due to dropout; and each sample has slightly different dropout curves, so now the distributions may not be so perfectly comparable:

ggplot(melt(y_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()

What happens if we quantile normalize the observations, and then compare to their original values (removing/ignoring any difference less than 0.2 units, for graphical clarity)?

normy <- normalizeBetweenArrays(y_ij, method="quantile")
medy <- median_normalization(y_ij)
deltas <- colMeans(medy - y_ij, na.rm=T)
deltas <- data.frame(delta=deltas, sampleID = names(deltas))
ggplot(subset(melt(normy - y_ij, value.name = "delta"),
              abs(delta) >= 0.2),
       aes(x=sampleID, y=delta, color=sampleID)) + geom_quasirandom() + 
  scale_color_brewer(palette="Set1") + theme_bw()

It appears, at least for this simulated dataset that doesn’t have gross differences in mean abundances or dropout rates, that quantile normalization has a large effect on only a small number of fragments. If we refit proDA to this (re)normalized data, we’d expect to see some differences:

normfit <- proDA(normy, design = factor(rep(groups, each=nrep)), verbose=T)
Fitting the hyper-parameters for the probabilistic dropout model.
Starting iter: 1
The inferred parameters are:
location_prior_mean:     14.8
location_prior_scale:    3.99
dropout_curve_position:  10.2, 10.7, 11.2, 9.6, 10.7, 11.5, 9.11, 10, 12.5
dropout_curve_scale:     -4.64, -5.09, -5.11, -4.26, -4.89, -5.46, -3.62, -4.5, -6.11
variance_prior_scale:    4.24
variance_prior_df:       12.8
Error: 0.59

Starting iter: 2
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.73106e-16
The inferred parameters are:
location_prior_mean:     14.8
location_prior_scale:    3.94
dropout_curve_position:  9.9, 10.4, 10.9, 9.33, 10.4, 11.3, 8.88, 9.75, 12.3
dropout_curve_scale:     -4.94, -5.48, -5.52, -4.52, -5.26, -5.96, -3.79, -4.79, -6.75
variance_prior_scale:    4.01
variance_prior_df:       16.3
Error: 0.13

Starting iter: 3
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.50063e-16
The inferred parameters are:
location_prior_mean:     14.9
location_prior_scale:    3.79
dropout_curve_position:  9.76, 10.2, 10.8, 9.2, 10.3, 11.1, 8.79, 9.62, 12.2
dropout_curve_scale:     -5.13, -5.74, -5.83, -4.67, -5.52, -6.32, -3.88, -4.97, -7.28
variance_prior_scale:    3.99
variance_prior_df:       16.8
Error: 0.047

Starting iter: 4
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.3672e-16
The inferred parameters are:
location_prior_mean:     14.9
location_prior_scale:    3.68
dropout_curve_position:  9.7, 10.2, 10.7, 9.17, 10.2, 11, 8.76, 9.57, 12.1
dropout_curve_scale:     -5.24, -5.87, -6, -4.74, -5.66, -6.54, -3.93, -5.07, -7.62
variance_prior_scale:    3.98
variance_prior_df:       17.3
Error: 0.017

Starting iter: 5
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.29826e-16
The inferred parameters are:
location_prior_mean:     15
location_prior_scale:    3.61
dropout_curve_position:  9.68, 10.2, 10.7, 9.15, 10.2, 11, 8.76, 9.54, 12
dropout_curve_scale:     -5.3, -5.95, -6.11, -4.78, -5.75, -6.67, -3.96, -5.13, -7.85
variance_prior_scale:    3.97
variance_prior_df:       17.7
Error: 0.0066

Starting iter: 6
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.26042e-16
The inferred parameters are:
location_prior_mean:     15
location_prior_scale:    3.56
dropout_curve_position:  9.67, 10.1, 10.7, 9.14, 10.2, 11, 8.76, 9.53, 12
dropout_curve_scale:     -5.34, -6.01, -6.19, -4.81, -5.8, -6.76, -3.97, -5.16, -8.01
variance_prior_scale:    3.96
variance_prior_df:       17.9
Error: 0.0032

Starting iter: 7
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.23447e-16
The inferred parameters are:
location_prior_mean:     15
location_prior_scale:    3.52
dropout_curve_position:  9.66, 10.1, 10.6, 9.14, 10.2, 11, 8.76, 9.52, 12
dropout_curve_scale:     -5.37, -6.04, -6.24, -4.82, -5.83, -6.82, -3.98, -5.19, -8.09
variance_prior_scale:    3.96
variance_prior_df:       18.2
Error: 0.0016

Starting iter: 8
Warning in solve.default(coef_hessian) :
  system is computationally singular: reciprocal condition number = 1.21986e-16
Converged!
The inferred parameters are:
location_prior_mean:     15
location_prior_scale:    3.5
dropout_curve_position:  9.66, 10.1, 10.6, 9.14, 10.1, 11, 8.77, 9.52, 12
dropout_curve_scale:     -5.37, -6.07, -6.26, -4.83, -5.86, -6.85, -3.98, -5.19, -8.17
variance_prior_scale:    3.96
variance_prior_df:       18.3
Error: 0.00052
normcoef <- normfit$coefficients
names(dimnames(normcoef)) <- c("peptide", "group")

coefdiff <- normcoef - coef
names(dimnames(coefdiff)) <- c("peptide", "group")
ggplot(subset(melt(coefdiff, value.name = "diff"),
              abs(diff) >= 0.2),
       aes(x=group, y=diff)) + geom_quasirandom() +
  theme_bw() + ylab("difference in coefficient estimates")


ggplot(merge(melt(coefdiff, value.name="diff"),
             melt(coef, value.name="estimate")),
       aes(x=estimate, y=diff)) + geom_point() + facet_wrap(~group) + theme_bw() +
  ylab("difference in coefficient estimates") + xlab("original estimate")

Perhaps as expected, the few coefficients that are grossly affected by the normalization procedure are confined to the (mostly upper) extremes on the abundance scale.

But let’s also look at what happens to the dropout curve parameters:

normhps <- normfit$hyper_parameters
ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=normhps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-normhps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()

The consequence of renormalizing forces the dropout midpoint locations to be uniformly over/under estimated for curves having true lower/higher locations (and the scaling factor remains as before, uniformly over-estimated). These small distortions in dropout curves could have knock-on effects for estimation of coefficients from groups having multiple dropout samples:

normses <- matrix(unlist(lapply(lapply(normfit$coefficient_variance_matrices, diag), sqrt)),
                  ncol=3, byrow=T)
colnames(normses) <- colnames(normfit$coefficient_variance_matrices[[1]])
normses <- as.data.frame(normses)
normses$peptide <- rownames(normfit$coefficients)
normses$lowqt <- qt(0.025, normfit$feature_parameters$df)
normses$highqt <- qt(0.975, normfit$feature_parameters$df)
normcis <- melt(normses,
                id=c("peptide", "highqt", "lowqt"),
                variable.name = "group",
                value.name = "ses")
normcis$low <- normcis$lowqt * normcis$ses
normcis$high <- normcis$highqt * normcis$ses

normdf <- merge(dropdf, merge(melt(normcoef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
normdf <- merge(normdf, normcis)

normdf$estlow <- normdf$estimate + normdf$low
normdf$esthigh <- normdf$estimate + normdf$high
normdf$estrange <- normdf$esthigh - normdf$estlow

ggplot(normdf, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()

Sample QC using distances

We know that differences in dropout abundances can lead to spurious PCA projections during sample-level QC; proDA provides a probabilistic distance measurement that could avoid such. Let’s explore:

maty <- y_ij
maty[is.na(maty)] <- log2(1)
maty <- normalizeBetweenArrays(maty, method="quantile")

pca <- prcomp(maty, scale=T)
ggplot(as.data.frame(pca$rotation), aes(x=PC1, y=PC2, color=colnames(maty))) +
  geom_point() + theme_bw()


dm <- dist_approx(fit)
heatmap(as.matrix(dm$mean))

While each sample has fewer/more dropouts, the heatmap pairwise sample distances appear uniform within/between each experimental group, as expected.

---
title: "proDA model simulation"
output: html_notebook
---

This notebook will (attempt) to demonstrate the statistical concepts behind the "generative" model employed by the proDA package to account for "dropout" zeros/NAs in proteomics data. Following the mathematics included in the paper, we will simulate a dataset that exactly matches the statistical, distributional expectations of the proDA model, and then examine how the model actually performs with said simulated data. Along the way, we will gain better understanding of what proDA is attempting to accomplish, and how it does so ...

```{r}
library(ggplot2)
library(extraDistr)
library(LaplacesDemon)
library(reshape2)
library(plyr)
library(ggbeeswarm)
library(proDA)
library(limma)
```

To fully (?) understand the proDA generative model, we will work backwards from
the various hyperparameters of parent distributions that define the sharing of
means and variances across peptides, and (eventually) simulate a phony dataset. This, in essence, is exactly what a generative model represents: an attempt to  represent real data via a simulation, which is governed by definable probability distributions. Remember: "all models are wrong; some happen to be useful" (George Box).

The best way to deal with a generative model is to "work backwards". Let us begin:

Starting with Appendix equation (4), the variance (squared s.d.), $\sigma_i^2$,
of each peptide $i$ is sampled from an Inverse Scaled Chi-Squared distribution
with hyperparameters $\tt{df}_0$ and $\tau_0^2$; the `rinvchisq` R function expects a parameterization of $\nu$ (positive shape parameter, same as $\tt{df}_0$) and $\tau$ (positive scale, or in this case the inverse of variance, _precision_). Let's see what that looks like for a (really and completely arbitrary) pair of hyperparameters:

```{r}
set.seed(123) # the well-known "Jackson Five" constant.
npep <- 5000 # number of peptides in our simulation
df_0 <- 20
tau_0 <- 4
var_i <- rinvchisq(npep, df_0, tau_0)
names(var_i) <- paste("fragment", 1:npep, sep="-")
hist(var_i, 100, freq=F)
```
OK, moving further backwards to Appendix equation (3), this is the so-called "non-standardized" version of the Student's t-distribution, with three parameters: the "normality" parameter $\tt{df}_{\tt{loc}}$, the central overall "grand mean" peptide log2 abundance $\mu_0$, and the variance $\sigma_0^2$ i.e. _range_ expected of the individual peptide abundances for each peptide in each group, $\mu_{i,j}$ sampled from this distribution. In the proDA paper, they assume that $\tt{df}_{\tt{loc}}$ is pre-specified, and is not estimated from the data; in the code the default is 3. Let's see what that looks like for some (again, somewhat arbitrarily selected) set of hyperparameters:
```{r}
groups <- c("Ctrl", "TrtA", "TrtB")
ngrp <- length(groups)
df_loc <- 3
mu_0 <- 15
sd_0 <- 1
mu_ij <- matrix(rst(n = npep*ngrp, mu = mu_0, sigma = sd_0^2, nu = df_loc),
                ncol=ngrp)
colnames(mu_ij) <- groups
rownames(mu_ij) <- names(var_i)
names(dimnames(mu_ij)) <- c("peptide", "group")


hist(mu_ij, 100, freq=F, xlim=c(5, 25))
```
OK, so now we have parent or "background" distributions for the means and variance of each peptide: with different means for the same peptide in different experimental groups, but enforcing ("assuming") shared/constant variance across all experimental groups for each peptide, i.e. "homoscedasticity" -- that's a really big assumption, and one that real data very often violates. We can return to that problem later.

So now we're ready to simulate the (latent, unobserved) log2 abundance values, $z_{i,j}$, for every sample replicate in each group (Appendix equation (1), second line). These are the values we would have observed in the MS if there were absolutely no dropout, in which case every log2 abundance would be measured directly, with some biological (and technical) variation $\sigma_i^2$.

```{r}
nrep <- 3 # number of replicates per group
z_ij <- matrix(data = NA, nrow=npep, ncol=0)
for (idx in 1:ncol(mu_ij)) {
  group <- groups[idx]
  reps <- matrix(rnorm(nrep*npep, mean=mu_ij[,idx], sd=sqrt(var_i)),
                 byrow = F,
                 ncol=nrep)
  colnames(reps) <- paste(group, rep(1:nrep), sep="-")
  z_ij <- cbind(z_ij, reps)
}
rownames(z_ij) <- rownames(mu_ij)
names(dimnames(z_ij)) <- c("peptide", "sampleID")
print(head(z_ij), digits=4)
```

Just to confirm (sanity check) that we did this right, let's make a plot of the intended group means, the $\mu_{i,j}$, with (+/- 2 s.d., from $\sigma_i^2$) for each group, and overlay the $z_{i,j}$ values, for the first handful of fragments:
```{r}
zs <- melt(z_ij[1:6,], value.name = "abundance") # first six peptides
zs$group <- gsub("(.*?)\\-\\d", "\\1", zs$sampleID)
mus <- melt(mu_ij[1:6,], value.name = "abundance")
mus <- merge(mus, melt(var_i[1:6], value.name = "var"), by.x="peptide", by.y=0)
ggplot(zs, aes(x=group, y=abundance, color=sampleID)) + geom_point() + 
  scale_color_brewer(palette="Set1") +
  geom_crossbar(data=mus, mapping = aes(y=abundance,
                                        ymin=abundance-2*sqrt(var),
                                        ymax=abundance+2*sqrt(var)),
                color=I("black")) +
  facet_wrap(~peptide) + theme_bw()
```

OK, so with that confirmed, let's remember that this is a rather idealized dataset where data from every sample comes already "pre-normalized" to all the others. Let's see what I mean:
```{r}
ggplot(melt(z_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```
In real, "raw" data, the distribution of abundances in each sample would vary a little bit, reflecting differences in sample concentration relative to each other (especially in a multiplexed experiment). Between-sample normalization techniques (quantile normalize, median normalization, etc.) are used to "align" the distributions, but we don't need to do such a step here. However, as discussed in the proDA paper, real "raw" data _does_ generally need to be normalized, and that normalization step can (and does) introduce distortions when the "lower end" of the abundance distribution has a higher dropout rate, leading to more dropout NAs (or zeroes) -- especially when the entire distribution for a given sample is shifted lower, due to an overall lower relative sample abundance (more dropouts/zeroes than in other samples).

OK, now to simulate the dropouts in each sample. The proDA model uses a sigmoid curve (using the complement of the Normal CDF function for convenience, equation 2), with parameters $\rho_j$ (inflection point) and $\zeta_j$ (scale) to represent the dropoff probability, at any given abundance value. Let's look at a few example dropout curves for a couple different parameter choices:
```{r}
ipnorm <- function(...) {
  1 - pnorm(...)
}
ggplot() +
  stat_function(fun = ipnorm,
                color = "blue",
                args = list(mean = 1, sd = 1)) +
  stat_function(fun = ipnorm,
                color="red",
                args = list(mean=10, sd=5)) +
  geom_vline(xintercept=1, color="blue", linetype=2) +
  geom_vline(xintercept=10, color="red", linetype=2) +
  geom_hline(yintercept=0.5, color="grey", linetype=2) +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()
```
From these two example dropout curves, we can see that the inflection point ($\rho_j$, specified as the mean to `pnorm` Normal CDF function) is where the probability is exactly 0.5 (see dotted lines); and the scale ($\zeta_j$, specified as the s.d. to `pnorm` defines the slope of the curve, how "sharp" the demarcation is between high and low dropout probability over a given range of abundances).

OK, with some intuition in hand now, we can start to reason about how the dropout curves might differ for each sample. In the proDA model, the sample-specific inflection points and scales don't have any common, shared parent distribution, but are estimated individually for each sample, based on an optimization of the likelihood function for each sample; since there are so many peptide fragments to inform that parameter estimation, that is quite computationally efficient. However, we've been entirely probabilistic so far in this simulation, with every value of interest derived as a random draw from some parent distribution, so let's stick with that theme. Given that we specified the "grand mean" $\mu_0$ of all peptide abundances at `r mu_0`, with a "range" $\sigma_0^2$ of `r sd_0^2`, let's say we set the "average" inflection point (where the dropout probability will be 0.5), $\rho_0$, over all samples to be $\mu_0 - 4\sigma_0$, 4 times the SD; and then let each sample's $\rho_j$ vary from this by $\sigma_{\rho_0}^2$. Similarly, let's let the average scale, $\zeta_0$ be a hyperparameter, and draw each sample's $\zeta_j$ from the InvChiSquare distribution as we did earlier for the analagous peptide variances:

```{r}
rho_0 <- mu_0 - 4 * sd_0
sigma_rho_0 <- 1.2
rho_j <- rnorm(ncol(z_ij), rho_0, sigma_rho_0)
names(rho_j) <- colnames(z_ij)

zeta_0 <- 10
zeta_tau_0 <- 3
zeta_j <- rinvchisq(ncol(z_ij), zeta_0, zeta_tau_0)
names(zeta_j) <- colnames(z_ij)

ggplot() +
  lapply(1:length(zeta_j), function (i) {
     stat_function(fun = ipnorm,
                   mapping = aes(color=colnames(z_ij)[i]),
                   args = list(mean=rho_j[i], sd=zeta_j[i]))
    }) +
  scale_color_brewer(name="sampleID", palette="Set1") +
  xlim(-10, 30) + xlab("abundance") + ylab("dropout probability") + theme_bw()

```
OK, great; we've got the underlying dropout curves, one for each sample. Now we can simulate which of the $z_{i,j}$ values will be observed (not a dropout) and which will be "NA" (is a dropout), by flipping a coin with the probability of dropout defined for each $z_{i,j}$ abundance value given by the dropout curve for sample $j$. This the "dropout matrix", $d_{i,j}$.

```{r}
d_ij <- z_ij
d_ij[,] <- 0
for (j in 1:ncol(d_ij)) {
  d_ij[,j] <- rbern(n=nrow(d_ij), prob=ipnorm(z_ij[,j], rho_j[j], zeta_j[j]))
}
qplot(rho_j, 100 * colSums(d_ij)/nrow(d_ij), color=zeta_j, size=I(5)) +
  geom_label(aes(label=names(zeta_j))) + 
  scale_y_log10() +
  ylab("% of dropouts in sample") + theme_bw()
```

From the above, we see that the range of dropout % in each sample correlates with the $\rho_j$ inflection points -- the higher (right-shifted) the inflection point, the more abundance values are lost due to dropout; and also, for a given $\rho_j$ value, the higher the $\zeta_j$ value, the shallower the dropoff slope, and so even more abundances dropout.

Now we have everything we need from Appendix equation (1) to produce a final, simulated dataset of experimentally observed abundances, $y_{i,j}$.

```{r}
# start by just copying z_ij into y_ij:
y_ij <- z_ij

# now dropout the values according to d_ij:
for (j in 1:ncol(y_ij)) {
  y_ij[d_ij[,j] == 1, j] <- NA
}

print(head(y_ij), digits=4)
```

In the first six fragments above, we can already see a handful of NA's that represent simulated dropouts.

OK, now we can test proDA using our simulated data, for which we know the true values for each single parameter of interest, and we can see how well proDA is able to recover what we've setup:

```{r}
fit <- proDA(y_ij, design = factor(rep(groups, each=nrep)), verbose=T)
```
So, perhaps not too unsurprisingly, when we use proDA to generate a fit with data that exactly matches the modeling expectations, it seems to do a pretty good job! Let's look at the various hyperparameters first:
```{r}
hps <- fit$hyper_parameters
knitr::kable(
  data.frame(
    name=c("grand mean",
           "grand range",
           "normality parameter",
           "variance scale",
           "variance shape"),
    values=c(mu_0, sd_0^2, df_loc, tau_0^2, df_0),
    estimate=c(hps$location_prior_mean,
               hps$location_prior_scale,
               hps$location_prior_df, # given as a fixed constant, not estimated
               hps$variance_prior_scale,
               hps$variance_prior_df)), format="pipe")
```

While the parameters (and random seed) of this notebook could change, at least with the current (?) values, we see that nearly all of the hyperparameters match very well, with the exception of the "grand range", or $\sigma_0^2$, which proDA has over-estimated as larger than the true distributed range of group-wise mean values; I expect this is an undue influence of the dropout values themselves, which will tend to "inflate" the expected variance of the group means, at the bottom/left end, even though the top/right end is accurately and explictly represented in the data (this amounts to a suggestion for the proDA authors to consider estimating the variance from a left-truncated distribution of group means, to avoid this artifact).

Next, let's checkout the dropout curve parameters:
```{r}
ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=hps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-hps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()
```
Interesting! While both the location ($\rho$) and scale ($\zeta$) parameters are very strongly correlated, the location is very slightly under-estimated (though not statistically signicant in this case), while the scale is distinctly and systematically _over-estimated_ in all samples. Off the top of my head, I don't have any rationale for this behavior, so I'm going to leave this as a question for the authors (!?!)

OK, so most importantly, let's look at how accurate proDA's estimation of the group means ended up, in the face of all these dropouts, and (mostly) correct hyperparameter estimations:

```{r}
coef <- fit$coefficients
names(dimnames(coef)) <- c("peptide", "group")
dtall <- melt(d_ij, value.name = "drop")
dtall$group <- gsub("(.*?)\\-\\d+", "\\1", dtall$sampleID)

dropdf <- ddply(dtall, .(peptide, group), function(dx) {
  data.frame(dropnum = sum(dx$drop))
})

ses <- matrix(unlist(lapply(lapply(fit$coefficient_variance_matrices, diag), sqrt)),
              ncol=3, byrow=T)
colnames(ses) <- colnames(fit$coefficient_variance_matrices[[1]])
ses <- as.data.frame(ses)
ses$peptide <- rownames(fit$coefficients)
ses$lowqt <- qt(0.025, fit$feature_parameters$df)
ses$highqt <- qt(0.975, fit$feature_parameters$df)
cis <- melt(ses,
            id=c("peptide", "highqt", "lowqt"),
            variable.name = "group",
            value.name = "ses")
cis$low <- cis$lowqt * cis$ses
cis$high <- cis$highqt * cis$ses

df <- merge(dropdf, merge(melt(coef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
df <- merge(df, cis)

df$estlow <- df$estimate + df$low
df$esthigh <- df$estimate + df$high
df$estrange <- df$esthigh - df$estlow

ggplot(df, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()
```

Again, interesting! When the number of dropouts in a group is 0, 1, or even 2, then proDA does a pretty remarkable job recovering the true value of the group mean. But when all of the replicates in a group are dropouts, we're left with coefficient estimates that are all ~12. The 95% confidence interval ranges look like this:

```{r}
ggplot(df, aes(x=estrange, fill=factor(dropnum))) +
  stat_bin(bins=50) + 
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum, scales="free_y", ncol=1) + theme_bw() +
  xlab("width of 95% confidence interval")
```

So, as expected: tighter interval ranges for groups having all observations; wider ranges with less observations, and essentially fixed (wide) interval when no observations are made.

Lastly, let's see how well proDA was able to estimate variances for each peptide:
```{r}
vardf <- data.frame(value = var_i,
                    estimate = fit$feature_parameters$s2,
                    missing = ncol(y_ij) - fit$feature_parameters$n_obs)

ggplot(vardf, aes(x=value, y=estimate)) + stat_binhex() +
  geom_abline(slope=1, intercept=0) + theme_bw() +
  xlab("true variance") + ylab("proDA estimated variance")

```
The majority of variances are correctly estimated, but a handful of peptides having the largest variances remain under-estimated, potentially inflating any downstream contrast statistics.

# effects of normalization with dropouts

We observed earlier that the simulated data is already perfectly normalized, at least prior to having random low-abundance values removed due to dropout; and each sample has slightly different dropout curves, so now the distributions may not be so perfectly comparable:

```{r}
ggplot(melt(y_ij, value.name = "abundance"),
       aes(x=sampleID, y=abundance, color=sampleID)) + geom_violin() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```

What happens if we quantile normalize the observations, and then compare to their original values (removing/ignoring any difference less than 0.2 units, for graphical clarity)?

```{r}
normy <- normalizeBetweenArrays(y_ij, method="quantile")
medy <- median_normalization(y_ij)
deltas <- colMeans(medy - y_ij, na.rm=T)
deltas <- data.frame(delta=deltas, sampleID = names(deltas))
ggplot(subset(melt(normy - y_ij, value.name = "delta"),
              abs(delta) >= 0.2),
       aes(x=sampleID, y=delta, color=sampleID)) + geom_quasirandom() + 
  scale_color_brewer(palette="Set1") + theme_bw()
```
It appears, at least for this simulated dataset that doesn't have gross differences in mean abundances or dropout rates, that quantile normalization has a large effect on only a small number of fragments. If we refit `proDA` to this (re)normalized data, we'd expect to see some differences:
```{r}
normfit <- proDA(normy, design = factor(rep(groups, each=nrep)), verbose=T)
```

```{r}
normcoef <- normfit$coefficients
names(dimnames(normcoef)) <- c("peptide", "group")

coefdiff <- normcoef - coef
names(dimnames(coefdiff)) <- c("peptide", "group")
ggplot(subset(melt(coefdiff, value.name = "diff"),
              abs(diff) >= 0.2),
       aes(x=group, y=diff)) + geom_quasirandom() +
  theme_bw() + ylab("difference in coefficient estimates")

ggplot(merge(melt(coefdiff, value.name="diff"),
             melt(coef, value.name="estimate")),
       aes(x=estimate, y=diff)) + geom_point() + facet_wrap(~group) + theme_bw() +
  ylab("difference in coefficient estimates") + xlab("original estimate")
```
Perhaps as expected, the few coefficients that are grossly affected by the normalization procedure are confined to the (mostly upper) extremes on the abundance scale.

But let's also look at what happens to the dropout curve parameters:
```{r}
normhps <- normfit$hyper_parameters
ggplot(rbind(data.frame(variable="dropout position",
                        value=rho_j,
                        estimate=normhps$dropout_curve_position,
                        sampleID = names(rho_j)),
             data.frame(variable="dropout scale",
                        value=zeta_j,
                        estimate=-normhps$dropout_curve_scale,
                        sampleID = names(zeta_j))),
       aes(x=value, y=estimate)) + 
  geom_abline(slope=1, intercept=0) +
  stat_smooth(method="lm", formula = y ~ x) +
  geom_point(aes(color=sampleID)) +
  facet_wrap(~variable, scales="free") +  theme_bw()
```
The consequence of renormalizing forces the dropout midpoint locations to be uniformly over/under estimated for curves having true lower/higher locations (and the scaling factor remains as before, uniformly over-estimated). These small distortions in dropout curves could have knock-on effects for estimation of coefficients from groups having multiple dropout samples:

```{r}
normses <- matrix(unlist(lapply(lapply(normfit$coefficient_variance_matrices, diag), sqrt)),
                  ncol=3, byrow=T)
colnames(normses) <- colnames(normfit$coefficient_variance_matrices[[1]])
normses <- as.data.frame(normses)
normses$peptide <- rownames(normfit$coefficients)
normses$lowqt <- qt(0.025, normfit$feature_parameters$df)
normses$highqt <- qt(0.975, normfit$feature_parameters$df)
normcis <- melt(normses,
                id=c("peptide", "highqt", "lowqt"),
                variable.name = "group",
                value.name = "ses")
normcis$low <- normcis$lowqt * normcis$ses
normcis$high <- normcis$highqt * normcis$ses

normdf <- merge(dropdf, merge(melt(normcoef, value.name = "estimate"), melt(mu_ij, value.name = "value")))
normdf <- merge(normdf, normcis)

normdf$estlow <- normdf$estimate + normdf$low
normdf$esthigh <- normdf$estimate + normdf$high
normdf$estrange <- normdf$esthigh - normdf$estlow

ggplot(normdf, aes(x=value, y=estimate, shape=group, color=factor(dropnum))) +
  geom_point() + geom_abline(slope=1, intercept=0) +
  scale_color_brewer(name="# dropouts", palette="Set2") +
  facet_wrap(~dropnum) + theme_bw()
```
# Sample QC using distances

We know that differences in dropout abundances can lead to spurious PCA projections during sample-level QC; `proDA` provides a probabilistic distance measurement that could avoid such. Let's explore:

```{r}
maty <- y_ij
maty[is.na(maty)] <- log2(1)
maty <- normalizeBetweenArrays(maty, method="quantile")

pca <- prcomp(maty, scale=T)
ggplot(as.data.frame(pca$rotation), aes(x=PC1, y=PC2, color=colnames(maty))) +
  geom_point() + theme_bw()

dm <- dist_approx(fit)
heatmap(as.matrix(dm$mean))
```

While each sample has fewer/more dropouts, the heatmap pairwise sample distances appear uniform within/between each experimental group, as expected.