This is a companion to JAGS Analysis of Copy Number, that investigates the differences between inference when using censored genotype data.

Problem

Alleles are present in a population with \(0\), \(1\) or \(2\) copies per chromosome. We can assume that copy number \(0\) alleles are rare enough that the frequency of copy number \(0\) genotypes is negligible.

We have genotype frequencies for individuals with \(1\), \(2\), and \(3+\) alleles, as for technical reasons it is difficult to distinguish between copy number \(3\) and \(4\).

How much to we lose by not distinguishing copy number \(3\) and \(4\)?

Comparison of Censored and Uncensored Data

Simulate data under Hardy-Weinberg. The function bnelow produces a matrix, with replicates by column, and the number of genotypes with copy number count \(i\) in row \(i\). Genotypes with copy number 0 are dropped.

sim <- function(ndatasets=100, samplesize=1000, p0=0.02, p2=0.3) {
  if (p0+p2>1|p0<0|p2<0)
    stop("p2 and p2 must be positive and sum to below 1")
  p <- c(p0, 1-p0-p2, p2)
  samp <- function() {
    m <- matrix(sample(0:2, samplesize, prob=p, replace=TRUE),ncol=2)  
    tabulate(rowSums(m), nbins=4)     ## removes 0s
  }
 replicate(ndatasets, samp()) 
}
library(pander)
pander(sim(ndatasets=6, samplesize=200, p0=0.1, p2=0.2))

24	15	16	21	16	13
47	52	48	48	58	54
27	30	31	26	22	27
1	3	4	3	4	4

JAGS Code for Analysis

I use JAGS (Plummer and others 2003), within the rjags (Plummer 2016) package for this analysis. I have jags code that analyses both models independently together.

jags_both_models <- "
model {
  ###  The full model under Hardy-Weinberg
  x[1] <- 2*p[1]*p[2]  
  x[2] <- 2*p[1]*p[3] + p[2]*p[2]
  x[3] <- 2*p[2]*p[3]
  x[4] <- p[3]*p[3]
  y ~ dmulti(x, n)
  ### The reduced model under Hardy-Weinberg
  x2[1] <- 2*p2[1]*p2[2] 
  x2[2] <- 2*p2[1]*p2[3] + p2[2]*p2[2]
  x2[3] <- 2*p2[2]*p2[3] + p2[3]*p2[3]
  y2 ~ dmulti(x2, n)
  ### priors for both models
  for (k in 1:3) {
    a[k] ~ dexp(1)
    a2[k] ~ dexp(1)
    p[k] <- a[k]/(a[1] + a[2] + a[3])
    p2[k] <- a2[k]/(a2[1] + a2[2] + a2[3])
  }
}
"

For these simulations I use the parallel package that is built into R (R Core Team 2016). I use the function parLapply which is the parallel version of the lapply command, which performs the same calculation on every item of a list. I convert the simulated data matrix to a data.frame and then analyse every column.

parallel_analysis <- function(data_list, modeltext, st.params) {
  library(parallel)
  library(rjags)
  
  no_cores <- detectCores() - 1                     ## Use available cores -1
  local_cluster <- makeCluster(no_cores)            ## Initiate cluster
  
  clusterExport(local_cluster, "jags.model")        ## Export the rjags functions
  clusterExport(local_cluster, "coda.samples")       

  run_jags_instance <- function(bb, ModelText, init.params) {
    local_data <- list(n=sum(bb), y=bb, y2=c(bb[1:2], bb[3]+bb[4]))     ## create data list for jags
    jags <- jags.model(textConnection(ModelText), data=local_data, n.chains=4, init=init.params)
    update(jags, 2000)
    s <- coda.samples(jags, c("p", "p2"), 2500, thin=5)
    summary(s)$q                        ## return the quantiles
  }
  xx <- simulated_data     
  res <- parLapply(local_cluster, data_list, run_jags_instance, ModelText=modeltext, init.params=st.params)
  stopCluster(local_cluster)
  return(res)
}

simulated_data <- sim(ndatasets = 1000, p0=0.01, p2=0.1)
start.a = list(a=c(1, 10, 3), a2=c(1, 10, 3))
posterior_results <- parallel_analysis(data.frame(simulated_data), modeltext=jags_both_models, st.params = start.a)

The results are a list of quantiles for each cooumn of the simulated data. Compare the medians of the posterior output for the frequency of copy number 2 for both models.

library(ggplot2)
meds2 <- data.frame(t(sapply(posterior_results, function(x) x[c(3, 6), 3])))
colnames(meds2) <- c("Full", "Censored")
ggplot(meds2, aes(Full, Censored)) + geom_point() +ggtitle("Estimated Median Frequency of CN 2 Allele")

By eye seems to be some tendency for the censored model using censored data to underestimate the frequency of the copy number 2 allele relative to the full model.

Copy Number 0

meds0 <- data.frame(t(sapply(posterior_results, function(x) x[c(1, 4), 3])))
colnames(meds0) <- c("Full", "Censored")
ggplot(meds0, aes(Full, Censored)) + geom_point() +ggtitle("Estimated Median Frequency of CN 0 Allele")

However, there is little difference for the frequency of the the copy number 0 allele, which is what we are most interested in. We can see how much more variable the estimates are per replicate by looking at posterior intervals.

Width of the equal tailed 95% posterior probability interval

Mean Absolute Deviation

Look at the mean absolute deviation of the median of each replicate from the true value.

	Full	Censored
mad0	0.003498	0.00353
mad2	0.009767	0.0103

There is little difference in performance.

References

Plummer, Martyn. 2016. Rjags: Bayesian Graphical Models Using Mcmc. https://CRAN.R-project.org/package=rjags.

Plummer, Martyn, and others. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124:125. Vienna.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Estimating Copy Number Frequencies from Censored Genotype Data