This is a companion to JAGS Analysis of Copy Number, that investigates the differences between inference when using censored genotype data.
Alleles are present in a population with \(0\), \(1\) or \(2\) copies per chromosome. We can assume that copy number \(0\) alleles are rare enough that the frequency of copy number \(0\) genotypes is negligible.
We have genotype frequencies for individuals with \(1\), \(2\), and \(3+\) alleles, as for technical reasons it is difficult to distinguish between copy number \(3\) and \(4\).
How much to we lose by not distinguishing copy number \(3\) and \(4\)?
Simulate data under Hardy-Weinberg. The function bnelow produces a matrix, with replicates by column, and the number of genotypes with copy number count \(i\) in row \(i\). Genotypes with copy number 0 are dropped.
sim <- function(ndatasets=100, samplesize=1000, p0=0.02, p2=0.3) {
if (p0+p2>1|p0<0|p2<0)
stop("p2 and p2 must be positive and sum to below 1")
p <- c(p0, 1-p0-p2, p2)
samp <- function() {
m <- matrix(sample(0:2, samplesize, prob=p, replace=TRUE),ncol=2)
tabulate(rowSums(m), nbins=4) ## removes 0s
}
replicate(ndatasets, samp())
}
library(pander)
pander(sim(ndatasets=6, samplesize=200, p0=0.1, p2=0.2))
| 24 | 15 | 16 | 21 | 16 | 13 |
| 47 | 52 | 48 | 48 | 58 | 54 |
| 27 | 30 | 31 | 26 | 22 | 27 |
| 1 | 3 | 4 | 3 | 4 | 4 |
I use JAGS (Plummer and others 2003), within the rjags (Plummer 2016) package for this analysis. I have jags code that analyses both models independently together.
jags_both_models <- "
model {
### The full model under Hardy-Weinberg
x[1] <- 2*p[1]*p[2]
x[2] <- 2*p[1]*p[3] + p[2]*p[2]
x[3] <- 2*p[2]*p[3]
x[4] <- p[3]*p[3]
y ~ dmulti(x, n)
### The reduced model under Hardy-Weinberg
x2[1] <- 2*p2[1]*p2[2]
x2[2] <- 2*p2[1]*p2[3] + p2[2]*p2[2]
x2[3] <- 2*p2[2]*p2[3] + p2[3]*p2[3]
y2 ~ dmulti(x2, n)
### priors for both models
for (k in 1:3) {
a[k] ~ dexp(1)
a2[k] ~ dexp(1)
p[k] <- a[k]/(a[1] + a[2] + a[3])
p2[k] <- a2[k]/(a2[1] + a2[2] + a2[3])
}
}
"
For these simulations I use the parallel package that is built into R (R Core Team 2016). I use the function parLapply which is the parallel version of the lapply command, which performs the same calculation on every item of a list. I convert the simulated data matrix to a data.frame and then analyse every column.
parallel_analysis <- function(data_list, modeltext, st.params) {
library(parallel)
library(rjags)
no_cores <- detectCores() - 1 ## Use available cores -1
local_cluster <- makeCluster(no_cores) ## Initiate cluster
clusterExport(local_cluster, "jags.model") ## Export the rjags functions
clusterExport(local_cluster, "coda.samples")
run_jags_instance <- function(bb, ModelText, init.params) {
local_data <- list(n=sum(bb), y=bb, y2=c(bb[1:2], bb[3]+bb[4])) ## create data list for jags
jags <- jags.model(textConnection(ModelText), data=local_data, n.chains=4, init=init.params)
update(jags, 2000)
s <- coda.samples(jags, c("p", "p2"), 2500, thin=5)
summary(s)$q ## return the quantiles
}
xx <- simulated_data
res <- parLapply(local_cluster, data_list, run_jags_instance, ModelText=modeltext, init.params=st.params)
stopCluster(local_cluster)
return(res)
}
simulated_data <- sim(ndatasets = 1000, p0=0.01, p2=0.1)
start.a = list(a=c(1, 10, 3), a2=c(1, 10, 3))
posterior_results <- parallel_analysis(data.frame(simulated_data), modeltext=jags_both_models, st.params = start.a)
The results are a list of quantiles for each cooumn of the simulated data. Compare the medians of the posterior output for the frequency of copy number 2 for both models.
library(ggplot2)
meds2 <- data.frame(t(sapply(posterior_results, function(x) x[c(3, 6), 3])))
colnames(meds2) <- c("Full", "Censored")
ggplot(meds2, aes(Full, Censored)) + geom_point() +ggtitle("Estimated Median Frequency of CN 2 Allele")
By eye seems to be some tendency for the censored model using censored data to underestimate the frequency of the copy number 2 allele relative to the full model.
meds0 <- data.frame(t(sapply(posterior_results, function(x) x[c(1, 4), 3])))
colnames(meds0) <- c("Full", "Censored")
ggplot(meds0, aes(Full, Censored)) + geom_point() +ggtitle("Estimated Median Frequency of CN 0 Allele")
However, there is little difference for the frequency of the the copy number 0 allele, which is what we are most interested in. We can see how much more variable the estimates are per replicate by looking at posterior intervals.
Look at the mean absolute deviation of the median of each replicate from the true value.
| Full | Censored | |
|---|---|---|
| mad0 | 0.003498 | 0.00353 |
| mad2 | 0.009767 | 0.0103 |
There is little difference in performance.
Plummer, Martyn. 2016. Rjags: Bayesian Graphical Models Using Mcmc. https://CRAN.R-project.org/package=rjags.
Plummer, Martyn, and others. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124:125. Vienna.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.