Part 1 - vcf filters : p-value or score?

Troubleshooting - suite

I wrote a script to extract the MBQ (median base quality) value for the vcf file.

perl ~/Pando/subclone/scripts/extractMQB.pl pando_subclone_bamlist.vcf

This vcf file is Pando and subclone variants called together and not filtered for crap yet (/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/pando_subclone_variants/pando_subclone_bamlist.vcf).

I did the same for the two other scores that we filter as probabilities.
perl ~/Pando/subclone/scripts/extractBQB.pl pando_subclone_bamlist.vcf

knitr::opts_chunk$set(echo = TRUE)
# upload data
mbq <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/MBQ.txt", sep = "\t")
bqb <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/BQB.txt", sep = "\t")
rpb <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/RPB.txt", sep = "\t")

# visualize MBQ
par(mfrow=c(1,2), family = "Times New Roman")
hist(mbq$V1, 50, main = "MBQ score", xlab = "MBQ score", col = "#ef3b2c")

mbq_scaled <- log10(mbq)
hist(mbq_scaled$V1, 50, main = "MBQ score (log10)", xlab = "MBQ score (log10)", col = "#f16913")

# visualize BQB
par(mfrow=c(1,2), family = "Times New Roman")
hist(bqb$V1, 50, main = "BQB score", xlab = "BQB score", col = "#ef3b2c")

bqb_scaled <- log10(bqb)
hist(bqb_scaled$V1, 50, main = "BQB score (log10)", xlab = "BQB score (log10)", col = "#f16913")

# visualize RPB
par(mfrow=c(1,2), family = "Times New Roman")
hist(rpb$V1, 50, main = "RPB score", xlab = "RPB score", col = "#ef3b2c")

rpb_scaled <- log10(rpb)
hist(rpb_scaled$V1, 50, main = "RPB score (log10)", xlab = "RPB score (log10)", col = "#f16913")

Some values are >1 - are we dealing with a probability or an integer for this score value? I will check the version I am using to call the variants and read the manual for this version.

The command line I used in the original file is:

samtools mpileup -C 50 -d 250 -f $ref -q 30 -Q 20 -g -I -t DP,AD -u -b $bam_list -o $out_dir/pando_subclone_bamlist.bcf

bcftools call -v -m -p 0.1 -P 0.01 -O v -o $out_dir/pando_subclone_bamlist.vcf $out_dir/pando_subclone_bamlist.bcf

If we believe this: “The MBQ value is expressed on the Phred scale, which is a logarithmic scale used to represent the quality scores in genetics. On the Phred scale, a higher score corresponds to a higher probability of the base being called correctly. Typically, quality scores are reported as negative logarithm base 10 of the error probability. So, for example, an MBQ value of 20 corresponds to an error probability of 1 in 100 (10^(-20/10)), while an MBQ value of 30 corresponds to an error probability of 1 in 1,000 (10^(-30/10)).

Therefore, the MBQ value represents the mapping quality of the base with respect to the variant allele, and higher MBQ values indicate higher confidence in the accuracy of the base call.” then it should be a score (but the vealues are pretty low for a Phred score).

What filter value did we usually use?

If the value in the vcf was less than our set threshold of 0.005, then we would filter out the variant. Maybe that is part of the problem? Interesting.

Conclusion: re-filter files with higher threshold?

Part 2 - replicability of the replicate samples between two different sequencing batches

Two datasets here : 2024 (replicates, 12x8) and the same 12 samples than the replicate dataset from the subclone (with 2018 and 2022 called together). Goal: I want to know how many SNPs called in this subclone + Pando dataset were found in the replicate dataset, in the same set of samples. This will tell us how many mutations we might be missing when there is only one replicate per sample.

pnt_est <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/pntest_reps_variants_april25.txt", sep = " ") # 43 x 91

depth <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/replicate_samples_april25_depth.txt", sep = "\t", header = T) # 43 x 91

ids <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/ids_april25.txt", sep = "\t", header = T) # 91 x 3

I will do not any filtering by depth for now.

One thing that I am interested in is: if the mutation is found in the replicated sample (2024 dataset), is it also found in the original single sequenced one (2022 dataset)?

# -----------------------------------------------------------------------------
# BINARIZATION
# -----------------------------------------------------------------------------
pn_bin <- matrix(NaN, dim(pnt_est)[1], dim(pnt_est)[2])
threshold <- 0.5

for (x in 1:dim(pnt_est)[1]) {
  
  for (y in 1:dim(pnt_est)[2]) {
    
    if (pnt_est[x,y]>threshold) {pn_bin[x,y] <- 1 } # if proba of being heteroZ > threshold, label with 1
    
    else {pn_bin[x,y] <- 0 }
    
  }
}


nInd <- 91 # number of individuals in the dataset

hets <- rowSums(pn_bin)/nInd
#length(which(hets == 0)) # 0
#length(which(rowSums(pn_bin)>1)) # 43

nSNP <-  43 # number of SNPs in the dataset

par(family="Times New Roman", cex.lab=1.2, cex.axis=1.2, cex.main=1, cex.sub=1)
hist(hets, xlab = "Heterozygosity", ylab = "counts", 100, main = "hets per SNP in replicated samples (2022 and 2024 datasets)", 
     col = "#2ca25f")

levels(as.factor(ids$group_ID)) # missing the replicate samples for group #8
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
group <- c(1:7,9:12) 

# (1) in how many samples of the replicate dataset do we find the same mutation?
rep_idx <- ids$group=="replicates"
pn_bin_rep <- pn_bin[,rep_idx] # 43 x 79
ids_rep <- ids[rep_idx,] # 79 x 3

pn_bin_sub <- pn_bin[,!rep_idx] # 43 x 12
ids_sub <- ids[!rep_idx,] # 12 x 3

store_prop <- c()

# number of mutations found in every replicate, normalized by the mean number of mutations found in each replicate
for (rep in 1:length(group)) {
  
  inds <- which( ids_rep$group_ID == group[rep] )

  ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation

  mean_n_mut <- mean(colSums(pn_bin[,inds])) # mean number of mutations per sample
  
  #prop <- c(length(which( ind_prop == 0 )) + length(which( ind_prop == 1 )) ) / nSNP
  prop <- c( length(which( ind_prop == 1 )) ) / mean_n_mut
  store_prop <- c(store_prop, prop)
  
}

data_store_prop <- store_prop
par(family = "Times New Roman")
plot(data_store_prop, pch = 16, ylim = c(0,1), xlab = "replicate", ylab = "Prop. of SNP in common per set of reps")
abline(h=mean(store_prop), col = "red") # 0.66

We have a good replication (66%) within replicate groups.

When we have the mutation in 60% of the replicate, do we find it in the single sample? what threshold do we lose it at?

store_replicability <- matrix(NA, 1, 11)

for (rep in 1:length(group)) {
  
  inds <- which( ids_rep$group_ID == group[rep] )
  ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
  idx_snp <- which(ind_prop >= 0.6)
  store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
  
}

mean_dt <- mean(store_replicability) # 0.89 

# Compare to random
N <- 1000
store_replicability_rdn <- matrix(NA, 1, N)

for (n in 1:N) {
  
  store_replicability <- matrix(NA, 1, 11)
  
  for (rep in 1:length(group)) {
  
    group_rdn <- sample(ids$group_ID)
    inds <- which( group_rdn == group[rep] )
    #print(inds)
    ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
    idx_snp <- which(ind_prop >= 0.6)
    
    store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
  
  }
  
  store_replicability_rdn[n] <- mean(store_replicability)
  
}

mean(store_replicability_rdn) # 0.89
## [1] 0.8980819
# Maybe most mutations are found in all reps?

# Plot 
par(family = "Times New Roman")
hist(store_replicability_rdn, xlab = "Number of times the single sample has the SNP found in the replicates", 100, col = "blue", main = "")
abline(v=mean(mean_dt), col = "red", lwd = 2) # 0.66
legend("topright", legend = c("mean", "sim"), col = c("red", "blue"), pch = 18, bty = "n")

I shuffle the indivivuals and calculate the proportion of SNPs that, when found in the set of randomly chosen individuals, is also found in the single sample from the subclone.
The data is not different from a randomized distribution.
Conclusion from this analysis: what is good is that, if the mutation is found in more than 60% of the samples, it is also found in the single sample. However, we mostly have SNPs that seem to be found in every sample, as compared to unique to every subgroup.

I am curious how this looks like if I release the threshold from 60% to 30%.

store_replicability <- matrix(NA, 1, 11)

for (rep in 1:length(group)) {
  
  inds <- which( ids_rep$group_ID == group[rep] )
  ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
  idx_snp <- which(ind_prop >= 0.3)
  store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
  
}

mean_dt <- mean(store_replicability) # 0.89 

# Compare to random
N <- 1000
store_replicability_rdn <- matrix(NA, 1, N)

for (n in 1:N) {
  
  store_replicability <- matrix(NA, 1, 11)
  
  for (rep in 1:length(group)) {
  
    group_rdn <- sample(ids$group_ID)
    inds <- which( group_rdn == group[rep] )
    #print(inds)
    ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
    idx_snp <- which(ind_prop >= 0.3)
    
    store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
  
  }
  
  store_replicability_rdn[n] <- mean(store_replicability)
  
}

mean(store_replicability_rdn) # 0.89
## [1] 0.8590871
# Maybe most mutations are found in all reps?

# Plot 
par(family = "Times New Roman")
hist(store_replicability_rdn, xlab = "Number of times the single sample has the SNP found in the replicates", 100, col = "blue", main = "")
abline(v=mean(mean_dt), col = "red", lwd = 2) # 0.66
legend("topright", legend = c("mean", "sim"), col = c("red", "blue"), pch = 18, bty = "n")