Troubleshooting - suite
I wrote a script to extract the MBQ (median base quality) value for
the vcf file.
perl ~/Pando/subclone/scripts/extractMQB.pl
pando_subclone_bamlist.vcf
This vcf file is Pando and subclone variants called together and not
filtered for crap yet
(/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/pando_subclone_variants/pando_subclone_bamlist.vcf).
I did the same for the two other scores that we filter as
probabilities.
perl ~/Pando/subclone/scripts/extractBQB.pl
pando_subclone_bamlist.vcf
knitr::opts_chunk$set(echo = TRUE)
# upload data
mbq <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/MBQ.txt", sep = "\t")
bqb <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/BQB.txt", sep = "\t")
rpb <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/RPB.txt", sep = "\t")
# visualize MBQ
par(mfrow=c(1,2), family = "Times New Roman")
hist(mbq$V1, 50, main = "MBQ score", xlab = "MBQ score", col = "#ef3b2c")
mbq_scaled <- log10(mbq)
hist(mbq_scaled$V1, 50, main = "MBQ score (log10)", xlab = "MBQ score (log10)", col = "#f16913")
# visualize BQB
par(mfrow=c(1,2), family = "Times New Roman")
hist(bqb$V1, 50, main = "BQB score", xlab = "BQB score", col = "#ef3b2c")
bqb_scaled <- log10(bqb)
hist(bqb_scaled$V1, 50, main = "BQB score (log10)", xlab = "BQB score (log10)", col = "#f16913")
# visualize RPB
par(mfrow=c(1,2), family = "Times New Roman")
hist(rpb$V1, 50, main = "RPB score", xlab = "RPB score", col = "#ef3b2c")
rpb_scaled <- log10(rpb)
hist(rpb_scaled$V1, 50, main = "RPB score (log10)", xlab = "RPB score (log10)", col = "#f16913")
Some values are >1 - are we dealing with a probability or an
integer for this score value? I will check the version I am using to
call the variants and read the manual for this version.
The command line I used in the original file is:
samtools mpileup -C 50 -d 250 -f $ref -q 30 -Q 20 -g -I -t DP,AD -u
-b $bam_list -o $out_dir/pando_subclone_bamlist.bcf
bcftools call -v -m -p 0.1 -P 0.01 -O v -o $out_dir/pando_subclone_bamlist.vcf $out_dir/pando_subclone_bamlist.bcf
If we believe this: “The MBQ value is expressed on the Phred scale, which is a logarithmic scale used to represent the quality scores in genetics. On the Phred scale, a higher score corresponds to a higher probability of the base being called correctly. Typically, quality scores are reported as negative logarithm base 10 of the error probability. So, for example, an MBQ value of 20 corresponds to an error probability of 1 in 100 (10^(-20/10)), while an MBQ value of 30 corresponds to an error probability of 1 in 1,000 (10^(-30/10)).
Therefore, the MBQ value represents the mapping quality of the base
with respect to the variant allele, and higher MBQ values indicate
higher confidence in the accuracy of the base call.” then it should be a
score (but the vealues are pretty low for a Phred score).
What filter value did we usually use?
If the value in the vcf was less than our set threshold of
0.005, then we would filter out the variant. Maybe that is part
of the problem? Interesting.
Conclusion: re-filter files with higher threshold?
Two datasets here : 2024 (replicates, 12x8) and the same 12 samples
than the replicate dataset from the subclone (with 2018 and 2022 called
together). Goal: I want to know how many SNPs called in this subclone +
Pando dataset were found in the replicate dataset, in the same set of
samples. This will tell us how many mutations we might be missing when
there is only one replicate per sample.
pnt_est <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/pntest_reps_variants_april25.txt", sep = " ") # 43 x 91
depth <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/replicate_samples_april25_depth.txt", sep = "\t", header = T) # 43 x 91
ids <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/4_troubleshooting/ids_april25.txt", sep = "\t", header = T) # 91 x 3
I will do not any filtering by depth for now.
One thing that I am interested in is: if the mutation is found in the
replicated sample (2024 dataset), is it also found in the original
single sequenced one (2022 dataset)?
# -----------------------------------------------------------------------------
# BINARIZATION
# -----------------------------------------------------------------------------
pn_bin <- matrix(NaN, dim(pnt_est)[1], dim(pnt_est)[2])
threshold <- 0.5
for (x in 1:dim(pnt_est)[1]) {
for (y in 1:dim(pnt_est)[2]) {
if (pnt_est[x,y]>threshold) {pn_bin[x,y] <- 1 } # if proba of being heteroZ > threshold, label with 1
else {pn_bin[x,y] <- 0 }
}
}
nInd <- 91 # number of individuals in the dataset
hets <- rowSums(pn_bin)/nInd
#length(which(hets == 0)) # 0
#length(which(rowSums(pn_bin)>1)) # 43
nSNP <- 43 # number of SNPs in the dataset
par(family="Times New Roman", cex.lab=1.2, cex.axis=1.2, cex.main=1, cex.sub=1)
hist(hets, xlab = "Heterozygosity", ylab = "counts", 100, main = "hets per SNP in replicated samples (2022 and 2024 datasets)",
col = "#2ca25f")
levels(as.factor(ids$group_ID)) # missing the replicate samples for group #8
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
group <- c(1:7,9:12)
# (1) in how many samples of the replicate dataset do we find the same mutation?
rep_idx <- ids$group=="replicates"
pn_bin_rep <- pn_bin[,rep_idx] # 43 x 79
ids_rep <- ids[rep_idx,] # 79 x 3
pn_bin_sub <- pn_bin[,!rep_idx] # 43 x 12
ids_sub <- ids[!rep_idx,] # 12 x 3
store_prop <- c()
# number of mutations found in every replicate, normalized by the mean number of mutations found in each replicate
for (rep in 1:length(group)) {
inds <- which( ids_rep$group_ID == group[rep] )
ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
mean_n_mut <- mean(colSums(pn_bin[,inds])) # mean number of mutations per sample
#prop <- c(length(which( ind_prop == 0 )) + length(which( ind_prop == 1 )) ) / nSNP
prop <- c( length(which( ind_prop == 1 )) ) / mean_n_mut
store_prop <- c(store_prop, prop)
}
data_store_prop <- store_prop
par(family = "Times New Roman")
plot(data_store_prop, pch = 16, ylim = c(0,1), xlab = "replicate", ylab = "Prop. of SNP in common per set of reps")
abline(h=mean(store_prop), col = "red") # 0.66
We have a good replication (66%) within replicate groups.
When we have the mutation in 60% of the replicate, do we find it in the single sample? what threshold do we lose it at?
store_replicability <- matrix(NA, 1, 11)
for (rep in 1:length(group)) {
inds <- which( ids_rep$group_ID == group[rep] )
ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
idx_snp <- which(ind_prop >= 0.6)
store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
}
mean_dt <- mean(store_replicability) # 0.89
# Compare to random
N <- 1000
store_replicability_rdn <- matrix(NA, 1, N)
for (n in 1:N) {
store_replicability <- matrix(NA, 1, 11)
for (rep in 1:length(group)) {
group_rdn <- sample(ids$group_ID)
inds <- which( group_rdn == group[rep] )
#print(inds)
ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
idx_snp <- which(ind_prop >= 0.6)
store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
}
store_replicability_rdn[n] <- mean(store_replicability)
}
mean(store_replicability_rdn) # 0.89
## [1] 0.8980819
# Maybe most mutations are found in all reps?
# Plot
par(family = "Times New Roman")
hist(store_replicability_rdn, xlab = "Number of times the single sample has the SNP found in the replicates", 100, col = "blue", main = "")
abline(v=mean(mean_dt), col = "red", lwd = 2) # 0.66
legend("topright", legend = c("mean", "sim"), col = c("red", "blue"), pch = 18, bty = "n")
I shuffle the indivivuals and calculate the proportion of SNPs that,
when found in the set of randomly chosen individuals, is also found in
the single sample from the subclone.
The data is not different
from a randomized distribution.
Conclusion from this analysis:
what is good is that, if the mutation is found in more than 60% of the
samples, it is also found in the single sample. However, we mostly have
SNPs that seem to be found in every sample, as compared to unique to
every subgroup.
I am curious how this looks like if I release the threshold from 60% to 30%.
store_replicability <- matrix(NA, 1, 11)
for (rep in 1:length(group)) {
inds <- which( ids_rep$group_ID == group[rep] )
ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
idx_snp <- which(ind_prop >= 0.3)
store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
}
mean_dt <- mean(store_replicability) # 0.89
# Compare to random
N <- 1000
store_replicability_rdn <- matrix(NA, 1, N)
for (n in 1:N) {
store_replicability <- matrix(NA, 1, 11)
for (rep in 1:length(group)) {
group_rdn <- sample(ids$group_ID)
inds <- which( group_rdn == group[rep] )
#print(inds)
ind_prop <- c(rowSums(pn_bin[,inds])/length(inds)) # proportion of individuals with the mutation
idx_snp <- which(ind_prop >= 0.3)
store_replicability[rep] <- length(which(pn_bin_sub[idx_snp,group[rep]] == 1) ) / length(idx_snp)
}
store_replicability_rdn[n] <- mean(store_replicability)
}
mean(store_replicability_rdn) # 0.89
## [1] 0.8590871
# Maybe most mutations are found in all reps?
# Plot
par(family = "Times New Roman")
hist(store_replicability_rdn, xlab = "Number of times the single sample has the SNP found in the replicates", 100, col = "blue", main = "")
abline(v=mean(mean_dt), col = "red", lwd = 2) # 0.66
legend("topright", legend = c("mean", "sim"), col = c("red", "blue"), pch = 18, bty = "n")