Dataset details

The variants were called on the 2022 dataset only (two subparts of the Pando clone with different tissues types: roots, branches, shoots and leaves, 117 samples).

The vcf before any filtering is here: /uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subpando_variants.vcf

Initial filtering steps: /uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/scripts/vcfFilter.sh and vcfFilter_subclone.pl

Finished filtering subpando_variants.vcf Retained 8806 variable loci

Filter against PON and Friends:

bgzip filtered_final_subpando_variants.vcf
bcftools index filtered_final_subpando_variants.vcf.gz

some vcf files are here: /uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/pando_data_2022/all_bams I will filter the PON vcf file to make sure it is the same filtering parameters (vcfFilter_PON.pl).

pon=/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/PON/filtered_pon_variants.vcf.gz subclone=/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/filtered_final_subpando_variants.vcf.gz

bcftools view -H ponfil_subclone.vcf | wc -l
7451

For Friends, (1) separate Pando from the Friends, then (2) filter the vcf file.

Keeping only the individuals with sufficient mean depth, I start over from this file:
/uufs/chpc.utah.edu/common/home/u6028866/Pando/entire-clone/variants/pando_friends_variants/pando_friends_variants.vcf

  1. separate Pando from the Friends

bcftools view -S friends_IDs_from_F1.csv –force-samples pando_friends_variants.vcf > friends_only_variants.vcf

query -l friends_only_variants.vcf | wc -l
93 # number of individuals in the Friends dataset

  1. filter the vcf file.

Filter the vcf file for crap using the following script:
perl ../../scripts/vcfFilter_friends.pl friends_only_variants.vcf
Finished filtering friends_only_variants.vcf
Retained 61377 variable loci

Filter the vcf file for SNPs found in PON:

bcftools isec -p friends_against_PON filtered_final_friends_only_variants.vcf.gz $pon

bcftools view -H ponfil_friends_only_variants.vcf | wc -l 41376

Compare subclone to Friends now that Friends is ready to go:
friends=/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/friends/friends_against_PON/ponfil_friends_only_variants.vcf.gz

bcftools isec -p subclone_versus_friends ponfil_subclone.vcf.gz $friends

bcftools view -H ponfil_friendsfil_subclone.vcf | wc -l
6239 # number of mutations in subclone after filtering against Friends

–> filters applied so far : SNPs removed when found in PON + Friends

calculate point estimates and depth to further filter the file:
/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/ponfil_friendsfil_subclone.vcf

Number of loci: 6239; number of individuals 117

This calculated the allele frequency. I will filter it to only keep variants with <= 0.7 in allele frequency.

—————————————————————————–

Step 1: filter based on the allele frequencies

—————————————————————————–

reasoning: the allele frequencies are sometimes >0.5. For somatic mutations, we do not want to keep mutations that are widespread (shared by all individuals). The Pando clone is triploid, which reduces our expectation for fixation of a mutation to 0.33. We thus transform the allele frequency file to not take into account any variants whose allele frequency > 0.7, and reduce to 0.5 ane allele frequency comprised between 0.5 and 0.7.

—————————————————————————–

Step 2: Point estimates with modified allele frequencies

—————————————————————————–

I used the modified allele frequency file to calculate point estimates based on genotype likelihoods. nect filters to apply: filter for empty variants, singletons and for read depth>4.

# scp u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/pntest_ponfil_friendsfil_subclone.txt /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/

# scp u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/depth_reps_optionB.txt /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/

# upoad pntest data and depth info
pn_est <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/pntest_ponfil_friendsfil_subclone.txt", sep = " " ) # 6239 x 117
depth <- read.csv("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/depth_subclone.csv", sep = ",", header = T) # 6239 x 117
ids <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/ids_subclone.txt", sep = "\t", header = T) # 117 x 5


# Filter based on depth
par(family = "Times New Roman")
plot(colMeans(depth), pch = 16, xlab = "individual", ylab = "depth", ylim = c(0,150)) # Mean depth per sample
abline(h=4, col = "red")

keep <- which (colMeans(depth) >= 4) # 101

pnt_est_fil <- pn_est[,keep] # 6239 x 101
ids_fil <- ids[as.numeric(keep), ] # 101 x 5
depth_fil <- depth[,keep] # 6239 x 101


# -----------------------------------------------------------------------------
# BINARIZATION
# -----------------------------------------------------------------------------
pn_bin <- matrix(NaN, dim(pnt_est_fil)[1], dim(pnt_est_fil)[2])
threshold <- 0.5

for (x in 1:dim(pnt_est_fil)[1]) {
  
  for (y in 1:dim(pnt_est_fil)[2]) {
    
    if (pnt_est_fil[x,y]>threshold) {pn_bin[x,y] <- 1 } # if proba of being heteroZ > threshold, label with 1
    
    else {pn_bin[x,y] <- 0 }
    
  }
}


nInd <- 101 # number of individuals in the dataset

hets <- rowSums(pn_bin)/nInd
#length(which(hets == 0)) # 1241
#length(which(rowSums(pn_bin)>1)) # 2170

# remove singletons, meaning present in 1/nInd individuals AND
# present in zero individuals (left after filtering out some individuals, maybe)

pn_bin_fil <- pn_bin[c(rowSums(pn_bin)>1),] # 2170 x 101
af_mod$V1[c(rowSums(pn_bin)<=1)] <- 0

nSNP <-  2170 # number of SNPs in the dataset
hets <- (rowSums(pn_bin_fil))/nInd

par(family="Times New Roman", cex.lab=1.2, cex.axis=1.2, cex.main=1, cex.sub=1)
hist(hets, xlab = "Heterozygosity", ylab = "counts", 100, main = "heterozygosity per SNP, after removing singletons", 
     col = "#2ca25f")

# I like how it looks

—————————————————————————–

Step 3. Spatial clustering of the samples

—————————————————————————–

How do the sample cluster? (before filtering for somatic mutations)

library(ape)
library(colourvalues)
library(ecodist)

# -----------------------------------------------------------------------------
# PAIRWISE SIMILARY
# -----------------------------------------------------------------------------

# All pairwise comparisons
# 1 - go sample by sample
# 2 - for each pair A and B, calculate the number of mutations shared b/w A and B
# 3 - calculate the number of mutations in each A and B, take the mean
# 4 - divide #2 by #3
# 5 - store this number in a nInd x nInd matrix

#  x first
store_mat <- matrix(NA, nInd, nInd)
n <- 1
N <- 1

for (x in n:nInd) {
  
  for (y in N:nInd) {
    
    
    n_shared <- sum(rowSums(pn_bin_fil[,c(x, y)]) == 2) #  calculate the number of mutations shared b/w A and B
    n_sample_x <- sum(pn_bin_fil[,x] == 1) # calculate the number of mutations in each sample
    n_sample_y <- sum(pn_bin_fil[,y] == 1)
    
    val <- n_shared / mean(n_sample_x,n_sample_y) # calculate similarity
    store_mat[x,y] <- val
    
    
  }
  
  # N <- N + 1
  
}

# what is the range of values
par(family = "Times New Roman")
hist(store_mat[lower.tri(store_mat)], 100, col = "blue", xlim = c(0,1), main = "Range of value for pairwise similarity", xlab = "Pairwise similarity")

# transform to distance object
dist_obj <- as.dist(store_mat)
pcoa_res <- pco(dist_obj, negvals = "zero", dround = 0) #, negvals = "zero", dround = 0, rm= 1) # if negvals = 0 sets all negative eigenvalues to zero; if = "rm" corrects for negative eigenvalues using method 1 of Legendre and Anderson 1999

# color by tree ID
par(mfrow=c(1,2))
col_vec <- colour_values(ids_fil$TreeID, palette = "plasma") 
par(family = "Times New Roman", cex = 1.2)
plot(pcoa_res$vectors[,1], pcoa_res$vectors[,2], xlab = "PCoA1", ylab = "PCoA2",pch = 16,
     axes = TRUE, main = "PCoA colored by Tree ID", , col = col_vec)

plot(pcoa_res$vectors[,3], pcoa_res$vectors[,4], xlab = "PCoA3", ylab = "PCoA4",pch = 16,
     axes = TRUE, main = "PCoA colored by Tree ID", , col = col_vec)

#legend("topright", legend=c("Tree A", "Tree G", "Tree H"), pch = 16, col = c("#440154FF", "#21908CFF", "#FDE725FF"), bty = "n")



# color by tissue type
par(mfrow=c(1,2))
col_vec <- colour_values(ids_fil$Group, palette = "plasma") 
par(family = "Times New Roman", cex = 1.2)
plot(pcoa_res$vectors[,1], pcoa_res$vectors[,2], xlab = "PCoA1", ylab = "PCoA2",pch = 16,
     axes = TRUE, main = "PCoA colored by tissue type", , col = col_vec)

plot(pcoa_res$vectors[,3], pcoa_res$vectors[,4], xlab = "PCoA3", ylab = "PCoA4",pch = 16,
     axes = TRUE, main = "PCoA colored by tissue type", , col = col_vec)

legend("bottomleft", legend=c(levels(as.factor(ids_fil$Group))), pch = 16, col = levels(as.factor(col_vec)), bty = "n")

—————————————————————————–

Step 4: Filtering for somatic mutations

—————————————————————————–

PCoA on similarity matrix with all mutations and with somatic only compare vcfs!!!!

af_mod$V1[hets>0.6] <- 0 # this is my way to get rid of SNPs that need to be deleted 
# setting the allele frequency to 0 means they will be deleted when getting rid of singletons and mutations present in 0 individuals 
length(which(af_mod$V1==0)) # 4628 SNPs allele frequencies set to 0
## [1] 4628
write.csv(af_mod, "/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/afmod2_ponfil_friendsfil_subclone.csv")

# upload the file on the cluster to re-calculate point estimates : 
# scp /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/afmod2_ponfil_friendsfil_subclone.csv u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/

# download the file from the cluster
# scp u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/pntest_som_ponfil_friendsfil_subclone.txt /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/

# upoad pntest data and depth info
pn_est_som <- read.table("/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/pntest_som_ponfil_friendsfil_subclone.txt", sep = " " ) # 6239  x 117
af_mod_2 <- af_mod

# Filter based on depth
pnt_est_som_fil <- pn_est_som[,keep] # 6239 x 101

# -----------------------------------------------------------------------------
# BINARIZATION
# -----------------------------------------------------------------------------
pn_som_bin <- matrix(NaN, dim(pnt_est_som_fil)[1], dim(pnt_est_som_fil)[2])
threshold <- 0.5

for (x in 1:dim(pnt_est_som_fil)[1]) {
  
  for (y in 1:dim(pnt_est_som_fil)[2]) {
    
    if (pnt_est_som_fil[x,y]>threshold) {pn_som_bin[x,y] <- 1 } # if proba of being heteroZ > threshold, label with 1
    
    else {pn_som_bin[x,y] <- 0 }
    
  }
}

pn_bin_som_fil <- pn_som_bin[c(rowSums(pn_som_bin)>1),] # 1611 x 101
af_mod_2$V1[c(rowSums(pn_som_bin)<=1)] <- 0

nSNP <-  1611 # number of SNPs in the dataset
hets <- (rowSums(pn_bin_som_fil))/nInd

par(family="Times New Roman", cex.lab=1.2, cex.axis=1.2, cex.main=1, cex.sub=1)
hist(hets, xlab = "Heterozygosity", ylab = "counts", 100, main = "heterozygosity per SNP, after removing singletons", 
     col = "#2ca25f")
abline(v=0.6, col = "red", lwd = 2)

Ok, what I find interesting here, is that I am still left with SNPs with heterozygoty with frequency > 0.6 when I got rid of them previously - so our previous filtering was “counteracted” by the use of this new script.

length(which(hets>0.6)) # remove 432 SNPs
## [1] 432
length(which(hets<=0.6)) # remove 432 SNPs
## [1] 1179
pn_bin_som_fil_2 <- pn_bin_som_fil[hets<=0.6, ] # 1179 x 101

# remove 432 SNPs of 6239 
af_mod_2$V1[hets>0.6] <- 0
length(which(af_mod_2$V1 != 0))
## [1] 1157
# To filter the vcf on the cluster, I need (1) the list of individuals and (2) the list of SNPs 
bool <- matrix(1,length(af_mod_2$V1), 1) # this boolean vector will be used to filter the vcf  6239 x 1
bool[c(af_mod_2$V1 == 0)] <- 0 # 1157 somatic mutations to look at
length(which(bool==1)) # 1157
## [1] 1157
# write.csv(bool, "/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/bool_to_filter_vcf_subclones.csv", row.names = F)
# write.csv(ids_fil, "/Users/rpineau3/Dropbox (GaTech)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/IDs_subclones_101.csv", row.names = F)

# send files on the cluster
# scp -r /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/1_data/final_filtered u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/result/

# also send the python script to filter the file based on boolean vector
# scp /Users/rpineau3/Dropbox\ \(GaTech\)/BioSci-Ratcliff/Rozenn/4.PandoProject/11-field-lab-work/5-GBS/2_analysis_of_subclones/0_scripts/filter_vcf_based_on_bool.py u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/Pando/subclone/variants/subclone/scripts/

# working: 
# less ponfil_friendsfil_subclone_noheader_filtered.vcf | wc -l
# 1157

# bcftools view -H ponfil_friendsfil_subclone_prefinal.vcf | wc -l
# 1157

# removing individuals
# bcftools view -S final_filtered/IDs_subclones_101.csv ponfil_friendsfil_subclone_prefinal.vcf > ponfil_friendsfil_subclone_final.vcf

# bcftools query -l ponfil_friendsfil_subclone_final.vcf | wc -l
# 101 # number of samples in the vcf file

# bcftools view -H ponfil_friendsfil_subclone_final.vcf | wc -l
# 1157 # number of SNPs in the vcf file

# This is my final vcf file for the subclone (2022) dataset!

I have my “final” vcf file for the subclone and the replicate dataset. Now I can check how many SNPs I have in common between both, within the same samples but also with the other samples.

Next, I can also re-do the PCoA analysis on the somatic set of mutations as well.

# -----------------------------------------------------------------------------
# PAIRWISE SIMILARY
# -----------------------------------------------------------------------------

# All pairwise comparisons
# 1 - go sample by sample
# 2 - for each pair A and B, calculate the number of mutations shared b/w A and B
# 3 - calculate the number of mutations in each A and B, take the mean
# 4 - divide #2 by #3
# 5 - store this number in a nInd x nInd matrix

#  x first
store_mat <- matrix(NA, nInd, nInd)
n <- 1
N <- 1

for (x in n:nInd) {
  
  for (y in N:nInd) {
    
    
    n_shared <- sum(rowSums(pn_bin_som_fil_2[,c(x, y)]) == 2) #  calculate the number of mutations shared b/w A and B
    n_sample_x <- sum(pn_bin_som_fil_2[,x] == 1) # calculate the number of mutations in each sample
    n_sample_y <- sum(pn_bin_som_fil_2[,y] == 1)
    
    val <- n_shared / mean(n_sample_x,n_sample_y) # calculate similarity
    store_mat[x,y] <- val
    
    
  }
  
  # N <- N + 1
  
}

# what is the range of values
par(family = "Times New Roman")
hist(store_mat[lower.tri(store_mat)], 100, col = "blue", xlim = c(0,1), main = "Range of value for pairwise similarity", xlab = "Pairwise similarity")

# transform to distance object
dist_obj <- as.dist(store_mat)
pcoa_res <- pco(dist_obj, negvals = "zero", dround = 0) #, negvals = "zero", dround = 0, rm= 1) # if negvals = 0 sets all negative eigenvalues to zero; if = "rm" corrects for negative eigenvalues using method 1 of Legendre and Anderson 1999

# color by tree ID
par(mfrow=c(1,2))
col_vec <- colour_values(ids_fil$TreeID, palette = "plasma") 
par(family = "Times New Roman", cex = 1.2)
plot(pcoa_res$vectors[,1], pcoa_res$vectors[,2], xlab = "PCoA1", ylab = "PCoA2",pch = 16,
     axes = TRUE, main = "PCoA colored by Tree ID", , col = col_vec)

plot(pcoa_res$vectors[,3], pcoa_res$vectors[,4], xlab = "PCoA3", ylab = "PCoA4",pch = 16,
     axes = TRUE, main = "PCoA colored by Tree ID", , col = col_vec)

#legend("topright", legend=c("Tree A", "Tree G", "Tree H"), pch = 16, col = c("#440154FF", "#21908CFF", "#FDE725FF"), bty = "n")



# color by tissue type
par(mfrow=c(1,2))
col_vec <- colour_values(ids_fil$Group, palette = "plasma") 
par(family = "Times New Roman", cex = 1.2)
plot(pcoa_res$vectors[,1], pcoa_res$vectors[,2], xlab = "PCoA1", ylab = "PCoA2",pch = 16,
     axes = TRUE, main = "PCoA colored by tissue type", , col = col_vec)

plot(pcoa_res$vectors[,3], pcoa_res$vectors[,4], xlab = "PCoA3", ylab = "PCoA4",pch = 16,
     axes = TRUE, main = "PCoA colored by tissue type", , col = col_vec)

legend("bottomleft", legend=c(levels(as.factor(ids_fil$Group))), pch = 16, col = levels(as.factor(col_vec)), bty = "n")

There seem to be less clustering when focusing on the somatic mutations…