Here, I would like to share some of my results exploring the single nucleotide variation data of EBV genomes we sequenced. All the code and data have been deposited to the repo https://github.com/yasinkaymaz/ViralGenomeAssembly Note: Directories are relative to repo master.
We are loading SNV data of EBV genomes from 58 eBL and 40 Healthy controls, excluding the plasma genomes of paired samples and replicated genomes.
#Load the data
pops <- read.delim(here::here("workspace/data/pop_info.txt"),row.names = 1, header = T)
summary(pops)
## Method EBV_Type Condition
## Direct Sequencing:33 Inter-typic: 3 eBL :58
## GenomiPhi-WGA :13 Type1 :60 HealthyControl:40
## Preamp-sWGA :52 Type2 :35
##
## Label
## eBL_Cell_Line : 3
## eBL_Plasma :14
## eBL_Tumor :41
## HealthyControl:40
I generated a vcf file out of De Novo assembled genomes following the steps:
multiple sequence alignment (scripts/msa_mafft.sh)
Fixed the insertions/deletions (bin/MSA_InsertCleaner.py & bin/MSA_Gap-Singleton_Editor.py)
Determined single nucleotide variant locations (bin/MSA_SNP_finder.py)
Removed repetitive regions (bin/MSA_cleaner.py)
Finally, generated a vcf file from the output using ‘snp-sites’ tool.
Resulting vcf file stores 9488 variant locations from 98 genomes.
Need to convert vcf file into genlight (gl) format. This conversion drops 200 loci with more than two alleles, which are not currently supported. What is left is 8249 loci from 98 genomes.
I then want to remove genomes with too many missing data, in other words, short assemblies due to lack of coverage. Remove individual genomes if only less than 25% of all variants locations are covered. I also only keep loci at which at least 25% of genomes have been covered.
This results to drop 3 healthy control genomes with low coverage; “HC-0031”, “HC-0033”, “HC-0038”.
We are left with 7458 variants from 95 genomes in total now.
Now, I run PCOA analysis on the data generating first 20 PCs:
## Performing a PCoA, individuals as entities, SNP loci as attributes
## Ordination yielded 13 informative dimensions from 94 original dimensions
## PCoA Axis 1 explains 43.9 % of the total variance
## PCoA Axis 1 and 2 combined explain 52 % of the total variance
## PCoA Axis 1-3 combined explain 57.2 % of the total variance
Here is the PCOA of 1st and 2nd axis with all variants.
PCoAxis 2nd and 3rd:
PCoAxis 4nd and 5rd:Then, I plot the PCA loading values for each genomic location
df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis1), group = 1))+
geom_point(size=.1)+
labs(x="genomic positions")+
ggtitle("PCA Loadings from Axis 1")+
geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
color="transparent", fill="orange", alpha=0.3,
inherit.aes = FALSE, show.legend = T)+
geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis1"]))*1.1, label=gene), size=4,angle=45)
df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis2), group = 1))+
geom_point(size=.1)+
labs(x="genomic positions")+
ggtitle("PCA Loadings from Axis 2")+
geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
color="transparent", fill="orange", alpha=0.3,
inherit.aes = FALSE, show.legend = T)+
geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis2"]))*1.1, label=gene), size=4,angle=45)
df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis3), group = 1))+
geom_point(size=.1)+
labs(x="genomic positions")+
ggtitle("PCA Loadings from Axis 3")+
geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
color="transparent", fill="orange", alpha=0.3,
inherit.aes = FALSE, show.legend = T)+
geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis3"]))*1.1, label=gene), size=4,angle=45)
Here, I wanted to calculate locus-by-locus Fst scores to determine accumulations of genomics regions that reflect differentiation between the two comparison groups. Fixation index ranges between 0 and 1, where 0 means complete exchange of genetic material while 1 means no sharing and complete differentiation between the groups.
First, I am plotting the locus-by-locus Fst for Type 1 and Type 2 genomes regardless of disease status:
Then, I look at the Fst between eBL and HC EBVs within Type 1 genomes only:
Then, I look at the Fst between eBL and HC EBVs within Type 2 genomes only:
As we discussed, I am now calculating the pairwise genetic distance of genomes for each gene and then take the average. Mean pairwise distances calculated using Kimura 2 parameter method. Error bars represent standard error of mean.
Kimura 2-Parameter distance = -0.5 log( (1 - 2p -q) * sqrt( 1 - 2q ) )
where:
p = transition frequency
q = transversion frequency
Type 1 and Type 2 genomes separately:
Delta d:
I took the main python functions from https://github.com/a1ultima/hpcleap_dnds/
dS = -(3./4.)*math.log(1.-((4./3.)*pS))
dN = -(3./4.)*math.log(1.-((4./3.)*pN))
where;
pN: The count of the number of nonsynonymous differences (Nd) normalized by the number of all possible nonsynonymous sites (N)
pS: The count of the number of synonymous differences (Sd) normalized by the number of all possible synonymous sites (S)
dN_dS = dN/dS
Here is the source for equations from Mega software
Type 1 and Type 2 genomes separately:
## gene GeneLen GenomesCovered SynPerGenome SynPerGenomePerKb
## LMP2B LMP2B 1137 111 2.74 2.41
## BNRF1 BNRF1 3957 98 12.48 3.15
## BCRF1 BCRF1 513 88 0.97 1.88
## BCRF2 BCRF2 1152 0 0.00 0.00
## EBNA-LP EBNA-LP 1521 0 0.00 0.00
## BWRF1 BWRF1 8064 0 0.00 0.00
## NonsynPerGenome NonsynPerGenomePerKb
## LMP2B 2.54 2.24
## BNRF1 6.66 1.68
## BCRF1 0.06 0.11
## BCRF2 0.00 0.00
## EBNA-LP 0.00 0.00
## BWRF1 0.00 0.00