Exploring population genetics of EBV genomes in BL endemic regions

Here, I would like to share some of my results exploring the single nucleotide variation data of EBV genomes we sequenced. All the code and data have been deposited to the repo https://github.com/yasinkaymaz/ViralGenomeAssembly Note: Directories are relative to repo master.

We are loading SNV data of EBV genomes from 58 eBL and 40 Healthy controls, excluding the plasma genomes of paired samples and replicated genomes.

#Load the data
pops <- read.delim(here::here("workspace/data/pop_info.txt"),row.names = 1, header = T)
summary(pops)
##                Method          EBV_Type           Condition 
##  Direct Sequencing:33   Inter-typic: 3   eBL           :58  
##  GenomiPhi-WGA    :13   Type1      :60   HealthyControl:40  
##  Preamp-sWGA      :52   Type2      :35                      
##                                                             
##             Label   
##  eBL_Cell_Line : 3  
##  eBL_Plasma    :14  
##  eBL_Tumor     :41  
##  HealthyControl:40

I generated a vcf file out of De Novo assembled genomes following the steps:

  1. multiple sequence alignment (scripts/msa_mafft.sh)

  2. Fixed the insertions/deletions (bin/MSA_InsertCleaner.py & bin/MSA_Gap-Singleton_Editor.py)

  3. Determined single nucleotide variant locations (bin/MSA_SNP_finder.py)

  4. Removed repetitive regions (bin/MSA_cleaner.py)

  5. Finally, generated a vcf file from the output using ‘snp-sites’ tool.

Resulting vcf file stores 9488 variant locations from 98 genomes.

Need to convert vcf file into genlight (gl) format. This conversion drops 200 loci with more than two alleles, which are not currently supported. What is left is 8249 loci from 98 genomes.

Filtration of variants and low coverage genomes

I then want to remove genomes with too many missing data, in other words, short assemblies due to lack of coverage. Remove individual genomes if only less than 25% of all variants locations are covered. I also only keep loci at which at least 25% of genomes have been covered.

This results to drop 3 healthy control genomes with low coverage; “HC-0031”, “HC-0033”, “HC-0038”.

We are left with 7458 variants from 95 genomes in total now.

Principle Coordinates Analysis (PCoA)

Now, I run PCOA analysis on the data generating first 20 PCs:

## Performing a PCoA, individuals as entities, SNP loci as attributes
## Ordination yielded 13 informative dimensions from 94 original dimensions
##   PCoA Axis 1 explains 43.9 % of the total variance
##   PCoA Axis 1 and 2 combined explain 52 % of the total variance
##   PCoA Axis 1-3 combined explain 57.2 % of the total variance

Here is the PCOA of 1st and 2nd axis with all variants.

PCoAxis 2nd and 3rd:

PCoAxis 4nd and 5rd:

Then, I plot the PCA loading values for each genomic location

df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis1), group = 1))+
  geom_point(size=.1)+
  labs(x="genomic positions")+
  ggtitle("PCA Loadings from Axis 1")+
  geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
              color="transparent", fill="orange", alpha=0.3,
              inherit.aes = FALSE, show.legend = T)+
  geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis1"]))*1.1, label=gene), size=4,angle=45)

df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis2), group = 1))+
  geom_point(size=.1)+
  labs(x="genomic positions")+
  ggtitle("PCA Loadings from Axis 2")+
  geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
              color="transparent", fill="orange", alpha=0.3,
              inherit.aes = FALSE, show.legend = T)+
  geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis2"]))*1.1, label=gene), size=4,angle=45)

df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=abs(Axis3), group = 1))+
  geom_point(size=.1)+
  labs(x="genomic positions")+
  ggtitle("PCA Loadings from Axis 3")+
  geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
              color="transparent", fill="orange", alpha=0.3,
              inherit.aes = FALSE, show.legend = T)+
  geom_text(data=rect, aes(x=xmin+(xmax-xmin)/2, y=max(abs(pc$loadings[,"Axis3"]))*1.1, label=gene), size=4,angle=45)

Calculating Fst for variant loci

Here, I wanted to calculate locus-by-locus Fst scores to determine accumulations of genomics regions that reflect differentiation between the two comparison groups. Fixation index ranges between 0 and 1, where 0 means complete exchange of genetic material while 1 means no sharing and complete differentiation between the groups.

First, I am plotting the locus-by-locus Fst for Type 1 and Type 2 genomes regardless of disease status:

Then, I look at the Fst between eBL and HC EBVs within Type 1 genomes only:

Then, I look at the Fst between eBL and HC EBVs within Type 2 genomes only:

Genetic distance

As we discussed, I am now calculating the pairwise genetic distance of genomes for each gene and then take the average. Mean pairwise distances calculated using Kimura 2 parameter method. Error bars represent standard error of mean.

    Kimura 2-Parameter distance = -0.5 log( (1 - 2p -q) * sqrt( 1 - 2q ) )
    where:
    p = transition frequency
    q = transversion frequency

Type 1 and Type 2 genomes separately:

Delta d:

dN/dS over protein coding genes

I took the main python functions from https://github.com/a1ultima/hpcleap_dnds/

            dS  = -(3./4.)*math.log(1.-((4./3.)*pS))
            dN  = -(3./4.)*math.log(1.-((4./3.)*pN))
            
            where; 
                  pN: The count of the number of nonsynonymous differences (Nd) normalized by the number of all possible nonsynonymous sites (N)
                  pS: The count of the number of synonymous differences (Sd) normalized by the number of all possible synonymous sites (S)
                  
            dN_dS = dN/dS

Here is the source for equations from Mega software

Type 1 and Type 2 genomes separately:

##            gene GeneLen GenomesCovered SynPerGenome SynPerGenomePerKb
## LMP2B     LMP2B    1137            111         2.74              2.41
## BNRF1     BNRF1    3957             98        12.48              3.15
## BCRF1     BCRF1     513             88         0.97              1.88
## BCRF2     BCRF2    1152              0         0.00              0.00
## EBNA-LP EBNA-LP    1521              0         0.00              0.00
## BWRF1     BWRF1    8064              0         0.00              0.00
##         NonsynPerGenome NonsynPerGenomePerKb
## LMP2B              2.54                 2.24
## BNRF1              6.66                 1.68
## BCRF1              0.06                 0.11
## BCRF2              0.00                 0.00
## EBNA-LP            0.00                 0.00
## BWRF1              0.00                 0.00