Two column table for the k-mer profile:
# Occurrences kmers
# 1 3 (CAT, GCT, CTG)
# 2 2 (ATG, GCA)
# 3 1 (TGC)
The most common read length for the concatenated fastq for Arabidopsis is 90bp.
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=Jellyfish
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu
module load jellyfish/2.3.0
jellyfish count -C -m 21 -s 1000000000 -t 10 chi2_concatenated -o Jellyfish_output.jf
jellyfish histo -t 10 Jellyfish_output.jf > Jellyfish_output.histo
kmer length=21 read length=90 max kmer coverage=1000
GenomeScope Results table output for Arabidopsis:
Results
GenomeScope version 1.0
k = 21
property min max
Heterozygosity 0.0964079% 0.102354%
Genome Haploid Length 124,807,825 bp 124,876,733 bp
Genome Repeat Length 20,318,464 bp 20,329,682 bp
Genome Unique Length 104,489,361 bp 104,547,051 bp
Model Fit 97.6172% 99.7011%
Read Error Rate 0.119641% 0.119641%
The two kmer profile plots generated by GenomeScope are downloaded as images and attached to this submission. Please see attachments for the two kmer profile plots generated.
Yes, the kmer-profiling estimates of heterozygosity for Arabidopsis and date palm match this expectation: low rates of heterozygosity estimate for Arabidopsis thaliana due to capability of self-fertilization and high rates of heterozygosity for date palm (outcrossing species). The kmer-profile graphs generated by GenomeScope supports this prediction by population genetic theory. The kmer-profile graph of Arabidopsis showed a Poisson distribution which according to Vurture et al. 2017 reading showed more of homozygous genomes, therefore Arabidopsis has low rates of heterozygosity matching the expectation. On the other hand, date palm kmer profile showed a characteristic bimodal profile, showing heterozygous genomes. Also, the predicted heterozygosity for Arabidopsis is 0.0994% while for date palms is 1.25%.
The left most peak, represented in orange, is the error peak. The middle peak is the heterozygous peak while to the right is the homozygous peak. The heterozygous peak is expected to be half of the homozygous peak in terms of number of occurences or coverage because k-mers that for the heterozygous peak, it contain k-mers that are unique to one haplotype while homozygous peak is for k-mers that are shared between two homologous chromosomes and thus in theory should be twice the occurences or coverage than the heterozygous peak.
According to a published assembly by Hazzouri et al. 2019 for date palm, the assembly has a haploid length of 772 MB when in reality it may be larger as the estimated genome size is between 870-899Mb from flow cytometry. For Arabidopsis thaliana, the estimated haploid genome size estimates of GenomeScope that I just performed was estimated to be 124,876,733bp or about 125 Mb. The GenomeScope estimates might differ from the genome assembly sizes for both species due to loss of some reads due to the specified maximum kmer size of 1000 specified during GenomeScope analysis, thus we are excluding kmers after 1000 causing us to miss some part of the genome.
I would expect differences in the k-mer profile between two fastqs from the same sample with a 2-fold difference in sequencing error rate. The k-mer profile from the fastq with the higher sequencing error rate would have a more distorted profile with low frequency false k-mers as well as increased variances as mentioned in Vurture et al. 2017. The higher sequencing error rate would have a GenomeScope profile containing a greater “errors” peak in terms of coverage/occurences and lower model fit percentage.