Task 1:Understanding k-mer profiles

Report the profile in the form of a two column table for your answer [ 1 point ].

Two column table for the k-mer profile:

# Occurrences    kmers
#     1           3       (CAT, GCT, CTG)
#     2           2       (ATG, GCA)
#     3           1       (TGC)

Task 2: Generate k-mer profile with Jellyfish

Q2.1: GenomeScope requires the read length for the fastq input file. What is the read length for the concatenated fastq for Arabidopsis [ 1 point ]?

The most common read length for the concatenated fastq for Arabidopsis is 90bp.

Q2.2: Copy your slurm script with all commands into your homework document for your answer [ 2 points ]

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=Jellyfish
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu


module load jellyfish/2.3.0
jellyfish count -C -m 21 -s 1000000000 -t 10 chi2_concatenated -o Jellyfish_output.jf

jellyfish histo -t 10 Jellyfish_output.jf > Jellyfish_output.histo

Task 3: Reference-free estimation of heterozygosity, genome size and repeat content

Q3.1: Report the settings (kmer length, read length, and max kmer coverage ) you input on the GenomeScope website [ 1 point ].

kmer length=21 read length=90 max kmer coverage=1000

Q3.2: Report your GenomeScope outputs [ 1 point ] (1) Paste the GenomeScope Results table output for Arabidopsis into your homework document (2) Include the two kmer profile plots generated by GenomeScope. Download images or take a screen shot of the output plots and include in your report.

GenomeScope Results table output for Arabidopsis:

Results

GenomeScope version 1.0
k = 21

property                      min               max               
Heterozygosity                0.0964079%        0.102354%         
Genome Haploid Length         124,807,825 bp    124,876,733 bp    
Genome Repeat Length          20,318,464 bp     20,329,682 bp     
Genome Unique Length          104,489,361 bp    104,547,051 bp    
Model Fit                     97.6172%          99.7011%          
Read Error Rate               0.119641%         0.119641% 

The two kmer profile plots generated by GenomeScope are downloaded as images and attached to this submission. Please see attachments for the two kmer profile plots generated.

Q3.3a: Does the kmer-profiling estimates of heterozygosity (the proportion of heterozygous sites) for Arabidopsis and date palm match this expectation? [ 1 point ].

Yes, the kmer-profiling estimates of heterozygosity for Arabidopsis and date palm match this expectation: low rates of heterozygosity estimate for Arabidopsis thaliana due to capability of self-fertilization and high rates of heterozygosity for date palm (outcrossing species). The kmer-profile graphs generated by GenomeScope supports this prediction by population genetic theory. The kmer-profile graph of Arabidopsis showed a Poisson distribution which according to Vurture et al. 2017 reading showed more of homozygous genomes, therefore Arabidopsis has low rates of heterozygosity matching the expectation. On the other hand, date palm kmer profile showed a characteristic bimodal profile, showing heterozygous genomes. Also, the predicted heterozygosity for Arabidopsis is 0.0994% while for date palms is 1.25%.

3.3b: Q3.3b. From left to right, which of the peaks in the output profile plot for date palm is the homozygous peak, the heterozygous peak, the error peak? Why is the heterozygous peak expected to be half of the homozygous peak (in terms of number of occurences or “Coverage”)? [ 1 points ]

The left most peak, represented in orange, is the error peak. The middle peak is the heterozygous peak while to the right is the homozygous peak. The heterozygous peak is expected to be half of the homozygous peak in terms of number of occurences or coverage because k-mers that for the heterozygous peak, it contain k-mers that are unique to one haplotype while homozygous peak is for k-mers that are shared between two homologous chromosomes and thus in theory should be twice the occurences or coverage than the heterozygous peak.

Q3.4: A published assembly for date palm has a haploid length of 772 Mb, but may be larger (Hazzouri et al. 2019). The Arabidopsis thaliana assembly is much smaller (~135 Mb). Summarize the haploid genome size estimates of GenomeScope for these two species? Then speculate as to why GenomeScope estimates might differ from the genome assembly sizes for both species [ 1 point ].

According to a published assembly by Hazzouri et al. 2019 for date palm, the assembly has a haploid length of 772 MB when in reality it may be larger as the estimated genome size is between 870-899Mb from flow cytometry. For Arabidopsis thaliana, the estimated haploid genome size estimates of GenomeScope that I just performed was estimated to be 124,876,733bp or about 125 Mb. The GenomeScope estimates might differ from the genome assembly sizes for both species due to loss of some reads due to the specified maximum kmer size of 1000 specified during GenomeScope analysis, thus we are excluding kmers after 1000 causing us to miss some part of the genome.

Q3.5: Hypothetically-speaking, how would you expect the k-mer profile to differ between two fastqs from the same sample with a 2-fold difference in sequencing error rate? [ 1 point ]

I would expect differences in the k-mer profile between two fastqs from the same sample with a 2-fold difference in sequencing error rate. The k-mer profile from the fastq with the higher sequencing error rate would have a more distorted profile with low frequency false k-mers as well as increased variances as mentioned in Vurture et al. 2017. The higher sequencing error rate would have a GenomeScope profile containing a greater “errors” peak in terms of coverage/occurences and lower model fit percentage.