Centromere Drive

Sid Bhadra-Lobo, Chris Fiscus, Tyler Kent, Tim Yang
August 27, 2014

Abundance

GWAS

Imputation

Diversity

Abundance

CentC is a 156 bp centromeric satellite repeat.

  • Collected 79 CentC repeats from NCBI.
  • Abundance gives the relative size of the centromere from 360 Jiao lines using BWA.

CRM2 A Centromeric Retrotransposon of Maize 2.

  • From UTE (Unique Transposable Element) files, mapped for abundance.
  • Included for comparison against CentC abundance.

Genome Size Initial Genome Sizes were found simply by aligning maize reference cDNA to each Jiao line.

  • Proved to be inaccurate as much of the cDNA is repetative, leading to false positives in mapping.

Genome Size Fix

Using parsesam.pl (Courtesy of Paul),

  • a per-gene count of reads mapped was found.

Using Jeff's Perl one-liners to,

  • find total number of reads
  • % of reads mapping to each gene
  • flag any gene that shows up more than 0.00001% (these will be ignored).
  • skip 663 flagged genes from each abundance file.
  • recalulate # of reads mapped.

Abundance Plots

  • Before accounting for relative differences in Gsize. alt text alt text

Abundance Plots Adjusted for Genome Size

  • Larger distribution and percent abundances. alt text alt text

Acknowledgements

Thanks guys.

  • Kevin Distor, for sending me CRM UTEs
  • Paul, for the parsesam.pl script
  • Jeff, for answering all my questions and the Github guidance.

GWAS

  • GWAS was run using Jiao/hapmap2 data set.
  • Estimated CentC/CRM2 Abundance using Sid's latest genome size estimation. K = 12 populations was used for this run.
  • NOTE: I did not have time to mark functional centromeres in these graphs (crunched for time)

Manhattan Plots

  • Highlighted regions in manhattan plots are SNPs within CentC satellites and +/- 5KB (thanks to Paul)

alt text

alt text

  • Numerous hits above 10-6 for CentC, just 2 hits for CRM2

QQ Plots

  • Observed vs Expected data checks if data (population structure, kinship, etc) is good.
  • CentC deviance is possibly from overcorrecting for population structure, checked by doing MLM analysis sans Q.

alt textalt text

Next steps

  • Replace the Jiao/hapmap2 with hapmap 3 panel which includes Jiao and other lines. This is an expansion of the data set which will hopefully yield more meaningful results.
  • Calculate population structure of the hapmap 3 panel using STRUCTURE in order to use for GWAS. We will use same pipeline to analyze the new data.

Imputation

Goal

  • Hapmap3 is 826 lines and Ames diversity panel is 2,815 lines
  • Bigger sample size = better
  • Get regions ±10MB around a big or small SNP that Chris finds in HMP3 in the Ames lines and send to Tim for analysis

First plan: Beagle

  • Worked great after help from Sofiane (graph attached)
  • Full hapmap files had an estimated completion time of infinity
  • Scrapped

Beagle IBS Chromosome 1

  • # of comparisons vs position (bp)

alt text

Next plan: FILLIN

  • Works better for imputing inbred lines than beagle
  • New plugin for TASSEL
  • Changed many times/very little documentation
  • After many changes and emails with Kelly Swarts, it seems to be working
  • Have to find the correct values to use for parameters to get good accuracy

Current plan: Projection from Buckler

  • Buckler lab sent us a projections file which has IBS regions of HMP3/Ames
  • Plug and chug in TASSEL to get hapmap imputed Ames
    • Except: there was no CML option for projections
  • Terry Casstevens wrote the CML options for me
  • Can't replicate output that I did before being distracted by a side project
  • Many weird formatting problems with HMP3 vcf files, different errors than I was getting before, no documentation
    • Help from Terry again
  • New TASSEL jar Terry is having me use needs java8, which Bill at CSE is installing for me

Diversity

Methods

  • AMES imputed data -> Variscan statistics profile
  • Graph results, add lowess

Variscan Profile Statistics

  • S = total # of segregating sites
  • Eta = # of singletons (SNPs that that only exist in a single individual)
  • Pi = measure of pairwise differences
  • Theta = population mutation rate (determines diversity)
    • Expectation of Pi

Pi = pairwise differences

  • Calculates nucleotide diversity = average pairwise nucleotide differences/site

    • Compare ind 1 to ind 2, 1 to 3, 1 to 4 and so on, then takes the average
  • Low value of pi is indicative of selection at that SNP

    • Low value at big centromeres should indicate selection

AMES Pi Plot alt text

Tajima’s D = θπ - θw

  • Θπ = Actual diversity as determined in part by by pi
  • Θw = Expected diversity given # of segregating sites and other factors
  • D = 0 = no selection
  • D < 0 = less diversity than expected = selection at that SNP (big centromeres)
  • D > 0 = more diversity than expected

AMES Tajima's D Plot alt text

End