Arabidopsis chromosome fusion analysis - subtelomere enrichment revision

Mutant analysis

~11 million read pairs from a MiSeq run were analysed to find pairs of reads showing evidence of chromosome rearrangements.

Here we consider read pairs where the reads map uniquely to different chromosomes, and in particular where at least one of the reads is mapping in a subtelomeric region of the chromosome (here defined as within 100kb of the end of a chromosome). Because the subtelomeres are highly repetitive, we use the raw (unmasked) sequence of the TAIR10 release of the Arabidopsis chromosome.

First we load positions of read mappings that meet the criterion of mapping to different chromosomes.

data <- '~/Dropbox/ongoing_projects/chromosome_chimeras/mutant/unique_mapping/unmasked_uniquebridges_evidence.csv'
df <- read.csv(data, as.is=T)
library(ggplot2)
df$left <- gsub(df$left, pattern='chr', replacement='')
df$right <- gsub(df$right, pattern='chr', replacement='')
summary(df)

##      left                lpos             right          
##  Length:28478       Min.   :       9   Length:28478      
##  Class :character   1st Qu.: 4520588   Class :character  
##  Mode  :character   Median :11186006   Mode  :character  
##                     Mean   :11314506                     
##                     3rd Qu.:16591262                     
##                     Max.   :30427408                     
##       rpos         
##  Min.   :     952  
##  1st Qu.: 4193922  
##  Median :11137496  
##  Mean   :10963727  
##  3rd Qu.:15568369  
##  Max.   :30426592

Next we load the lengths of the chromosomes in the TAIR10 assembly.

# lengths of chromosomes
# ruby -e "require 'bio'; Bio::FastaFormat.open('TAIR10_chr_all.fas').each{ |rec| puts \"#{rec.entry_id},#{rec.seq.length}\" }"
chromlen <- read.csv('~/Dropbox/ongoing_projects/chromosome_chimeras/reference/ath_chrom_lengths.csv',
                     head=F, as.is=T)
names(chromlen) <- c('chrom', 'length')
print(chromlen)

##          chrom   length
## 1            1 30427671
## 2            2 19698289
## 3            3 23459830
## 4            4 18585056
## 5            5 26975502
## 6 mitochondria   366924
## 7  chloroplast   154478

We’re only interested in nuclear chromosomes.

# subset to real chromosomes
chr <- as.character(1:5)
df <- subset(df, left %in% chr & right %in% chr)

Annotate the aligments by whether left or right reads were within subtelomeres.

# annotate subtelomere reads
subtelo_cutoff <- 1e5
df$l_subtelo <- apply(df, 1, function(x) {
  x <- as.numeric(x[1:2])
  len <- chromlen$length[chromlen$chrom == x[1]]
  pos <- x[2]
  return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
df$r_subtelo <- apply(df, 1, function(x) {
  x <- as.numeric(x[3:4])
  len <- chromlen$length[chromlen$chrom == x[1]]
  pos <- x[2]
  return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
table(df[,5:6])

##          r_subtelo
## l_subtelo FALSE  TRUE
##     FALSE 26753   953
##     TRUE    762    10

There were 28478 read pairs showing evidence of crossovers. Only 10 read pairs show evidence of crossovers between subtelomeric regions, while 1725 pairs have at least one read aligning to a subtelomere. This is equivalent to 6% of crossovers involving the subtelomeres. To examine whether this is different than the rate expected by chance we can use a hypergeometric test.

# find the proportion of all reads that mapped to subtelomeres  
total_reads <- 8713960
subtelo_reads <- 87083
p_reads_subtelo <- subtelo_reads / total_reads
# test for enrichment in subtelomeres
total_crossovers <- dim(df)[1]
p_reads_crossover <- total_crossovers/total_reads
subtelo_crossovers <- dim(subset(df, l_subtelo==T | r_subtelo==T))[1]
p <- 1.0 - phyper(subtelo_crossovers - 1,       # the number of white balls drawn
                  subtelo_reads,                # the number of white balls in the urn
                  total_reads - subtelo_reads,  # the number of black balls in the urn
                  total_crossovers)             # the number of balls drawn from the urn
enrichment_mutant <- (subtelo_crossovers/total_crossovers)/p_reads_subtelo
### print some statistics
sprintf("%.2f %% of all reads mapped to subtelomeres", p_reads_subtelo * 100)

## [1] "1.00 % of all reads mapped to subtelomeres"

sprintf("%.2f %% of all reads showed evidence of crossovers", p_reads_crossover * 100)

## [1] "0.33 % of all reads showed evidence of crossovers"

sprintf("evidence of crossovers in the subtelomeres was enriched %.2f-fold compared to the proportion of all reads mapping there", enrichment_mutant)

## [1] "evidence of crossovers in the subtelomeres was enriched 6.06-fold compared to the proportion of all reads mapping there"

sprintf("the hypergeometric test returned a p-value of %.4f", p)

## [1] "the hypergeometric test returned a p-value of 0.0000"

Thus, the hypergeometric test shows that detected crossovers are significantly enriched in the subtelomeric regions.

This result relies on some assumptions:

that the sequencing and read mapping aren’t biased towards the subtelomeres
that the meaning of ‘expected by chance’ in this case is a uniform distribution of crossovers along the chromosomes

In reality, we expect to be less able to sequence repetitive regions and less able to uniquely map reads that come from repetitive regions. Because the telomeres are highly repetetive, the enrichment of mapping discovered here could be an underestimate of the true rate of telomeric crossovers.

It’s worth examining the distribution of the alignments across the chromosomes.

# plot the points
ggplot(subset(df, l_subtelo==T | r_subtelo==T), aes(x=lpos, y=rpos)) +
  geom_point(size=0.8, alpha=0.8) +
  facet_grid(right ~ left, scales='free', space='free') +
  xlab("chromosome") +
  xlab("chromosome")

plot of chunk unnamed-chunk-6

The beginning of chromosome two has a notable enrichment compared to the other chromosomes. It might be worth exploring whether this is a sequencing/alignment bias or true biologial phenomenon. If it is a bias, it will also affect the hypergeometric test.

Wild-type analysis

To provide a wild-type control, Illumina Hiseq paired-end genomic DNA reads from Col0 were obtained from Short Read Archive accession SRR519624. These were processed in an idental way to the mutant reads.

wt_data <- '~/Dropbox/ongoing_projects/chromosome_chimeras/reference/unique_mapping/col0_masked_uniquebridges_evidence.csv'
wt_df <- read.csv(wt_data, as.is=T)
library(ggplot2)
wt_df$left <- gsub(wt_df$left, pattern='chr', replacement='')
wt_df$right <- gsub(wt_df$right, pattern='chr', replacement='')
summary(wt_df)

##      left                lpos             right          
##  Length:28105       Min.   :    1629   Length:28105      
##  Class :character   1st Qu.: 5463034   Class :character  
##  Mode  :character   Median :12002485   Mode  :character  
##                     Mean   :12067698                     
##                     3rd Qu.:17329324                     
##                     Max.   :30423084                     
##       rpos         
##  Min.   :       2  
##  1st Qu.: 3034830  
##  Median : 9314801  
##  Mean   : 9654267  
##  3rd Qu.:15295124  
##  Max.   :26971490

We’re only interested in nuclear chromosomes.

# subset to real chromosomes
chr <- as.character(1:5)
wt_df <- subset(wt_df, left %in% chr & right %in% chr)

Annotate the aligments by whether left or right reads were within subtelomeres.

# annotate subtelomere reads
subtelo_cutoff <- 1e5
wt_df$l_subtelo <- apply(wt_df, 1, function(x) {
  x <- as.numeric(x[1:2])
  len <- chromlen$length[chromlen$chrom == x[1]]
  pos <- x[2]
  return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
wt_df$r_subtelo <- apply(wt_df, 1, function(x) {
  x <- as.numeric(x[3:4])
  len <- chromlen$length[chromlen$chrom == x[1]]
  pos <- x[2]
  return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
table(wt_df[,5:6])

##          r_subtelo
## l_subtelo FALSE  TRUE
##     FALSE 23808   178
##     TRUE    270     1

There were 24257 read pairs showing evidence of crossovers.

# find the proportion of all reads that mapped to subtelomeres  
wt_total_reads <- 9586161
wt_subtelo_reads <- 91860
wt_p_reads_subtelo <- wt_subtelo_reads / wt_total_reads
# test for enrichment in subtelomeres
wt_total_crossovers <- dim(wt_df)[1]
wt_p_reads_crossover <- wt_total_crossovers/wt_total_reads
wt_subtelo_crossovers <- dim(subset(wt_df, l_subtelo==T | r_subtelo==T))[1]
wt_p <- 1.0 - phyper(wt_subtelo_crossovers - 1,       # the number of white balls drawn
                  wt_subtelo_reads,                # the number of white balls in the urn
                  wt_total_reads - wt_subtelo_reads,  # the number of black balls in the urn
                  wt_total_crossovers)             # the number of balls drawn from the urn
enrichment_wt <- (wt_subtelo_crossovers/wt_total_crossovers)/wt_p_reads_subtelo
### print some statistics
sprintf("%.2f%% of all reads mapped to subtelomeres", wt_p_reads_subtelo * 100)

## [1] "0.96% of all reads mapped to subtelomeres"

sprintf("%.2f%% of all reads showed evidence of crossovers", wt_p_reads_crossover * 100)

## [1] "0.25% of all reads showed evidence of crossovers"

sprintf("evidence of crossovers in the subtelomeres was enriched %.2f-fold compared to the proportion of all reads mapping there", enrichment_wt)

## [1] "evidence of crossovers in the subtelomeres was enriched 1.93-fold compared to the proportion of all reads mapping there"

sprintf("the hypergeometric test returned a p-value of %.4f", wt_p)

## [1] "the hypergeometric test returned a p-value of 0.0000"

Comparison of wild-type and mutant

The wild-type and mutant had a similar (~1%) proportion of read pairs mapping to the subtelomeres.

0.25% of wild-type read pairs, and 0.32% of mutant read pairs, showed evidence of crossovers (read pairs uniquely mapping to different chromosomes). Without replication it is impossible to evaluate whether this difference is significant.

Both wild type and mutant strains have crossovers significantly enriched in subtelomeres. However, the fold enrichment in the mutant was ~3x as great as in wild type.

ggplot(data.frame(strain=c('wt', 'mutant'), enrichment=c(enrichment_wt, enrichment_mutant)),
       aes(x=strain, y=enrichment)) +
  geom_bar(position='dodge', stat='identity') +
  ylab("enrichment (observed/expected)") +
  theme_bw() +
  ggtitle("subtelomeric enrichment of crossover evidence in mutant vs wild type")

plot of chunk unnamed-chunk-11