~11 million read pairs from a MiSeq run were analysed to find pairs of reads showing evidence of chromosome rearrangements.
Here we consider read pairs where the reads map uniquely to different chromosomes, and in particular where at least one of the reads is mapping in a subtelomeric region of the chromosome (here defined as within 100kb of the end of a chromosome). Because the subtelomeres are highly repetitive, we use the raw (unmasked) sequence of the TAIR10 release of the Arabidopsis chromosome.
First we load positions of read mappings that meet the criterion of mapping to different chromosomes.
data <- '~/Dropbox/ongoing_projects/chromosome_chimeras/mutant/unique_mapping/unmasked_uniquebridges_evidence.csv'
df <- read.csv(data, as.is=T)
library(ggplot2)
df$left <- gsub(df$left, pattern='chr', replacement='')
df$right <- gsub(df$right, pattern='chr', replacement='')
summary(df)
## left lpos right
## Length:28478 Min. : 9 Length:28478
## Class :character 1st Qu.: 4520588 Class :character
## Mode :character Median :11186006 Mode :character
## Mean :11314506
## 3rd Qu.:16591262
## Max. :30427408
## rpos
## Min. : 952
## 1st Qu.: 4193922
## Median :11137496
## Mean :10963727
## 3rd Qu.:15568369
## Max. :30426592
Next we load the lengths of the chromosomes in the TAIR10 assembly.
# lengths of chromosomes
# ruby -e "require 'bio'; Bio::FastaFormat.open('TAIR10_chr_all.fas').each{ |rec| puts \"#{rec.entry_id},#{rec.seq.length}\" }"
chromlen <- read.csv('~/Dropbox/ongoing_projects/chromosome_chimeras/reference/ath_chrom_lengths.csv',
head=F, as.is=T)
names(chromlen) <- c('chrom', 'length')
print(chromlen)
## chrom length
## 1 1 30427671
## 2 2 19698289
## 3 3 23459830
## 4 4 18585056
## 5 5 26975502
## 6 mitochondria 366924
## 7 chloroplast 154478
We’re only interested in nuclear chromosomes.
# subset to real chromosomes
chr <- as.character(1:5)
df <- subset(df, left %in% chr & right %in% chr)
Annotate the aligments by whether left or right reads were within subtelomeres.
# annotate subtelomere reads
subtelo_cutoff <- 1e5
df$l_subtelo <- apply(df, 1, function(x) {
x <- as.numeric(x[1:2])
len <- chromlen$length[chromlen$chrom == x[1]]
pos <- x[2]
return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
df$r_subtelo <- apply(df, 1, function(x) {
x <- as.numeric(x[3:4])
len <- chromlen$length[chromlen$chrom == x[1]]
pos <- x[2]
return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
table(df[,5:6])
## r_subtelo
## l_subtelo FALSE TRUE
## FALSE 26753 953
## TRUE 762 10
There were 28478 read pairs showing evidence of crossovers. Only 10 read pairs show evidence of crossovers between subtelomeric regions, while 1725 pairs have at least one read aligning to a subtelomere. This is equivalent to 6% of crossovers involving the subtelomeres. To examine whether this is different than the rate expected by chance we can use a hypergeometric test.
# find the proportion of all reads that mapped to subtelomeres
total_reads <- 8713960
subtelo_reads <- 87083
p_reads_subtelo <- subtelo_reads / total_reads
# test for enrichment in subtelomeres
total_crossovers <- dim(df)[1]
p_reads_crossover <- total_crossovers/total_reads
subtelo_crossovers <- dim(subset(df, l_subtelo==T | r_subtelo==T))[1]
p <- 1.0 - phyper(subtelo_crossovers - 1, # the number of white balls drawn
subtelo_reads, # the number of white balls in the urn
total_reads - subtelo_reads, # the number of black balls in the urn
total_crossovers) # the number of balls drawn from the urn
enrichment_mutant <- (subtelo_crossovers/total_crossovers)/p_reads_subtelo
### print some statistics
sprintf("%.2f %% of all reads mapped to subtelomeres", p_reads_subtelo * 100)
## [1] "1.00 % of all reads mapped to subtelomeres"
sprintf("%.2f %% of all reads showed evidence of crossovers", p_reads_crossover * 100)
## [1] "0.33 % of all reads showed evidence of crossovers"
sprintf("evidence of crossovers in the subtelomeres was enriched %.2f-fold compared to the proportion of all reads mapping there", enrichment_mutant)
## [1] "evidence of crossovers in the subtelomeres was enriched 6.06-fold compared to the proportion of all reads mapping there"
sprintf("the hypergeometric test returned a p-value of %.4f", p)
## [1] "the hypergeometric test returned a p-value of 0.0000"
Thus, the hypergeometric test shows that detected crossovers are significantly enriched in the subtelomeric regions.
This result relies on some assumptions:
In reality, we expect to be less able to sequence repetitive regions and less able to uniquely map reads that come from repetitive regions. Because the telomeres are highly repetetive, the enrichment of mapping discovered here could be an underestimate of the true rate of telomeric crossovers.
It’s worth examining the distribution of the alignments across the chromosomes.
# plot the points
ggplot(subset(df, l_subtelo==T | r_subtelo==T), aes(x=lpos, y=rpos)) +
geom_point(size=0.8, alpha=0.8) +
facet_grid(right ~ left, scales='free', space='free') +
xlab("chromosome") +
xlab("chromosome")
The beginning of chromosome two has a notable enrichment compared to the other chromosomes. It might be worth exploring whether this is a sequencing/alignment bias or true biologial phenomenon. If it is a bias, it will also affect the hypergeometric test.
To provide a wild-type control, Illumina Hiseq paired-end genomic DNA reads from Col0 were obtained from Short Read Archive accession SRR519624. These were processed in an idental way to the mutant reads.
wt_data <- '~/Dropbox/ongoing_projects/chromosome_chimeras/reference/unique_mapping/col0_masked_uniquebridges_evidence.csv'
wt_df <- read.csv(wt_data, as.is=T)
library(ggplot2)
wt_df$left <- gsub(wt_df$left, pattern='chr', replacement='')
wt_df$right <- gsub(wt_df$right, pattern='chr', replacement='')
summary(wt_df)
## left lpos right
## Length:28105 Min. : 1629 Length:28105
## Class :character 1st Qu.: 5463034 Class :character
## Mode :character Median :12002485 Mode :character
## Mean :12067698
## 3rd Qu.:17329324
## Max. :30423084
## rpos
## Min. : 2
## 1st Qu.: 3034830
## Median : 9314801
## Mean : 9654267
## 3rd Qu.:15295124
## Max. :26971490
We’re only interested in nuclear chromosomes.
# subset to real chromosomes
chr <- as.character(1:5)
wt_df <- subset(wt_df, left %in% chr & right %in% chr)
Annotate the aligments by whether left or right reads were within subtelomeres.
# annotate subtelomere reads
subtelo_cutoff <- 1e5
wt_df$l_subtelo <- apply(wt_df, 1, function(x) {
x <- as.numeric(x[1:2])
len <- chromlen$length[chromlen$chrom == x[1]]
pos <- x[2]
return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
wt_df$r_subtelo <- apply(wt_df, 1, function(x) {
x <- as.numeric(x[3:4])
len <- chromlen$length[chromlen$chrom == x[1]]
pos <- x[2]
return((pos < subtelo_cutoff) || pos > (len - subtelo_cutoff))
})
table(wt_df[,5:6])
## r_subtelo
## l_subtelo FALSE TRUE
## FALSE 23808 178
## TRUE 270 1
There were 24257 read pairs showing evidence of crossovers.
# find the proportion of all reads that mapped to subtelomeres
wt_total_reads <- 9586161
wt_subtelo_reads <- 91860
wt_p_reads_subtelo <- wt_subtelo_reads / wt_total_reads
# test for enrichment in subtelomeres
wt_total_crossovers <- dim(wt_df)[1]
wt_p_reads_crossover <- wt_total_crossovers/wt_total_reads
wt_subtelo_crossovers <- dim(subset(wt_df, l_subtelo==T | r_subtelo==T))[1]
wt_p <- 1.0 - phyper(wt_subtelo_crossovers - 1, # the number of white balls drawn
wt_subtelo_reads, # the number of white balls in the urn
wt_total_reads - wt_subtelo_reads, # the number of black balls in the urn
wt_total_crossovers) # the number of balls drawn from the urn
enrichment_wt <- (wt_subtelo_crossovers/wt_total_crossovers)/wt_p_reads_subtelo
### print some statistics
sprintf("%.2f%% of all reads mapped to subtelomeres", wt_p_reads_subtelo * 100)
## [1] "0.96% of all reads mapped to subtelomeres"
sprintf("%.2f%% of all reads showed evidence of crossovers", wt_p_reads_crossover * 100)
## [1] "0.25% of all reads showed evidence of crossovers"
sprintf("evidence of crossovers in the subtelomeres was enriched %.2f-fold compared to the proportion of all reads mapping there", enrichment_wt)
## [1] "evidence of crossovers in the subtelomeres was enriched 1.93-fold compared to the proportion of all reads mapping there"
sprintf("the hypergeometric test returned a p-value of %.4f", wt_p)
## [1] "the hypergeometric test returned a p-value of 0.0000"
The wild-type and mutant had a similar (~1%) proportion of read pairs mapping to the subtelomeres.
0.25% of wild-type read pairs, and 0.32% of mutant read pairs, showed evidence of crossovers (read pairs uniquely mapping to different chromosomes). Without replication it is impossible to evaluate whether this difference is significant.
Both wild type and mutant strains have crossovers significantly enriched in subtelomeres. However, the fold enrichment in the mutant was ~3x as great as in wild type.
ggplot(data.frame(strain=c('wt', 'mutant'), enrichment=c(enrichment_wt, enrichment_mutant)),
aes(x=strain, y=enrichment)) +
geom_bar(position='dodge', stat='identity') +
ylab("enrichment (observed/expected)") +
theme_bw() +
ggtitle("subtelomeric enrichment of crossover evidence in mutant vs wild type")