Comparisons of accession-matched Bakker and Rpp8 subsets

Here are some histograms of popgen summary statistics in the Bakker set, with colors to indicate Rpp8 subsets, singleton R-genes, and Rpp13, which is frequently an outlier among ‘singleton’ R-genes.

This is for 11 subsets of the 20 LRRs that have Rpp8 homologs from accessions also in the Bakker set. I am comparing these to the 7 - 11 LRRs from the same accessions that were sequenced for each of the 27 singleton R-genes in the Bakker set.

I subset the 20 LRRs so I had a set of Rpp8 subsets with between 7 and 11 accessions, with similar proportions to the Bakker subsets. This was to show the range of summary statistic values that Rpp8 subsets could fall into, for a fair comparison.

To specify the colors in the plots, I made a column that specifies whether the R-gene is Rpp8, a singleton, or Rpp13.

wb2 <- loadWorkbook("04 Comparisons to Bakker R genes/Bakker sequences in Rpp8 set/838bp_Comp/Subset_838bp_comparison_11Rpp8_27Bakker.xlsx")
substats = readWorksheet(wb2, sheet = getSheets(wb2))

Here are histograms of the summary statistics I am most interested in.

substats %>%
  ggplot(mapping = aes(x = S)) +
  geom_histogram(binwidth = 15, aes(fill = Gene)) +
  theme_bw() +
  labs(x = "S (Segregating Sites)", y = "Count")

substats %>%
  ggplot(mapping = aes(x = S.per.500bp)) +
  geom_histogram(binwidth = 10, aes(fill = Gene)) +
  theme_bw() +
  labs(x = "S per 500 bp", y = "Count")

substats %>%
  ggplot(mapping = aes(x = FracUniqueHaplotypes)) +
  geom_histogram(binwidth = .08, aes(fill = Gene)) +
  theme_bw() +
  labs(x = "Fraction of Haplotypes that are Unique", y = "Count")

substats %>%
  ggplot(mapping = aes(x = Pi)) +
  geom_histogram(binwidth = 0.007, aes(fill = Gene)) +
  theme_bw() +
  labs(x = "Nucleotide Diversity (Pi)", y = "Count")

substats %>%
  ggplot(mapping = aes(x = ThetaWattNuc)) +
  geom_histogram(binwidth = 5.6, aes(fill = Gene)) +
  theme_bw() +
  labs(x = "Watterson's Theta", y = "Count")

Comparisons to the Full Bakker Set

Now these look great but the subsets are not very large. So what if we compare the whole Bakker sets - though the accessions aren’t identical, and the Bakker paper sequenced more LRR’s - to the 50 LRRs I have sequences for for Rpp8 homologs?

Histograms of popgen summary statistics for the Bakker set of singleton R-genes follow, with labels and fill color to show where Rpp8 is compared to genes in this set.

Again, this is for the 50 LRRs that I have for Rpp8, compared to the 56 - 92 LRRs sequenced for the 27 singleton R-genes in the Bakker set.

wb <- loadWorkbook("04 Comparisons to Bakker R genes/All Bakker sequences/All_Bakker_sequences_to_50_Rpp8.out.xlsx")
lst = readWorksheet(wb, sheet = getSheets(wb))
divstats <- as_tibble(lst$All_Bakker_sequences_to_50_Rpp8)

Below I make data subsets for the Bakker set and the Rpp8 statistics so I can refer to them separately in ggplot.

bakker <- divstats %>%
  filter(!(Datafile == "50LRR_Rpp8_1161bp.fas" | Datafile == "50LRR_Rpp8_838bp.fas" ))

avgsites <- divstats %>%
  filter(Datafile == "50LRR_Rpp8_838bp.fas")

netsites <- divstats %>%
  filter(Datafile == "50LRR_Rpp8_1161bp.fas")

bakker1 <- divstats %>%
    filter(!(Datafile == "50LRR_Rpp8_838bp.fas"))

rpp13 <- divstats %>%
  filter(Datafile == "At3g46530_87seq.fas")

Now I generate labeled histograms of the summary statistics I am most interested in.

bakker1 %>%
  ggplot(mapping = aes(x = S)) +
  geom_histogram(binwidth = 15) +
  geom_histogram(data = netsites, fill = "red", binwidth = 15) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 1.75, label = "Rpp13") +
  theme_bw() +
  labs(x = "S (Segregating Sites)", y = "Count")

bakker1 %>%
  ggplot(mapping = aes(x = S.per.500bp)) +
  geom_histogram(binwidth = 10) +
  geom_histogram(data = netsites, fill = "red", binwidth = 10) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 1.75, label = "Rpp13") +
  theme_bw() +
  labs(x = "Segregating Sites per 500 bp", y = "Count")

bakker1 %>%
  ggplot(mapping = aes(x = FracUniqHap)) +
  geom_histogram(binwidth = .05) +
  geom_histogram(data = netsites, fill = "red", binwidth = .05) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 3.75, label = "Rpp13") +
  theme_bw() +
  labs(x = "Fraction of Haplotypes that are Unique", y = "Count")

bakker1 %>%
  ggplot(mapping = aes(x = Pi)) +
  geom_histogram(binwidth = .006) +
  geom_histogram(data = netsites, fill = "red", binwidth = .006) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 1.75, label = "Rpp13") +
  theme_bw() +
  labs(x = "Nucleotide Diversity (Pi)", y = "Count")

bakker1 %>%
  ggplot(mapping = aes(x = ThetaWattNuc)) +
  geom_histogram(binwidth = 3) +
  geom_histogram(data = netsites, fill = "red", binwidth = 3) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 1.75, label = "Rpp13") +
  theme_bw() +
  labs(x = "Watterson's Theta", y = "Count")

DNAsp provided some other summary statistics too, so I may as well plot these even though I think they are less informative.

bakker1 %>%
  ggplot(mapping = aes(x = Eta)) +
  geom_histogram(binwidth = 20) +
  geom_histogram(data = netsites, fill = "red", binwidth = 20) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 1.75, label = "Rpp13") +
  theme_bw()

bakker1 %>%
  ggplot(mapping = aes(x = Hap)) +
  geom_histogram(binwidth = 5) +
  geom_histogram(data = netsites, fill = "red", binwidth = 5) +
  geom_label(data = netsites, y = 1.75, label = "Rpp8") +
  geom_label(data = rpp13, y = 3.75, label = "Rpp13") +
  theme_bw()

bakker1 %>%
  ggplot(mapping = aes(x = Hd)) +
  geom_histogram(binwidth = .05) +
  geom_histogram(data = netsites, fill = "red", binwidth = .05) +
  geom_label(data = netsites, y = 1.25, label = "Rpp8") +
  geom_label(data = rpp13, y = 3.25, label = "Rpp13") +
  theme_bw()