Just quickly exploring two questions: How much heterozygousity is there in the bean samples, and how much does heterozygousity vary by bean variety?
I used Tassel to obtain a summary table of heterozygousity for each “taxa” (i.e., individual) in the vcf file. This vcf file was filtered so that each SNP present was genotyped in >33% of individuals. It was also filtered so that SNPs had to have a MAF of at least 1%. I don’t think either of these filters should change the analysis significantly… though these filters do remove about 2.5 million SNPs from the analysis.
This analysis is genome-wide, on 639127 SNPs. Here’s a basic summary of the data file. The proportion of heterozygousity in each individual varies from almost nothing to about 22%:
Taxa Taxa.Name Number.of.Sites Gametes.Missing Proportion.Missing Number.Heterozygous
Min. : 0.0 CDBN_001_1543 : 1 Min. :639127 Min. : 36766 Min. :0.02876 Min. : 4
1st Qu.:150.8 CDBN_002_2242 : 1 1st Qu.:639127 1st Qu.: 207874 1st Qu.:0.16262 1st Qu.: 17351
Median :301.5 CDBN_003_5229 : 1 Median :639127 Median : 308012 Median :0.24096 Median : 31787
Mean :301.5 CDBN_004_5501 : 1 Mean :639127 Mean : 356238 Mean :0.27869 Mean : 38999
3rd Qu.:452.2 CDBN_005_54028: 1 3rd Qu.:639127 3rd Qu.: 426188 3rd Qu.:0.33341 3rd Qu.: 55641
Max. :603.0 CDBN_006_55012: 1 Max. :639127 Max. :1267156 Max. :0.99132 Max. :134793
(Other) :598
Proportion.Heterozygous Inbreeding.Coefficient Inbreeding.Coefficient.Scaled.by.Missing
Min. :0.0002762 Inbreeding Coefficient:604 ICSBM:604
1st Qu.:0.0418125
Median :0.0659900
Mean :0.0759546
3rd Qu.:0.1030775
Max. :0.2262500
Here’s a sorted table for the proportion of heterozygous sites per taxa.
(SNPs.noncom <- SNPs.604 %>%
filter(!grepl('combined', Taxa.Name)) %>% # Selects 523 rows that were not combined data.
select(Taxa.Name, Proportion.Missing, Proportion.Heterozygous) %>%
arrange(desc(Proportion.Heterozygous)) # Arrange the table so that the greatest heterozygousity comes first.
)
The immediate problem is that there is a negative relationship between the proportion of sites missing and the proportion of heterozygous sites.
ggplot(SNPs.noncom, mapping = aes(x = Proportion.Heterozygous, y = Proportion.Missing)) +
geom_point() + geom_smooth()
If we get rid of taxa with different percentages of sites that are missing, it looks like heterozygousity is quite high in these varieties - at least 7% heterozygousity on average, but probably upwards of 10% if we account for missing data or low coverage in some taxa.
(Het.summary <- SNPs.604 %>%
filter(!grepl('combined', Taxa.Name)) %>% # Selects 523 rows that were not combined data.
arrange(desc(Proportion.Heterozygous)) %>% # Arrange the table so that the greatest heterozygousity comes first.
summarise(het.75percent.missing = mean(Proportion.Heterozygous[Proportion.Missing <= 0.75]),
het.50percent.missing = mean(Proportion.Heterozygous[Proportion.Missing <= 0.50]),
het.25percent.missing = mean(Proportion.Heterozygous[Proportion.Missing <= 0.25]),
het.12.5percent.missing = mean(Proportion.Heterozygous[Proportion.Missing <= 0.125]),
het.6.25percent.missing = mean(Proportion.Heterozygous[Proportion.Missing <= 0.0625]),
no.6.25percent.missing = sum(!is.na(Proportion.Missing[Proportion.Missing <= 0.0625])))
)
Ok, what about across positions in the genome? Are there any blocks in the genome that (for all taxa) are particularly heterozygous?
SNPs.sites <- read.table("C:/Users/ahm543/OneDrive/NSF Common Bean/CDBN Genomics/CDBN SNPs/8_Tassel/604_individual_genoSummary3.txt", sep = "\t", header = TRUE)
summary(SNP.sites)
Error in summary(SNP.sites) : object 'SNP.sites' not found
Here’s a summary of the genotypic data, averaged by site. The only really helpful thing here is that the Proportion.Heterozygous has really quite a low third quartile (0.09), which probably indicates there aren’t a huge number of problematic SNPs.
summary(SNPs.sites)
Site.Number Site.Name Chromosome Physical.Position Number.of.Taxa Ref Alt
Min. : 0 S01_10007272: 1 Min. :1.0 Min. : 195 Min. :604 A:130491 A:192829
1st Qu.:159782 S01_10007276: 1 1st Qu.:2.0 1st Qu.:11273856 1st Qu.:604 C:189176 C:126833
Median :319563 S01_10007277: 1 Median :4.0 Median :19925246 Median :604 G:188870 G:127192
Mean :319563 S01_10007281: 1 Mean :3.5 Mean :20747115 Mean :604 T:130590 T:192273
3rd Qu.:479345 S01_10007282: 1 3rd Qu.:5.0 3rd Qu.:28786009 3rd Qu.:604
Max. :639126 S01_10007291: 1 Max. :7.0 Max. :53433183 Max. :604
(Other) :639121
Major.Allele Major.Allele.Gametes Major.Allele.Proportion Major.Allele.Frequency Minor.Allele Minor.Allele.Gametes
A:126368 Min. : 191.0 Min. :0.1581 Min. :0.3031 A:197006 Min. : 4.0
C:193706 1st Qu.: 580.0 1st Qu.:0.4801 1st Qu.:0.7125 C:122293 1st Qu.: 42.0
G:193374 Median : 692.0 Median :0.5728 Median :0.8475 G:122669 Median :132.0
T:125679 Mean : 711.3 Mean :0.5889 Mean :0.8190 T:197159 Mean :158.8
3rd Qu.: 839.0 3rd Qu.:0.6945 3rd Qu.:0.9503 3rd Qu.:254.0
Max. :1190.0 Max. :0.9851 Max. :0.9900 Max. :602.0
Minor.Allele.Proportion Minor.Allele.Frequency Allele.3 Allele.3.Gametes Allele.3.Proportion
Min. :0.00331 Min. :0.01000 A : 6323 Min. : 0.000 Min. :0.000000
1st Qu.:0.03477 1st Qu.:0.04917 C : 3620 1st Qu.: 0.000 1st Qu.:0.000000
Median :0.10927 Median :0.15116 G : 3870 Median : 0.000 Median :0.000000
Mean :0.13143 Mean :0.17963 T : 6435 Mean : 1.216 Mean :0.001007
3rd Qu.:0.21026 3rd Qu.:0.28615 NA's:618879 3rd Qu.: 0.000 3rd Qu.:0.000000
Max. :0.49834 Max. :0.50000 Max. :356.000 Max. :0.294700
Allele.3.Frequency Allele.4 Allele.4.Gametes Allele.4.Proportion Allele.4.Frequency Allele.5
Min. :0.000000 A : 210 Min. : 0.00000 Min. :0.000e+00 Min. :0.000e+00 Mode:logical
1st Qu.:0.000000 C : 180 1st Qu.: 0.00000 1st Qu.:0.000e+00 1st Qu.:0.000e+00 NA's:639127
Median :0.000000 G : 178 Median : 0.00000 Median :0.000e+00 Median :0.000e+00
Mean :0.001313 T : 230 Mean : 0.02111 Mean :1.748e-05 Mean :2.149e-05
3rd Qu.:0.000000 NA's:638329 3rd Qu.: 0.00000 3rd Qu.:0.000e+00 3rd Qu.:0.000e+00
Max. :0.330860 Max. :161.00000 Max. :1.333e-01 Max. :1.952e-01
Allele.5.Gametes Allele.5.Proportion Allele.5.Frequency Allele.6 Allele.6.Gametes Allele.6.Proportion
Min. :0 Min. :0 Min. :0 Mode:logical Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 NA's:639127 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
Allele.6.Frequency Gametes.Missing Proportion.Missing Number.Heterozygous Proportion.Heterozygous
Min. :0 Min. : 0.0 Min. :0.0000 Min. : 0.00 Min. :0.00000
1st Qu.:0 1st Qu.:210.0 1st Qu.:0.1738 1st Qu.: 1.00 1st Qu.:0.00271
Median :0 Median :318.0 Median :0.2632 Median : 11.00 Median :0.02734
Mean :0 Mean :336.7 Mean :0.2787 Mean : 36.86 Mean :0.07548
3rd Qu.:0 3rd Qu.:458.0 3rd Qu.:0.3791 3rd Qu.: 45.00 3rd Qu.:0.09978
Max. :0 Max. :808.0 Max. :0.6689 Max. :586.00 Max. :0.97830
Inbreeding.Coefficient Inbreeding.Coefficient.Scaled.by.Missing
TBD:639127 TBD:639127
If you plot heterozygousity by chromosome, you see that my f***ing SNP calling did not actually finish the way I thought it did! It only got to chromosome 7! Ok, I may have to request a >>48hr job to finish calling these bastards.
If you just look at the SNPs I have, you see there are some spikes in heterozygousity in a few places that we should be concerned about (end of Chr1, 3, and 5 - I bet you anything these are R-gene clusters!).
ggplot(data = SNPs.sites, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex() + geom_smooth(color = "white") +
facet_wrap( ~ Chromosome)
Let’s zoom in on different parts of Chromosome 1 to see if these are really hotspots. They are, pretty convincingly.
Chr1 <- filter(SNPs.sites, Chromosome == 1)
Chr1.pt1 <- filter(Chr1, Physical.Position < 2000000)
Chr1.pt2 <- filter(Chr1, Physical.Position %in% c(2000000:4000000))
Chr1.pt3 <- filter(Chr1, Physical.Position %in% c(4000000:5200000))
ggplot(data = Chr1.pt1, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(bins = 200)
ggplot(data = Chr1.pt2, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(bins = 200)
ggplot(data = Chr1.pt3, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(bins = 120)
Do these ‘heterozygousity hotspots’ map to frequency of major or minor alleles, or the proportion of missing alleles? There seems to perhaps be a relationship with all three. This makes me wonder if this pattern isn’t a heterozygousity problem so much as it is a reduced representation problem. What do you think??
ggplot(data = Chr1.pt1, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(mapping = aes(x = Physical.Position, y = Major.Allele.Frequency), bins = 120) + geom_line(color = "red")
ggplot(data = Chr1.pt1, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(mapping = aes(x = Physical.Position, y = Minor.Allele.Frequency), bins = 120) + geom_line(color = "red")
ggplot(data = Chr1.pt1, mapping = aes(x = Physical.Position, y = Proportion.Heterozygous)) +
geom_hex(mapping = aes(x = Physical.Position, y = Proportion.Missing), bins = 120) + geom_line(color = "red")