The goal of this analysis was to assess the genetic affinities of maize accessions labeled as “Guam”, in order to infer their relationship to globally distributed maize lineages.
Specifically, we aimed to:
By examining clustering patterns and nearest-neighbor relationships, we sought to simply determine whether the Guam accessions form a single genetic group or multiple distinct clusters, and which reference lineages they most closely resemble.
Principal Component Analysis (PCA) was used to summarize genome-wide genetic variation and visualize population structure among maize accessions. In this specific context, PCA was applied to the genotype matrix consisting of quality-controlled SNPs from chromosome one, transforming correlated SNP variables into orthogonal axes (a.k.a. principal components) that explain decreasing proportions of genetic variance.
Each sample was assigned scores on the first two principal components (PC1 and PC2), which capture the dominant (biggest) patterns of genetic variation across samples. PCA was used to assess whether the Guam maize accessions cluster together or form multiple genetic subgroups, and to evaluate their relative proximity to reference (“anchor”) lines in multivariate genetic space.
Identity-By-State (IBS) analysis was conducted to quantify pairwise genetic similarity between each Guam maize accession and each anchor line. While PCA provides a global, low-dimensional representation of genetic structure, IBS offers a direct numerical measure of similarity.
IBS was calculated as the proportion of alleles shared between two samples across the set of quality-controlled SNPs, independent of inferred ancestry. Values closer to 1 indicate higher genetic similarity, whereas lower values indicate greater divergence. For each Guam accession, the anchor line with the highest IBS value was identified as its closest genetic reference.
The PCA and IBS analyses reflect genetic similarity, not geographic origin. PCA positions samples based on shared SNP variation, and IBS measures allele-level similarity, but neither method directly incorporates information about country, region, or historical movement.
To contextualize genetic similarities, anchor metadata, including known collection history, breeding background, and associated literature, were examined after identifying each Guam accession’s closest genetic matches. Any inferences regarding the historical introduction or diffusion pathway of Guam maize were therefore based on the combination of genetic clustering patterns and external reference information, and should be interpreted as probabilistic inferences rather than direct measurements of geographic origin.
Together, PCA and IBS provide complementary perspectives on genetic structure, allowing inference of genetic affinities relevant to historical maize introduction pathways.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
setwd("/Users/michaelcajigal/GROW Internship Projects/Corn Project/")
library(vcfR)
library(adegenet)
library(LEA)
library(ggplot2)
library(R.utils)
library(SNPRelate)
library(gdsfmt)
library(dplyr)
Raw data. Only raw paired-end reads in FASTQ
(.fq.gz) format were collected from multiple project
folders, then consolidated and renamed into a single
trimmed_reads/ directory with one R1/R2 file pair per
sample (i.e., TAMA_46_trimmed_R1.fq.gz,
TAMA_46_trimmed_R2.fq.gz).
Sample availability. An expanded search of publicly available sequencing datasets resulted in 208 maize accessions with corresponding paired-end FASTQ read files. These accessions, which include the Guam maize samples analyzed here, were retained for downstream variant calling and comparative analyses. Accessions without raw sequencing data were excluded.
Prior to downstream analyses, variant call format (VCF) files were generated for all maize samples with available sequencing data. All variant calling and VCF processing were performed on macOS using command-line tools.
These were the brief steps taken to generate those VCF files:
mpileup, call, view,
merge) together with htslib utilities (bgzip, tabix) to
compress and index the resulting VCF files.Again, a comparative analyses were conducted based on PCA and identity-by-state (IBS) analyses:
vcf_file <- "shared_sites.vcf.gz"
gds_file <- "shared_sites.gds"
snpgdsVCF2GDS(
vcf_file,
gds_file,
method = "biallelic.only"
)
genofile <- snpgdsOpen(gds_file)
# Summary of GDS file
snpgdsSummary(genofile)
samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
length(samples)
head(samples)
# Missing assessment per sample
samp_miss <- snpgdsSampMissRate(genofile)
summary(samp_miss)
samp_miss
samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
grep("B73", samples, ignore.case=TRUE, value=TRUE)
We build a list of high-quality SNPs by keeping only variants with minor
allele frequency ≥ 0.05 and
missing genotypes in ≤ 10% of samples.
This reduces rare and poorly genotyped sites before running PCA and IBS.
# Building SNP list
snps_qc <- snpgdsSelectSNP(genofile,
maf=0.05, # Drops rare SNPs
missing.rate=0.1) # Drop SNPs missing in >10% samples
# Counting quantity of SNPs
length(snps_qc)
pca <- snpgdsPCA(genofile, num.thread = 1)
pc_df <- data.frame(
sample = pca$sample.id,
PC1 = pca$eigenvect[, 1],
PC2 = pca$eigenvect[, 2],
stringsAsFactors = FALSE
)
# Clean labels (optional)
pc_df$label <- gsub("sorted_bam_all/|sorted_bam/|_sorted\\.bam", "", pc_df$sample)
# Make sure you defined is_guam first
pc_df$is_guam <- grepl("^(GUAM)", pc_df$label)
# % variance explained for axis labels
eigen <- pca$eigenval[!is.na(pca$eigenval)]
pve <- eigen / sum(eigen) * 100
pve[1] # PC1 %
pve[2] # PC2 %
pve[3] # PC3 %
pve[4] # PC4 %
sum(pve[1:2])
sum(pve[1:4])
# Set plot limits to include ALL points (add a small padding)
xlim <- range(pc_df$PC1, na.rm = TRUE)
ylim <- range(pc_df$PC2, na.rm = TRUE)
pad_x <- diff(xlim) * 0.05
pad_y <- diff(ylim) * 0.05
xlim <- xlim + c(-pad_x, pad_x)
ylim <- ylim + c(-pad_y, pad_y)
set.seed(123)
PC1_jit <- jitter(pc_df$PC1, amount = diff(xlim) * 0.005)
PC2_jit <- jitter(pc_df$PC2, amount = diff(ylim) * 0.005)
plot(PC1_jit[!pc_df$is_guam], PC2_jit[!pc_df$is_guam],
col = rgb(0.5, 0.5, 0.5, 0.6),
pch = 19,
xlim = xlim,
ylim = ylim,
xlab = sprintf("PC1 (%.1f%%)", pve[1]),
ylab = sprintf("PC2 (%.1f%%)", pve[2]),
main = "PCA of maize accessions (Guam highlighted)")
# Guam-only limits
x_zoom <- range(pc_df$PC1[pc_df$is_guam], na.rm = TRUE)
y_zoom <- range(pc_df$PC2[pc_df$is_guam], na.rm = TRUE)
# --- robust padding so the rectangle is visible ---
# use overall PC spread as fallback if Guam spread is tiny
dx <- diff(x_zoom)
dy <- diff(y_zoom)
padx <- if (dx == 0) diff(xlim) * 0.03 else dx * 0.25
pady <- if (dy == 0) diff(ylim) * 0.03 else dy * 0.25
x_zoom2 <- x_zoom + c(-padx, padx)
y_zoom2 <- y_zoom + c(-pady, pady)
# draw rectangle ON TOP so it doesn't get hidden
rect(
xleft = x_zoom2[1],
ybottom = y_zoom2[1],
xright = x_zoom2[2],
ytop = y_zoom2[2],
border = "red",
col = rgb(1, 0, 0, 0.12),
lwd = 3
)
points(PC1_jit[pc_df$is_guam], PC2_jit[pc_df$is_guam],
pch = 21, # circle with border
bg = "red", # fill
col = "black", # outline
cex = 1.5,
lwd = 1.5)
text(PC1_jit, PC2_jit,
labels = pc_df$label,
col = ifelse(pc_df$is_guam, "red", rgb(0.4,0.4,0.4,0.2)),
cex = ifelse(pc_df$is_guam, 0.7, 0.4),
pos = 3)
legend("topright",
legend = c("Guam accessions", "Other accessions"),
col = c("red", "gray60"),
pch = 19,
bty = "n")
# centroid-based zoom (REQUIRED when diff = 0)
cx <- mean(pc_df$PC1[pc_df$is_guam])
cy <- mean(pc_df$PC2[pc_df$is_guam])
dx <- 0.006
dy <- 0.004
x_zoom <- cx + c(-dx, dx)
y_zoom <- cy + c(-dy, dy)
plot(PC1_jit, PC2_jit,
col = ifelse(pc_df$is_guam, "red", rgb(0.5,0.5,0.5,0.4)),
pch = 19,
xlim = x_zoom,
ylim = y_zoom,
xlab = sprintf("PC1 (%.1f%%)", pve[1]),
ylab = sprintf("PC2 (%.1f%%)", pve[2]),
main = "Zoomed PCA around Guam cluster")
# label only points actually visible
in_zoom <- PC1_jit >= x_zoom[1] & PC1_jit <= x_zoom[2] &
PC2_jit >= y_zoom[1] & PC2_jit <= y_zoom[2]
text(PC1_jit[in_zoom], PC2_jit[in_zoom],
labels = pc_df$label[in_zoom],
col = ifelse(pc_df$is_guam[in_zoom], "red", rgb(0.4,0.4,0.4,0.7)),
cex = ifelse(pc_df$is_guam[in_zoom], 0.8, 0.5),
pos = 3)
points(PC1_jit[pc_df$is_guam], PC2_jit[pc_df$is_guam],
pch = 21, bg = "red", col = "black", cex = 1.6, lwd = 1.5)
PCA Interpretation. The first two principle components explained 56.3% of the total genetic variation (PC1 = 33%, PC2 = 23.3%), and the first four PCs together explained 66.2%. For visualization, only the first two principal components (PC1 and PC2) were retained and plotted, since they capture the major axis of genetic structure.
In the zoomed PCA view, GUAM1 and GUAM2 clustered tightly together and were embedded within a broader group of American maize landraces. The Guam accessions did not form a distinct cluster, indicating close genetic similarity to multiple Central and South American accessions. The closest neighboring accessions in PCA space included landraces from Guatemala, Honduras, Ecuador, Paraguay, and Uruguay, suggesting that Guam white corn is genetically consistent with New World field corn diversity.
snp <- read.gdsn(index.gdsn(genofile, "snp.id"))
length(snp)
head(snp)
ibs <- snpgdsIBS(genofile,
snp.id = snp,
num.thread = 2,
autosome.only = FALSE,
remove.monosnp = TRUE,
missing.rate = NaN)
ibs_mat <- ibs$ibs
ibs_ids <- ibs$sample.id
dim(ibs_mat)
ibs_label <- gsub("sorted_bam_all/|sorted_bam/|_sorted\\.bam", "", ibs_ids)
guam_idx <- which(ibs_label %in% c("GUAM1", "GUAM2"))
#guam_idx
#ibs_label[guam_idx]
get_top_neighbors <- function(i, top_n = 10) {
v <- ibs_mat[i, ]
v[i] <- NA # remove self
ord <- order(v, decreasing = TRUE, na.last = NA)
data.frame(
focal = ibs_label[i],
neighbor = ibs_label[ord][1:top_n],
ibs = v[ord][1:top_n]
)
}
top_guam1 <- get_top_neighbors(guam_idx[1], top_n = 15)
top_guam2 <- get_top_neighbors(guam_idx[2], top_n = 15)
top_guam1
## focal neighbor ibs
## 1 GUAM1 GUAM2 0.9486480
## 2 GUAM1 GUAT_804 0.9346686
## 3 GUAM1 VENE_909 0.9342561
## 4 GUAM1 GUAT_247 0.9340749
## 5 GUAM1 PARA_330 0.9332922
## 6 GUAM1 ECUA_937 0.9327574
## 7 GUAM1 ECUA_473 0.9317255
## 8 GUAM1 GUER_188 0.9313625
## 9 GUAM1 NICA_17 0.9312818
## 10 GUAM1 NICA_74 0.9311216
## 11 GUAM1 VENE_572 0.9309133
## 12 GUAM1 GUAT_688 0.9308256
## 13 GUAM1 BRAZ_6 0.9305050
## 14 GUAM1 GUAT_118 0.9302802
## 15 GUAM1 BOLI_83 0.9299056
top_guam2
## focal neighbor ibs
## 1 GUAM2 SD-2 0.9486521
## 2 GUAM2 GUAM1 0.9486480
## 3 GUAM2 GUAT_325 0.9427273
## 4 GUAM2 URUG_752 0.9418161
## 5 GUAM2 DR-0 0.9417476
## 6 GUAM2 NICA_17 0.9412328
## 7 GUAM2 SONO_39 0.9404419
## 8 GUAM2 GUAT_208 0.9393992
## 9 GUAM2 GUAT_567 0.9392966
## 10 GUAM2 GUAT_688 0.9384762
## 11 GUAM2 ECUA_473 0.9382571
## 12 GUAM2 GUAT_51 0.9380007
## 13 GUAM2 ECUA_937 0.9377880
## 14 GUAM2 PARA_136 0.9377634
## 15 GUAM2 BRAZ_6 0.9377580
neighbors <- unique(c(top_guam1$neighbor, top_guam2$neighbor, "GUAM1", "GUAM2"))
sub_idx <- which(ibs_label %in% neighbors)
# hierarchical clustering on IBS distance
d <- as.dist(1 - ibs_mat[sub_idx, sub_idx])
hc <- hclust(d, method = "average")
heatmap(ibs_mat[sub_idx, sub_idx],
Rowv = as.dendrogram(hc),
Colv = as.dendrogram(hc),
scale = "none",
labRow = ibs_label[sub_idx],
labCol = ibs_label[sub_idx],
main = "IBS similarity (Guam + nearest neighbors)")
IBS Interpretation. Identity-by-state (IBS) analysis supported the PCA results by quantifying pairwise genetic similarity between Guam accessions and all other samples. GUAM1 and GUAM2 showed very high similarity to each other (IBS = 0.9486), indicating close relatedness. For GUAM1, the closest non-Guam accessions included GUAT_804 (IBS = 0.9347), VENE_909 (0.9343), GUAT_247 (0.9341), PARA_330 (0.9333), and ECUA_937 (0.9328). For GUAM2, the closest accessions included GUAT_325 (0.9427), URUG_752 (0.9418), DR-0 (0.9417), NICA_17 (0.9412), and SONO_39 (0.9404). Overall, the nearest IBS neighbors to the Guam accessions were predominantly Central and South American landraces, consistent with Guam maize clustering within New World field corn diversity.
In the IBS heatmap, GUAM2 clustered closely with accessions from Guatemala, Ecuador, Brazil, Uruguay, and Venezuela, forming a contiguous high-similarity block. The Guam accession did not form a distinct branch but instead nested within a broader American landrace cluster, consistent with the PCA results.
Using genome-wide SNP data, this study examined the genetic relationships between Guam white field corn and a diverse panel of maize landraces. Principal Component Analysis and Identity-by-State analyses consistently showed that Guam accessions cluster within New World maize diversity rather than forming a distinct genetic lineage. The two Guam samples exhibited high genetic similarity to each other, suggesting limited within-Guam divergence, and shared closest affinity with multiple Central and South American landraces, including accessions from Guatemala, Ecuador, Venezuela, Brazil, Uruguay, and Nicaragua. These findings are consistent with historical accounts of maize introduction to Guam from the Americas and suggest that Guam white corn represents a locally maintained derivative of American field corn rather than an independently evolved lineage.