1 Objective

The goal of this analysis was to assess the genetic affinities of maize accessions labeled as “Guam”, in order to infer their relationship to globally distributed maize lineages.

Specifically, we aimed to:

  1. compare the genetic similarity of the Guam maize accessions to the broader set of maize landraces included in this study.

By examining clustering patterns and nearest-neighbor relationships, we sought to simply determine whether the Guam accessions form a single genetic group or multiple distinct clusters, and which reference lineages they most closely resemble.

2 Methods

2.1 Principle Component Analysis (PCA)

Principal Component Analysis (PCA) was used to summarize genome-wide genetic variation and visualize population structure among maize accessions. In this specific context, PCA was applied to the genotype matrix consisting of quality-controlled SNPs from chromosome one, transforming correlated SNP variables into orthogonal axes (a.k.a. principal components) that explain decreasing proportions of genetic variance.

Each sample was assigned scores on the first two principal components (PC1 and PC2), which capture the dominant (biggest) patterns of genetic variation across samples. PCA was used to assess whether the Guam maize accessions cluster together or form multiple genetic subgroups, and to evaluate their relative proximity to reference (“anchor”) lines in multivariate genetic space.

2.2 Identity-By-State (IBS)

Identity-By-State (IBS) analysis was conducted to quantify pairwise genetic similarity between each Guam maize accession and each anchor line. While PCA provides a global, low-dimensional representation of genetic structure, IBS offers a direct numerical measure of similarity.

IBS was calculated as the proportion of alleles shared between two samples across the set of quality-controlled SNPs, independent of inferred ancestry. Values closer to 1 indicate higher genetic similarity, whereas lower values indicate greater divergence. For each Guam accession, the anchor line with the highest IBS value was identified as its closest genetic reference.

2.3 Interpreting PCA and IBS vs. Geographical/Regional Information

The PCA and IBS analyses reflect genetic similarity, not geographic origin. PCA positions samples based on shared SNP variation, and IBS measures allele-level similarity, but neither method directly incorporates information about country, region, or historical movement.

To contextualize genetic similarities, anchor metadata, including known collection history, breeding background, and associated literature, were examined after identifying each Guam accession’s closest genetic matches. Any inferences regarding the historical introduction or diffusion pathway of Guam maize were therefore based on the combination of genetic clustering patterns and external reference information, and should be interpreted as probabilistic inferences rather than direct measurements of geographic origin.

Together, PCA and IBS provide complementary perspectives on genetic structure, allowing inference of genetic affinities relevant to historical maize introduction pathways.

3 Directory & Package Set-Ups

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
setwd("/Users/michaelcajigal/GROW Internship Projects/Corn Project/")
library(vcfR)
library(adegenet)
library(LEA)
library(ggplot2)
library(R.utils)
library(SNPRelate)
library(gdsfmt)
library(dplyr)

4 Pre-Analysis & Data Set-Ups

Raw data. Only raw paired-end reads in FASTQ (.fq.gz) format were collected from multiple project folders, then consolidated and renamed into a single trimmed_reads/ directory with one R1/R2 file pair per sample (i.e., TAMA_46_trimmed_R1.fq.gz, TAMA_46_trimmed_R2.fq.gz).

Sample availability. An expanded search of publicly available sequencing datasets resulted in 208 maize accessions with corresponding paired-end FASTQ read files. These accessions, which include the Guam maize samples analyzed here, were retained for downstream variant calling and comparative analyses. Accessions without raw sequencing data were excluded.

4.1 Generate Variant Calls for Samples

Prior to downstream analyses, variant call format (VCF) files were generated for all maize samples with available sequencing data. All variant calling and VCF processing were performed on macOS using command-line tools.

These were the brief steps taken to generate those VCF files:

  1. Reads were aligned to the Zea mays reference genome sourced from MaizeGDB (https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/) with BWA-MEM, and alignment files were processed with SAMtools.
  2. Variants were then called and filtered using bcftools (including mpileup, call, view, merge) together with htslib utilities (bgzip, tabix) to compress and index the resulting VCF files.
  3. To enable comparative population genetic analyses, individual sample VCFs were then merged using bcftools merge to construct a multi-sample VCF. A shared set of SNP sites were defined by retaining only quality-controlled SNPs present across samples, ensuring that all individuals were compared at homologous genomic positions. The resulting VCF files were compressed and indexed using bgzip and tabix (htslib).
  4. This shared SNP dataset served as the basis for subsequent PCA and identity-by-state (IBS) analyses.

5 Comparative Analysis

Again, a comparative analyses were conducted based on PCA and identity-by-state (IBS) analyses:

  1. Guam accessions compared to the full maize accession set, to assess internal genetic structure, evaluate whether Guam samples cluster together or form multiple subgroups, and examine their placement within the broader diversity of the dataset.

5.1 Genetic Structure of Guam Maize Within the Full Accession Set

5.1.1 Data Reads

vcf_file <- "shared_sites.vcf.gz"
gds_file <- "shared_sites.gds"

5.1.2 Convert Unknown VCF to GDS and Check Missingness

snpgdsVCF2GDS(
  vcf_file,
  gds_file,
  method = "biallelic.only"
)

genofile <- snpgdsOpen(gds_file)

# Summary of GDS file
snpgdsSummary(genofile)

samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
length(samples)
head(samples)

# Missing assessment per sample
samp_miss <- snpgdsSampMissRate(genofile)
summary(samp_miss)
samp_miss

samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
grep("B73", samples, ignore.case=TRUE, value=TRUE)

5.1.3 Quality Controled (QC) Single Nucleotide Polymorphism (SNP) Build

We build a list of high-quality SNPs by keeping only variants with minor

  1. allele frequency ≥ 0.05 and

  2. missing genotypes in ≤ 10% of samples.

This reduces rare and poorly genotyped sites before running PCA and IBS.

# Building SNP list
snps_qc <- snpgdsSelectSNP(genofile,
                           maf=0.05,         # Drops rare SNPs
                           missing.rate=0.1) # Drop SNPs missing in >10% samples

# Counting quantity of SNPs
length(snps_qc)  

5.1.4 Principle Component Analysis (PCA)

pca <- snpgdsPCA(genofile, num.thread = 1)

pc_df <- data.frame(
  sample = pca$sample.id,
  PC1 = pca$eigenvect[, 1],
  PC2 = pca$eigenvect[, 2],
  stringsAsFactors = FALSE
)

# Clean labels (optional)
pc_df$label <- gsub("sorted_bam_all/|sorted_bam/|_sorted\\.bam", "", pc_df$sample)

# Make sure you defined is_guam first
pc_df$is_guam <- grepl("^(GUAM)", pc_df$label)

# % variance explained for axis labels
eigen <- pca$eigenval[!is.na(pca$eigenval)]
pve <- eigen / sum(eigen) * 100

pve[1]  # PC1 %
pve[2]  # PC2 %
pve[3]  # PC3 %
pve[4]  # PC4 %
sum(pve[1:2])
sum(pve[1:4])
# Set plot limits to include ALL points (add a small padding)
xlim <- range(pc_df$PC1, na.rm = TRUE)
ylim <- range(pc_df$PC2, na.rm = TRUE)

pad_x <- diff(xlim) * 0.05
pad_y <- diff(ylim) * 0.05

xlim <- xlim + c(-pad_x, pad_x)
ylim <- ylim + c(-pad_y, pad_y)

set.seed(123)

PC1_jit <- jitter(pc_df$PC1, amount = diff(xlim) * 0.005)
PC2_jit <- jitter(pc_df$PC2, amount = diff(ylim) * 0.005)

plot(PC1_jit[!pc_df$is_guam], PC2_jit[!pc_df$is_guam],
     col = rgb(0.5, 0.5, 0.5, 0.6),
     pch = 19,
     xlim = xlim,
     ylim = ylim,
     xlab = sprintf("PC1 (%.1f%%)", pve[1]),
     ylab = sprintf("PC2 (%.1f%%)", pve[2]),
     main = "PCA of maize accessions (Guam highlighted)")

# Guam-only limits
x_zoom <- range(pc_df$PC1[pc_df$is_guam], na.rm = TRUE)
y_zoom <- range(pc_df$PC2[pc_df$is_guam], na.rm = TRUE)

# --- robust padding so the rectangle is visible ---
# use overall PC spread as fallback if Guam spread is tiny
dx <- diff(x_zoom)
dy <- diff(y_zoom)

padx <- if (dx == 0) diff(xlim) * 0.03 else dx * 0.25
pady <- if (dy == 0) diff(ylim) * 0.03 else dy * 0.25

x_zoom2 <- x_zoom + c(-padx, padx)
y_zoom2 <- y_zoom + c(-pady, pady)

# draw rectangle ON TOP so it doesn't get hidden
rect(
  xleft   = x_zoom2[1],
  ybottom = y_zoom2[1],
  xright  = x_zoom2[2],
  ytop    = y_zoom2[2],
  border  = "red",
  col     = rgb(1, 0, 0, 0.12),
  lwd     = 3
)

points(PC1_jit[pc_df$is_guam], PC2_jit[pc_df$is_guam],
       pch = 21,          # circle with border
       bg  = "red",       # fill
       col = "black",     # outline
       cex = 1.5,
       lwd = 1.5)

text(PC1_jit, PC2_jit,
     labels = pc_df$label,
     col = ifelse(pc_df$is_guam, "red", rgb(0.4,0.4,0.4,0.2)),
     cex = ifelse(pc_df$is_guam, 0.7, 0.4),
     pos = 3)

legend("topright",
       legend = c("Guam accessions", "Other accessions"),
       col = c("red", "gray60"),
       pch = 19,
       bty = "n")

  • Because the PCA of all 208 maize accessions resulted in a highly dense region of overlapping points, a zoomed-in view focusing on the cluster containing the Guam accessions (highlighted by the red rectangle) was generated to improve visualization and facilitate interpretation of local genetic structure.
# centroid-based zoom (REQUIRED when diff = 0)
cx <- mean(pc_df$PC1[pc_df$is_guam])
cy <- mean(pc_df$PC2[pc_df$is_guam])

dx <- 0.006
dy <- 0.004

x_zoom <- cx + c(-dx, dx)
y_zoom <- cy + c(-dy, dy)

plot(PC1_jit, PC2_jit,
     col = ifelse(pc_df$is_guam, "red", rgb(0.5,0.5,0.5,0.4)),
     pch = 19,
     xlim = x_zoom,
     ylim = y_zoom,
     xlab = sprintf("PC1 (%.1f%%)", pve[1]),
     ylab = sprintf("PC2 (%.1f%%)", pve[2]),
     main = "Zoomed PCA around Guam cluster")

# label only points actually visible
in_zoom <- PC1_jit >= x_zoom[1] & PC1_jit <= x_zoom[2] &
           PC2_jit >= y_zoom[1] & PC2_jit <= y_zoom[2]

text(PC1_jit[in_zoom], PC2_jit[in_zoom],
     labels = pc_df$label[in_zoom],
     col = ifelse(pc_df$is_guam[in_zoom], "red", rgb(0.4,0.4,0.4,0.7)),
     cex = ifelse(pc_df$is_guam[in_zoom], 0.8, 0.5),
     pos = 3)

points(PC1_jit[pc_df$is_guam], PC2_jit[pc_df$is_guam],
       pch = 21, bg = "red", col = "black", cex = 1.6, lwd = 1.5)

PCA Interpretation. The first two principle components explained 56.3% of the total genetic variation (PC1 = 33%, PC2 = 23.3%), and the first four PCs together explained 66.2%. For visualization, only the first two principal components (PC1 and PC2) were retained and plotted, since they capture the major axis of genetic structure.

In the zoomed PCA view, GUAM1 and GUAM2 clustered tightly together and were embedded within a broader group of American maize landraces. The Guam accessions did not form a distinct cluster, indicating close genetic similarity to multiple Central and South American accessions. The closest neighboring accessions in PCA space included landraces from Guatemala, Honduras, Ecuador, Paraguay, and Uruguay, suggesting that Guam white corn is genetically consistent with New World field corn diversity.

5.2 Identity-By-State (IBS)

snp <- read.gdsn(index.gdsn(genofile, "snp.id"))
length(snp)
head(snp)

ibs <- snpgdsIBS(genofile,
                 snp.id = snp,
                 num.thread = 2,
                 autosome.only = FALSE, 
                 remove.monosnp = TRUE,
                 missing.rate = NaN)  

ibs_mat <- ibs$ibs
ibs_ids <- ibs$sample.id
dim(ibs_mat)
ibs_label <- gsub("sorted_bam_all/|sorted_bam/|_sorted\\.bam", "", ibs_ids)

guam_idx <- which(ibs_label %in% c("GUAM1", "GUAM2"))
#guam_idx
#ibs_label[guam_idx]

get_top_neighbors <- function(i, top_n = 10) {
  v <- ibs_mat[i, ]
  v[i] <- NA  # remove self
  ord <- order(v, decreasing = TRUE, na.last = NA)
  data.frame(
    focal = ibs_label[i],
    neighbor = ibs_label[ord][1:top_n],
    ibs = v[ord][1:top_n]
  )
}

top_guam1 <- get_top_neighbors(guam_idx[1], top_n = 15)
top_guam2 <- get_top_neighbors(guam_idx[2], top_n = 15)

top_guam1
##    focal neighbor       ibs
## 1  GUAM1    GUAM2 0.9486480
## 2  GUAM1 GUAT_804 0.9346686
## 3  GUAM1 VENE_909 0.9342561
## 4  GUAM1 GUAT_247 0.9340749
## 5  GUAM1 PARA_330 0.9332922
## 6  GUAM1 ECUA_937 0.9327574
## 7  GUAM1 ECUA_473 0.9317255
## 8  GUAM1 GUER_188 0.9313625
## 9  GUAM1  NICA_17 0.9312818
## 10 GUAM1  NICA_74 0.9311216
## 11 GUAM1 VENE_572 0.9309133
## 12 GUAM1 GUAT_688 0.9308256
## 13 GUAM1   BRAZ_6 0.9305050
## 14 GUAM1 GUAT_118 0.9302802
## 15 GUAM1  BOLI_83 0.9299056
top_guam2
##    focal neighbor       ibs
## 1  GUAM2     SD-2 0.9486521
## 2  GUAM2    GUAM1 0.9486480
## 3  GUAM2 GUAT_325 0.9427273
## 4  GUAM2 URUG_752 0.9418161
## 5  GUAM2     DR-0 0.9417476
## 6  GUAM2  NICA_17 0.9412328
## 7  GUAM2  SONO_39 0.9404419
## 8  GUAM2 GUAT_208 0.9393992
## 9  GUAM2 GUAT_567 0.9392966
## 10 GUAM2 GUAT_688 0.9384762
## 11 GUAM2 ECUA_473 0.9382571
## 12 GUAM2  GUAT_51 0.9380007
## 13 GUAM2 ECUA_937 0.9377880
## 14 GUAM2 PARA_136 0.9377634
## 15 GUAM2   BRAZ_6 0.9377580
neighbors <- unique(c(top_guam1$neighbor, top_guam2$neighbor, "GUAM1", "GUAM2"))
sub_idx <- which(ibs_label %in% neighbors)

# hierarchical clustering on IBS distance
d <- as.dist(1 - ibs_mat[sub_idx, sub_idx])
hc <- hclust(d, method = "average")

heatmap(ibs_mat[sub_idx, sub_idx],
        Rowv = as.dendrogram(hc),
        Colv = as.dendrogram(hc),
        scale = "none",
        labRow = ibs_label[sub_idx],
        labCol = ibs_label[sub_idx],
        main = "IBS similarity (Guam + nearest neighbors)")

IBS Interpretation. Identity-by-state (IBS) analysis supported the PCA results by quantifying pairwise genetic similarity between Guam accessions and all other samples. GUAM1 and GUAM2 showed very high similarity to each other (IBS = 0.9486), indicating close relatedness. For GUAM1, the closest non-Guam accessions included GUAT_804 (IBS = 0.9347), VENE_909 (0.9343), GUAT_247 (0.9341), PARA_330 (0.9333), and ECUA_937 (0.9328). For GUAM2, the closest accessions included GUAT_325 (0.9427), URUG_752 (0.9418), DR-0 (0.9417), NICA_17 (0.9412), and SONO_39 (0.9404). Overall, the nearest IBS neighbors to the Guam accessions were predominantly Central and South American landraces, consistent with Guam maize clustering within New World field corn diversity.

In the IBS heatmap, GUAM2 clustered closely with accessions from Guatemala, Ecuador, Brazil, Uruguay, and Venezuela, forming a contiguous high-similarity block. The Guam accession did not form a distinct branch but instead nested within a broader American landrace cluster, consistent with the PCA results.

6 Conclusion

Using genome-wide SNP data, this study examined the genetic relationships between Guam white field corn and a diverse panel of maize landraces. Principal Component Analysis and Identity-by-State analyses consistently showed that Guam accessions cluster within New World maize diversity rather than forming a distinct genetic lineage. The two Guam samples exhibited high genetic similarity to each other, suggesting limited within-Guam divergence, and shared closest affinity with multiple Central and South American landraces, including accessions from Guatemala, Ecuador, Venezuela, Brazil, Uruguay, and Nicaragua. These findings are consistent with historical accounts of maize introduction to Guam from the Americas and suggest that Guam white corn represents a locally maintained derivative of American field corn rather than an independently evolved lineage.