1 Objective

The goal of this analysis was to determine how genetically similar a set of 22 “unknown” maize accessions (with available FASTQ data) are to a panel of 19 reference “anchor” lines with known backgrounds, using SNP data from chromosome 1. Specifically, we want to see which anchor line each unknown most closely resembles and whether the unknowns form a single genetic group or multiple distinct clusters.

2 Methods

2.1 Principle Component Analysis (PCA)

PCA, in this specific context, will take the full genotype matrix (multiple SNPS across samples) and find new axes that explain the largest amount of variation. Each sample gets a score on Principle Component 1 (PC1) and Principle Component 2 (PC2) and these results capture major patterns of the genetic structure and allow us to visually see whether the unknowns cluster together or split into subgroups, and which anchors they cluster closest in the genetic space.

2.2 Identity-By-State (IBS)

Identity-By-State (IBS) was used to quantify how similar each unknown sample is to each anchor line. While PCA shows overall clustering, IBS provides a direct numerical answer to “How strong is that similarity?” IBS measures the proportion of alleles that two samples share at the same SNPs, regardless of their ancestral origin; values close to 1 indicate very similar genotypes across the SNPs considered, whereas lower values indicate greater divergence. In this analysis, IBS was computed across Quality Controlled-filtered SNPs, and for each unknown sample the anchor with the highest IBS value was identified as its closest genetic reference.

2.3 Interpreting PCA and IBS vs. Geographical/Regional Information

The PCA and IBS results in this analysis reflect genetic similarity, not geography/region themselves. PCA places samples close together if they share similar SNP patterns, and IBS quantifies how often their alleles match, but neither directly encodes country or region. To make geographic inferences, the analysis first identified which anchor line each unknown sample was genetically closest to (based on PCA/IBS), and then validated external metadata and any reachable publications to determine where that specific anchor line was collected or bred. Any statements about the likely origin of the unknowns therefore depend on both the genetic clustering and the anchor metadata, and should be interpreted as inferences, and not direct measurements of geographic origin.

3 Directory & Package Set-Ups

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
setwd("/Users/michaelcajigal/GROW Internship Projects/Corn Project/")
library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.15.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(adegenet)

## Loading required package: ade4

## 
##    /// adegenet 2.1.11 is loaded ////////////
## 
##    > overview: '?adegenet'
##    > tutorials/doc/questions: 'adegenetWeb()' 
##    > bug reports/feature requests: adegenetIssues()

library(LEA)
library(ggplot2)
library(R.utils)

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.27.1 (2025-05-02 21:00:05 UTC) successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## The following object is masked from 'package:R.methodsS3':
## 
##     throw

## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods

## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save

## R.utils v2.13.0 (2025-02-24 21:20:02 UTC) successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## The following object is masked from 'package:utils':
## 
##     timestamp

## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings

library(SNPRelate)

## Loading required package: gdsfmt

library(gdsfmt)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(SNPRelate)

4 Pre-Analysis & Data Set-Ups

Raw data. Only raw paired-end reads in FASTQ (.fq.gz) format were collected from multiple project folders, then consolidated and renamed into a single trimmed_reads/ directory with one R1/R2 file pair per sample (i.e., TAMA_46_trimmed_R1.fq.gz, TAMA_46_trimmed_R2.fq.gz).

Sample availability. Of the 160+ maize accessions in the project file, only 22 had corresponding FASTQ read files available. Therefore, these 22 accessions were the only ones included in the downstream variant calling and comparative analyses.

4.1 Generate Variant Calls for Unknown Samples

Before reading in the data, first we have to generate the Variant Call Formats (VCF) files for our “unknown ancestry” samples. All variant calling and VCF processing were carried out on a macOS using the Terminal.

These were the brief steps taken to generate those VCF files:

Reads were aligned to the Zea mays reference genome sourced from MaizeGDB (https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/) with BWA-MEM, and alignment files were processed with SAMtools.
Variants were then called and filtered using bcftools (including mpileup, call, view, merge) together with htslib utilities (bgzip, tabix) to compress and index the resulting VCF files.

4.2 Anchor Panel Build from SNPVersity

An anchor panel was built using SNPVersity from MaizeGDB (https://wgs.maizegdb.org) then downloaded as a VCF file and used as the reference samples for comparison in order to provide us some ancestry information of the “unknown” samples. 19 anchor accessions from various regions were manually selected. Below is the specific list of anchors chosen:

After the anchor VCF file was generated, it was compressed and indexed in the macOS Terminal using the same **bgzip/tabix and bcftools** workflow as that applied to the unknown samples.
SRA_ID	Name	BioProject Accession	Population Region
CRX444931	td282	PRJCA009749	Canada
CRX444932	te322	PRJCA009749	Canada
CRX444935	te350	PRJCA009749	Canada
CRX445267	cml322	PRJCA009749	Mexico
CRX445268	cml333	PRJCA009749	Mexico
CRX445312	cml61	PRJCA009749	Mexico
CRX445535	cimbl58	PRJCA009749	Mexico
CRX445127	kw4m029	PRJCA009749	Europe
CRX445284	72_125	PRJCA009749	Europe
CRX445460	75_322	PRJCA009749	Europe
CRX445527	lol125	PRJCA009749	Europe
CRX445559	z021e0138	PRJCA009749	USA
CRX445690	seagullsteven	PRJCA009749	USA
CRX445663	phpr5	PRJCA009749	USA
ERR10235203	s68911	PRJEB56265	Poland
SRR17045871	1582	PRJNA783885	China
SRR17045871	848	PRJNA783885	China
SRR17045871	1868	PRJNA783885	China
SRR17045871	6	PRJNA783885	China

4.3 Merge Anchors with Unknown Samples (Site Intersection)

These were the brief steps to merge chosen anchors and unknown samples via site intersection for the comparative analysis:

Create a list of anchor SNP sites.
Subset unknown VCF to anchor sites.
Merge anchors + unknowns (first pass, with --missing-to-ref)

5 Comparative Analysis

Merging strategy. First a “baseline” merge of the anchor panel and unknown samples using bcftools merge with the option --missing-to-ref was passed, replacing missing genotypes at shared sites with homozygous reference calls (a crude imputation method). A second pass was then conducted repeating the merge process but without --missing-to-ref, to assess whether ancestry inferences were robust to how missing data were handled.

5.1 First Pass

Considering --missing-to-ref. The--missing-to-ref option in bcftools merge fills in missing genotypes as homozygous reference (0/0). This creates a complete genotype matrix that can simplify downstream PCA and IBS calculations, but it can also inflate similarity between samples if true missing data are incorrectly assumed to be reference. Because of this trade-off, the --missing-to-ref merge only as an initial pass and then repeated the analyses on a merge without this assumption.

5.1.1 Data Reads

vcf_file <- "merged_vcf/merged_with_anchors.vcf.gz"
gds_file <- "merged_vcf/with_anchors.gds"

5.1.2 Convert Unknown VCF to GDS and Check Missingness

# Converting to GDS
snpgdsVCF2GDS(vcf_file, gds_file, method="biallelic.only")

## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 41
## Parsing "merged_vcf/merged_with_anchors.vcf.gz" ...
##  import 303385 variants.
## + genotype   { Bit2 41x303385, 3.0M } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'merged_vcf/with_anchors.gds' (4.4M)
##     # of fragments: 78
##     save to 'merged_vcf/with_anchors.gds.tmp'
##     rename 'merged_vcf/with_anchors.gds.tmp' (4.4M, reduced: 696B)
##     # of fragments: 20

genofile <- snpgdsOpen(gds_file)

# Summary of GDS file
snpgdsSummary(genofile)

## The file name: /Users/michaelcajigal/GROW Internship Projects/Corn Project/merged_vcf/with_anchors.gds 
## The total number of samples: 41 
## The total number of SNPs: 303385 
## SNP genotypes are stored in SNP-major mode (Sample X SNP).

samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
length(samples)

## [1] 41

head(samples)

## [1] "1868_SRR17045855"    "1582_SRR17045871"    "848_SRR17045945"    
## [4] "6_SRR17045986"       "Z021E0138_CRX445559" "PHPR5_CRX445663"

# Missing assessment per sample
samp_miss <- snpgdsSampMissRate(genofile)
summary(samp_miss)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05919 0.11517 0.24147

samp_miss

##  [1] 0.10671919 0.10405920 0.03284605 0.05866803 0.06505266 0.11516720
##  [7] 0.15051173 0.15688646 0.13568568 0.07999736 0.18724063 0.16259868
## [13] 0.24146876 0.14871533 0.01318457 0.22546929 0.18271174 0.19314073
## [19] 0.06664799 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [25] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [31] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [37] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
grep("B73", samples, ignore.case=TRUE, value=TRUE)

## character(0)

5.1.3 Define anchor and unknown sample sets in R

# Defining anchors (using SNPVeristy ID)
anchors <- c(
  "1868_SRR17045855","1582_SRR17045871","848_SRR17045945","6_SRR17045986",
  "Z021E0138_CRX445559","PHPR5_CRX445663","SeagullSeventeen_CRX445690",
  "TD282_CRX444931","TE322_CRX444932","TE350_CRX444935",
  "CML322_CRX445267","CML333_CRX445268","CML61_CRX445312","CIMBL58_CRX445535",
  "KW4M029_CRX445127","72_125_CRX445284","75_322_CRX445460",
  "Lo1125_CRX445527","S68911_ERR10235203"
)

# Defining unknowns (using exact IDs)
unknowns <- setdiff(samples, anchors)

# Counting quantity of anchors and unknown samples
length(anchors)

## [1] 19

length(unknowns)

## [1] 22

unknowns

##  [1] "sorted_bam/TAMA_46_sorted.bam"   "sorted_bam/URUG_1054_sorted.bam"
##  [3] "sorted_bam/URUG_17_sorted.bam"   "sorted_bam/URUG_19_sorted.bam"  
##  [5] "sorted_bam/URUG_353_sorted.bam"  "sorted_bam/URUG_365_sorted.bam" 
##  [7] "sorted_bam/URUG_51_sorted.bam"   "sorted_bam/URUG_527_sorted.bam" 
##  [9] "sorted_bam/URUG_537_sorted.bam"  "sorted_bam/URUG_713_sorted.bam" 
## [11] "sorted_bam/URUG_722_sorted.bam"  "sorted_bam/URUG_724_sorted.bam" 
## [13] "sorted_bam/URUG_747_sorted.bam"  "sorted_bam/URUG_752_sorted.bam" 
## [15] "sorted_bam/URUG_97_sorted.bam"   "sorted_bam/VENE_323_sorted.bam" 
## [17] "sorted_bam/VENE_326_sorted.bam"  "sorted_bam/VENE_334_sorted.bam" 
## [19] "sorted_bam/VENE_572_sorted.bam"  "sorted_bam/VENE_760_sorted.bam" 
## [21] "sorted_bam/VENE_909_sorted.bam"  "sorted_bam/YUCA_164_sorted.bam"

5.1.4 Quality Controled (QC) Single Nucleotide Polymorphism (SNP) Build

We build a list of high-quality SNPs by keeping only variants with minor allele frequency ≥ 0.05 and missing genotypes in ≤ 10% of samples. This reduces rare and poorly genotyped sites before running PCA and IBS.

# Building SNP list
snps_qc <- snpgdsSelectSNP(genofile,
                           maf=0.05,         # Drops rare SNPs
                           missing.rate=0.1) # Drop SNPs missing in >10% samples

## Excluding 0 SNP on non-autosomes
## Excluding 251,123 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.1)

# Counting quantity of SNPs
length(snps_qc)

## [1] 52262

5.1.5 Principle Component Analysis (PCA)

# PCA on QC SNPs
pca <- snpgdsPCA(genofile, snp.id=snps_qc, num.thread=2)

## Principal Component Analysis (PCA) on genotypes:
## Excluding 251,123 SNPs (non-autosomes or non-selection)
## Excluding 37,495 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
##     # of samples: 41
##     # of SNPs: 14,767
##     using 2 threads/cores
##     # of principal components: 32
## PCA:    the sum of all selected genotypes (0,1,2) = 1026575
## CPU capabilities:
## 2025-11-26 15:40:20    (internal increment: 11988)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:20    Begin (eigenvalues and eigenvectors)
## 2025-11-26 15:40:20    Done.

# Building PCA Dataframe
pc_df <- data.frame(
  sample = pca$sample.id,
  PC1 = pca$eigenvect[,1],
  PC2 = pca$eigenvect[,2],
  stringsAsFactors = FALSE
)

pc_df$type <- ifelse(pc_df$sample %in% anchors, "anchor", "unknown")
pc_df$label <- ifelse(pc_df$type=="unknown",
                      gsub("sorted_bam/|_sorted.bam","", pc_df$sample),
                      pc_df$sample)

# Checking eigen values
pca$eigenval

##  [1] 11.78650462  4.64106578  4.13949731  2.39072449  2.19033692  2.00628049
##  [7]  1.81045503  1.58878012  1.43749645  1.41743311  1.27743222  1.21626486
## [13]  0.96213305  0.87857056  0.65627138  0.56066761  0.53787673  0.15647500
## [19]  0.06660645  0.03152228  0.02538612  0.02268002  0.02039808  0.01813191
## [25]  0.01767623  0.01698265  0.01542501  0.01386006  0.01246449  0.01230765
## [31]  0.01157453  0.01099847         NaN         NaN         NaN         NaN
## [37]         NaN         NaN         NaN         NaN         NaN

# Computing eigen values w/out "NaN" values
eigen_full <- pca$eigenval           # Defining full eigen value list
eigen_cut <- eigen_full[!is.na(eigen_full)] # Removing "NaN's" from eigen value list

# Checking updated eigen values
eigen_cut

##  [1] 11.78650462  4.64106578  4.13949731  2.39072449  2.19033692  2.00628049
##  [7]  1.81045503  1.58878012  1.43749645  1.41743311  1.27743222  1.21626486
## [13]  0.96213305  0.87857056  0.65627138  0.56066761  0.53787673  0.15647500
## [19]  0.06660645  0.03152228  0.02538612  0.02268002  0.02039808  0.01813191
## [25]  0.01767623  0.01698265  0.01542501  0.01386006  0.01246449  0.01230765
## [31]  0.01157453  0.01099847

# % variance explained
pve <- eigen_cut / sum(eigen_cut) * 100
pve[1:20]

##  [1] 29.50293394 11.61710461 10.36162286  5.98424969  5.48265728  5.02194354
##  [7]  4.53177060  3.97689362  3.59821373  3.54799295  3.19755513  3.04444642
## [13]  2.40832620  2.19915996  1.64272038  1.40341349  1.34636536  0.39167435
## [19]  0.16672336  0.07890378

# Plotting PC1 vs PC2
with(pc_df, plot(PC1, PC2,
                 col=ifelse(type=="anchor","red","black"),
                 pch=19, xlab="PC1", ylab="PC2",
                 main="PCA on SNPs"))
text(pc_df$PC1, pc_df$PC2, labels=pc_df$label, pos=3, cex=0.6)
legend("topright", legend=c("Anchors","Unknowns"), col=c("red","black"), pch=19)

PCA Interpretation. The first two principle components explained 41.1% of the total genetic variation (PC1 = 29.5%, PC2 = 11.6%), and the first four PCs together explained 57.5%. For visualization, only the first two principal components (PC1 and PC2) were retained and plotted, since they capture the major axis of genetic structure.

The PCA on QC-filtered SNPs shows clear separation between reference anchors (which are in red) and the “unknown” samples (in black) along PC1, which explains approximately 29.5% of the genetic variance (PC1 + PC2 ≈ 41.1%). Most anchors are spread across negative PC1 values, forming several distinct groups, whereas all 22 unknown samples form a tight cluster on the far right of PC1. The anchor maize line KW4M029_CRX445127 specifically sits inside this right-hand cluster, indicating that, among the anchors used, the “unknown” samples are genetically most similar to KW4M029 and to each other, and are distinct from the other anchor backgrounds represented on the left side of the plot.

5.2 Identity-By-State (IBS)

# IBS on QC SNPs
ibs <- snpgdsIBS(genofile, snp.id=snps_qc, num.thread=2)

## Identity-By-State (IBS) analysis on genotypes:
## Excluding 251,123 SNPs (non-autosomes or non-selection)
## Excluding 37,495 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
##     # of samples: 41
##     # of SNPs: 14,767
##     using 2 threads/cores
## IBS:    the sum of all selected genotypes (0,1,2) = 1026575
## 2025-11-26 15:40:20    (internal increment: 65536)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:20    Done.

ibs_mat <- ibs$ibs
rownames(ibs_mat) <- colnames(ibs_mat) <- ibs$sample.id

ibs_to_anchors <- ibs_mat[unknowns, anchors, drop=FALSE]

best_anchor <- t(apply(ibs_to_anchors, 1, function(x){
  a <- names(which.max(x))
  c(best_anchor=a, best_ibs=max(x))
}))
best_anchor <- as.data.frame(best_anchor, stringsAsFactors=FALSE)
best_anchor$best_ibs <- as.numeric(best_anchor$best_ibs)
best_anchor$unknown <- gsub("sorted_bam/|_sorted.bam","", rownames(best_anchor))
best_anchor <- best_anchor[, c("unknown","best_anchor","best_ibs")]

best_anchor[order(-best_anchor$best_ibs), ]

##                                   unknown       best_anchor  best_ibs
## sorted_bam/URUG_527_sorted.bam   URUG_527 KW4M029_CRX445127 0.9861854
## sorted_bam/URUG_752_sorted.bam   URUG_752 KW4M029_CRX445127 0.9860838
## sorted_bam/URUG_724_sorted.bam   URUG_724 KW4M029_CRX445127 0.9855421
## sorted_bam/VENE_572_sorted.bam   VENE_572 KW4M029_CRX445127 0.9855421
## sorted_bam/URUG_365_sorted.bam   URUG_365 KW4M029_CRX445127 0.9854066
## sorted_bam/URUG_747_sorted.bam   URUG_747 KW4M029_CRX445127 0.9853389
## sorted_bam/URUG_1054_sorted.bam URUG_1054 KW4M029_CRX445127 0.9851696
## sorted_bam/VENE_760_sorted.bam   VENE_760 KW4M029_CRX445127 0.9851358
## sorted_bam/TAMA_46_sorted.bam     TAMA_46 KW4M029_CRX445127 0.9850342
## sorted_bam/URUG_722_sorted.bam   URUG_722 KW4M029_CRX445127 0.9850342
## sorted_bam/YUCA_164_sorted.bam   YUCA_164 KW4M029_CRX445127 0.9849665
## sorted_bam/VENE_323_sorted.bam   VENE_323 KW4M029_CRX445127 0.9849326
## sorted_bam/URUG_713_sorted.bam   URUG_713 KW4M029_CRX445127 0.9848988
## sorted_bam/URUG_51_sorted.bam     URUG_51 KW4M029_CRX445127 0.9848649
## sorted_bam/URUG_537_sorted.bam   URUG_537 KW4M029_CRX445127 0.9847633
## sorted_bam/VENE_909_sorted.bam   VENE_909 KW4M029_CRX445127 0.9846279
## sorted_bam/URUG_17_sorted.bam     URUG_17 KW4M029_CRX445127 0.9845263
## sorted_bam/URUG_353_sorted.bam   URUG_353 KW4M029_CRX445127 0.9844586
## sorted_bam/VENE_334_sorted.bam   VENE_334 KW4M029_CRX445127 0.9840523
## sorted_bam/URUG_97_sorted.bam     URUG_97 KW4M029_CRX445127 0.9837475
## sorted_bam/URUG_19_sorted.bam     URUG_19 KW4M029_CRX445127 0.9835105
## sorted_bam/VENE_326_sorted.bam   VENE_326 KW4M029_CRX445127 0.9826640

IBS Interpretation. Across all 22 unknown accessions, KW4M029_CRX445127 is consistently identified as the closest anchor, with very high IBS values (~0.98–0.99). This indicates that the unknown samples form a group that is genetically very similar to each other and to KW4M029, and more distant from the other anchor backgrounds in the panel.

# Not necessary just wanted to see heatmap relation
#heatmap(
#  ibs_mat,
#  symm=TRUE,
#  main="IBS Heatmap (chr1 pruned SNPs)"
#)

# Not necessary just wanted to see heatmap relation
#library(ape)

#dist_mat <- 1 - ibs_mat
#tree <- nj(as.dist(dist_mat))

#plot(tree, cex=0.6, main="NJ Tree from 1-IBS (chr1)")

5.3 Second Pass

Because the IBS values were extremely high, the PCA seemed to cluster to well, and the initial merge used the --missing-to-ref option (which, as mentioned previously, fills missing genotypes as homozygous reference and can artificially inflate similarity), a second pass of the analysis was performed without --missing-to-ref. This was done to check whether the apparent closeness to KW4M029 was robust to the treatment of missing data rather than being an artifact of this crude imputation step.

vcf_nomiss <- "merged_vcf/merged_no_missing2ref.vcf.gz"
gds_nomiss <- "merged_vcf/with_anchors_no_missing2ref.gds"

snpgdsVCF2GDS(vcf_nomiss, gds_nomiss, method="biallelic.only")

## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 41
## Parsing "merged_vcf/merged_no_missing2ref.vcf.gz" ...
##  import 303385 variants.
## + genotype   { Bit2 41x303385, 3.0M } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'merged_vcf/with_anchors_no_missing2ref.gds' (4.4M)
##     # of fragments: 78
##     save to 'merged_vcf/with_anchors_no_missing2ref.gds.tmp'
##     rename 'merged_vcf/with_anchors_no_missing2ref.gds.tmp' (4.4M, reduced: 696B)
##     # of fragments: 20

genofile2 <- snpgdsOpen(gds_nomiss)

snpgdsSummary(genofile2)

## The file name: /Users/michaelcajigal/GROW Internship Projects/Corn Project/merged_vcf/with_anchors_no_missing2ref.gds 
## The total number of samples: 41 
## The total number of SNPs: 303385 
## SNP genotypes are stored in SNP-major mode (Sample X SNP).

samples2 <- read.gdsn(index.gdsn(genofile2, "sample.id"))
length(samples2)

## [1] 41

head(samples2)

## [1] "1868_SRR17045855"    "1582_SRR17045871"    "848_SRR17045945"    
## [4] "6_SRR17045986"       "Z021E0138_CRX445559" "PHPR5_CRX445663"

anchors_raw <- c(
  "1868_SRR17045855","1582_SRR17045871","848_SRR17045945","6_SRR17045986",
  "Z021E0138_CRX445559","PHPR5_CRX445663","SeagullSeventeen_CRX445690",
  "TD282_CRX444931","TE322_CRX444932","TE350_CRX444935",
  "CML322_CRX445267","CML333_CRX445268","CML61_CRX445312","CIMBL58_CRX445535",
  "KW4M029_CRX445127","72_125_CRX445284","75_322_CRX445460",
  "Lo1125_CRX445527","S68911_ERR10235203"
)

anchors2  <- intersect(samples2, anchors_raw)
unknowns2 <- setdiff(samples2, anchors2)

length(anchors2)

## [1] 19

length(unknowns2)

## [1] 22

snps_qc2 <- snpgdsSelectSNP(genofile2, maf=0.05, missing.rate=0.1)

## Excluding 0 SNP on non-autosomes
## Excluding 302,558 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.1)

length(snps_qc2)   # should be similar scale as before (tens of thousands)

## [1] 827

pca2 <- snpgdsPCA(genofile2, snp.id=snps_qc2, num.thread=2)

## Principal Component Analysis (PCA) on genotypes:
## Excluding 302,558 SNPs (non-autosomes or non-selection)
## Excluding 517 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
##     # of samples: 41
##     # of SNPs: 310
##     using 2 threads/cores
##     # of principal components: 32
## PCA:    the sum of all selected genotypes (0,1,2) = 19965
## CPU capabilities:
## 2025-11-26 15:40:23    (internal increment: 11988)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:23    Begin (eigenvalues and eigenvectors)
## 2025-11-26 15:40:23    Done.

pc_df2 <- data.frame(
  sample = pca2$sample.id,
  PC1 = pca2$eigenvect[,1],
  PC2 = pca2$eigenvect[,2],
  stringsAsFactors = FALSE
)

pc_df2$type <- ifelse(pc_df2$sample %in% anchors2, "anchor", "unknown")
pc_df2$label <- ifelse(
  pc_df2$type=="unknown",
  gsub("sorted_bam/|_sorted.bam","", pc_df2$sample),
  pc_df2$sample
)

# variance explained (optional)
pve2 <- pca2$eigenval / sum(pca2$eigenval) * 100

with(pc_df2, plot(
  PC1, PC2,
  col=ifelse(type=="anchor","red","black"),
  pch=19,
  xlab=paste0("PC1"),
  ylab=paste0("PC2"),
  main="PCA (no missing-to-ref merge)"
))
text(pc_df2$PC1, pc_df2$PC2, labels=pc_df2$label, pos=3, cex=0.6)
legend("topright", legend=c("anchors","unknowns"), col=c("red","black"), pch=19)

PCA Interpretation. This PCA, shows essentially the same structure as the first PCA confirming genetic cluster structures. The anchors remain spread across negative to near-zero PC1 values, forming several distinct genetic groups. The unknown sample still cluster together on the right side of PC1, overlapping the position of KW4M029_CRX445127, although they are slightly more dispersed than in the first PCA. This indicates that even when missing genotypes are not forced to the reference allele, the unknown accessions continue to form a coherent cluster that is genetically closest to KW4M029 and clearly separated from the other anchor backgrounds.

ibs2 <- snpgdsIBS(genofile2, snp.id=snps_qc2, num.thread=2)

## Identity-By-State (IBS) analysis on genotypes:
## Excluding 302,558 SNPs (non-autosomes or non-selection)
## Excluding 517 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
##     # of samples: 41
##     # of SNPs: 310
##     using 2 threads/cores
## IBS:    the sum of all selected genotypes (0,1,2) = 19965
## 2025-11-26 15:40:23    (internal increment: 65536)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:23    Done.

ibs_mat2 <- ibs2$ibs
rownames(ibs_mat2) <- colnames(ibs_mat2) <- ibs2$sample.id

ibs_to_anchors2 <- ibs_mat2[unknowns2, anchors2, drop=FALSE]

best_anchor2 <- t(apply(ibs_to_anchors2, 1, function(x){
  a <- names(which.max(x))
  c(best_anchor=a, best_ibs=max(x))
}))

best_anchor2 <- as.data.frame(best_anchor2, stringsAsFactors=FALSE)
best_anchor2$best_ibs <- as.numeric(best_anchor2$best_ibs)
best_anchor2$unknown <- gsub("sorted_bam/|_sorted.bam","", rownames(best_anchor2))
best_anchor2 <- best_anchor2[, c("unknown","best_anchor","best_ibs")]

best_anchor2[order(-best_anchor2$best_ibs), ]

##                                   unknown       best_anchor  best_ibs
## sorted_bam/URUG_527_sorted.bam   URUG_527 KW4M029_CRX445127 0.9435484
## sorted_bam/URUG_752_sorted.bam   URUG_752 KW4M029_CRX445127 0.9387097
## sorted_bam/URUG_724_sorted.bam   URUG_724 KW4M029_CRX445127 0.9129032
## sorted_bam/VENE_572_sorted.bam   VENE_572 KW4M029_CRX445127 0.9129032
## sorted_bam/URUG_365_sorted.bam   URUG_365 KW4M029_CRX445127 0.9080645
## sorted_bam/URUG_747_sorted.bam   URUG_747 KW4M029_CRX445127 0.9064516
## sorted_bam/URUG_1054_sorted.bam URUG_1054 KW4M029_CRX445127 0.8967742
## sorted_bam/VENE_760_sorted.bam   VENE_760 KW4M029_CRX445127 0.8951613
## sorted_bam/TAMA_46_sorted.bam     TAMA_46 KW4M029_CRX445127 0.8887097
## sorted_bam/URUG_722_sorted.bam   URUG_722 KW4M029_CRX445127 0.8887097
## sorted_bam/YUCA_164_sorted.bam   YUCA_164 KW4M029_CRX445127 0.8887097
## sorted_bam/URUG_713_sorted.bam   URUG_713 KW4M029_CRX445127 0.8838710
## sorted_bam/VENE_323_sorted.bam   VENE_323 KW4M029_CRX445127 0.8822581
## sorted_bam/URUG_51_sorted.bam     URUG_51 KW4M029_CRX445127 0.8790323
## sorted_bam/URUG_537_sorted.bam   URUG_537 KW4M029_CRX445127 0.8758065
## sorted_bam/VENE_909_sorted.bam   VENE_909 KW4M029_CRX445127 0.8693548
## sorted_bam/URUG_17_sorted.bam     URUG_17 KW4M029_CRX445127 0.8645161
## sorted_bam/URUG_353_sorted.bam   URUG_353 KW4M029_CRX445127 0.8629032
## sorted_bam/VENE_334_sorted.bam   VENE_334 KW4M029_CRX445127 0.8435484
## sorted_bam/URUG_97_sorted.bam     URUG_97 KW4M029_CRX445127 0.8290323
## sorted_bam/URUG_19_sorted.bam     URUG_19 KW4M029_CRX445127 0.8145161
## sorted_bam/VENE_326_sorted.bam   VENE_326 KW4M029_CRX445127 0.7774194

table(best_anchor2$best_anchor)

## 
## KW4M029_CRX445127 
##                22

sort(colMeans(ibs_to_anchors2), decreasing=TRUE)

##          KW4M029_CRX445127            TE350_CRX444935 
##                  0.8801320                  0.7442082 
##            PHPR5_CRX445663        Z021E0138_CRX445559 
##                  0.7295455                  0.7126833 
##          CIMBL58_CRX445535            TD282_CRX444931 
##                  0.6836510                  0.6509531 
##           72_125_CRX445284           1582_SRR17045871 
##                  0.6276393                  0.6267595 
##           Lo1125_CRX445527            TE322_CRX444932 
##                  0.6219208                  0.6090176 
##         S68911_ERR10235203           1868_SRR17045855 
##                  0.6006598                  0.5969941 
##           CML333_CRX445268            CML61_CRX445312 
##                  0.5918622                  0.5902493 
##              6_SRR17045986           75_322_CRX445460 
##                  0.5874633                  0.5826246 
## SeagullSeventeen_CRX445690            848_SRR17045945 
##                  0.5801320                  0.5682551 
##           CML322_CRX445267 
##                  0.5597507

IBS Interpretation. When IBS was recomputed on the merge without --missing-to-ref, KW4M029_CRX445127 remained the closest anchor for all 22 unknown accessions (IBS range 0.78–0.94; mean ≈ 0.88). The reduction in absolute IBS values relative to the first pass reflects a more conservative treatment of missing genotypes, but the identity of the nearest anchor and the relative ranking of samples were unchanged, supporting the robustness of the inferred relationship to KW4M029.

6 Conclusions

In this brief comparitive analysis, both PCA and IBS show that the 22 unknown maize accessions form a tight genetic cluster that is clearly separated from most anchor backgrounds and lies closest to the anchor line KW4M029_CRX445127. In both the initial analysis (with --missing-to-ref) and the more conservative re-analysis without this option, KW4M029 remained the nearest anchor for every unknown sample, with moderately to highly elevated IBS values. According to panel metadata, KW4M029 is a European inbred line from the KWS breeding program, registered in Turkey, so the unknown accessions are most consistent with a genetic background related to this European/KWS germplasm rather than to the other reference lines included in the panel.

Interestingly, although the unknown accessions were labeled and sourced under names suggesting South American origins (URUG, VENE, YUCA, TAMA), SNP PCA and IBS analyses indicate that they are genetically most similar to the European KWS inbred line KW4M029. This discrepancy between apparent geographic sourcing and genomic affinity suggests that these materials likely trace back to European-derived germplasm, highlighting the complex movement and exchange of maize lines across regions.

Preliminary PCA/IBS Comparison of Unknown Ancestry Maize Accessions to Multi-Regional Anchors

Michael Basaliza Cajigal

2025-11-26