The goal of this analysis was to determine how genetically similar a set of 22 “unknown” maize accessions (with available FASTQ data) are to a panel of 19 reference “anchor” lines with known backgrounds, using SNP data from chromosome 1. Specifically, we want to see which anchor line each unknown most closely resembles and whether the unknowns form a single genetic group or multiple distinct clusters.
PCA, in this specific context, will take the full genotype matrix (multiple SNPS across samples) and find new axes that explain the largest amount of variation. Each sample gets a score on Principle Component 1 (PC1) and Principle Component 2 (PC2) and these results capture major patterns of the genetic structure and allow us to visually see whether the unknowns cluster together or split into subgroups, and which anchors they cluster closest in the genetic space.
Identity-By-State (IBS) was used to quantify how similar each unknown sample is to each anchor line. While PCA shows overall clustering, IBS provides a direct numerical answer to “How strong is that similarity?” IBS measures the proportion of alleles that two samples share at the same SNPs, regardless of their ancestral origin; values close to 1 indicate very similar genotypes across the SNPs considered, whereas lower values indicate greater divergence. In this analysis, IBS was computed across Quality Controlled-filtered SNPs, and for each unknown sample the anchor with the highest IBS value was identified as its closest genetic reference.
The PCA and IBS results in this analysis reflect genetic similarity, not geography/region themselves. PCA places samples close together if they share similar SNP patterns, and IBS quantifies how often their alleles match, but neither directly encodes country or region. To make geographic inferences, the analysis first identified which anchor line each unknown sample was genetically closest to (based on PCA/IBS), and then validated external metadata and any reachable publications to determine where that specific anchor line was collected or bred. Any statements about the likely origin of the unknowns therefore depend on both the genetic clustering and the anchor metadata, and should be interpreted as inferences, and not direct measurements of geographic origin.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
setwd("/Users/michaelcajigal/GROW Internship Projects/Corn Project/")
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.15.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(adegenet)
## Loading required package: ade4
##
## /// adegenet 2.1.11 is loaded ////////////
##
## > overview: '?adegenet'
## > tutorials/doc/questions: 'adegenetWeb()'
## > bug reports/feature requests: adegenetIssues()
library(LEA)
library(ggplot2)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.27.1 (2025-05-02 21:00:05 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.13.0 (2025-02-24 21:20:02 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings
library(SNPRelate)
## Loading required package: gdsfmt
library(gdsfmt)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(SNPRelate)
Raw data. Only raw paired-end reads in FASTQ
(.fq.gz) format were collected from multiple project
folders, then consolidated and renamed into a single
trimmed_reads/ directory with one R1/R2 file pair per
sample (i.e., TAMA_46_trimmed_R1.fq.gz,
TAMA_46_trimmed_R2.fq.gz).
Sample availability. Of the 160+ maize accessions in the project file, only 22 had corresponding FASTQ read files available. Therefore, these 22 accessions were the only ones included in the downstream variant calling and comparative analyses.
Before reading in the data, first we have to generate the Variant Call Formats (VCF) files for our “unknown ancestry” samples. All variant calling and VCF processing were carried out on a macOS using the Terminal.
These were the brief steps taken to generate those VCF files:
mpileup, call, view,
merge) together with htslib utilities (bgzip, tabix) to
compress and index the resulting VCF files.An anchor panel was built using SNPVersity from MaizeGDB (https://wgs.maizegdb.org) then downloaded as a VCF file and used as the reference samples for comparison in order to provide us some ancestry information of the “unknown” samples. 19 anchor accessions from various regions were manually selected. Below is the specific list of anchors chosen:
| SRA_ID | Name | BioProject Accession | Population Region |
|---|---|---|---|
| CRX444931 | td282 | PRJCA009749 | Canada |
| CRX444932 | te322 | PRJCA009749 | Canada |
| CRX444935 | te350 | PRJCA009749 | Canada |
| CRX445267 | cml322 | PRJCA009749 | Mexico |
| CRX445268 | cml333 | PRJCA009749 | Mexico |
| CRX445312 | cml61 | PRJCA009749 | Mexico |
| CRX445535 | cimbl58 | PRJCA009749 | Mexico |
| CRX445127 | kw4m029 | PRJCA009749 | Europe |
| CRX445284 | 72_125 | PRJCA009749 | Europe |
| CRX445460 | 75_322 | PRJCA009749 | Europe |
| CRX445527 | lol125 | PRJCA009749 | Europe |
| CRX445559 | z021e0138 | PRJCA009749 | USA |
| CRX445690 | seagullsteven | PRJCA009749 | USA |
| CRX445663 | phpr5 | PRJCA009749 | USA |
| ERR10235203 | s68911 | PRJEB56265 | Poland |
| SRR17045871 | 1582 | PRJNA783885 | China |
| SRR17045871 | 848 | PRJNA783885 | China |
| SRR17045871 | 1868 | PRJNA783885 | China |
| SRR17045871 | 6 | PRJNA783885 | China |
These were the brief steps to merge chosen anchors and unknown samples via site intersection for the comparative analysis:
--missing-to-ref)Merging strategy. First a “baseline” merge of the
anchor panel and unknown samples using bcftools merge with
the option --missing-to-ref was passed, replacing missing
genotypes at shared sites with homozygous reference calls (a crude
imputation method). A second pass was then conducted repeating the merge
process but without --missing-to-ref, to
assess whether ancestry inferences were robust to how missing data were
handled.
Considering --missing-to-ref.
The--missing-to-ref option in bcftools merge
fills in missing genotypes as homozygous reference (0/0). This creates a
complete genotype matrix that can simplify downstream PCA and IBS
calculations, but it can also inflate similarity between samples if true
missing data are incorrectly assumed to be reference. Because of this
trade-off, the --missing-to-ref merge only as an initial
pass and then repeated the analyses on a merge without
this assumption.
vcf_file <- "merged_vcf/merged_with_anchors.vcf.gz"
gds_file <- "merged_vcf/with_anchors.gds"
# Converting to GDS
snpgdsVCF2GDS(vcf_file, gds_file, method="biallelic.only")
## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 41
## Parsing "merged_vcf/merged_with_anchors.vcf.gz" ...
## import 303385 variants.
## + genotype { Bit2 41x303385, 3.0M } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'merged_vcf/with_anchors.gds' (4.4M)
## # of fragments: 78
## save to 'merged_vcf/with_anchors.gds.tmp'
## rename 'merged_vcf/with_anchors.gds.tmp' (4.4M, reduced: 696B)
## # of fragments: 20
genofile <- snpgdsOpen(gds_file)
# Summary of GDS file
snpgdsSummary(genofile)
## The file name: /Users/michaelcajigal/GROW Internship Projects/Corn Project/merged_vcf/with_anchors.gds
## The total number of samples: 41
## The total number of SNPs: 303385
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
length(samples)
## [1] 41
head(samples)
## [1] "1868_SRR17045855" "1582_SRR17045871" "848_SRR17045945"
## [4] "6_SRR17045986" "Z021E0138_CRX445559" "PHPR5_CRX445663"
# Missing assessment per sample
samp_miss <- snpgdsSampMissRate(genofile)
summary(samp_miss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05919 0.11517 0.24147
samp_miss
## [1] 0.10671919 0.10405920 0.03284605 0.05866803 0.06505266 0.11516720
## [7] 0.15051173 0.15688646 0.13568568 0.07999736 0.18724063 0.16259868
## [13] 0.24146876 0.14871533 0.01318457 0.22546929 0.18271174 0.19314073
## [19] 0.06664799 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [25] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [31] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [37] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
grep("B73", samples, ignore.case=TRUE, value=TRUE)
## character(0)
# Defining anchors (using SNPVeristy ID)
anchors <- c(
"1868_SRR17045855","1582_SRR17045871","848_SRR17045945","6_SRR17045986",
"Z021E0138_CRX445559","PHPR5_CRX445663","SeagullSeventeen_CRX445690",
"TD282_CRX444931","TE322_CRX444932","TE350_CRX444935",
"CML322_CRX445267","CML333_CRX445268","CML61_CRX445312","CIMBL58_CRX445535",
"KW4M029_CRX445127","72_125_CRX445284","75_322_CRX445460",
"Lo1125_CRX445527","S68911_ERR10235203"
)
# Defining unknowns (using exact IDs)
unknowns <- setdiff(samples, anchors)
# Counting quantity of anchors and unknown samples
length(anchors)
## [1] 19
length(unknowns)
## [1] 22
unknowns
## [1] "sorted_bam/TAMA_46_sorted.bam" "sorted_bam/URUG_1054_sorted.bam"
## [3] "sorted_bam/URUG_17_sorted.bam" "sorted_bam/URUG_19_sorted.bam"
## [5] "sorted_bam/URUG_353_sorted.bam" "sorted_bam/URUG_365_sorted.bam"
## [7] "sorted_bam/URUG_51_sorted.bam" "sorted_bam/URUG_527_sorted.bam"
## [9] "sorted_bam/URUG_537_sorted.bam" "sorted_bam/URUG_713_sorted.bam"
## [11] "sorted_bam/URUG_722_sorted.bam" "sorted_bam/URUG_724_sorted.bam"
## [13] "sorted_bam/URUG_747_sorted.bam" "sorted_bam/URUG_752_sorted.bam"
## [15] "sorted_bam/URUG_97_sorted.bam" "sorted_bam/VENE_323_sorted.bam"
## [17] "sorted_bam/VENE_326_sorted.bam" "sorted_bam/VENE_334_sorted.bam"
## [19] "sorted_bam/VENE_572_sorted.bam" "sorted_bam/VENE_760_sorted.bam"
## [21] "sorted_bam/VENE_909_sorted.bam" "sorted_bam/YUCA_164_sorted.bam"
We build a list of high-quality SNPs by keeping only variants with minor allele frequency ≥ 0.05 and missing genotypes in ≤ 10% of samples. This reduces rare and poorly genotyped sites before running PCA and IBS.
# Building SNP list
snps_qc <- snpgdsSelectSNP(genofile,
maf=0.05, # Drops rare SNPs
missing.rate=0.1) # Drop SNPs missing in >10% samples
## Excluding 0 SNP on non-autosomes
## Excluding 251,123 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.1)
# Counting quantity of SNPs
length(snps_qc)
## [1] 52262
# PCA on QC SNPs
pca <- snpgdsPCA(genofile, snp.id=snps_qc, num.thread=2)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 251,123 SNPs (non-autosomes or non-selection)
## Excluding 37,495 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
## # of samples: 41
## # of SNPs: 14,767
## using 2 threads/cores
## # of principal components: 32
## PCA: the sum of all selected genotypes (0,1,2) = 1026575
## CPU capabilities:
## 2025-11-26 15:40:20 (internal increment: 11988)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:20 Begin (eigenvalues and eigenvectors)
## 2025-11-26 15:40:20 Done.
# Building PCA Dataframe
pc_df <- data.frame(
sample = pca$sample.id,
PC1 = pca$eigenvect[,1],
PC2 = pca$eigenvect[,2],
stringsAsFactors = FALSE
)
pc_df$type <- ifelse(pc_df$sample %in% anchors, "anchor", "unknown")
pc_df$label <- ifelse(pc_df$type=="unknown",
gsub("sorted_bam/|_sorted.bam","", pc_df$sample),
pc_df$sample)
# Checking eigen values
pca$eigenval
## [1] 11.78650462 4.64106578 4.13949731 2.39072449 2.19033692 2.00628049
## [7] 1.81045503 1.58878012 1.43749645 1.41743311 1.27743222 1.21626486
## [13] 0.96213305 0.87857056 0.65627138 0.56066761 0.53787673 0.15647500
## [19] 0.06660645 0.03152228 0.02538612 0.02268002 0.02039808 0.01813191
## [25] 0.01767623 0.01698265 0.01542501 0.01386006 0.01246449 0.01230765
## [31] 0.01157453 0.01099847 NaN NaN NaN NaN
## [37] NaN NaN NaN NaN NaN
# Computing eigen values w/out "NaN" values
eigen_full <- pca$eigenval # Defining full eigen value list
eigen_cut <- eigen_full[!is.na(eigen_full)] # Removing "NaN's" from eigen value list
# Checking updated eigen values
eigen_cut
## [1] 11.78650462 4.64106578 4.13949731 2.39072449 2.19033692 2.00628049
## [7] 1.81045503 1.58878012 1.43749645 1.41743311 1.27743222 1.21626486
## [13] 0.96213305 0.87857056 0.65627138 0.56066761 0.53787673 0.15647500
## [19] 0.06660645 0.03152228 0.02538612 0.02268002 0.02039808 0.01813191
## [25] 0.01767623 0.01698265 0.01542501 0.01386006 0.01246449 0.01230765
## [31] 0.01157453 0.01099847
# % variance explained
pve <- eigen_cut / sum(eigen_cut) * 100
pve[1:20]
## [1] 29.50293394 11.61710461 10.36162286 5.98424969 5.48265728 5.02194354
## [7] 4.53177060 3.97689362 3.59821373 3.54799295 3.19755513 3.04444642
## [13] 2.40832620 2.19915996 1.64272038 1.40341349 1.34636536 0.39167435
## [19] 0.16672336 0.07890378
# Plotting PC1 vs PC2
with(pc_df, plot(PC1, PC2,
col=ifelse(type=="anchor","red","black"),
pch=19, xlab="PC1", ylab="PC2",
main="PCA on SNPs"))
text(pc_df$PC1, pc_df$PC2, labels=pc_df$label, pos=3, cex=0.6)
legend("topright", legend=c("Anchors","Unknowns"), col=c("red","black"), pch=19)
PCA Interpretation. The first two principle components explained 41.1% of the total genetic variation (PC1 = 29.5%, PC2 = 11.6%), and the first four PCs together explained 57.5%. For visualization, only the first two principal components (PC1 and PC2) were retained and plotted, since they capture the major axis of genetic structure.
The PCA on QC-filtered SNPs shows clear separation between reference anchors (which are in red) and the “unknown” samples (in black) along PC1, which explains approximately 29.5% of the genetic variance (PC1 + PC2 ≈ 41.1%). Most anchors are spread across negative PC1 values, forming several distinct groups, whereas all 22 unknown samples form a tight cluster on the far right of PC1. The anchor maize line KW4M029_CRX445127 specifically sits inside this right-hand cluster, indicating that, among the anchors used, the “unknown” samples are genetically most similar to KW4M029 and to each other, and are distinct from the other anchor backgrounds represented on the left side of the plot.
# IBS on QC SNPs
ibs <- snpgdsIBS(genofile, snp.id=snps_qc, num.thread=2)
## Identity-By-State (IBS) analysis on genotypes:
## Excluding 251,123 SNPs (non-autosomes or non-selection)
## Excluding 37,495 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
## # of samples: 41
## # of SNPs: 14,767
## using 2 threads/cores
## IBS: the sum of all selected genotypes (0,1,2) = 1026575
## 2025-11-26 15:40:20 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:20 Done.
ibs_mat <- ibs$ibs
rownames(ibs_mat) <- colnames(ibs_mat) <- ibs$sample.id
ibs_to_anchors <- ibs_mat[unknowns, anchors, drop=FALSE]
best_anchor <- t(apply(ibs_to_anchors, 1, function(x){
a <- names(which.max(x))
c(best_anchor=a, best_ibs=max(x))
}))
best_anchor <- as.data.frame(best_anchor, stringsAsFactors=FALSE)
best_anchor$best_ibs <- as.numeric(best_anchor$best_ibs)
best_anchor$unknown <- gsub("sorted_bam/|_sorted.bam","", rownames(best_anchor))
best_anchor <- best_anchor[, c("unknown","best_anchor","best_ibs")]
best_anchor[order(-best_anchor$best_ibs), ]
## unknown best_anchor best_ibs
## sorted_bam/URUG_527_sorted.bam URUG_527 KW4M029_CRX445127 0.9861854
## sorted_bam/URUG_752_sorted.bam URUG_752 KW4M029_CRX445127 0.9860838
## sorted_bam/URUG_724_sorted.bam URUG_724 KW4M029_CRX445127 0.9855421
## sorted_bam/VENE_572_sorted.bam VENE_572 KW4M029_CRX445127 0.9855421
## sorted_bam/URUG_365_sorted.bam URUG_365 KW4M029_CRX445127 0.9854066
## sorted_bam/URUG_747_sorted.bam URUG_747 KW4M029_CRX445127 0.9853389
## sorted_bam/URUG_1054_sorted.bam URUG_1054 KW4M029_CRX445127 0.9851696
## sorted_bam/VENE_760_sorted.bam VENE_760 KW4M029_CRX445127 0.9851358
## sorted_bam/TAMA_46_sorted.bam TAMA_46 KW4M029_CRX445127 0.9850342
## sorted_bam/URUG_722_sorted.bam URUG_722 KW4M029_CRX445127 0.9850342
## sorted_bam/YUCA_164_sorted.bam YUCA_164 KW4M029_CRX445127 0.9849665
## sorted_bam/VENE_323_sorted.bam VENE_323 KW4M029_CRX445127 0.9849326
## sorted_bam/URUG_713_sorted.bam URUG_713 KW4M029_CRX445127 0.9848988
## sorted_bam/URUG_51_sorted.bam URUG_51 KW4M029_CRX445127 0.9848649
## sorted_bam/URUG_537_sorted.bam URUG_537 KW4M029_CRX445127 0.9847633
## sorted_bam/VENE_909_sorted.bam VENE_909 KW4M029_CRX445127 0.9846279
## sorted_bam/URUG_17_sorted.bam URUG_17 KW4M029_CRX445127 0.9845263
## sorted_bam/URUG_353_sorted.bam URUG_353 KW4M029_CRX445127 0.9844586
## sorted_bam/VENE_334_sorted.bam VENE_334 KW4M029_CRX445127 0.9840523
## sorted_bam/URUG_97_sorted.bam URUG_97 KW4M029_CRX445127 0.9837475
## sorted_bam/URUG_19_sorted.bam URUG_19 KW4M029_CRX445127 0.9835105
## sorted_bam/VENE_326_sorted.bam VENE_326 KW4M029_CRX445127 0.9826640
IBS Interpretation. Across all 22 unknown accessions, KW4M029_CRX445127 is consistently identified as the closest anchor, with very high IBS values (~0.98–0.99). This indicates that the unknown samples form a group that is genetically very similar to each other and to KW4M029, and more distant from the other anchor backgrounds in the panel.
# Not necessary just wanted to see heatmap relation
#heatmap(
# ibs_mat,
# symm=TRUE,
# main="IBS Heatmap (chr1 pruned SNPs)"
#)
# Not necessary just wanted to see heatmap relation
#library(ape)
#dist_mat <- 1 - ibs_mat
#tree <- nj(as.dist(dist_mat))
#plot(tree, cex=0.6, main="NJ Tree from 1-IBS (chr1)")
Because the IBS values were extremely high, the PCA seemed to cluster
to well, and the initial merge used the --missing-to-ref
option (which, as mentioned previously, fills missing genotypes as
homozygous reference and can artificially inflate similarity), a second
pass of the analysis was performed without
--missing-to-ref. This was done to check whether the
apparent closeness to KW4M029 was robust to the treatment of missing
data rather than being an artifact of this crude imputation step.
vcf_nomiss <- "merged_vcf/merged_no_missing2ref.vcf.gz"
gds_nomiss <- "merged_vcf/with_anchors_no_missing2ref.gds"
snpgdsVCF2GDS(vcf_nomiss, gds_nomiss, method="biallelic.only")
## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 41
## Parsing "merged_vcf/merged_no_missing2ref.vcf.gz" ...
## import 303385 variants.
## + genotype { Bit2 41x303385, 3.0M } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'merged_vcf/with_anchors_no_missing2ref.gds' (4.4M)
## # of fragments: 78
## save to 'merged_vcf/with_anchors_no_missing2ref.gds.tmp'
## rename 'merged_vcf/with_anchors_no_missing2ref.gds.tmp' (4.4M, reduced: 696B)
## # of fragments: 20
genofile2 <- snpgdsOpen(gds_nomiss)
snpgdsSummary(genofile2)
## The file name: /Users/michaelcajigal/GROW Internship Projects/Corn Project/merged_vcf/with_anchors_no_missing2ref.gds
## The total number of samples: 41
## The total number of SNPs: 303385
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
samples2 <- read.gdsn(index.gdsn(genofile2, "sample.id"))
length(samples2)
## [1] 41
head(samples2)
## [1] "1868_SRR17045855" "1582_SRR17045871" "848_SRR17045945"
## [4] "6_SRR17045986" "Z021E0138_CRX445559" "PHPR5_CRX445663"
anchors_raw <- c(
"1868_SRR17045855","1582_SRR17045871","848_SRR17045945","6_SRR17045986",
"Z021E0138_CRX445559","PHPR5_CRX445663","SeagullSeventeen_CRX445690",
"TD282_CRX444931","TE322_CRX444932","TE350_CRX444935",
"CML322_CRX445267","CML333_CRX445268","CML61_CRX445312","CIMBL58_CRX445535",
"KW4M029_CRX445127","72_125_CRX445284","75_322_CRX445460",
"Lo1125_CRX445527","S68911_ERR10235203"
)
anchors2 <- intersect(samples2, anchors_raw)
unknowns2 <- setdiff(samples2, anchors2)
length(anchors2)
## [1] 19
length(unknowns2)
## [1] 22
snps_qc2 <- snpgdsSelectSNP(genofile2, maf=0.05, missing.rate=0.1)
## Excluding 0 SNP on non-autosomes
## Excluding 302,558 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.1)
length(snps_qc2) # should be similar scale as before (tens of thousands)
## [1] 827
pca2 <- snpgdsPCA(genofile2, snp.id=snps_qc2, num.thread=2)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 302,558 SNPs (non-autosomes or non-selection)
## Excluding 517 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
## # of samples: 41
## # of SNPs: 310
## using 2 threads/cores
## # of principal components: 32
## PCA: the sum of all selected genotypes (0,1,2) = 19965
## CPU capabilities:
## 2025-11-26 15:40:23 (internal increment: 11988)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:23 Begin (eigenvalues and eigenvectors)
## 2025-11-26 15:40:23 Done.
pc_df2 <- data.frame(
sample = pca2$sample.id,
PC1 = pca2$eigenvect[,1],
PC2 = pca2$eigenvect[,2],
stringsAsFactors = FALSE
)
pc_df2$type <- ifelse(pc_df2$sample %in% anchors2, "anchor", "unknown")
pc_df2$label <- ifelse(
pc_df2$type=="unknown",
gsub("sorted_bam/|_sorted.bam","", pc_df2$sample),
pc_df2$sample
)
# variance explained (optional)
pve2 <- pca2$eigenval / sum(pca2$eigenval) * 100
with(pc_df2, plot(
PC1, PC2,
col=ifelse(type=="anchor","red","black"),
pch=19,
xlab=paste0("PC1"),
ylab=paste0("PC2"),
main="PCA (no missing-to-ref merge)"
))
text(pc_df2$PC1, pc_df2$PC2, labels=pc_df2$label, pos=3, cex=0.6)
legend("topright", legend=c("anchors","unknowns"), col=c("red","black"), pch=19)
PCA Interpretation. This PCA, shows essentially the same structure as the first PCA confirming genetic cluster structures. The anchors remain spread across negative to near-zero PC1 values, forming several distinct genetic groups. The unknown sample still cluster together on the right side of PC1, overlapping the position of KW4M029_CRX445127, although they are slightly more dispersed than in the first PCA. This indicates that even when missing genotypes are not forced to the reference allele, the unknown accessions continue to form a coherent cluster that is genetically closest to KW4M029 and clearly separated from the other anchor backgrounds.
ibs2 <- snpgdsIBS(genofile2, snp.id=snps_qc2, num.thread=2)
## Identity-By-State (IBS) analysis on genotypes:
## Excluding 302,558 SNPs (non-autosomes or non-selection)
## Excluding 517 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0.01)
## # of samples: 41
## # of SNPs: 310
## using 2 threads/cores
## IBS: the sum of all selected genotypes (0,1,2) = 19965
## 2025-11-26 15:40:23 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## 2025-11-26 15:40:23 Done.
ibs_mat2 <- ibs2$ibs
rownames(ibs_mat2) <- colnames(ibs_mat2) <- ibs2$sample.id
ibs_to_anchors2 <- ibs_mat2[unknowns2, anchors2, drop=FALSE]
best_anchor2 <- t(apply(ibs_to_anchors2, 1, function(x){
a <- names(which.max(x))
c(best_anchor=a, best_ibs=max(x))
}))
best_anchor2 <- as.data.frame(best_anchor2, stringsAsFactors=FALSE)
best_anchor2$best_ibs <- as.numeric(best_anchor2$best_ibs)
best_anchor2$unknown <- gsub("sorted_bam/|_sorted.bam","", rownames(best_anchor2))
best_anchor2 <- best_anchor2[, c("unknown","best_anchor","best_ibs")]
best_anchor2[order(-best_anchor2$best_ibs), ]
## unknown best_anchor best_ibs
## sorted_bam/URUG_527_sorted.bam URUG_527 KW4M029_CRX445127 0.9435484
## sorted_bam/URUG_752_sorted.bam URUG_752 KW4M029_CRX445127 0.9387097
## sorted_bam/URUG_724_sorted.bam URUG_724 KW4M029_CRX445127 0.9129032
## sorted_bam/VENE_572_sorted.bam VENE_572 KW4M029_CRX445127 0.9129032
## sorted_bam/URUG_365_sorted.bam URUG_365 KW4M029_CRX445127 0.9080645
## sorted_bam/URUG_747_sorted.bam URUG_747 KW4M029_CRX445127 0.9064516
## sorted_bam/URUG_1054_sorted.bam URUG_1054 KW4M029_CRX445127 0.8967742
## sorted_bam/VENE_760_sorted.bam VENE_760 KW4M029_CRX445127 0.8951613
## sorted_bam/TAMA_46_sorted.bam TAMA_46 KW4M029_CRX445127 0.8887097
## sorted_bam/URUG_722_sorted.bam URUG_722 KW4M029_CRX445127 0.8887097
## sorted_bam/YUCA_164_sorted.bam YUCA_164 KW4M029_CRX445127 0.8887097
## sorted_bam/URUG_713_sorted.bam URUG_713 KW4M029_CRX445127 0.8838710
## sorted_bam/VENE_323_sorted.bam VENE_323 KW4M029_CRX445127 0.8822581
## sorted_bam/URUG_51_sorted.bam URUG_51 KW4M029_CRX445127 0.8790323
## sorted_bam/URUG_537_sorted.bam URUG_537 KW4M029_CRX445127 0.8758065
## sorted_bam/VENE_909_sorted.bam VENE_909 KW4M029_CRX445127 0.8693548
## sorted_bam/URUG_17_sorted.bam URUG_17 KW4M029_CRX445127 0.8645161
## sorted_bam/URUG_353_sorted.bam URUG_353 KW4M029_CRX445127 0.8629032
## sorted_bam/VENE_334_sorted.bam VENE_334 KW4M029_CRX445127 0.8435484
## sorted_bam/URUG_97_sorted.bam URUG_97 KW4M029_CRX445127 0.8290323
## sorted_bam/URUG_19_sorted.bam URUG_19 KW4M029_CRX445127 0.8145161
## sorted_bam/VENE_326_sorted.bam VENE_326 KW4M029_CRX445127 0.7774194
table(best_anchor2$best_anchor)
##
## KW4M029_CRX445127
## 22
sort(colMeans(ibs_to_anchors2), decreasing=TRUE)
## KW4M029_CRX445127 TE350_CRX444935
## 0.8801320 0.7442082
## PHPR5_CRX445663 Z021E0138_CRX445559
## 0.7295455 0.7126833
## CIMBL58_CRX445535 TD282_CRX444931
## 0.6836510 0.6509531
## 72_125_CRX445284 1582_SRR17045871
## 0.6276393 0.6267595
## Lo1125_CRX445527 TE322_CRX444932
## 0.6219208 0.6090176
## S68911_ERR10235203 1868_SRR17045855
## 0.6006598 0.5969941
## CML333_CRX445268 CML61_CRX445312
## 0.5918622 0.5902493
## 6_SRR17045986 75_322_CRX445460
## 0.5874633 0.5826246
## SeagullSeventeen_CRX445690 848_SRR17045945
## 0.5801320 0.5682551
## CML322_CRX445267
## 0.5597507
IBS Interpretation. When IBS was recomputed on the
merge without --missing-to-ref, KW4M029_CRX445127 remained
the closest anchor for all 22 unknown accessions (IBS range 0.78–0.94;
mean ≈ 0.88). The reduction in absolute IBS values relative to the first
pass reflects a more conservative treatment of missing genotypes, but
the identity of the nearest anchor and the relative ranking of samples
were unchanged, supporting the robustness of the inferred relationship
to KW4M029.
In this brief comparitive analysis, both PCA and IBS show that the 22
unknown maize accessions form a tight genetic cluster that is clearly
separated from most anchor backgrounds and lies closest to the anchor
line KW4M029_CRX445127. In both the initial analysis
(with --missing-to-ref) and the more conservative
re-analysis without this option, KW4M029 remained the nearest anchor for
every unknown sample, with moderately to highly elevated IBS values.
According to panel metadata, KW4M029 is a European inbred line
from the KWS breeding program, registered in Turkey, so the
unknown accessions are most consistent with a genetic background related
to this European/KWS germplasm rather than to the other reference lines
included in the panel.
Interestingly, although the unknown accessions were labeled and sourced under names suggesting South American origins (URUG, VENE, YUCA, TAMA), SNP PCA and IBS analyses indicate that they are genetically most similar to the European KWS inbred line KW4M029. This discrepancy between apparent geographic sourcing and genomic affinity suggests that these materials likely trace back to European-derived germplasm, highlighting the complex movement and exchange of maize lines across regions.