In this document we explore three multivariate methods commonly used in biology:
We will focus on a specific question:
How do these methods behave when some samples are very similar or closely related (e.g. twins, technical replicates, oversampled locations or populations)?
Key intuition:
We will:
LDA is a supervised method: it uses known group labels (here, populations) and finds linear combinations of variables that maximize separation between groups.
Formally, LDA finds directions \(w\) that maximize:
\[ J(w) = \frac{w^T S_B w}{w^T S_W w} \]
where:
In words:
LDA seeks directions that maximize between-group variance while minimizing within-group variance.
Important consequences:
We will run LDA using pop as the class label and SNPs as
predictors.
# LDA with MASS::lda
geno_X <- geno_df[, grep("^SNP_", colnames(geno_df))]
geno_pop <- geno_df$pop
lda_fit <- MASS::lda(geno_X, grouping = geno_pop)
lda_fit
## Call:
## lda(geno_X, grouping = geno_pop)
##
## Prior probabilities of groups:
## PopA PopB PopC
## 0.60 0.25 0.15
##
## Group means:
## SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7
## PopA 0.6166667 0.6833333 0.20000000 0.6833333 0.5333333 0.2833333 0.55
## PopB 0.8800000 0.6400000 0.40000000 0.6800000 0.5200000 0.4800000 0.48
## PopC 0.6666667 0.4000000 0.06666667 0.4000000 0.5333333 0.2000000 0.60
## SNP_8 SNP_9 SNP_10 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
## PopA 0.11666667 0.3333333 0.5333333 0.4000000 0.50 0.6833333 0.2666667 0.25
## PopB 0.20000000 0.1200000 0.4800000 0.2400000 0.48 0.5600000 0.3600000 0.32
## PopC 0.06666667 0.4666667 0.6666667 0.5333333 0.40 0.6666667 0.1333333 0.40
## SNP_16 SNP_17 SNP_18 SNP_19 SNP_20 SNP_21 SNP_22
## PopA 0.6333333 0.7000000 0.1833333 0.4166667 0.4000000 0.5833333 0.2666667
## PopB 0.7600000 0.7200000 0.1600000 0.6400000 0.8000000 0.6000000 0.2400000
## PopC 0.4000000 0.7333333 0.2666667 0.3333333 0.7333333 0.6666667 0.4000000
## SNP_23 SNP_24 SNP_25 SNP_26 SNP_27 SNP_28 SNP_29
## PopA 0.5833333 0.8333333 0.1500000 0.4833333 0.30 0.5833333 0.3666667
## PopB 0.5200000 0.8000000 0.0400000 0.4000000 0.32 0.6800000 0.2000000
## PopC 0.3333333 0.8000000 0.2666667 0.5333333 0.00 0.9333333 0.4666667
## SNP_30 SNP_31 SNP_32 SNP_33 SNP_34 SNP_35 SNP_36
## PopA 0.6333333 0.8000000 0.5833333 0.3500000 0.5000000 0.10000000 0.6166667
## PopB 0.7200000 0.8000000 0.6000000 0.3200000 0.3600000 0.04000000 0.6800000
## PopC 0.6666667 0.7333333 0.6666667 0.3333333 0.2666667 0.06666667 0.3333333
## SNP_37 SNP_38 SNP_39 SNP_40 SNP_41 SNP_42 SNP_43
## PopA 0.08333333 0.23333333 0.5500000 0.46666667 0.2500000 0.4666667 0.08333333
## PopB 0.08000000 0.16000000 0.6000000 0.48000000 0.4400000 0.4400000 0.24000000
## PopC 0.40000000 0.06666667 0.3333333 0.06666667 0.2666667 0.3333333 0.13333333
## SNP_44 SNP_45 SNP_46 SNP_47 SNP_48 SNP_49 SNP_50
## PopA 0.6833333 0.3500000 0.65 0.6833333 0.5166667 0.8166667 0.5166667
## PopB 0.8800000 0.2800000 0.56 0.6800000 0.2000000 0.4800000 0.6000000
## PopC 0.6000000 0.2666667 0.80 0.8000000 0.3333333 0.8666667 0.4666667
## SNP_51 SNP_52 SNP_53 SNP_54 SNP_55 SNP_56 SNP_57
## PopA 0.33333333 0.3666667 0.3833333 0.8166667 0.08333333 0.5833333 0.4500000
## PopB 0.64000000 0.4000000 0.2000000 0.6400000 0.04000000 0.5600000 0.2800000
## PopC 0.06666667 0.4000000 0.2666667 1.4000000 0.06666667 0.2666667 0.5333333
## SNP_58 SNP_59 SNP_60 SNP_61 SNP_62 SNP_63 SNP_64
## PopA 0.1666667 0.2000000 0.4166667 0.5666667 0.6333333 0.5166667 0.5833333
## PopB 0.1600000 0.2800000 0.3200000 0.4400000 0.7200000 0.4800000 0.3600000
## PopC 0.0000000 0.1333333 0.4666667 0.2000000 0.9333333 0.6666667 0.6000000
## SNP_65 SNP_66 SNP_67 SNP_68 SNP_69 SNP_70 SNP_71
## PopA 0.6166667 0.28333333 0.2 0.550000 0.5333333 0.2666667 0.1000000
## PopB 0.3600000 0.16000000 0.2 0.680000 0.3200000 0.2400000 0.0000000
## PopC 0.8666667 0.06666667 0.2 1.133333 0.6000000 0.0000000 0.2666667
## SNP_72 SNP_73 SNP_74 SNP_75 SNP_76 SNP_77 SNP_78
## PopA 0.16666667 0.2333333 0.3166667 0.3000000 0.65 0.1333333 0.4333333
## PopB 0.08000000 0.2000000 0.2000000 0.1200000 0.24 0.1200000 0.1200000
## PopC 0.06666667 0.1333333 0.0000000 0.5333333 0.60 0.2000000 0.3333333
## SNP_79 SNP_80 SNP_81 SNP_82 SNP_83 SNP_84 SNP_85
## PopA 0.3166667 0.15 0.4666667 0.2500000 0.31666667 0.4666667 0.5833333
## PopB 0.1600000 0.12 0.4000000 0.1600000 0.36000000 0.6000000 0.5600000
## PopC 0.4000000 0.20 0.4000000 0.3333333 0.06666667 0.5333333 0.7333333
## SNP_86 SNP_87 SNP_88 SNP_89 SNP_90 SNP_91 SNP_92 SNP_93
## PopA 0.3666667 0.15 0.2666667 0.1333333 0.20 0.5333333 0.08333333 0.1666667
## PopB 0.2800000 0.12 0.3200000 0.0400000 0.48 0.4400000 0.00000000 0.0800000
## PopC 0.3333333 0.60 0.1333333 0.0000000 0.00 0.4666667 0.13333333 0.4000000
## SNP_94 SNP_95 SNP_96 SNP_97 SNP_98 SNP_99 SNP_100
## PopA 0.8166667 0.6333333 0.5666667 0.3 0.40 0.5833333 0.3000000
## PopB 0.9200000 0.3600000 0.6000000 0.4 0.52 0.8400000 0.4800000
## PopC 0.8666667 0.2666667 0.8000000 0.6 0.20 0.6000000 0.4666667
##
## Coefficients of linear discriminants:
## LD1 LD2
## SNP_1 6.72522051 2.056640467
## SNP_2 0.73352762 -0.282090539
## SNP_3 1.56409074 -1.529274022
## SNP_4 8.64182730 3.377715434
## SNP_5 -2.59866410 -1.418778861
## SNP_6 2.57035563 -3.127040896
## SNP_7 -2.77239075 -1.940340801
## SNP_8 -0.81958410 -0.503951523
## SNP_9 1.43173348 2.650653573
## SNP_10 -1.28678387 0.685621169
## SNP_11 -2.14904545 -0.513483532
## SNP_12 0.60601804 -1.910847505
## SNP_13 -3.03890641 -0.198552988
## SNP_14 4.05423879 1.088000220
## SNP_15 -0.71563481 -1.844822590
## SNP_16 -1.79525360 -0.310938320
## SNP_17 -5.73810745 -1.860803479
## SNP_18 -4.07456141 -0.872751830
## SNP_19 -0.68278709 0.008064209
## SNP_20 2.99106990 2.025965172
## SNP_21 6.25649999 3.092184461
## SNP_22 -1.06261147 0.129021346
## SNP_23 -0.83343780 0.224043610
## SNP_24 1.27867095 0.857132773
## SNP_25 -0.52794269 -1.628297059
## SNP_26 -1.57177713 -0.869285353
## SNP_27 -7.89358333 -2.757255960
## SNP_28 -0.01022789 1.108158948
## SNP_29 -5.54437209 -2.409617350
## SNP_30 4.37916398 1.753254057
## SNP_31 -1.89835126 -0.463497014
## SNP_32 -0.37089857 0.868239084
## SNP_33 -2.65570889 -1.477116682
## SNP_34 3.74439280 0.356508505
## SNP_35 -5.27026251 -7.143854927
## SNP_36 6.27233734 3.698713987
## SNP_37 -1.37302512 6.757998793
## SNP_38 -0.80613685 -3.602642513
## SNP_39 5.94772311 1.759527757
## SNP_40 -3.20046778 -1.315204398
## SNP_41 2.74498016 -0.663455079
## SNP_42 -5.93126192 -5.755174796
## SNP_43 4.90173994 -0.847144867
## SNP_44 4.90116227 0.709592088
## SNP_45 -1.65889782 -0.876347965
## SNP_46 0.19721933 2.284109382
## SNP_47 7.33208316 5.614636632
## SNP_48 0.66428387 1.090802141
## SNP_49 -0.98331123 -0.646650958
## SNP_50 0.46656599 -0.242808749
## SNP_51 -1.84028323 -1.929383116
## SNP_52 -4.72889452 -2.795155512
## SNP_53 -0.81991115 -2.859793923
## SNP_54 1.47931890 2.788318116
## SNP_55 -14.65752251 -3.635199684
## SNP_56 4.06008894 1.831506987
## SNP_57 0.94011326 2.113291180
## SNP_58 -3.83693627 -2.543484497
## SNP_59 -0.59659532 6.335908274
## SNP_60 -6.02922664 -1.078624494
## SNP_61 -2.03822415 -2.139386043
## SNP_62 4.58762571 2.443566741
## SNP_63 0.58614585 -0.829358760
## SNP_64 1.08984222 1.589177941
## SNP_65 -3.70302623 -2.052400508
## SNP_66 -0.49880208 1.749104388
## SNP_67 -5.76593459 -4.496643278
## SNP_68 1.92466623 -0.966950427
## SNP_69 -6.44908351 -1.525358991
## SNP_70 -3.42159580 -0.545552261
## SNP_71 -3.27701874 6.533240454
## SNP_72 -9.53762523 -6.659194239
## SNP_73 -6.74673071 -6.399055302
## SNP_74 0.55535273 -2.190371436
## SNP_75 -0.35470621 -3.474591638
## SNP_76 3.01741087 0.807770683
## SNP_77 -5.64651674 -4.503210955
## SNP_78 3.25646914 3.709958129
## SNP_79 0.50802724 1.033181903
## SNP_80 -1.97130910 1.115081856
## SNP_81 3.50452555 -0.144850867
## SNP_82 -0.73759623 -2.116633117
## SNP_83 3.68685933 1.631935441
## SNP_84 2.35217883 1.039355911
## SNP_85 -1.78314970 1.053475409
## SNP_86 -2.90822182 -3.256819924
## SNP_87 -7.20309985 -2.489681697
## SNP_88 1.85337501 -1.230326339
## SNP_89 4.99503699 2.474716274
## SNP_90 0.92881774 1.702010316
## SNP_91 0.74760430 -1.612354561
## SNP_92 5.01390914 3.848720142
## SNP_93 -5.68797895 -3.404711648
## SNP_94 -2.48611469 0.278309004
## SNP_95 -0.46561882 0.199976254
## SNP_96 3.12428518 1.141932472
## SNP_97 0.47936236 2.472298614
## SNP_98 -1.10682147 -2.298047805
## SNP_99 -1.77623864 0.092872696
## SNP_100 1.55278126 3.586740810
##
## Proportion of trace:
## LD1 LD2
## 0.7775 0.2225
Project samples into LDA space (discriminant functions):
lda_pred <- predict(lda_fit)
lda_scores <- as.data.frame(lda_pred$x)
lda_scores$pop <- geno_df$pop
lda_scores$twin_status <- geno_df$twin_status
ggplot(lda_scores, aes(LD1, LD2, color = pop, shape = twin_status)) +
geom_point(size = 2, alpha = 0.8) +
theme_mva +
labs(title = "LDA on genotype-like data",
subtitle = "Twins are almost overlapping within PopB")
If we had many near-duplicates in PopB and very few in other populations, LDA might:
Thus:
LDA is relatively robust to duplicates within the same class but still affected by heavy imbalance or mislabelled relatives.
Multidimensional Scaling (MDS), also called Principal Coordinates Analysis (PCoA), starts from a distance matrix between samples and finds a configuration of points in a low-dimensional Euclidean space that best preserves the pairwise distances.
Two common forms:
cmdscale) – equivalent
to PCA on a centered distance matrix.isoMDS) – preserves
rank order of distances.Key properties:
So:
MDS is less sensitive to highly related samples than raw PCA, but over-represented clusters still influence the configuration because they contribute many similar distances.
We compute a simple Euclidean distance between individuals:
D <- dist(geno_X, method = "euclidean")
D[1:5]
## [1] 7.681146 8.246211 7.810250 7.681146 8.124038
Run classical MDS:
mds_fit <- cmdscale(D, k = 2, eig = TRUE)
mds_scores <- as.data.frame(mds_fit$points)
colnames(mds_scores) <- c("MDS1", "MDS2")
mds_scores$pop <- geno_df$pop
mds_scores$twin_status <- geno_df$twin_status
ggplot(mds_scores, aes(MDS1, MDS2, color = pop, shape = twin_status)) +
geom_point(size = 2, alpha = 0.8) +
theme_mva +
labs(title = "Classical MDS (PCoA) on genotype distances",
subtitle = "Twins overlap; structure driven by overall distances")
Observe:
We now consider clustering methods:
Both operate on a distance matrix rather than raw
variables.
Duplication of samples tends to create tight little clusters or
zero-length branches, which is intuitive and usually not harmful.
Agglomerative hierarchical clustering:
Relatives or near-duplicates:
hc_complete <- hclust(D, method = "complete")
# Plot dendrogram with population colors (simple version)
plot(hc_complete, labels = FALSE, main = "Hierarchical clustering (complete linkage)")
To visualize groupings more clearly, we can color labels by population:
# Create labels with population info
lab_colors <- as.factor(geno_df$pop)
lab_colors
## [1] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
## [16] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
## [31] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
## [46] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
## [61] PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB
## [76] PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopC PopC PopC PopC PopC
## [91] PopC PopC PopC PopC PopC PopC PopC PopC PopC PopC
## Levels: PopA PopB PopC
plot(hc_complete, labels = FALSE, main = "Hierarchical clustering with population colors")
tip_order <- hc_complete$order
points(seq_along(tip_order), rep(par("usr")[3], length(tip_order)),
pch = 16, col = lab_colors[tip_order], xpd = TRUE)
Twins will appear as adjacent leaves with almost zero height, meaning they are much closer to each other than to others.
Neighbor Joining is a distance-based tree reconstruction method, widely used in phylogenetics. It:
Key points regarding relatives:
In real phylogenetics, NJ is applied to genetic distances (e.g., p-distance, K2P, etc.), but here we can demonstrate NJ on the Euclidean distances between genotype vectors for teaching.
# Use ape::nj on the dist object
nj_tree <- nj(D)
# Assign tip labels to population information
nj_tree$tip.label <- paste0(geno_df$sample_id, "_", geno_df$pop)
plot(nj_tree, cex = 0.5, main = "Neighbor Joining tree from genotype distances")
Twins will appear as neighboring tips with very short branch lengths.
Note: For real phylogenetic analysis, one should use DNA sequences or proper evolutionary distances, but the logic with related samples is very similar.
geno_df, re-run LDA after removing
Twin2.method = "single",
"average", and "complete".Instead of Euclidean distance on SNP genotypes, compute a Manhattan distance:
D_manhattan <- dist(geno_X, method = "manhattan")
tree_manhattan <- nj(D_manhattan)
plot(tree_manhattan, cex = 0.5)Compare the tree to the Euclidean-based tree; how stable are the main splits?
This R Markdown provides code, theory, and examples to illustrate how LDA, MDS, and clustering methods behave in the presence of highly related samples. You can adapt the simulations or swap in real datasets for teaching or analysis.