1 1. Overview and Motivation

In this document we explore three multivariate methods commonly used in biology:

  1. Linear Discriminant Analysis (LDA)
  2. Multidimensional Scaling (MDS) / Principal Coordinates Analysis (PCoA)
  3. Clustering
      1. Hierarchical clustering
      1. Neighbor Joining (NJ)

We will focus on a specific question:

How do these methods behave when some samples are very similar or closely related (e.g. twins, technical replicates, oversampled locations or populations)?

Key intuition:

  • PCA (not covered here in detail) and LDA work directly with covariances; they can be sensitive to clusters of near-duplicate samples.
  • Distance-based methods like MDS, hierarchical clustering, and NJ work from a pairwise distance matrix; they are generally less globally distorted by duplicates, but over-representation still affects the analysis.

We will:

  • Simulate genotype-like data with related samples.
  • Apply LDA, MDS, hierarchical clustering, and NJ.
  • Compare their behavior in the presence of highly similar samples.
  • Summarize which methods are more robust and when.

3 3. Linear Discriminant Analysis (LDA)

3.1 3.1 Conceptual theory

LDA is a supervised method: it uses known group labels (here, populations) and finds linear combinations of variables that maximize separation between groups.

Formally, LDA finds directions \(w\) that maximize:

\[ J(w) = \frac{w^T S_B w}{w^T S_W w} \]

where:

  • \(S_B\) = between-class scatter (variance of class means),
  • \(S_W\) = within-class scatter (variance inside each class).

In words:

LDA seeks directions that maximize between-group variance while minimizing within-group variance.

Important consequences:

  • Duplicates / very similar samples within the same class have limited effect (they slightly shrink within-class variance).
  • Relatives split across classes can be harmful, because they shrink between-class distances.
  • LDA does not care about overall variance, only about separation according to labels.

3.2 3.2 LDA on genotype-like data

We will run LDA using pop as the class label and SNPs as predictors.

# LDA with MASS::lda
geno_X <- geno_df[, grep("^SNP_", colnames(geno_df))]
geno_pop <- geno_df$pop

lda_fit <- MASS::lda(geno_X, grouping = geno_pop)
lda_fit
## Call:
## lda(geno_X, grouping = geno_pop)
## 
## Prior probabilities of groups:
## PopA PopB PopC 
## 0.60 0.25 0.15 
## 
## Group means:
##          SNP_1     SNP_2      SNP_3     SNP_4     SNP_5     SNP_6 SNP_7
## PopA 0.6166667 0.6833333 0.20000000 0.6833333 0.5333333 0.2833333  0.55
## PopB 0.8800000 0.6400000 0.40000000 0.6800000 0.5200000 0.4800000  0.48
## PopC 0.6666667 0.4000000 0.06666667 0.4000000 0.5333333 0.2000000  0.60
##           SNP_8     SNP_9    SNP_10    SNP_11 SNP_12    SNP_13    SNP_14 SNP_15
## PopA 0.11666667 0.3333333 0.5333333 0.4000000   0.50 0.6833333 0.2666667   0.25
## PopB 0.20000000 0.1200000 0.4800000 0.2400000   0.48 0.5600000 0.3600000   0.32
## PopC 0.06666667 0.4666667 0.6666667 0.5333333   0.40 0.6666667 0.1333333   0.40
##         SNP_16    SNP_17    SNP_18    SNP_19    SNP_20    SNP_21    SNP_22
## PopA 0.6333333 0.7000000 0.1833333 0.4166667 0.4000000 0.5833333 0.2666667
## PopB 0.7600000 0.7200000 0.1600000 0.6400000 0.8000000 0.6000000 0.2400000
## PopC 0.4000000 0.7333333 0.2666667 0.3333333 0.7333333 0.6666667 0.4000000
##         SNP_23    SNP_24    SNP_25    SNP_26 SNP_27    SNP_28    SNP_29
## PopA 0.5833333 0.8333333 0.1500000 0.4833333   0.30 0.5833333 0.3666667
## PopB 0.5200000 0.8000000 0.0400000 0.4000000   0.32 0.6800000 0.2000000
## PopC 0.3333333 0.8000000 0.2666667 0.5333333   0.00 0.9333333 0.4666667
##         SNP_30    SNP_31    SNP_32    SNP_33    SNP_34     SNP_35    SNP_36
## PopA 0.6333333 0.8000000 0.5833333 0.3500000 0.5000000 0.10000000 0.6166667
## PopB 0.7200000 0.8000000 0.6000000 0.3200000 0.3600000 0.04000000 0.6800000
## PopC 0.6666667 0.7333333 0.6666667 0.3333333 0.2666667 0.06666667 0.3333333
##          SNP_37     SNP_38    SNP_39     SNP_40    SNP_41    SNP_42     SNP_43
## PopA 0.08333333 0.23333333 0.5500000 0.46666667 0.2500000 0.4666667 0.08333333
## PopB 0.08000000 0.16000000 0.6000000 0.48000000 0.4400000 0.4400000 0.24000000
## PopC 0.40000000 0.06666667 0.3333333 0.06666667 0.2666667 0.3333333 0.13333333
##         SNP_44    SNP_45 SNP_46    SNP_47    SNP_48    SNP_49    SNP_50
## PopA 0.6833333 0.3500000   0.65 0.6833333 0.5166667 0.8166667 0.5166667
## PopB 0.8800000 0.2800000   0.56 0.6800000 0.2000000 0.4800000 0.6000000
## PopC 0.6000000 0.2666667   0.80 0.8000000 0.3333333 0.8666667 0.4666667
##          SNP_51    SNP_52    SNP_53    SNP_54     SNP_55    SNP_56    SNP_57
## PopA 0.33333333 0.3666667 0.3833333 0.8166667 0.08333333 0.5833333 0.4500000
## PopB 0.64000000 0.4000000 0.2000000 0.6400000 0.04000000 0.5600000 0.2800000
## PopC 0.06666667 0.4000000 0.2666667 1.4000000 0.06666667 0.2666667 0.5333333
##         SNP_58    SNP_59    SNP_60    SNP_61    SNP_62    SNP_63    SNP_64
## PopA 0.1666667 0.2000000 0.4166667 0.5666667 0.6333333 0.5166667 0.5833333
## PopB 0.1600000 0.2800000 0.3200000 0.4400000 0.7200000 0.4800000 0.3600000
## PopC 0.0000000 0.1333333 0.4666667 0.2000000 0.9333333 0.6666667 0.6000000
##         SNP_65     SNP_66 SNP_67   SNP_68    SNP_69    SNP_70    SNP_71
## PopA 0.6166667 0.28333333    0.2 0.550000 0.5333333 0.2666667 0.1000000
## PopB 0.3600000 0.16000000    0.2 0.680000 0.3200000 0.2400000 0.0000000
## PopC 0.8666667 0.06666667    0.2 1.133333 0.6000000 0.0000000 0.2666667
##          SNP_72    SNP_73    SNP_74    SNP_75 SNP_76    SNP_77    SNP_78
## PopA 0.16666667 0.2333333 0.3166667 0.3000000   0.65 0.1333333 0.4333333
## PopB 0.08000000 0.2000000 0.2000000 0.1200000   0.24 0.1200000 0.1200000
## PopC 0.06666667 0.1333333 0.0000000 0.5333333   0.60 0.2000000 0.3333333
##         SNP_79 SNP_80    SNP_81    SNP_82     SNP_83    SNP_84    SNP_85
## PopA 0.3166667   0.15 0.4666667 0.2500000 0.31666667 0.4666667 0.5833333
## PopB 0.1600000   0.12 0.4000000 0.1600000 0.36000000 0.6000000 0.5600000
## PopC 0.4000000   0.20 0.4000000 0.3333333 0.06666667 0.5333333 0.7333333
##         SNP_86 SNP_87    SNP_88    SNP_89 SNP_90    SNP_91     SNP_92    SNP_93
## PopA 0.3666667   0.15 0.2666667 0.1333333   0.20 0.5333333 0.08333333 0.1666667
## PopB 0.2800000   0.12 0.3200000 0.0400000   0.48 0.4400000 0.00000000 0.0800000
## PopC 0.3333333   0.60 0.1333333 0.0000000   0.00 0.4666667 0.13333333 0.4000000
##         SNP_94    SNP_95    SNP_96 SNP_97 SNP_98    SNP_99   SNP_100
## PopA 0.8166667 0.6333333 0.5666667    0.3   0.40 0.5833333 0.3000000
## PopB 0.9200000 0.3600000 0.6000000    0.4   0.52 0.8400000 0.4800000
## PopC 0.8666667 0.2666667 0.8000000    0.6   0.20 0.6000000 0.4666667
## 
## Coefficients of linear discriminants:
##                  LD1          LD2
## SNP_1     6.72522051  2.056640467
## SNP_2     0.73352762 -0.282090539
## SNP_3     1.56409074 -1.529274022
## SNP_4     8.64182730  3.377715434
## SNP_5    -2.59866410 -1.418778861
## SNP_6     2.57035563 -3.127040896
## SNP_7    -2.77239075 -1.940340801
## SNP_8    -0.81958410 -0.503951523
## SNP_9     1.43173348  2.650653573
## SNP_10   -1.28678387  0.685621169
## SNP_11   -2.14904545 -0.513483532
## SNP_12    0.60601804 -1.910847505
## SNP_13   -3.03890641 -0.198552988
## SNP_14    4.05423879  1.088000220
## SNP_15   -0.71563481 -1.844822590
## SNP_16   -1.79525360 -0.310938320
## SNP_17   -5.73810745 -1.860803479
## SNP_18   -4.07456141 -0.872751830
## SNP_19   -0.68278709  0.008064209
## SNP_20    2.99106990  2.025965172
## SNP_21    6.25649999  3.092184461
## SNP_22   -1.06261147  0.129021346
## SNP_23   -0.83343780  0.224043610
## SNP_24    1.27867095  0.857132773
## SNP_25   -0.52794269 -1.628297059
## SNP_26   -1.57177713 -0.869285353
## SNP_27   -7.89358333 -2.757255960
## SNP_28   -0.01022789  1.108158948
## SNP_29   -5.54437209 -2.409617350
## SNP_30    4.37916398  1.753254057
## SNP_31   -1.89835126 -0.463497014
## SNP_32   -0.37089857  0.868239084
## SNP_33   -2.65570889 -1.477116682
## SNP_34    3.74439280  0.356508505
## SNP_35   -5.27026251 -7.143854927
## SNP_36    6.27233734  3.698713987
## SNP_37   -1.37302512  6.757998793
## SNP_38   -0.80613685 -3.602642513
## SNP_39    5.94772311  1.759527757
## SNP_40   -3.20046778 -1.315204398
## SNP_41    2.74498016 -0.663455079
## SNP_42   -5.93126192 -5.755174796
## SNP_43    4.90173994 -0.847144867
## SNP_44    4.90116227  0.709592088
## SNP_45   -1.65889782 -0.876347965
## SNP_46    0.19721933  2.284109382
## SNP_47    7.33208316  5.614636632
## SNP_48    0.66428387  1.090802141
## SNP_49   -0.98331123 -0.646650958
## SNP_50    0.46656599 -0.242808749
## SNP_51   -1.84028323 -1.929383116
## SNP_52   -4.72889452 -2.795155512
## SNP_53   -0.81991115 -2.859793923
## SNP_54    1.47931890  2.788318116
## SNP_55  -14.65752251 -3.635199684
## SNP_56    4.06008894  1.831506987
## SNP_57    0.94011326  2.113291180
## SNP_58   -3.83693627 -2.543484497
## SNP_59   -0.59659532  6.335908274
## SNP_60   -6.02922664 -1.078624494
## SNP_61   -2.03822415 -2.139386043
## SNP_62    4.58762571  2.443566741
## SNP_63    0.58614585 -0.829358760
## SNP_64    1.08984222  1.589177941
## SNP_65   -3.70302623 -2.052400508
## SNP_66   -0.49880208  1.749104388
## SNP_67   -5.76593459 -4.496643278
## SNP_68    1.92466623 -0.966950427
## SNP_69   -6.44908351 -1.525358991
## SNP_70   -3.42159580 -0.545552261
## SNP_71   -3.27701874  6.533240454
## SNP_72   -9.53762523 -6.659194239
## SNP_73   -6.74673071 -6.399055302
## SNP_74    0.55535273 -2.190371436
## SNP_75   -0.35470621 -3.474591638
## SNP_76    3.01741087  0.807770683
## SNP_77   -5.64651674 -4.503210955
## SNP_78    3.25646914  3.709958129
## SNP_79    0.50802724  1.033181903
## SNP_80   -1.97130910  1.115081856
## SNP_81    3.50452555 -0.144850867
## SNP_82   -0.73759623 -2.116633117
## SNP_83    3.68685933  1.631935441
## SNP_84    2.35217883  1.039355911
## SNP_85   -1.78314970  1.053475409
## SNP_86   -2.90822182 -3.256819924
## SNP_87   -7.20309985 -2.489681697
## SNP_88    1.85337501 -1.230326339
## SNP_89    4.99503699  2.474716274
## SNP_90    0.92881774  1.702010316
## SNP_91    0.74760430 -1.612354561
## SNP_92    5.01390914  3.848720142
## SNP_93   -5.68797895 -3.404711648
## SNP_94   -2.48611469  0.278309004
## SNP_95   -0.46561882  0.199976254
## SNP_96    3.12428518  1.141932472
## SNP_97    0.47936236  2.472298614
## SNP_98   -1.10682147 -2.298047805
## SNP_99   -1.77623864  0.092872696
## SNP_100   1.55278126  3.586740810
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7775 0.2225

Project samples into LDA space (discriminant functions):

lda_pred <- predict(lda_fit)
lda_scores <- as.data.frame(lda_pred$x)
lda_scores$pop <- geno_df$pop
lda_scores$twin_status <- geno_df$twin_status

ggplot(lda_scores, aes(LD1, LD2, color = pop, shape = twin_status)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_mva +
  labs(title = "LDA on genotype-like data",
       subtitle = "Twins are almost overlapping within PopB")

3.2.1 Interpretation

  • LDA axes are explicitly chosen to separate PopA, PopB, PopC.
  • Twin1 and Twin2 appear almost on top of each other in LDA space.
  • Because they belong to the same class (PopB), their near-identity mostly reduces within-class variance and does not distort global class separation.

If we had many near-duplicates in PopB and very few in other populations, LDA might:

  • Weight PopB structure more heavily,
  • But still optimize for between-population separation.

Thus:

LDA is relatively robust to duplicates within the same class but still affected by heavy imbalance or mislabelled relatives.

4 4. Multidimensional Scaling (MDS / PCoA)

4.1 4.1 Conceptual theory

Multidimensional Scaling (MDS), also called Principal Coordinates Analysis (PCoA), starts from a distance matrix between samples and finds a configuration of points in a low-dimensional Euclidean space that best preserves the pairwise distances.

Two common forms:

  • Classical MDS (cmdscale) – equivalent to PCA on a centered distance matrix.
  • Non-metric MDS (isoMDS) – preserves rank order of distances.

Key properties:

  • MDS works only from distances, not raw variables.
  • Duplicate samples (distance = 0) simply end up at the same coordinate.
  • A cluster of near-duplicates mostly affects local structure, not global axes as in PCA.

So:

MDS is less sensitive to highly related samples than raw PCA, but over-represented clusters still influence the configuration because they contribute many similar distances.

4.2 4.2 Classical MDS on genotype distances

We compute a simple Euclidean distance between individuals:

D <- dist(geno_X, method = "euclidean")
D[1:5]
## [1] 7.681146 8.246211 7.810250 7.681146 8.124038

Run classical MDS:

mds_fit <- cmdscale(D, k = 2, eig = TRUE)
mds_scores <- as.data.frame(mds_fit$points)
colnames(mds_scores) <- c("MDS1", "MDS2")
mds_scores$pop <- geno_df$pop
mds_scores$twin_status <- geno_df$twin_status

ggplot(mds_scores, aes(MDS1, MDS2, color = pop, shape = twin_status)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_mva +
  labs(title = "Classical MDS (PCoA) on genotype distances",
       subtitle = "Twins overlap; structure driven by overall distances")

Observe:

  • Twins lie almost at the same position (distance ≈ 0).
  • MDS finds major axes of variation in the distance space. Over-represented PopA still influences the configuration, but twin duplicates do not dominate a principal direction as strongly as in covariance-based PCA.

5 5. Clustering

We now consider clustering methods:

  1. Hierarchical clustering (agglomerative)
  2. Neighbor Joining (NJ) (tree-building, often used in phylogenetics)

Both operate on a distance matrix rather than raw variables.
Duplication of samples tends to create tight little clusters or zero-length branches, which is intuitive and usually not harmful.

5.1 5.1 Hierarchical clustering

5.1.1 5.1.1 Conceptual theory

Agglomerative hierarchical clustering:

  1. Start with each sample as its own cluster.
  2. Iteratively merge the two closest clusters according to a linkage rule:
    • complete, average, single, Ward, etc.
  3. The result is a dendrogram.

Relatives or near-duplicates:

  • Have very small distances.
  • They join early and form small clades or tight subclusters.
  • Do not typically distort large-scale relationships (unless one group is extremely oversampled).

5.1.2 5.1.2 Example on genotype distances

hc_complete <- hclust(D, method = "complete")

# Plot dendrogram with population colors (simple version)
plot(hc_complete, labels = FALSE, main = "Hierarchical clustering (complete linkage)")

To visualize groupings more clearly, we can color labels by population:

# Create labels with population info
lab_colors <- as.factor(geno_df$pop)
lab_colors
##   [1] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
##  [16] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
##  [31] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
##  [46] PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA PopA
##  [61] PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB
##  [76] PopB PopB PopB PopB PopB PopB PopB PopB PopB PopB PopC PopC PopC PopC PopC
##  [91] PopC PopC PopC PopC PopC PopC PopC PopC PopC PopC
## Levels: PopA PopB PopC
plot(hc_complete, labels = FALSE, main = "Hierarchical clustering with population colors")
tip_order <- hc_complete$order
points(seq_along(tip_order), rep(par("usr")[3], length(tip_order)),
       pch = 16, col = lab_colors[tip_order], xpd = TRUE)

Twins will appear as adjacent leaves with almost zero height, meaning they are much closer to each other than to others.

5.2 5.2 Neighbor Joining (NJ)

5.2.1 5.2.1 Conceptual theory

Neighbor Joining is a distance-based tree reconstruction method, widely used in phylogenetics. It:

  • Takes a distance matrix as input.
  • Iteratively joins pairs of taxa that minimize a specific criterion (the Q-matrix).
  • Produces an unrooted tree.

Key points regarding relatives:

  • Highly similar or identical samples form short branches or appear as near neighbors.
  • Over-representation of a clade will add more leaves to that part of the tree, but usually does not change the deep structure if distances are additive.

In real phylogenetics, NJ is applied to genetic distances (e.g., p-distance, K2P, etc.), but here we can demonstrate NJ on the Euclidean distances between genotype vectors for teaching.

5.2.2 5.2.2 NJ on genotype distances

# Use ape::nj on the dist object
nj_tree <- nj(D)

# Assign tip labels to population information
nj_tree$tip.label <- paste0(geno_df$sample_id, "_", geno_df$pop)

plot(nj_tree, cex = 0.5, main = "Neighbor Joining tree from genotype distances")

Twins will appear as neighboring tips with very short branch lengths.

Note: For real phylogenetic analysis, one should use DNA sequences or proper evolutionary distances, but the logic with related samples is very similar.

7 7. Exercises

7.1 7.1 In-class exercises (short, exploratory)

  1. LDA with and without twins
    • Using geno_df, re-run LDA after removing Twin2.
    • Compare the LDA scatterplot (LD1 vs LD2) with and without the second twin.
    • Do the global population separations change?
  2. MDS with and without oversampling
    • Create a balanced subset: randomly sample 15 individuals from PopA, all from PopB and PopC.
    • Re-run MDS on this balanced dataset.
    • Compare the MDS plot to the original one. How does oversampling PopA influence the configuration?
  3. Hierarchical clustering linkage methods
    • Run hierarchical clustering with method = "single", "average", and "complete".
    • Compare how the dendrogram changes, especially for the twin pair.

7.2 7.2 Homework-style exercises (more involved)

  1. MDS vs LDA
    • On the full dataset, compare LDA (LD1 vs LD2) and MDS (MDS1 vs MDS2).
    • Which method separates PopB from PopA more clearly?
    • Discuss how supervision (LDA uses labels) influences the result.
  2. NJ on a different distance
    • Instead of Euclidean distance on SNP genotypes, compute a Manhattan distance:

      D_manhattan <- dist(geno_X, method = "manhattan")
      tree_manhattan <- nj(D_manhattan)
      plot(tree_manhattan, cex = 0.5)
    • Compare the tree to the Euclidean-based tree; how stable are the main splits?

  3. Adding extra near-duplicates
    • Artificially create 5 additional near-duplicate individuals in PopC by copying 5 real individuals and adding small random noise to their genotypes.
    • Re-run MDS, hierarchical clustering, and NJ.
    • How do these extra near-duplicates in PopC affect the results?
  4. Extension to real data (optional)
    • Apply the same pipeline (LDA, MDS, clustering, NJ) to a real dataset (e.g. SNP genotypes or gene expression).
    • Identify any suspected related individuals or technical replicates.
    • Compare results with and without these samples.

This R Markdown provides code, theory, and examples to illustrate how LDA, MDS, and clustering methods behave in the presence of highly related samples. You can adapt the simulations or swap in real datasets for teaching or analysis.