label x y z
1 A 0 1 1
2 B 1 1 0
3 C 1 0 5
Consider a \(p=3\) dimensional data set of \(n=3\) observations:
label x y z
1 A 0 1 1
2 B 1 1 0
3 C 1 0 5
Questions:
Consider plotting only 2 dimensions of these data at a time:
label x y z
1 A 0 1 1
2 B 1 1 0
3 C 1 0 5
…so, indeed, A and B are much closer to each other than either of them is to point C.
label x y
1 A 0 1
2 B 1 1
3 C 1 0
\(\scriptsize d(A,B)= 1, d(A,C) = \sqrt{2} = 1.41, d(B,C) = 1\)
label x z
1 A 0 1
2 B 1 0
3 C 1 5
\(\scriptsize d(A,B)= \sqrt{2} = 1.41, d(A,C) = \sqrt{17} = 4.12, d(B,C) = 5\)
… so the pairwise distances in the XZ configuration are much more comparable to the full dimensional distances than the distances in XY configuration
Given \(d_{ij}\) and \(\hat d_{ij}\) measuring the dissimilarity between observations \(i\) and \(j\) in full \(p\)-dimensional space and reduced (also called configured or ordination) space, respectively:
\[Stress = \sqrt{\sum_{i\ne j} (d_{ij} - \hat d_{ij})^2}\]
…in other words, you can think of stress as measuring “difference in distances.”
For our \(n=3\) example:
Full 3D distances:
Distances in 2D XY space:
Distances in 2D XZ space:
\[Stress(\mbox{XY configuration}) = \sqrt{(1.414-1)^2 + (4.243-1.414)^2+(5.099-1)^2} = 4.998\]
\[Stress(\mbox{XZ configuration}) = \sqrt{(1.414-1.414)^2 + (4.243-4.123)^2+(5.099-5)^2} = 0.156\]
Moral: the 2D XZ configuration is a more faithful representation of the 3D data than the 2D XY configuration.
metaMDS()
function in the vegan
package performs non-metric MDS (NMDS).\[ \mbox{Sammon squared stress: } \frac{1}{\sum_{i< j}d_{ij}}\sum_{i< j}\frac{(d_{ij} - \hat d_{ij})^2}{d_{ij}}\] \[ \mbox{Kruskal squared original stress: } \frac{\sum_{i< j}(d_{ij} - \hat d_{ij})^2}{\sum_{i< j} d_{ij}^2}\]
\[ \mbox{Kruskal squared alternative stress: } \frac{\sum_{i< j}(d_{ij} - \hat d_{ij})^2}{\sum_{i< j} (d_{ij}-\bar d)^2}; \bar d = \mbox{average of all }d_{ij}\]
Seattle Los.Angeles Dallas Miami Memphis Minneapolis Denver Detroit Boston New.York Atlanta St..Louis Chicago Las.Vegas
Seattle 0 955 1658 2721 1867 1395 1022 1922 2489 2415 2178 1705 1716 867
Los.Angeles 955 0 1232 2338 1615 1533 861 1975 2605 2470 1942 1589 1741 236
Dallas 1658 1232 0 1120 431 853 641 986 1559 1389 730 550 802 1053
Miami 2721 2338 1120 0 860 1503 1708 1148 1260 1092 596 1070 1199 2171
Memphis 1867 1615 431 860 0 701 871 611 1138 963 331 257 492 1413
Minneapolis 1395 1533 853 1503 701 0 679 527 1121 1026 907 449 334 1297
Denver 1022 861 641 1708 871 679 0 1120 1750 1622 1197 768 886 627
Detroit 1922 1975 986 1148 611 527 1120 0 631 508 595 439 234 1745
Boston 2489 2605 1559 1260 1138 1121 1750 631 0 186 945 1044 864 2375
New.York 2415 2470 1389 1092 963 1026 1622 508 186 0 759 890 738 2243
Atlanta 2178 1942 730 596 331 907 1197 595 945 759 0 484 606 1743
St..Louis 1705 1589 550 1070 257 449 768 439 1044 890 484 0 258 1368
Chicago 1716 1741 802 1199 492 334 886 234 864 738 606 258 0 1511
Las.Vegas 867 236 1053 2171 1413 1297 627 1745 2375 2243 1743 1368 1511 0
Using metaMDS()
to do the MDS, note that it knows nothing about location, only distances!
After NMDS, plotting the proposed coordinates:
The 2D coordinates approximate how these cities are arranged on a map!
Oil_ID Region Area Palmitic Palmitoleic Strearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
1 1 Southern North-Apulia 1075 75 226 7823 672 36 60 29
2 2 Southern North-Apulia 1088 73 224 7709 781 31 61 29
3 3 Southern North-Apulia 911 54 246 8113 549 31 63 29
4 4 Southern North-Apulia 966 57 240 7952 619 50 78 35
5 5 Southern North-Apulia 1051 67 259 7771 672 50 80 46
6 6 Southern North-Apulia 911 49 268 7924 678 51 70 44
metaMDS()
function in the vegan
package will be our NMDS workhorse function.scaled_acids <- scale(acids_only)
library(vegan)
#autotransform = FALSE indicates we've already scaled the data
#trace = 0 so we don't get a bunch of printed stuff
mds_k2 <- metaMDS(scaled_acids, distance = 'euclidean', k=2, autotransform = FALSE, trace = 0)
mds_k3 <- metaMDS(scaled_acids, distance = 'euclidean', k=3, autotransform = FALSE, trace = 0)
Call:
metaMDS(comm = scaled_acids, distance = "euclidean", k = 2, autotransform = FALSE, trace = 0)
global Multidimensional Scaling using monoMDS
Data: scaled_acids
Distance: euclidean
Dimensions: 2
Stress: 0.1253932
Stress type 1, weak ties
Best solution was not repeated after 20 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation
Species: scores missing
Call:
metaMDS(comm = scaled_acids, distance = "euclidean", k = 3, autotransform = FALSE, trace = 0)
global Multidimensional Scaling using monoMDS
Data: scaled_acids
Distance: euclidean
Dimensions: 3
Stress: 0.07713075
Stress type 1, weak ties
Best solution was repeated 6 times in 20 tries
The best solution was from try 8 (random start)
Scaling: centring, PC rotation
Species: scores missing
Stress type 1
refers to the type of stress being used; see ?monoMDS
(the default engine used by metaMDS()
)Let’s start with the 2D ordination:
MDS1 MDS2
1 -1.2108207 -1.503616
2 -0.9585032 -1.098613
3 -2.4902901 -2.273519
4 -1.4279609 -2.987704
5 -0.4839538 -3.261394
6 -1.3429766 -3.617517
These are the coordinates of 2D space that best preserve the ordering of the distances in 8D space!
\[X_j = \beta_0 + \beta_1 MDS_1 + \beta_2 MDS_2 \]
The envfit()
function from vegan
takes care of this regression, and has a plot()
method that will add the directions of steepest ascent for each variable:
The length of each arrow is scaled by \(\sqrt{R^2}\) between that variable and the ordination coordinates, such that longer arrows indicate greater variable importance.
The three plots below are the fitted values of regressing oleic, strearic, and linolenic acids on the ordination coordinates, plotted vs the coordinates. Can you identify which is which?
So what do we learn from this?
ggplot
for more plotting flexibility, including using geom_text_repel
from the ggrepel
package for better arrow labeling.#Adjusts arrow lengths to fill graph
scale_arrow <- 4
# Data frame for arrows, from envfit():
arrowdf <- data.frame(k2_fit$vectors$arrows) %>%
rownames_to_column('oil') %>%
mutate(r = k2_fit$vectors$r) %>%
mutate(across(NMDS1:NMDS2, \(x) x*r*4))
head(arrowdf, 3)
oil NMDS1 NMDS2 r
1 Palmitic 3.0846194 0.5941522 0.7853301
2 Palmitoleic 2.8211315 1.7514107 0.8301439
3 Strearic -0.1412849 -0.3787475 0.1010603
# Plot production
library(ggrepel)
ggplot(data = oils_with_MDS2) +
geom_point(aes(x = MDS1, y = MDS2, color = Area)) +
geom_segment(aes(x = 0, y = 0,
xend = NMDS1, yend = NMDS2),
arrow = arrow(length = unit(.2, 'cm')),
data = arrowdf) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil),
data = arrowdf) +
theme_classic(base_size = 14)
Faceting may help; since there are 9 regions the colors get difficult to distinguish:
ggplot(data = oils_with_MDS2) +
geom_point(aes(x = MDS1, y = MDS2, color = Area)) +
geom_segment(aes(x = 0, y = 0,
xend = NMDS1, yend = NMDS2),
arrow = arrow(length = unit(.2, 'cm')),
data = arrowdf) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil),
data = arrowdf) +
theme_classic(base_size = 14) +
facet_wrap(~Area) +
guides(color='none') #Color legend no longer needed with faceting
We could also consider focusing on “most important” oils by filtering the arrowdf
and using that in the plot:
important_arrows <- arrowdf %>% filter(r > 0.7)
ggplot(data = oils_with_MDS2) +
geom_point(aes(x = MDS1, y = MDS2, color = Area)) +
geom_segment(aes(x = 0, y = 0,
xend = NMDS1, yend = NMDS2),
arrow = arrow(length = unit(.2, 'cm')),
data = important_arrows) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil),
data = important_arrows) +
theme_classic(base_size = 14) +
facet_wrap(~Area) +
guides(color='none') #Color legend no longer needed with faceting
MDS1 MDS2 MDS3
1 -1.4825182 1.479909 -0.1723490
2 -1.2114245 1.074259 -0.1379729
3 -2.7571058 2.090409 -0.4065870
4 -1.6505858 3.044282 0.2517048
5 -0.7969349 3.343739 -0.2793707
6 -1.6521762 3.573486 -0.4553298
oil
data set:choices=1:3
)arrows
data frameplotly
With the help of Claude AI (see chat transcript), I produced this code to create the 3D version of the ordination plot:
# Specify 3D scatterplot
fig <- plot_ly() %>%
add_trace(data = oil_with_MDS3,
x = ~MDS1,
y = ~MDS2,
z = ~MDS3,
color = ~Area,
type = "scatter3d",
mode = "markers",
marker = list(size = 2)
)
# Add arrows starting from origin
for(i in 1:nrow(arrows)) {
fig <- fig %>%
add_trace(
x = c(0, arrows$NMDS1[i]),
y = c(0, arrows$NMDS2[i]),
z = c(0, arrows$NMDS3[i]),
type = "scatter3d",
mode = "lines",
line = list(color = "red", width = 4),
showlegend = FALSE,
name = arrows$oil[i]
)
}
# Add labels at the end of each arrow
fig <- fig %>%
add_trace(
data = arrows,
x = ~NMDS1, y = ~NMDS2, z = ~NMDS3,
type = "scatter3d",
mode = "text",
text = ~oil,
textposition = "top center",
textfont = list(size = 14, color = "red"),
showlegend = FALSE,
name = "Labels"
)
fig
It appears one of the things the 3rd dimension does is to separate oils with high/low Strearic acid, previously an unimportant acid in 2 dimensions.
A link to the JMP script with plots… can you find out anything else about what that third dimension might represent?