2.2 - Multidimensional scaling (MDS)

Motivation

  • A common objective in unsupervised learning is dimension reduction
  • Dimension reduction: reduce the number of columns into a much smaller number that still preserves information as best as possible
  • Multidimensional scaling tries to preserve information by trying to preserve the \(p\)-dimensional distances between points in lower dimensional space as best as possible

Example

Consider a \(p=3\) dimensional data set of \(n=3\) observations:

  label x y z
1     A 0 1 1
2     B 1 1 0
3     C 1 0 5

Questions:

  • Which points are closest? Which point is “off on its own”?
  • Can you rotate this plot so that the points all appear equidistant?
  • Can you rotate this plot to make point B look like the outlier?
  • Which 2D “snapshot” best represents the 3D picture?

2D projections

Consider plotting only 2 dimensions of these data at a time:

  • In the context of MDS, these lower-dimensional reprentations are called configurations or ordinations.
  • Which of these ordinations/configurations best approximates how far apart the points are in 3 dimensions?

The concept of stress

  • Stress measures how different distances between points are in the configuration space compared to full dimensional space
  • Let’s take another look at the full (3D) data, and measure the Euclidean distance between the 3 points:
  label x y z
1     A 0 1 1
2     B 1 1 0
3     C 1 0 5
  • \(d(A,B) = \sqrt{1^2 + 0^2 + 1^2} = 1.414\)
  • \(d(A,C) = \sqrt{1^2 + 1^2 + 4^2} = 4.243\)
  • \(d(B,C) = \sqrt{0^2 + 1^2 + 5^2} = 5.099\)

…so, indeed, A and B are much closer to each other than either of them is to point C.

Distances in 2D

  label x y
1     A 0 1
2     B 1 1
3     C 1 0

\(\scriptsize d(A,B)= 1, d(A,C) = \sqrt{2} = 1.41, d(B,C) = 1\)

  label x z
1     A 0 1
2     B 1 0
3     C 1 5

\(\scriptsize d(A,B)= \sqrt{2} = 1.41, d(A,C) = \sqrt{17} = 4.12, d(B,C) = 5\)

… so the pairwise distances in the XZ configuration are much more comparable to the full dimensional distances than the distances in XY configuration

Stress defined

Given \(d_{ij}\) and \(\hat d_{ij}\) measuring the dissimilarity between observations \(i\) and \(j\) in full \(p\)-dimensional space and reduced (also called configured or ordination) space, respectively:

\[Stress = \sqrt{\sum_{i\ne j} (d_{ij} - \hat d_{ij})^2}\]

…in other words, you can think of stress as measuring “difference in distances.”

Example: finding the stress

For our \(n=3\) example:

Full 3D distances:

  • \(d(A,B) = 1.414\)
  • \(d(A,C) = 4.243\)
  • \(d(B,C) = 5.099\)

Distances in 2D XY space:

  • \(d(A,B) = 1\)
  • \(d(A,C) =1.414\)
  • \(d(B,C) = 1\)

Distances in 2D XZ space:

  • \(d(A,B) = 1.414\)
  • \(d(A,C) =4.123\)
  • \(d(B,C) = 5\)

\[Stress(\mbox{XY configuration}) = \sqrt{(1.414-1)^2 + (4.243-1.414)^2+(5.099-1)^2} = 4.998\]

\[Stress(\mbox{XZ configuration}) = \sqrt{(1.414-1.414)^2 + (4.243-4.123)^2+(5.099-5)^2} = 0.156\]

Moral: the 2D XZ configuration is a more faithful representation of the 3D data than the 2D XY configuration.

Metric vs non-metric MDS

  • If \(d(\cdot)\) is truly a distance metric (e.g., Euclidean, taxicab), then metric MDS is appropriate.
    • Objective: represent the full-dimensional distances as well as possible.
    • Note: metric MDS with Euclidean distances \(\equiv\) PCA (later)
  • If \(d(\cdot)\) is not truly a distance metric, but rather a dissimilarity measure (e.g., Bray-Curtis) then we perform non-metric MDS (NMDS).
    • Objective: represent the rankings of the full-dimensional dissimilarities as well as possible.
    • NMDS may also refer to preserving orderings when a distance metric is used.
  • The metaMDS() function in the vegan package performs non-metric MDS (NMDS).

NMDS stress measures

  • NMDS algorithms slightly modify the definition of stress. Some of these are below; note that the \(i<j\) indicates we sum over the lower triangular portion of a dissimilarity matrix:

\[ \mbox{Sammon squared stress: } \frac{1}{\sum_{i< j}d_{ij}}\sum_{i< j}\frac{(d_{ij} - \hat d_{ij})^2}{d_{ij}}\] \[ \mbox{Kruskal squared original stress: } \frac{\sum_{i< j}(d_{ij} - \hat d_{ij})^2}{\sum_{i< j} d_{ij}^2}\]

\[ \mbox{Kruskal squared alternative stress: } \frac{\sum_{i< j}(d_{ij} - \hat d_{ij})^2}{\sum_{i< j} (d_{ij}-\bar d)^2}; \bar d = \mbox{average of all }d_{ij}\]

The big idea

  • Given a distance matrix \(D\), and specified size of configuration space \(k\), find the coordinates of the points in \(k\) dimensions that minimize stress.

What’s a “good” stress?

  • In general, smaller stress = more faithful lower-dimensional representation of the full dimensional data
  • Rule of thumb: Stress < 0.2 = acceptable; Stress < 0.1 = good; Stress < 0.05 = excellent
  • For fixed \(k\), stress increases both with \(n\) and \(p\) (a low-dimensional representation gets harder with more points to consider, or more variables feeding into the distance matrix)
  • For fixed \(n\) and \(p\), the larger the dimension of the ordination space (i.e., the low-dimensional representation), the lower the stress (our low-dimensional approximation gets less and less “low”)

Airport distance example

  • Consider a matrix of the distances (in miles) between 14 U.S. airports.
            Seattle Los.Angeles Dallas Miami Memphis Minneapolis Denver Detroit Boston New.York Atlanta St..Louis Chicago Las.Vegas
Seattle           0         955   1658  2721    1867        1395   1022    1922   2489     2415    2178      1705    1716       867
Los.Angeles     955           0   1232  2338    1615        1533    861    1975   2605     2470    1942      1589    1741       236
Dallas         1658        1232      0  1120     431         853    641     986   1559     1389     730       550     802      1053
Miami          2721        2338   1120     0     860        1503   1708    1148   1260     1092     596      1070    1199      2171
Memphis        1867        1615    431   860       0         701    871     611   1138      963     331       257     492      1413
Minneapolis    1395        1533    853  1503     701           0    679     527   1121     1026     907       449     334      1297
Denver         1022         861    641  1708     871         679      0    1120   1750     1622    1197       768     886       627
Detroit        1922        1975    986  1148     611         527   1120       0    631      508     595       439     234      1745
Boston         2489        2605   1559  1260    1138        1121   1750     631      0      186     945      1044     864      2375
New.York       2415        2470   1389  1092     963        1026   1622     508    186        0     759       890     738      2243
Atlanta        2178        1942    730   596     331         907   1197     595    945      759       0       484     606      1743
St..Louis      1705        1589    550  1070     257         449    768     439   1044      890     484         0     258      1368
Chicago        1716        1741    802  1199     492         334    886     234    864      738     606       258       0      1511
Las.Vegas       867         236   1053  2171    1413        1297    627    1745   2375     2243    1743      1368    1511         0
  • GOAL: Where should these cities be placed on a 2D grid to preserve these distances?
  • Note: here, \(k=p\): the dimension of our ordination space equals dimension of “location” data (e.g. lat/long).

NMDS on airport data

Using metaMDS() to do the MDS, note that it knows nothing about location, only distances!

library(vegan)
air_mds <- metaMDS(as.matrix(airport_distances), k = 2, trace = 0)

After NMDS, plotting the proposed coordinates:

The 2D coordinates approximate how these cities are arranged on a map!

Olive oil data

  • A more typical MDS application involves finding a \(k\) (potentially much) lower than the dimensions used to produce \(D\).
  • The olive oil data contains measures (in hundredths of percents) of 8 fatty acids from various regions of Italy.
  • Goals:
    1. identify any differences in acidic composition between oils
    2. determine whether regional differences exists in oil compositions
oils <- read.csv('Data/oliveoil.csv')
head(oils)
  Oil_ID   Region         Area Palmitic Palmitoleic Strearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
1      1 Southern North-Apulia     1075          75      226  7823      672        36         60         29
2      2 Southern North-Apulia     1088          73      224  7709      781        31         61         29
3      3 Southern North-Apulia      911          54      246  8113      549        31         63         29
4      4 Southern North-Apulia      966          57      240  7952      619        50         78         35
5      5 Southern North-Apulia     1051          67      259  7771      672        50         80         46
6      6 Southern North-Apulia      911          49      268  7924      678        51         70         44
acids_only <- oils %>% select(Palmitic:Eicosenoic)

NMDS with \(k=2\) and \(k=3\)

  • We have \(p = 8\)
  • Tasks:
    • Since we have quantitative data, use Euclidean distance
    • Make sure to scale! (don’t want Oleic and Linoleic acid driving the ordination)
    • Choose the dimension of the ordination space
  • The metaMDS() function in the vegan package will be our NMDS workhorse function.
scaled_acids <- scale(acids_only)

library(vegan)
#autotransform = FALSE indicates we've already scaled the data
#trace = 0 so we don't get a bunch of printed stuff
mds_k2 <- metaMDS(scaled_acids, distance = 'euclidean', k=2, autotransform = FALSE, trace = 0)
mds_k3 <- metaMDS(scaled_acids, distance = 'euclidean', k=3, autotransform = FALSE, trace = 0)

Investigating stress of solutions

mds_k2

Call:
metaMDS(comm = scaled_acids, distance = "euclidean", k = 2, autotransform = FALSE,      trace = 0) 

global Multidimensional Scaling using monoMDS

Data:     scaled_acids 
Distance: euclidean 

Dimensions: 2 
Stress:     0.1253932 
Stress type 1, weak ties
Best solution was not repeated after 20 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation 
Species: scores missing
stressplot(mds_k2)
# Called a "Shepard plot"

mds_k3

Call:
metaMDS(comm = scaled_acids, distance = "euclidean", k = 3, autotransform = FALSE,      trace = 0) 

global Multidimensional Scaling using monoMDS

Data:     scaled_acids 
Distance: euclidean 

Dimensions: 3 
Stress:     0.07713075 
Stress type 1, weak ties
Best solution was repeated 6 times in 20 tries
The best solution was from try 8 (random start)
Scaling: centring, PC rotation 
Species: scores missing
stressplot(mds_k3)
# Called a "Shepard plot"

Interpreting output

  • Stress when \(k=2\) is 0.125 (“good/adequate”), stress when \(k=3\) is 0.077 (“very good”) (but remember, higher \(k\) will always yield lower stress)
  • Stress type 1 refers to the type of stress being used; see ?monoMDS (the default engine used by metaMDS())
  • Both Shepard plots show high correlation between full-dimension and ordination space

Exploring the structure

Let’s start with the 2D ordination:

head(mds_k2$points)
        MDS1      MDS2
1 -1.2108207 -1.503616
2 -0.9585032 -1.098613
3 -2.4902901 -2.273519
4 -1.4279609 -2.987704
5 -0.4839538 -3.261394
6 -1.3429766 -3.617517

These are the coordinates of 2D space that best preserve the ordering of the distances in 8D space!

So what do they look like?

plot(mds_k2)

okaaaay….

Regressing variables onto ordination

  • To understand which variables are important for a 2D ordination, and how they impact it, we do the following.
  • Let \(X_j\) be one of the original variables (e.g., Palmitic acid); \(j= 1, 2, ...,p\).
  • Let \(MDS_1\) and \(MDS_2\) be the ordination coordinates.
  • Fit:

\[X_j = \beta_0 + \beta_1 MDS_1 + \beta_2 MDS_2 \]

  • Find the fitted plane \(\hat X_j\) from this model
  • Find the direction in ordination space where this plane has rate of greatest change

Example: Palmitic acid

#Add ordination coordinates to data set:
oils_with_MDS2 <- bind_cols(oils, mds_k2$points)
# Regress Palmitic acid on the ordination coordinates: 
palmitic_fit <- lm(Palmitic ~ MDS1 + MDS2, data = oils_with_MDS2)
oils_with_MDS2$palmitic_Xhat <- fitted(palmitic_fit)
  • Rotate this to get a sense of the direction in which the fitted values have their steepest ascent.
  • Then rotate to look “down” on the MDS space, with MDS1 on horizontal and MDS2 on vertical, making sure the positive directions are up and right!
  • Where is the direction of steepest ascent on the MDS space?

Visual representation

The envfit() function from vegan takes care of this regression, and has a plot() method that will add the directions of steepest ascent for each variable:

k2_fit <- envfit(ord=mds_k2, env = acids_only)
plot(mds_k2)
plot(k2_fit)

The length of each arrow is scaled by \(\sqrt{R^2}\) between that variable and the ordination coordinates, such that longer arrows indicate greater variable importance.

Identify the acid

The three plots below are the fitted values of regressing oleic, strearic, and linolenic acids on the ordination coordinates, plotted vs the coordinates. Can you identify which is which?

Interpretation

So what do we learn from this?

  • Strearic acid is the least important for distinguishing between olive oils
  • Oils in the top-right portion of the graph have:
    • High levels of Linoleic, Palmitoleic, and Palmitic acids
    • Low levels of Oleic acid
  • Linolenic, Eicosanoic, and Eicosenoic acids are:
    • high for oils on the bottom-right
    • low for oils on the top-left
  • Linoleic, palmitoleic, and palmitic acid are strongly positively correlated with each other, and negatively correlated with oleic acid.
  • Linoleic, palmitoleic, and palmitic, an oleic acids have low correlation with eicosanoic, eicosenoic, and linolenic acids (due to near-orthogonality of those arrows).

Incorporating regional information

  • So are there regional patterns in oil composition we can identify?
  • It would be nice to color-code the points by region
  • At this point, let’s put all the ingredients we need into a couple data frames so we can use ggplot for more plotting flexibility, including using geom_text_repel from the ggrepel package for better arrow labeling.
#Adjusts arrow lengths to fill graph
scale_arrow <- 4

# Data frame for arrows, from envfit():
arrowdf <- data.frame(k2_fit$vectors$arrows) %>% 
  rownames_to_column('oil') %>% 
  mutate(r = k2_fit$vectors$r) %>% 
  mutate(across(NMDS1:NMDS2, \(x) x*r*4))

head(arrowdf, 3)
1
This row elongates each arrow according to how correlated it is with the coordinates, as well as an arrow scaler to “fill up” the plot.
          oil      NMDS1      NMDS2         r
1    Palmitic  3.0846194  0.5941522 0.7853301
2 Palmitoleic  2.8211315  1.7514107 0.8301439
3    Strearic -0.1412849 -0.3787475 0.1010603
# Plot production
library(ggrepel)

ggplot(data = oils_with_MDS2) +
  geom_point(aes(x = MDS1, y = MDS2, color = Area)) + 
  geom_segment(aes(x = 0, y = 0, 
                   xend = NMDS1, yend = NMDS2), 
               arrow = arrow(length = unit(.2, 'cm')),
               data = arrowdf) + 
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil), 
            data = arrowdf) + 
  theme_classic(base_size = 14) 

Incorporating regional information

Faceting may help; since there are 9 regions the colors get difficult to distinguish:

ggplot(data = oils_with_MDS2) +
  geom_point(aes(x = MDS1, y = MDS2, color = Area)) + 
  geom_segment(aes(x = 0, y = 0, 
                   xend = NMDS1, yend = NMDS2), 
               arrow = arrow(length = unit(.2, 'cm')),
               data = arrowdf) + 
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil), 
            data = arrowdf) + 
  theme_classic(base_size = 14) +
  facet_wrap(~Area) + 
  guides(color='none') #Color legend no longer needed with faceting

Filtering to important arrows

We could also consider focusing on “most important” oils by filtering the arrowdf and using that in the plot:

important_arrows <- arrowdf %>% filter(r > 0.7)


ggplot(data = oils_with_MDS2) +
  geom_point(aes(x = MDS1, y = MDS2, color = Area)) + 
  geom_segment(aes(x = 0, y = 0, 
                   xend = NMDS1, yend = NMDS2), 
               arrow = arrow(length = unit(.2, 'cm')),
               data = important_arrows) + 
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = oil), 
            data = important_arrows) + 
  theme_classic(base_size = 14) +
  facet_wrap(~Area) + 
  guides(color='none') #Color legend no longer needed with faceting
  • Which regions produce oils:
    • high in linoleic, palmitic, and palmitoleic acid?
    • low in linolenic acid?
    • high in oleic acid?
    • middling amounts of all acids?

What about the 3D ordination?

  • While the 2D ordination was quite insightful for this problem, sometimes considering a 3rd dimension can yield information hidden in only two dimensions.
head(mds_k3$points)
        MDS1     MDS2       MDS3
1 -1.4825182 1.479909 -0.1723490
2 -1.2114245 1.074259 -0.1379729
3 -2.7571058 2.090409 -0.4065870
4 -1.6505858 3.044282  0.2517048
5 -0.7969349 3.343739 -0.2793707
6 -1.6521762 3.573486 -0.4553298
  • Plotting strategies:
    • Pairwise 2D combinations of the 3 ordinations (simple extension of what we’ve already seen)
    • Interactive 3D scatterplot

Setting up the data structures

  • First step: add new 3D ordination coordinates to the oil data set:
oil_with_MDS3 <- bind_cols(oils, mds_k3$points)
  • Second step: regress the acids on all 3 MDS coordinates (important: choices=1:3)
fit_k3 <- envfit(ord = mds_k3, env = acids_only, choices = 1:3) 
  • Third step: create arrows data frame
arrow_scaler <- 4 #play around to fill up the plot

arrows <- fit_k3$vectors$arrows %>% 
  data.frame() %>%
  rownames_to_column('oil') %>% 
  mutate(r = fit_k3$vectors$r) %>% 
  mutate(across(NMDS1:NMDS3, \(x) x*r*arrow_scaler))

3D visualization via plotly

With the help of Claude AI (see chat transcript), I produced this code to create the 3D version of the ordination plot:

# Specify 3D scatterplot
fig <- plot_ly() %>% 
  add_trace(data = oil_with_MDS3, 
            x = ~MDS1, 
            y = ~MDS2, 
            z = ~MDS3,
            color = ~Area,
        type = "scatter3d",
        mode = "markers",
        marker = list(size = 2)
) 


# Add arrows starting from origin
for(i in 1:nrow(arrows)) {
  fig <- fig %>%
    add_trace(
      x = c(0, arrows$NMDS1[i]),
      y = c(0, arrows$NMDS2[i]),
      z = c(0, arrows$NMDS3[i]),
      type = "scatter3d",
      mode = "lines",
      line = list(color = "red", width = 4),
      showlegend = FALSE,
      name = arrows$oil[i]
    )
}


# Add labels at the end of each arrow
fig <- fig %>%
  add_trace(
    data = arrows,
    x = ~NMDS1, y = ~NMDS2, z = ~NMDS3,
    type = "scatter3d",
    mode = "text",
    text = ~oil,
    textposition = "top center",
    textfont = list(size = 14, color = "red"),
    showlegend = FALSE,
    name = "Labels"
  )

fig

It appears one of the things the 3rd dimension does is to separate oils with high/low Strearic acid, previously an unimportant acid in 2 dimensions.

Moving to JMP

  • This is where interactivity between plots can be very useful, which is much easier in a software like JMP.
  • In JMP, we can create a 3D scatterplot of the 3D ordination, combined with a parallel coordinates plot.
    • Note: Use parallel coordinates plots with caution! As a stand-alone plot, the overplotting makes them useless. Use with interactivity and/or color-coding to emphasize specific points.
  • In this animation, I demonstrate how I used this combination of plots and interactivity to determine that the 3rd dimension is, for at least some oils, representing variation in strearic acid levels.

A link to the JMP script with plots… can you find out anything else about what that third dimension might represent?

Plot creation in JMP