This tutorial will demonstrate the use of various MDS algorithms for:
geographical distances
perceptual space.
The logic follows this paper https://cran.r-project.org/web/packages/smacof/vignettes/smacof.pdf and these vignettes closely: https://rdrr.io/cran/smacof/f/vignettes/mdsnutshell.Rmd and https://rdrr.io/cran/smacof/f/vignettes/unfoldingnutshell.Rmd.
Let’s pick a couple of locations and draw distances between them in a 2D space, keeping the distances as close as possible to the original.
Take Paris, London, Yuzhno-Sakhalinsk, and Singapore, and calculate distances between them by plane + in minutes.
For converting time to minutes, I used https://www.calculateme.com/time/hours-minutes-seconds/to-minutes/.
I used a GoogleSheet/Excel file, uploaded it to R, and transformed the table into a diagonal matrix of distances:
distdata <- read.csv(choose.files())
geo <- distdata[1:4, 2:5]
row.names(geo) <- distdata[1:4, 1]
geo
## london paris yuzhno.sakhalinsk singapore
## london 0
## paris 343 0
## yuzhno-sakhalinsk 8527 8763 0
## singapore 10888 10727 6302 0
geo2 <- dist(geo, diag = T)
geo2
## london paris yuzhno-sakhalinsk singapore
## london 0.000
## paris 686.000 0.000
## yuzhno-sakhalinsk 17054.000 16956.888 0.000
## singapore 21776.000 21272.779 8095.004 0.000
time <- distdata[8:11, 2:5]
row.names(time) <- distdata[8:11, 1]
time
## london paris yuzhno.sakhalinsk singapore
## london 0
## paris 136 0
## yuzhno-sakhalinsk 1310 1035 0
## singapore 790 777 690 0
time2 <- dist(time, diag = T)
time2
## london paris yuzhno-sakhalinsk singapore
## london 0.000
## paris 272.000 0.000
## yuzhno-sakhalinsk 2620.000 2213.369 0.000
## singapore 1580.000 1436.276 1041.194 0.000
# install.packages("smacof")
library(smacof)
mds_geo <- mds(geo2)
summary(mds_geo) # new coordinates + % distortion, aka Stress Per Point, SPP
##
## Configurations:
## D1 D2
## london 0.5953 0.0058
## paris 0.5757 0.0462
## yuzhno-sakhalinsk -0.4346 -0.2256
## singapore -0.7363 0.1736
##
##
## Stress per point (in %):
## london paris yuzhno-sakhalinsk singapore
## 29.62 29.30 7.45 33.63
mds_geo
##
## Call:
## mds(delta = geo2)
##
## Model: Symmetric SMACOF
## Number of objects: 4
## Stress-1 value: 0.003
## Number of iterations: 5
plot(mds_geo)
mds_time <- mds(time2)
summary(mds_time)
##
## Configurations:
## D1 D2
## london -0.6453 -0.0016
## paris -0.4882 -0.0849
## yuzhno-sakhalinsk 0.8441 -0.0926
## singapore 0.2894 0.1791
##
##
## Stress per point (in %):
## london paris yuzhno-sakhalinsk singapore
## 31.19 21.91 34.24 12.67
mds_time
##
## Call:
## mds(delta = time2)
##
## Model: Symmetric SMACOF
## Number of objects: 4
## Stress-1 value: 0.028
## Number of iterations: 17
plot(mds_time)
Now let’s try plotting the ‘geo’ version but treat distances as ordinal. Pay attention to the goodness-of-fit (stress) metric.
mds_geo2 <- mds(geo2, type = 'ordinal')
summary(mds_geo2) # new coordinates + % distortion
##
## Configurations:
## D1 D2
## london 0.5946 -0.0012
## paris 0.5761 0.0532
## yuzhno-sakhalinsk -0.4344 -0.2256
## singapore -0.7362 0.1735
##
##
## Stress per point (in %):
## london paris yuzhno-sakhalinsk singapore
## NaN NaN NaN NaN
mds_geo2
##
## Call:
## mds(delta = geo2, type = "ordinal")
##
## Model: Symmetric SMACOF
## Number of objects: 4
## Stress-1 value: 0
## Number of iterations: 2
plot(mds_geo2)
The goodness-of-fit value is always better for ordinal values as the task is to keep the order of distances correct. However, one pays by losing the information about the exact distances.
Finally, let’s draw the MDS map with usual gg-tools:
library(ggplot2)
conf_geo <- as.data.frame(mds_geo$conf)
p <-
ggplot(conf_geo, aes(
x = D1,
y = D2,
label = rownames(conf_geo)
))
p + geom_point(size = 0.9) +
geom_text(size = 3.5, vjust = -0.8) +
coord_fixed(xlim = c(-1, 1), ylim = c(-0.3, 0.3)) +
ggtitle("Geographic Configuration Between Cities To Visit") +
theme_classic()
Now let’s put the two configuration plots next to each other to make conclusions about socioeconomic vs. flight distances:
library(patchwork)
library(ggrepel)
a <- p + geom_point(size = 0.9) +
geom_text_repel(size = 3.5, vjust = -0.8) +
coord_fixed(xlim = c(-1, 1), ylim = c(-0.3, 0.3)) +
# "It is important to fix the aspect ratio to 1 such that the distances in the plot are Euclidean."
ggtitle("Geographic Configuration Between Cities To Visit") +
xlab("") +
ylab("") +
theme_bw()
conf_time <- as.data.frame(mds_time$conf)
q <-
ggplot(conf_time, aes(
x = D1,
y = D2,
label = rownames(conf_time)
))
b <- q + geom_point(size = 0.9) +
geom_text(size = 3.5, vjust = -0.8) +
coord_fixed(xlim = c(-1, 1), ylim = c(-0.3, 0.3)) +
ggtitle("Temporal Configuration Between Cities To Visit") +
xlab("") +
ylab("") +
theme_bw()
a / b
Important: to flip and reverse the plot in ggplot (for a more effective comparison), see http://www.sthda.com/english/wiki/ggplot2-rotate-a-graph-reverse-and-flip-the-plot
Next up:
So far, we have dealt with examples of distance data, using a lower triangle of the distance matrix. This matrix contains distances between the objects, therefore, it will always have the same number (n) of rows and columns. This kind of matrix is also called the square matrix.
The following example, by contrast, uses the so called rectangular matrix. It is the type of data where n subjects evaluate m objects. Subjects are called ‘ideal points’, objects are ‘object points.’
Square matrix -> mds()
Rectangular matrix -> unfolding()
Unfolding is an algorithm close to the MDS family. It produces a configuration plot that shows both row and column points. This is because it is a ‘dual scaling’ algorithm that combines two plots in one. The distance units for the X and Y axes are always the same, therefore, it is possible to compare and evaluate distances not only between rows or columns but also from row points to column points.
The input data for unfolding must be non-negative dissimilarities.
There are two common ways to obtain such data:
preferences. E.g., people ranking 15 breakfast items where 1 is the most desirable and 15 the least desirable item (the person is the “ideal point” and numbers serve as dissimilarities)
people rank some objects by some scales, then numbers are averaged into ratings of objects. E.g., n respondents evaluate 26 cars along 10 quality scales (typically, 1-5 or 1-6 where 1 is “not at all” and 5/6 is “absolutely so”). Then average values for each car along the 10 scales are computed and converted into dissimilarities. Subjects: cars, objects: quality scales.
Breakfast preferences example
head(breakfast)
## toast butoast engmuff jdonut cintoast bluemuff hrolls toastmarm butoastj
## 1 13 12 7 3 5 4 8 11 10
## 2 15 11 6 3 10 5 14 8 9
## 3 15 10 12 14 3 2 9 8 7
## 4 6 14 11 3 7 8 12 10 9
## 5 15 9 6 14 13 2 12 8 7
## 6 9 11 14 4 7 6 15 10 8
## toastmarg cinbun danpastry gdonut cofcake cornmuff
## 1 15 2 1 6 9 14
## 2 12 7 1 4 2 13
## 3 11 1 6 4 5 13
## 4 15 4 1 2 5 13
## 5 10 11 1 4 3 5
## 6 12 5 2 3 1 13
un_breakfast <- unfolding(breakfast)
un_breakfast
##
## Call: unfolding(delta = breakfast)
##
## Model: Rectangular smacof
## Number of subjects: 42
## Number of objects: 15
## Transformation: none
## Conditionality: matrix
##
## Stress-1 value: 0.308625
## Penalized Stress: 3.525172
## Number of iterations: 50
plot(un_breakfast, main = "Configuration Breakfast Data")
Interpretation from the vignette:
Breakfast items close to each other are similarly preferred;
individuals close to each other have similar breakfast preferences;
the closer a breakfast item to an individual, the higher the individuals’ preference for this item.
Of course, to which degree this interpretation reflects the actual preferences in the data depends on the goodness-of-fit of the solution.
Sometimes it is also possible to interpret the dimensions but, just as in MDS, this is not as crucial.
# library(ggplot2)
# library(ggrepel)
conf_items <- as.data.frame(un_breakfast$conf.col)
conf_persons <- as.data.frame(un_breakfast$conf.row)
br <- ggplot(conf_persons, aes(x = D1, y = D2))
br + geom_point(size = 1, colour = "red", alpha = 0.5) +
coord_fixed(xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5)) +
xlab("") +
ylab("") +
geom_point(aes(x = D1, y = D2), conf_items, colour = "cadetblue") +
geom_text_repel(aes(x = D1, y = D2, label = rownames(conf_items)),
conf_items, colour = "cadetblue", vjust = -0.8) +
ggtitle("Unfolding Configuration Breakfast Items") +
annotate("text", x = 0, y = -1.5, label = "soft bread", color = "navy") +
annotate("text", x = 0, y = 1.5, label = "crunchy bread", color = "navy") +
annotate("text", x = -1.5, y = 0, label = "sweet", angle = 90, color = "navy") +
annotate("text", x = 1.5, y = 0, label = "savoury", angle = 270, color = "navy") +
theme_bw()
“Users should not judge the goodness-of-fit by solely relying on this stress value but rather use several diagnostic tools in combination (see Mair, Borg, and Rusch 2016 for details).” https://www.researchgate.net/publication/309617943_Goodness-of-Fit_Assessment_in_Multidimensional_Scaling_and_Unfolding
Options for model comparison (both MDS and Unfolding):
metric / ordinal / mspline type of function (with metric distances and ratings, ordinal type may improve the quality of solution)
examine stress per point (SPP) and re-run without the most “noisy” points (alternatively, use the jackknife technique from the package)
Example:
plot(un_breakfast, plot.type = "stressplot")
Re-run the analysis without the ‘noisiest’ points (they are in the
middle of the plot): 39, 13, 21. Optionally, call the
summary(un_breakfast) for full numbers.
un_breakfast2 <- unfolding(breakfast[c(-39, -13, -21), ])
un_breakfast2$stress
## [1] 0.2949098
Compared to the original value, the fit is a little better–but not radically better:
un_breakfast$stress
## [1] 0.3086254
plot(un_breakfast, plot.type = "Shepard")
un_breakfast_sp <- unfolding(breakfast, type = "mspline")
un_breakfast_sp$stress # worse fit here
## [1] 0.3436307
plot(un_breakfast_sp, plot.type = "Shepard")
un_breakfast_o <- unfolding(breakfast, type = "ordinal")
un_breakfast_o$stress # worse fit here
## [1] 0.3268706
plot(un_breakfast_o, plot.type = "Shepard")
Another possibility is the permutation test (available in
smacof for both MDS and Unfolding). H0: the
stress/configuration are obtained from a random permutation of
dissimilarities.
permtest(un_breakfast, breakfast, nrep = 100, verbose = F)
##
## Call: permtest.smacofR(object = un_breakfast, data = breakfast, nrep = 100,
## verbose = F)
##
## SMACOF Permutation Test
## Number of objects: 42
## Number of replications (permutations): 100
##
## Observed stress value: 0.309
## p-value: <0.001
We reject the H0 here: the observed configuration is not random, but has some structure within.
For the breakfast data, I would either collect more data or exclude those noisiest data points (or offered them another type of breakfast). Otherwise, the plot already provides some structure and even dimensions: D1 is about sugary/fattening vs. savoury items, D2 is about soft vs. crunchy bread.
Since the SD also features objects and their values at scales, it is possible.
What you need to do first though is to transform values into dissimilarities.
Example:
sd <- matrix(c(-1,2,3,
-3,-3,-3,
3,-2,1), byrow = T, nrow = 3)
row.names(sd) <- c("real Self", "anti-Self", "rational Self")
colnames(sd) <- c("evaluation", "strength", "activity")
sd
## evaluation strength activity
## real Self -1 2 3
## anti-Self -3 -3 -3
## rational Self 3 -2 1
sd1 <- sd + 3 # make them non-negative
sd1
## evaluation strength activity
## real Self 2 5 6
## anti-Self 0 0 0
## rational Self 6 1 4
Use the smacof::sim2diss() function to convert any
similarities to dissimilarities (several methods of transformation are
available):
sd2 <- sim2diss(sd1, method = "reverse")
sd2
## evaluation strength activity
## real Self 4 1 0
## anti-Self 6 6 6
## rational Self 0 5 2
sd_unf <- unfolding(sd2)
plot(sd_unf)
# library(ggplot2)
# library(ggrepel)
conf_items <- as.data.frame(sd_unf$conf.col)
conf_perc <- as.data.frame(sd_unf$conf.row)
p <- ggplot(conf_perc, aes(x = D1, y = D2))
p + geom_point(size = 1, colour = "red", alpha = 0.5) +
coord_fixed(xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5)) +
xlab("") +
ylab("") +
geom_point(aes(x = D1, y = D2), conf_items, colour = "cadetblue") +
geom_text_repel(aes(x = D1, y = D2, label = rownames(conf_perc)),
conf_perc, colour = "red", vjust = -0.5) +
geom_text_repel(aes(x = D1, y = D2, label = rownames(conf_items)),
conf_items, colour = "cadetblue", vjust = 0.8) +
ggtitle("Unfolding for Semantic Differential") +
theme_bw()
p1 <- p + geom_point(size = 1, colour = "red", alpha = 0.5) +
coord_fixed(xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5)) +
xlab("") +
ylab("") +
geom_point(aes(x = D1, y = D2), conf_items, colour = "cadetblue") +
geom_text_repel(aes(x = D1, y = D2, label = rownames(conf_perc)),
conf_perc, colour = "red", vjust = -0.5) +
#geom_text_repel(aes(x = D1, y = D2, label = rownames(conf_items)),
# conf_items, colour = "cadetblue", vjust = 0.8) +
ggtitle("Unfolding for Semantic Differential") +
theme_bw()
The configuration plot above shows both the original scales and the objects.
In contrast, the MDS configuration plot will only show the objects due to the fact that it is based on distances. Compare:
sd_dist <- dist(sd, method = "euclidean", diag = T)
sd_dist
## real Self anti-Self rational Self
## real Self 0.000000
## anti-Self 8.062258 0.000000
## rational Self 6.000000 7.280110 0.000000
sd_mds <- mds(sd_dist)
plot(sd_mds)
conf_sd <- as.data.frame(sd_mds$conf)
pm <- ggplot(conf_sd, aes(
x = D1,
y = D2,
label = rownames(conf_sd)
))
p2 <- pm + geom_point(size = 0.9) +
geom_text_repel(size = 3.5, vjust = -0.8) +
coord_fixed(xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5)) +
ggtitle("MDS for Semantic Differential") +
xlab("") +
ylab("") +
theme_bw()
library(patchwork)
p1 | p2
The Schwartz circle (from https://maksimrudnev.com/)
I recommend reading the paper “Measuring the 4 Higher-Order Values in Schwartz’s Theory: A Validation of a 17-Item Inventory” https://psyarxiv.com/xmh5: MDS is one of the 8 steps of analysis there. The goal is to make sure that the shorter version of the questionnaire follows the quasi-circumplex shape of the usual Schwartz circle.
Look (p.17) how elegantly/briefly the MDS solution is discussed.
MDS keeps the distances between objects interpretable, especially when there are both rows and columns in the plot. => Find and interpret clusters of points. However, this is not possible in symmetric correspondence analysis biplots.
Distances are #1 but dimensions are only occasionally meaningful in MDS. Sometimes they also make sense and can be summarized beautifully (e.g., Schwartz higher-order values). By contrast, dimensions are important for understanding correspondence plot biplots and some PCA biplots.
Labelled or not, dimensions are equally important in MDS. As a result, configuration plots can be flipped in whatever direction they look better for you. By contrast, the 1st dimension is the most important one in PCA and CoA alike.
Finally, MDS takes as input distance matrices, or evaluations like ratings and ranks. However, CoA can deal with categorical variables of equal importance (vs. “ideal points and object points” in MDS).
MDS can also handle correlation matrices, like PCA. But their purposes are different. MDS aims to plot the perceptual, low-dimensional space and distances between objects, while PCA captures maximum variance in the lowest suitable number of (not necessarily two) components.
Compare the biplots from https://rpubs.com/shirokaner/pca and the following solutions:
data_raw <- read.csv("THE2021.csv")
library(tidyverse)
library(corrplot)
# data_raw[,3:11] %>%
# cor() %>%
# corrplot()
data_raw <- data_raw %>%
relocate(stats_pc_intl_students, .after = scores_international_outlook)
data_cor <- data_raw[ , 3:11] %>% cor()
data_uni <- sim2diss(data_cor, method = "corr", to.dist = T)
uni <- mds(data_uni)
uni # 0.214
##
## Call:
## mds(delta = data_uni)
##
## Model: Symmetric SMACOF
## Number of objects: 9
## Stress-1 value: 0.214
## Number of iterations: 26
#plot(uni)
uni2 <- mds(data_uni, type = "mspline")
uni2
##
## Call:
## mds(delta = data_uni, type = "mspline")
##
## Model: Symmetric SMACOF
## Number of objects: 9
## Stress-1 value: 0.107
## Number of iterations: 18
plot(uni2)
#head(data_raw)
row.names(data_raw) <- data_raw$name
data_uni2 <- sim2diss(data_raw[ , 3:11], method = 100)
#head(data_uni2)
min(data_uni2[ , 7])
## [1] -222002
data_uni2[ ,7] <- data_uni2[ ,7] + 222002
min(data_uni2[ ,8])
## [1] -9.1
data_uni2[ ,8] <- data_uni2[ ,8] + 9.1
#summary(data_raw)
#summary(data_uni2)
uni3 <- unfolding(data_uni2)
uni3
##
## Call: unfolding(delta = data_uni2)
##
## Model: Rectangular smacof
## Number of subjects: 1448
## Number of objects: 9
## Transformation: none
## Conditionality: matrix
##
## Stress-1 value: 0.096464
## Penalized Stress: 85.22537
## Number of iterations: 5
#plot(uni3)
# data_uni3 <- data_uni2[ , -7]
# uni4 <- unfolding(data_uni3)
# uni4
# plot(uni4)
To explore/validate the basic values circle on the individual level, see “Example II: Unfolding on personal values”, pp.784-785 in: Multivariate Behav Res . 2016 Nov-Dec;51(6):772-789. doi: 10.1080/00273171.2016.1235966. Epub 2016 Nov 1. https://www.researchgate.net/profile/Patrick-Mair-2/publication/309617943_Goodness-of-Fit_Assessment_in_Multidimensional_Scaling_and_Unfolding/links/59f4c6f5aca272607e2a8706/Goodness-of-Fit-Assessment-in-Multidimensional-Scaling-and-Unfolding.pdf
(Optionally: compare to the vector model of unfolding - “individuals (rows) represented as vectors in the biplot”): https://cran.r-project.org/web/packages/smacof/vignettes/smacof.pdf
#plot(vmu(sd2))
(Optionally) See also multidimensional preference analysis from the
“pmr” package. pmr is a package for all kinds of analyses
for ranking data. https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-13-65
See also: Finch, Analyzing ranked data (2022) https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1589&context=pare