Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

This report tends to concisely and incisively explain Multidimensional Scaling(MDS), a non-linear projection class of methods which can be used in dimensionality reduction, and thus data visualization of high-dimensional datasets.

The report is composed of four sections:

  1. What is MDS?
  2. How does MDS work? MDS in R
  3. Why MDS? MDS vs. PCA
  4. Conclusion
  5. References

If one prefers video lectures rather than verbal reports, one can have a look at this video …, which is prepared based on the current report by this author.

In a sequent report, I would explain a specific sort of MDS, called Multidimensional Unfolding, which is able to cope with asymmetric data.

1. What is MDS?

let’s start with an intuitive explanation with a familiar dataset.

Figure 1 is the geographical map of 11 cities in US, highlighet by blue markers.

Figure1- 11 US cities

Figure1- 11 US cities

It is easy to measure the [scaled] inter-city distances. We only need a ruler and the physical map, or the coordinates of the red dots to calculate the Euclidean distance. Then the achieved distance numbers can be scaled up to kilometers based on the scale of the original map.

Now, let’s imagine the reverse process, when the inter-cities distances are available, and we want to draw a map out of them, such that the close cities locate close to each other, and far ones locate far from each other(not surprisingly!). To do so, I would use the between-cities distances of the 11 US cities, highlighted in figure1. The dataset is available in psych package.

Table1 -Approximate pair-wise distances of 11 US cities
ATL BOS ORD DCA DEN LAX MIA JFK SEA SFO MSY
ATL 0 934 585 542 1209 1942 605 751 2181 2139 424
BOS 934 0 853 392 1769 2601 1252 183 2492 2700 1356
ORD 585 853 0 598 918 1748 1187 720 1736 1857 830
DCA 542 392 598 0 1493 2305 922 209 2328 2442 964
DEN 1209 1769 918 1493 0 836 1723 1636 1023 951 1079
LAX 1942 2601 1748 2305 836 0 2345 2461 957 341 1679
MIA 605 1252 1187 922 1723 2345 0 1092 2733 2594 669
JFK 751 183 720 209 1636 2461 1092 0 2412 2577 1173
SEA 2181 2492 1736 2328 1023 957 2733 2412 0 681 2101
SFO 2139 2700 1857 2442 951 341 2594 2577 681 0 1925
MSY 424 1356 830 964 1079 1679 669 1173 2101 1925 0

Although the geographical coordinates of the cities in this example are easily available, and can be used in order to generate the map, let’s assume that we do not have the coordinates, and the only data that we have is the distance table. Now we want to draw a map of the major cities of Spain using the between-cities distances. It is way more difficult than the measuring distances on the map!

This is where MDS can help. MDS is able to find a configuration of the observations(here cities) in a low dimensional map(here 2d map) such that the original pair-wise between-object distances are preserved as much as possible in the map(=low dimensional projection). Hence, the final map(also called configuration) is set in a way that close observations tend to be closer, and far observations locate far from each other.

The distances are, in a broader sense, proximity scores. By proximity I mean dis/similarity scores. Thus, when two observations are similar, their distance score should be low, and they should reside close to each other on the map. How to measure the distance in general is determined by the characteristics of the data as well as the goal of the study. Here, I use Euclidean Distance as it is the most common distance form. Moreover, the generated map of Euclidean distance will be easy to read, as we are used to such maps. There are other distances however such as Manhattan Distance and so on.

More formally, multidimensional scaling is defined as:

“Multidimensional scaling, then, refers to a class of techniques. These techniques use proximities among any kind of objects as input. A proximity is a number which indicates how similar or how different two objects are, or are perceived to be, or any measure of this kind. The chief output is a spatial representation, consisting of a geometric configuration of points, as on a map. Each point in the configuration corresponds to one of the objects. This configuration reflects the”hidden structure" in the data, and often makes the data much easier to comprehend." (Kruskal and Wish , 1978 - p7)

“Multidimensional scaling (MDS) is a technique for the analysis of similarity or dissimilarity data on a set of objects. MDS attempts to model such data as distances among points in a geometric space. The main reason for doing this is that one wants a graphical display of the structure of the data, one that is much easier to understand than an array of numbers and, moreover, one that displays the essential information in the data, smoothing out noise.” (Borg and Groenen, 2005)

I have prepared the figure2 in order to portray the big picture of the process.

Figure2- From raw data to MDS configureation

Figure2- From raw data to MDS configureation

Euclidean distance is only one function for gaining proximities, and other suggestions can be correlation for instance. How to obtain proximites is the topic of (Borg and Groenen, 2005) Chapter 6.

2. How does MDS work? MDS in R

Since this report tends not to be a technical report, but rather a practical one, MDS is treated as a black-box as much as possible. It means that the concepts which are necessary to use MDS in practice are explained, while interested readers are referred to the cited sources in this text in order to gain in-depth understanding of the technique.

2.1 MDS MODEL

MDS transforms the proximity values into disparities. Doing so is done using a transformation function, such that: \[ disparity_{ij} = f(proximity_{ij}) \]

The values of \(f(\delta_{ij})\) [ \(\delta\) is equivalent to \(p_{ij}\) in this report ] are often called “fitted distances”, and denoted by \(\hat{d_{ij}}\). They are also sometimes called “distaprities”. However, it is important to remember that the fitted distances are not distances, but simply numbers which are fitted to the distances.(Kruskal and Wish, 1978, p29)

The choice of this transformation function, defines the model of MDS. In the next step, a MDS algorithm tries to find a configuration of the objects in the low dimensional space, here 2d space, such that the distances are the map are as close as possible to the corresponding disparities. This closeness is considered as the index of goodness of fit, i.e. how good the map represents the disparities. In order to quantify such measure of fit, usually Stress-1 index (Kruskal, 1964) is used as the loss function. The index is as follows:

\[ Stress-1 = \sigma_{i} = \sqrt{ \frac{\sum{[\hat{d_{ij}}-d_{ij}]^2} }{\sum{d_{ij}^2} } } =\sqrt{ \frac{\sum{[f(p_{ij})-d_{ij}]^2} }{\sum{d_{ij}^2} } } \]

As one can evaluate, the stress-1 formula measures the deviation of low dimensional representation from the ideal representation and then norm it by division through the denominator.

In practice, the computerized procedure iteratively changes the position of the observations in the MDS configuration in order to minimize Stress-1 or similar indices. While the ideal situation seems exact representation of the disparities as the between-object distances on the map, it usually cannot be achieved, and what we get is an approximation of the disparities by distances: \(f(p_{ij}) \approx d_{ij}\)

This approximation, as far as it is not very high, is desirable comparing to exact representation: > Given that the proximities contain some error, such approximate representations make even better representations— more robust, reliable, replicable, and substantively meaningful ones—than those that are formally perfect, because they may smooth out noise. (Borg and Groenen, 2005, p41)

As stated before, the choice of transformation function defines the MDS model. This choice is usually choosing a class of functions rather than exact function with explicit parameters. As a consequence of such choice, ratio MDS, interval MDS, and ordinal MDS can be applied to proper dataset. While there are more models available, here I briefly review these three function types, since they are the most prevalent in practice, to the best of my knowledge.

In interval MDS, a linear function is used as the trasformation function: \[ p_{ij} \longrightarrow a + b.p_{ij} \]

a and b would be determined by the computerized procedure, so there is no need to be worry about them. If \(a=0\) then we will have ratio MDS:

\[ p_{ij} \longrightarrow b.p_{ij} \]

In ratio MDS, however, the proximities are assumed to have a fixed origin and no such arbitrary additive constants are admissible.(Borg and Groenen, 2005, p34)

These two types of transformation are called metric scaling.

where metric refers to the type of transformation of the dissimilarities and not the space in which a configuration of points issought. (Cox and Cox, 2001, p6)

In contrast, the ordinal MDS is a sort of nonmetric scaling. In non-metric scaling the transformation function only preserves the rank order of the proximities. Thus the non-metric function must follow the monotonicity constraint (Cox and Cox, 2001, p7) : \[ p_{ij} < p_{kl} \Rightarrow d_{ij} \leq d_{kl} \]

So in a non-metric MDS configuration, the absolute value of distances is not meaningful, since the algorithm tries to represent the rank-order, i.e. relative distances.

It is important to re-emphasize that these ratio, interval and ordinal types of MDS should not be confused with the data measurement with the same names. These types are talking about how to transform proximity scores into disparities. However, when the proximity data is ordinal, e.g. rank-order preference of a comodity from the point of view of a customer, then the rational choice is applying ordinal transformation to the data.

2.2 MDS in R

In R, there are several functions and packages for multidimensional scaling, e.g. cmdscale() in stats package, wcmdscale() in vegan package, isoMDS() in MASS package. However, here I use smacof package by (Jan de Leeuw, Patrick Mair, 2009). The smacof package is very intuitive and easy to use, as well as comprehensive such that it also has a function for multidinemsional unfolding, an interestig topic which I intend to cover in the next report.

Now let’s have a look at the MDS configuration of distance matrix of 11 US cities presented in table1. The MDS configuration is produced using smacof package. I use mds() function, with these arguments: + delta is the dist object or dissimilarity matrix. Here, the distances can be considered as dissimilarity values. + ndim determines the number of dimensions of the final configuration. I want a 2d output. + type as explained before, the type of transformation function. I use ratio.

cities_mds = mds(delta = cities,ndim = 2, type = "ratio" )
cities_mds
## 
## Call:
## mds(delta = cities, ndim = 2, type = "ratio")
## 
## Model: Symmetric SMACOF 
## Number of objects: 11 
## Stress-1 value: 0.002 
## Number of iterations: 3

As we can see in the results, the \(Stress-1 value: 0.00206447\), which is an almost perfect match. Anyway, we should have expect such low stress, i.e. high goodness of fit, because we are re-constructing a 2d map in a 2d space. Since there is no dimensionality reduction, there is no or very little stress on the map.

We can plot the MDS configuration.

plot(cities_mds,caption = "Figure3: US cities MDS configuration")

Something funny has happened! The cities on the map are not in their expected locations. It seems the map is not only mirrored, but flipped. Indeed, this is the proper time to point to this fact that MDS only tries to preserve the inter-object distances,and therefore there is nothing wrong with the map. By operations such as rotation of the map, the distances remain intact. The map must be rotated 180 degrees. It is possible to do so by multiplying the coordinates of each point into -1.

The second point is about dimensions. Here the dimensions can be labeled as geographical dimensions such as west/east and south/north. However, the dimensions in MDS does not have any inherent meaning, in contrast to components in Principle Component Analysis(PCA). One may interpret this lacking of meaning as a cost of dimensionality reduction.

Let’s rotate the figure2 in order to have a familiar map.

ggplot() + geom_point(data = as.data.frame(cities_mds$conf) , mapping = aes(x = -D1 , y = -D2), alpha = 0.5 , color = "blue", size = 10 ) + geom_text(data = as.data.frame(cities_mds$conf), mapping = aes(x = -D1,y= -D2), label = rownames(cities) ) + labs(title = "figure3: MDS representation of pair-wise distances of 11 US cities")

Comparing figure3 and figure1, one can validate the accuracy of the MDS representation.

2.3 MDS on a [relatively] highdimensional dataset

In the previous example, the MDS could not show its abilities, since the original dataset was a 2dimensional dataset as well. Here in this sub-section, a dataset with 6 variables is dimensionaly reduced and subsequently visualized using MDS.

The dataset, presented in table2, is from (Aoki et al., 2007) and about 47 Japan prefectures, with 6 attributes as follows:

Table2 - Japan Prefecture’s dataset
Perfecture Area Population Household.Income Manufacturing.shipment.Amount Agricultural.Shipment.Amount Commercial.Sales.Amount
Hokkaido 83453 5707654 510910 57137 10551 223000
Aomori 9235 1472633 524671 13479 2648 41027
Iwate 15278 1413099 514243 23058 2849 40455
Miyagi 6861 2368591 466685 37492 2202 125793
Akita 11434 1183380 593805 16201 2058 35325
Yamagata 7394 1240877 596394 27451 2372 32899

Having 6 dimensions, it is not possible to visualize the objects(=perfectures) in 2dimensional space without dimensionality reduction. MDS will do it for us. The first step is generation of a proximity matrix out of the current raw data. The proximity values can be calculated using either Euclidean distance function, Correlation, or any other function which can be used and justified as a proximity measure. Here, I choose to use Euclidean distance. Since the scale of the variables are very different in the dataset, before using dist() function, I need scale() function to standardize the data.

japan_perf_scale = scale(japan_perf[,-1])
japan_dist = dist(x = japan_perf_scale)
japan_mds= mds(delta = japan_dist , ndim = 2 , type = "ratio")
japan_mds
## 
## Call:
## mds(delta = japan_dist, ndim = 2, type = "ratio")
## 
## Model: Symmetric SMACOF 
## Number of objects: 47 
## Stress-1 value: 0.113 
## Number of iterations: 76

The \(Stress-1 value: 0.112966\) is acceptable(we talk about it later), so we can continue to plot the MDS map.

ggplot() + geom_point(data = as.data.frame(japan_mds$conf) , mapping = aes(x = D1, y = D2), color = "blue", alpha = 0.5) + labs(title = "figure4: MDS configuration of Japan Prefectures")

As one can see, most of the observations are concentrated in one corner of the map. This means that there is not much difference among these observations. However, the interesting observations are the ones which located far from the maddening crowd! Specially there are two observations which have clearly lied out of the cloud.Thus, they may be outliers!

Let’s add labels to the map. The labels are the first three letters of prefecture names.

perf = substr(x = as.character(japan_perf[,1]), start = 1, stop = 3)

ggplot() + geom_point(data = as.data.frame(japan_mds$conf) , mapping = aes(x = D1, y = D2), color = "blue", alpha = 0.5) + geom_text_repel(data =data.frame(perf ,japan_mds$conf), mapping = aes(x=D1, y=D2 , label = perf)  ) + labs(title = "figure5:  MDS configuration of Japan Prefectures with labels")

According to figure5, the “Tokyo” and “Hokaydo”, following by “Aichi”,“Osaka”, and “Kanagawa” are the relative outliers, if we use an intuitive distance-based outlier detection through eye-balling the map. However, “Tokyo” and “Hokaydo”, the two most outliers, are far from each other which means they are different. In other words, they are probably outliers in two different ways. But in which way they are different? This is the question that MDS configuration cannot answer by its own. In order to resolve this shortcomming, I would suggest two methods to resolve this problem in a separate instruction. Otherwise, this instruction would be too protracted!

2.4 On evaluation of the results

A very expected question is how much stress-1 is too much? Unfortunately, it is not possible to deal with stress-1 with setting some universal thresholds for different qualities of the results. > For the Stress-1 coefficient σ1 using ordinal MDS, Kruskal (1964a), on the basis of his “experience with experimental and synthetic data” (p. 16), suggests the following benchmarks: .20 = poor, .10 = fair, .05 = good, .025 = excellent, and .00 = perfect.3 Unfortunately, such criteria almost inevitably lead to misuse by suggesting that only solutions whose Stress is less than .20 are acceptable, or that all solutions with a Stress of less than .05 are good in more than just a formal sense. Neither conclusion is correct. An MDS solution may have high Stress simply as a consequence of high error in the data, and finding a precise representation for the data does not imply anything about its scientific value. (Borg and Groenen, 2005, p48 )

So what to do? The same authors have a suggestion, normalized stress, which has a very straightforward meaning:

The value of σn(X∗) is the proportion of variation of the dissimilarities not accounted for by the distances, and 1 − σn(X∗) is the fitted proportion, a coefficient of determination. (Borg and Groenen, 2005, p249)

However, this normalized stress is not available as a ready-to-use output of the mds() function. But it can be computed as follows:

dhat_matrix = as.matrix(japan_mds$dhat)
d_matrix = as.matrix(japan_mds$confdiss)
denominator = sum(dhat_matrix[upper.tri(dhat_matrix)]^2)
p_ij = dhat_matrix[upper.tri(dhat_matrix)]
d_ij = d_matrix[upper.tri(d_matrix)]
nominator = sum((p_ij - d_ij)^2) 
normalized_stress = nominator/denominator
## [1] "normalized stress is 0.0128023015050354"
## [1] "coefficient of determination 0.987197698494965"

Consequently, the our configuration is quite successful in representation of the proximity values as distances on the map.

3. Why MDS? Advantage and disadvantages of MDS

The benchmark of evaluation of MDS should be PCA. Not because they are both “dimensionality reduction” or MDS in very classical form is similar to PCA, but because the prevalence of PCA. Everyone knows PCA, and it is very common topic in data science courses.

PCA has some serious drawbacks that stem from the assumptions of the method. (Melssen et al.,2006) summarize the shortcommings of PCA:

First of all, it is explicitly assumed that the high-dimensional structure (topology) of the input data can be reduced in a linear fashion. Secondly, these techniques perform only well if the objects in the data set do not contain oddities like outliers (which introduce a leverage effect disturbing the quality of the projection). Thirdly, the visualisation power of these transfor- mation techniques deteriorates considerably if the number of relevant dimensions in the multivariate space (i.e., the number of significant principal components or, speaking mathematical- ly, the actual rank of the input matrix) remains high after a PCA analysis.

In short, PCA tries to find hidden linear correlation among the variables. Hence, if the variables have non-linear relation, then the technique does not perform well, and indeed it may be misleading, since it does not try to preserve the topology of the data. Even in some datasets where variables are linearly related, the PCA does not perform well, because of its orthogonal transformation. And of course, I am not talking about kernel PCA, but about the commonly used -or misused- PCA.

In contrast, MDS tries to preserve the topology of the data, and it is inherently a non-linear transformation. MDS thus can be used not only to reduce dimensionality in a more reliable way than PCA, but it is not negatively affected by outliers. Indeed, MDS can be used in order to detect outliers, as we did in this report.

However, there are some drawbacks in practical usage of MDS, the shortcommings that I have encountered while working with it. One of them is the sensitivity of the method to the initial points. The other is the method is prone to produce degerated solutions, specially in ordinal model. Next, is the problem of adding a new observation to the dataset, since the method needs to re-do all its procedure to find the new configuration. In addition, for large datasets, a compression method seems necessary to reduce the number of datapoints on the map, otherwise the map would be overcrowded. A a forthcomming article, I would suggest “compression method” which resolves this problem as well as all previous shortcommings.

4. Conclusion

Multidimensional Scaling(MDS) is a non-linear projection of high-dimensional objects into a low-dimensional space(usually 2dimensional map), such that the pair-wise distances of the objects are preserved as much as possible. Hence, MDS is a topology preserving method which can be used in dimensionality reduction, and consequently data visualization.

This article is a practical introduction to MDS in R. This instruction, and related forthcoming one, try to fill a gap that this author have seen in available sources. To my best research, the current sources on MDS ,available on web, are either too brief, to such an extent that they may be puzzling or misleading, or they are comprehensive books that are excellent in quality but too lengthy for an introduction. Hence, this instruction tries to reside in between, and tries to be sufficiently comprehensive, and sufficiently concise. In order to keep this report short, I know no better way than removing details, which inevitably makes the report less accurate. This practical instruction is written to help as the first step into MDS, and further readings are provided for adding details to this big picture, and making it more accurate.

In order to show the MDS practically and provide sample codes, Smacof package(Jan de Leeuw, Patrick Mair, 2009) has been used to visualize a dataset of 47 Japan prefectures from 6 dimensional space into a two dimensional map. Strengths of the MDS have been pointed to in contrast to PCA, as well.

In the next report, some improvement in MDS map will be suggested. Afterwards, a special case of MDS, called Multidimensional Unfolding, will be discussed.

5. References

Aoki, S., Toyozumi, K. and Tsuji, H., 2007, October. Visualizing method for data envelopment analysis. In Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on (pp. 474-479). IEEE.

Borg, I. and Groenen, P. (2005). Modern multidimensional scaling. New York: Springer.

Cox, T. and Cox, M. (2001). Multidimensional scaling. Boca Raton: Chapman & Hall/CRC.

Jan de Leeuw, Patrick Mair (2009). Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31(3), 1-30. URL http://www.jstatsoft.org/v31/i03/.

Kruskal, J. and Wish, M. (1978). Multidimensional scaling. Beverly Hills, Calif.: Sage Publications.

Melssen, W., Wehrens, R. and Buydens, L., 2006. Supervised Kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83(2), pp.99-113.