Dimension reduction

Dimension reduction with MDS

Dimension reduction in unsupervised learning is a crucial tool for handling high-dimensional data, offering benefits in terms of simplification, computational efficiency, and improved interpretability. It enables researchers and data scientists to extract meaningful insights from complex datasets more effectively. Knowing and understanding the data being analyzed allows the researcher to accurately choose the distance metric which translates into the digestibility of the output obtained.

Choice of distance metric in MDS - playing with random data

This report presents the impact of the choice of a distance matrix on the outcome of dimension reduction using Multidimensional Scaling MDS applied to random data. The aim is to transparently illustrate the results and present them visually in a clear and enjoyable manner. Random points were generated to form a) sphere and b) pyramid for the purpose of creating visual results.

The work is inspired by examples from Prof. Katarzyna Kopczewska and Prof. Jacek Lewkowicz.

Before starting, let’s install the necessary packages.

library(scatterplot3d)
library(rgl)
library(StatMatch)

## Ładowanie wymaganego pakietu: proxy

## 
## Dołączanie pakietu: 'proxy'

## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     as.dist, dist

## Następujący obiekt został zakryty z 'package:base':
## 
##     as.matrix

## Ładowanie wymaganego pakietu: survey

## Ładowanie wymaganego pakietu: grid

## Ładowanie wymaganego pakietu: Matrix

## Ładowanie wymaganego pakietu: survival

## 
## Dołączanie pakietu: 'survey'

## Następujący obiekt został zakryty z 'package:graphics':
## 
##     dotchart

## Ładowanie wymaganego pakietu: lpSolve

## Ładowanie wymaganego pakietu: ggplot2

PART 1 - Sphere

At the beginning, random points were generated, and then a sphere was created from them. The next two plots visualize this sphere.

# Generate random points.
theta <- runif(1000, 0, 2 * pi)
phi <- runif(1000, 0, pi)
r <- runif(1000, 0, 1)^(1/3)

# Create Cartesian coordinates (x, y, z) based on generated spherical coordinates.
x <- r * sin(phi) * cos(theta)
y <- r * sin(phi) * sin(theta)
z <- r * cos(phi)

# Create a vector of points forming a sphere.
v <- cbind(x, y, z)

# Create 3D plot
scatterplot3d(v[, 1], v[, 2], v[, 3], color="black", pch=16, main="Sphere")

# Similar plot of the same data with a different scale
scatterplot3d(v[,1], v[,2],v[,3], ylim=c(-2,2), xlim=c(-2,2),zlim=c(-2,2),main="Sphere")

Then MDS sphere points data was conducted, reducing dimensions from 3 to 2 (3D to 2D).

# MDS on the data
dist.reg<-dist(v)
mds1<-cmdscale(dist.reg, k=2)
summary(mds1)

##        V1                  V2           
##  Min.   :-0.986789   Min.   :-0.950070  
##  1st Qu.:-0.443153   1st Qu.:-0.264996  
##  Median :-0.002901   Median :-0.003321  
##  Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.459564   3rd Qu.: 0.253918  
##  Max.   : 0.996934   Max.   : 0.921050

plot(mds1[,1], mds1[,2],col="red") # retrieved data
points(v, col="black") # empirical data
title(main="Points retrived from distance (red) vs. original points (black)")

The plot above presents results of first MSD, reducing dimensions from 3 to 2. Original points are marked with black color and retrived ones with red.

# MDS on the data
dist.reg<-dist(v)
mds2<-cmdscale(dist.reg, k=3)
summary(mds2)

##        V1                  V2                  V3          
##  Min.   :-0.986789   Min.   :-0.950070   Min.   :-0.95688  
##  1st Qu.:-0.443153   1st Qu.:-0.264996   1st Qu.:-0.25239  
##  Median :-0.002901   Median :-0.003321   Median : 0.01632  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.459564   3rd Qu.: 0.253918   3rd Qu.: 0.24843  
##  Max.   : 0.996934   Max.   : 0.921050   Max.   : 0.95822

points3d(mds2[, 1], mds2[, 2], mds2[, 3], col = "red")# retrieved data
points3d(v[, 1], v[, 2], v[, 3], col = "black")# empirical data
title3d(main="Points retrived from distance (red) vs. original points (black)")
rglwidget(altText = "Plot 3D presenting points retrived from distance (red) vs. original points (black)")

rgl::clear3d()

The second MDS was conducted to compare outputs of reduction to 2 dimensions with 3 dimensions results. Retrieved data and the original points were presented at the plot above.

It can be observed that in both the first and second cases, the result obtained in MDS accurately reflects the shape and distribution of the original data. Although the points do not exactly overlay, they form a very coherent shape/structure.

As the 2-dimensional circle has been discussed in the classes, here I will focus on the results in 3 dimensions.

Different distance metrics and MDS results

In further MDS analysis, the following distance metrics were utilized: Euclidean, Manhattan, Canberra, Maximum, Minkowski, Gower.

Euclidean: The Euclidean distance is the straight-line distance between two points in space. It is the most common distance metric and is calculated as the square root of the sum of squared differences between corresponding coordinates.
Manhattan: Manhattan distance is the sum of absolute differences between corresponding coordinates. It measures the distance one would have to travel along grid lines in a city to get from one point to another.
Canberra: Canberra distance is a measure of the dissimilarity between two points, considering the relative differences in magnitudes of the variables. It is particularly sensitive to small differences when the values are close to zero.
Maximum: Also known as Chebyshev distance, it calculates the maximum absolute difference between corresponding coordinates. It represents the greatest difference along any dimension.
Minkowski: The Minkowski distance is a generalization of both the Euclidean and Manhattan. With parameter p=0.7 in the Minkowski distance metric, it calculates the distance between two points by raising the sum of the absolute differences raised to the power of 0.7 and then taking the reciprocal of 0.7th root of the result. This metric balances sensitivity to both small and large differences in feature values.
Gower: Gower distance is a measure of dissimilarity designed for mixed data types (categorical and numerical). It considers the similarity between pairs of observations based on the attributes’ scales and types.

#Impact of distance metric on quality of MDS


points3d(v[, 1], v[, 2], v[, 3], col="black")
title3d("empirical points")
rglwidget(altText = "empirical points")

rgl::clear3d()

dist.reg<-dist(v, method="euclidean") 
mds3<-cmdscale(dist.reg, k=3)
points3d(mds3[,1], mds3[,2],mds3[,3], col="red")
title3d("Euclidean")
rglwidget(altText = "Euclidean")

rgl::clear3d()

dist.reg<-dist(v, method="manhattan") 
mds4<-cmdscale(dist.reg, k=3)
points3d(mds4[,1], mds4[,2],mds4[,3], col="blue")
title3d("manhattan")
rglwidget(altText = "manhattan")

rgl::clear3d()

dist.reg<-dist(v, method="canberra") 
mds5<-cmdscale(dist.reg, k=3)
points3d(mds5[,1], mds5[,2],mds5[,3], col="green")
title3d( "canberra")
rglwidget(altText = "canberra")

rgl::clear3d()

dist.reg<-dist(v, method="maximum") 
mds6<-cmdscale(dist.reg, k=3)
points3d(mds6[,1], mds6[,2],mds6[,3], col="yellow")
title3d("maximum")
rglwidget(altText = "maximum")

rgl::clear3d()

dist.reg<-dist(v, method="minkowski", p=0.7) 
mds7<-cmdscale(dist.reg, k=3)
points3d(mds7[,1], mds7[,2],mds7[,3], col="pink")
title3d("Minkowski p=0.7")
rglwidget(altText = "Minkowski p=0.7")

rgl::clear3d()


dist.gower<-gower.dist(v)
mds8<-cmdscale(dist.gower, k=3)
points3d(mds8[,1], mds8[,2],mds8[,3], col="darkmagenta")
title3d("Gower")
rglwidget(altText = "Gower")

rgl::clear3d()

As expected, the results are very similar to those in 2D in the case of a circle. Presented on interactive plots, they constitute a fascinating element of playing with random data.

The original data is most accurately represented by MDS based on Euclidean distance. Subsequent visualizations also make sense, showing significant similarity between them and forming structures resembling a cube. The most distinct output is observed in MDS using the Canberra method, where only a few points are visibly present in space.

PART 2 - Pyramid

An analogous procedure was carried out for the Pyramid.

Initially, random points were generated, and subsequently, the first triangle forming the base of the pyramid was created. Following that, segments building the remaining three sides of the pyramid were constructed.

The interactive and the static plots below present the pyramid created in this manner.

# Generate 200 random numbers from 0 to 1
x1<-runif(200, 0,1)

v1<-cbind(0, x1,0)
v2<-cbind(x1,0,0)
v3<-cbind(1-x1,x1,0)
t1<-cbind(0, 0,x1)
t2<-cbind(0,x1,1-x1)
t3<-cbind(x1,0,1-x1)

d<-rbind(v1,v2,v3,t1,t2,t3)

points3d(v1, col = "blue", pch = 16)
points3d(v2, col = "blue", pch = 16)
points3d(v3, col = "blue", pch = 16)
points3d(t1, col = "green", pch = 16)
points3d(t2, col = "green", pch = 16)
points3d(t3, col = "green", pch = 16)
title3d("empirical points")
rglwidget(altText = "empirical points")

rgl::clear3d()


scatterplot3d(d[, 1], d[, 2], d[, 3], color="blue", pch=16, main="Pyramid Empirical Points",xlim = c(-0.5, 1), ylim = c(-0.5, 1), zlim = c(0, 1))

The initial MDS was conducted by reducing the dimension of variables to 2 dimensions, followed by additional reductions to 3 dimensions. The results are presented in successive plots, where black points represent empirical data (randomly generated points), while red points represent the data obtained through MDS.

# MDS on the data
dist.reg<-dist(d) # as a main input we need distance between units
mds9<-cmdscale(dist.reg, k=2)
summary(mds1)

##        V1                  V2           
##  Min.   :-0.986789   Min.   :-0.950070  
##  1st Qu.:-0.443153   1st Qu.:-0.264996  
##  Median :-0.002901   Median :-0.003321  
##  Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.459564   3rd Qu.: 0.253918  
##  Max.   : 0.996934   Max.   : 0.921050

plot(mds9[,1], mds9[,2], col="red",ylim=c(-1,1), xlim=c(-1,1)) # retrieved data
points(d, col="black") # empirical data
title(main="Plot 3D - retrived from distance (black) vs. original points (red)")

# MDS on the data
dist.reg<-dist(d) # as a main input we need distance between units
mds10<-cmdscale(dist.reg, k=3)
summary(mds10)

##        V1                 V2                 V3         
##  Min.   :-0.78233   Min.   :-0.57131   Min.   :-0.4371  
##  1st Qu.:-0.29777   1st Qu.:-0.28950   1st Qu.:-0.1392  
##  Median : 0.08263   Median :-0.07785   Median : 0.1342  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.29942   3rd Qu.: 0.28572   3rd Qu.: 0.1401  
##  Max.   : 0.58315   Max.   : 0.79322   Max.   : 0.1465

points3d(mds10[, 1], mds10[, 2], mds10[, 3], col = "red")# retrieved data
points3d(d[, 1], d[, 2], d[, 3], col = "black")# empirical data
title3d(main="Points retrived from distance (red) vs. original points (black)")
rglwidget(altText = "Points retrived from distance (red) vs. original points (black)")

rgl::clear3d()

In both cases, the output reasonably well represents the original data, allowing us to infer the input.

Interestingly, in the case of reducing dimensionality to 2 dimensions, the MDS output resembles a drawing of a three-dimensional object transferred to two dimensions (highlighted segments simulating 3D space). I believe this indicates a high accuracy of fitting, as, according to the intention, we reduced the number of data dimensions (dimension reduction task) while preserving as much information as possible from the original data (a structure reminiscent of 3D).

In the case of 3 dimensions, the obtained result in terms of shape and size looks exactly like the original data, with only the arrangement of the object in space being different.

Different distance metrics and MDS results

In further MDS analysis, the same as before distance metrics were utilized (Euclidean, Manhattan, Canberra, Maximum, Minkowski, Gower).

# Impact of distance metric on quality of MDS

points3d(v1, col = "black", pch = 16)
points3d(v2, col = "black", pch = 16)
points3d(v3, col = "black", pch = 16)
points3d(t1, col = "black", pch = 16)
points3d(t2, col = "black", pch = 16)
points3d(t3, col = "black", pch = 16)
title3d("empirical points")
rglwidget(altText = "empirical points")

rgl::clear3d()

dist.reg<-dist(d, method="euclidean") 
mds11<-cmdscale(dist.reg, k=3)
points3d(mds11[,1], mds11[,2],mds11[,3], col="red")
title3d("Euclidean")
rglwidget(altText = "Euclidean")

rgl::clear3d()

dist.reg<-dist(d, method="manhattan") 
mds12<-cmdscale(dist.reg, k=3)
points3d(mds12[,1], mds12[,2],mds12[,3], col="blue")
title3d("manhattan")
rglwidget(altText = "manhattan")

rgl::clear3d()

dist.reg<-dist(d, method="canberra") 
mds13<-cmdscale(dist.reg, k=3)
points3d(mds13[,1], mds13[,2],mds13[,3], col="green")
title3d( "canberra")
rglwidget(altText = "canberra")

rgl::clear3d()

dist.reg<-dist(d, method="maximum") 
mds14<-cmdscale(dist.reg, k=3)
points3d(mds14[,1], mds14[,2],mds14[,3], col="yellow")
title3d("maximum")
rglwidget(altText = "maximum")

rgl::clear3d()

dist.reg<-dist(d, method="minkowski", p=0.7) 
mds15<-cmdscale(dist.reg, k=3)
points3d(mds15[,1], mds15[,2],mds15[,3], col="pink")
title3d("Minkowski p=0.7")
rglwidget(altText = "Minkowski p=0.7")

rgl::clear3d()

dist.gower<-gower.dist(d)
mds16<-cmdscale(dist.gower, k=3)
points3d(mds16[,1], mds16[,2],mds16[,3], col="darkmagenta")
title3d("Gower")
rglwidget(altText = "Gower")

rgl::clear3d()

In the case of the Pyramid, the main conclusions could be presented similarly to the Sphere. The best representation of the original data is provided by MDS based on Euclidean distance. The result most deviating from the others was obtained when calculating the distance using the canberra method.

Compared to the Sphere, the spatial figures obtained here exhibit more unconventional shapes, their appearance has an interesting, artistic character in my opinion. Although there is some resemblance or reference to the input in most cases, the output surprises with its presentation.

I consider the task I’ve performed as an interesting form of experiment/play with random data and various distance metrics.

Summary

As the discussed example illustrates, when conducting Dimension Reduction, besides determining the number of dimensions to which we want to reduce the data, the choice of an appropriate distance metric is also crucial. The results can significantly differ depending on the selected method.

Visualization in the form of interactive data plots, easily presented in 2-3 dimensional space, demonstrates how diverse and immensely interesting the outputs can be. The sphere and pyramid have been presented in various ways, and other shapes or structures could also serve as excellent material for further experimentation.

Dimension reduction

Katarzyna Mocio