Dimensionality reduction intro

In this notebook I will try to see how can multiple dimensions be plotted on a 2 dimensional plane.

Loading necessary libraries

library(tidyverse)
library(dplyr)
library(glue)
library(scales)
library(ggplot2)
library(GGally)
library(rempsyc)
library(smacof)
library(Rtsne)
library(gridExtra)

Loading the dataset and displaying datapoints of interest.

df=read.csv('dataset.csv')
df=df %>% select('Product','Protein..g.','Sugar..g.','Saturated.Fat..g.','Trans.Fat..g.') # selecting relevant columns
colnames(df)=c('product','proteins','sugars','saturated_fats','trans_fats') # changing column names
num_cols=c('proteins','sugars','saturated_fats','trans_fats') # numerical columns
df=df[df['trans_fats']!=max(df['trans_fats']),] # removing a meal with suspiciously high trans fats value
df=df[!duplicated(df[num_cols]),]
df[308:317,] # I will focus on comparing selected McCafe beverages

##                   product proteins sugars saturated_fats trans_fats
## 311           Iced Coffee     4.36  26.95           3.26       0.15
## 312    Cold Coffee Frappe     4.98  35.57          13.91       0.16
## 313          Mocha Frappe     5.49  47.55          14.00       0.20
## 314 Chocolate Oreo Frappe     6.03  55.14          15.91       0.22
## 315      Strawberry Shake     3.67  37.42           6.68       0.12
## 316       Chocolate Shake     4.16  37.78           6.74       0.14
## 317        Mango Smoothie     3.21  38.87           2.65       0.14
## 318  Mixed Berry Smoothie     3.33  43.00           2.64       0.15
## 319      Raw Mango Cooler     0.14  21.06           0.04       0.04
## 320      Mix Berry Cooler     0.16  21.25           0.04       0.04

Pairplots are regular way of showing the relationships between variables. However, as they are 2 dimensional, the displayed relationships are limited to 2 at a time. This is not helpful when trying to visually determine if specific points are similar to each other or not.

pairplot=ggpairs(df[308:317,num_cols],
        upper=list(continuous='blankDiag'),
        diag=list(continuous='blankDiag'),
        progress=F)
ggsave('pairplot.png', width = 10, height = 10)

In order to visualize similarities (or dissimilarities) between datapoints, one can perform dimensionality reduction.

First, the data has to be standardized, so that each variable has the same “weight” when using MDS or t-SNE. Both methods used here rely on Euclidean distance, which is directly related to the order of magnitude of distinct variables.

df_std=df
for (col in num_cols)
{
  df_std[col]=scale(df[,col])
}
df_std[308:317,]

##                   product   proteins   sugars saturated_fats   trans_fats
## 311           Iced Coffee -0.5097094 1.519896     -0.5702121  0.086374538
## 312    Cold Coffee Frappe -0.4844120 2.225012      1.6348198  0.164611804
## 313          Mocha Frappe -0.4636028 3.204976      1.6534539  0.477560867
## 314 Chocolate Oreo Frappe -0.4415695 3.825838      2.0489103  0.634035398
## 315      Strawberry Shake -0.5378630 2.376342      0.1378827 -0.148337258
## 316       Chocolate Shake -0.5178699 2.405790      0.1503054  0.008137273
## 317        Mango Smoothie -0.5566321 2.494952     -0.6965097  0.008137273
## 318  Mixed Berry Smoothie -0.5517358 2.832786     -0.6985802  0.086374538
## 319      Raw Mango Cooler -0.6818954 1.038094     -1.2368978 -0.774235383
## 320      Mix Berry Cooler -0.6810793 1.053636     -1.2368978 -0.774235383

Performing standard Multidimensional Scaling (MDS). Here is the original paper from 1964, explaining what the algorithm is and how it works.

mds_res=mds(dist(df_std[num_cols]), # on standardized numerical columns
            ndim=2, # Folding to 2 dimensions
            type='ratio')
print(glue("Stress-1 value = {round(mds_res$stress,3)}"))

## Stress-1 value = 0.121

According to the creator of the algorithm, the achieved goodness of fit, measured by the stress value, of 0.12 can be labeled as fair/good. Let’s see how the chosen meals look like in the newly created dimensions.

conf=mds_res$conf
df_std[,c('mds_dim1','mds_dim2')]=mds_res$conf
df_std[308:317,c('product','mds_dim1','mds_dim2')]

##                   product   mds_dim1   mds_dim2
## 311           Iced Coffee -0.5823810 -0.2738667
## 312    Cold Coffee Frappe -0.4376651 -0.9329618
## 313          Mocha Frappe -0.6152817 -1.1890822
## 314 Chocolate Oreo Frappe -0.6202616 -1.4724306
## 315      Strawberry Shake -0.7605102 -0.5012184
## 316       Chocolate Shake -0.7368614 -0.5473443
## 317        Mango Smoothie -0.8854028 -0.4203400
## 318  Mixed Berry Smoothie -0.9669862 -0.5070659
## 319      Raw Mango Cooler -0.6924077  0.1350273
## 320      Mix Berry Cooler -0.6962709  0.1322860

Plotting the results. With the number of dimensions reduced, we can now quickly evaluate which products have similar nutritional values. For example, chocolate and strawberry shakes are almost the same, while Mixed Berry Smoothie and Mango Smoothie are really close.

mds_scatter=nice_scatter(df_std[308:317,c('product','mds_dim1','mds_dim2')],
             predictor='mds_dim1',
             response='mds_dim2',
             group='product',
             has.line=F,
             xtitle='MDS dimension 1',
             ytitle='MDS dimension 2')
ggsave('mds.png', width = 10, height = 6)

Now let’s examine how each of the variables contributes to the stress value. To do so the MDS algorithm has to be run on a transposed dataset.

mds_res_t=mds(dist(t(df_std[num_cols])), 
              ndim=2, 
              type='ratio')
summary(mds_res_t)

## 
## Configurations:
##                     D1      D2
## proteins        0.4047 -0.4384
## sugars         -0.6693 -0.2910
## saturated_fats  0.4588  0.1625
## trans_fats     -0.1942  0.5669
## 
## 
## Stress per point (in %):
##       proteins         sugars saturated_fats     trans_fats 
##          26.80          17.50          28.81          26.89

Plotting the stress decomposition over the variables. As the data has no extreme outliers no variables prevent getting a better goodness of fit statistic on their own.

plot(mds_res_t, 
     plot.type="stressplot",
     xlab='Variable',
     ylab='% contribution to the stress value')

A different method for mapping points multidimensional points in low dimensional spaces is t-SNE, proposed and described in this extremely influential paper. the chosen perplexity and theta parameters are offering a balance between computational complexity and satisfying results.

tsne_res=Rtsne(df_std[num_cols],
               dims=2,
               perplexity=30,
               verbose=F,
               theta=0.5)
df_std[,c('tsne_dim1','tsne_dim2')]=tsne_res$Y
df_std[308:317,c('product','tsne_dim1','tsne_dim2')]

##                   product  tsne_dim1 tsne_dim2
## 311           Iced Coffee  1.1944835 -17.63006
## 312    Cold Coffee Frappe -3.4033495 -17.81250
## 313          Mocha Frappe -3.9197519 -18.43582
## 314 Chocolate Oreo Frappe -4.1396163 -18.61343
## 315      Strawberry Shake -1.3968335 -18.61702
## 316       Chocolate Shake -1.4818395 -18.70644
## 317        Mango Smoothie -0.7003729 -19.83934
## 318  Mixed Berry Smoothie -0.9883538 -20.01803
## 319      Raw Mango Cooler  6.9745094 -18.25156
## 320      Mix Berry Cooler  6.9576922 -18.32046

Plotting the results. Yet again, each product can now be easily compared to the rest quickly and easily. Unsurprisingly, observed similarities and dissimilarities are basically the same as for the MDS results. However, t-SNE seems to have put similar datapoints closer together, while keeping a greater distance from the rest than MDS.

tsne_scatter=nice_scatter(df_std[308:317,c('product','tsne_dim1','tsne_dim2')],
             predictor='tsne_dim1',
             response='tsne_dim2',
             group='product',
             has.line=F,
             xtitle='t-SNE dimension 1',
             ytitle='t-SNE dimension 2')
ggsave('tsne.png', width=10, height=6)

Here are the individual standardized variables together with the fitted dimensions from MDS and t-SNE. Here we can see that products graphically portrayed as similar indeed are rather close in terms of nutritional values.

all_num_cols=c('proteins','sugars','saturated_fats','trans_fats','mds_dim1','mds_dim2','tsne_dim1','tsne_dim2')
df_std_round=df_std['product']
df_std_round[all_num_cols]=round(df_std[all_num_cols],2)
png('table.png',width=800, height=300)
grid.table(df_std_round[308:317,])

Dimensionality reduction intro

Michał Woźniak

2024-01-27