In this notebook I will try to see how can multiple dimensions be plotted on a 2 dimensional plane.
Loading necessary libraries
library(tidyverse)
library(dplyr)
library(glue)
library(scales)
library(ggplot2)
library(GGally)
library(rempsyc)
library(smacof)
library(Rtsne)
library(gridExtra)
Loading the dataset and displaying datapoints of interest.
df=read.csv('dataset.csv')
df=df %>% select('Product','Protein..g.','Sugar..g.','Saturated.Fat..g.','Trans.Fat..g.') # selecting relevant columns
colnames(df)=c('product','proteins','sugars','saturated_fats','trans_fats') # changing column names
num_cols=c('proteins','sugars','saturated_fats','trans_fats') # numerical columns
df=df[df['trans_fats']!=max(df['trans_fats']),] # removing a meal with suspiciously high trans fats value
df=df[!duplicated(df[num_cols]),]
df[308:317,] # I will focus on comparing selected McCafe beverages
## product proteins sugars saturated_fats trans_fats
## 311 Iced Coffee 4.36 26.95 3.26 0.15
## 312 Cold Coffee Frappe 4.98 35.57 13.91 0.16
## 313 Mocha Frappe 5.49 47.55 14.00 0.20
## 314 Chocolate Oreo Frappe 6.03 55.14 15.91 0.22
## 315 Strawberry Shake 3.67 37.42 6.68 0.12
## 316 Chocolate Shake 4.16 37.78 6.74 0.14
## 317 Mango Smoothie 3.21 38.87 2.65 0.14
## 318 Mixed Berry Smoothie 3.33 43.00 2.64 0.15
## 319 Raw Mango Cooler 0.14 21.06 0.04 0.04
## 320 Mix Berry Cooler 0.16 21.25 0.04 0.04
Pairplots are regular way of showing the relationships between variables. However, as they are 2 dimensional, the displayed relationships are limited to 2 at a time. This is not helpful when trying to visually determine if specific points are similar to each other or not.
pairplot=ggpairs(df[308:317,num_cols],
upper=list(continuous='blankDiag'),
diag=list(continuous='blankDiag'),
progress=F)
ggsave('pairplot.png', width = 10, height = 10)
In order to visualize similarities (or dissimilarities) between datapoints, one can perform dimensionality reduction.
First, the data has to be standardized, so that each variable has the same “weight” when using MDS or t-SNE. Both methods used here rely on Euclidean distance, which is directly related to the order of magnitude of distinct variables.
df_std=df
for (col in num_cols)
{
df_std[col]=scale(df[,col])
}
df_std[308:317,]
## product proteins sugars saturated_fats trans_fats
## 311 Iced Coffee -0.5097094 1.519896 -0.5702121 0.086374538
## 312 Cold Coffee Frappe -0.4844120 2.225012 1.6348198 0.164611804
## 313 Mocha Frappe -0.4636028 3.204976 1.6534539 0.477560867
## 314 Chocolate Oreo Frappe -0.4415695 3.825838 2.0489103 0.634035398
## 315 Strawberry Shake -0.5378630 2.376342 0.1378827 -0.148337258
## 316 Chocolate Shake -0.5178699 2.405790 0.1503054 0.008137273
## 317 Mango Smoothie -0.5566321 2.494952 -0.6965097 0.008137273
## 318 Mixed Berry Smoothie -0.5517358 2.832786 -0.6985802 0.086374538
## 319 Raw Mango Cooler -0.6818954 1.038094 -1.2368978 -0.774235383
## 320 Mix Berry Cooler -0.6810793 1.053636 -1.2368978 -0.774235383
Performing standard Multidimensional Scaling (MDS). Here is the original paper from 1964, explaining what the algorithm is and how it works.
mds_res=mds(dist(df_std[num_cols]), # on standardized numerical columns
ndim=2, # Folding to 2 dimensions
type='ratio')
print(glue("Stress-1 value = {round(mds_res$stress,3)}"))
## Stress-1 value = 0.121
According to the creator of the algorithm, the achieved goodness of fit, measured by the stress value, of 0.12 can be labeled as fair/good. Let’s see how the chosen meals look like in the newly created dimensions.
conf=mds_res$conf
df_std[,c('mds_dim1','mds_dim2')]=mds_res$conf
df_std[308:317,c('product','mds_dim1','mds_dim2')]
## product mds_dim1 mds_dim2
## 311 Iced Coffee -0.5823810 -0.2738667
## 312 Cold Coffee Frappe -0.4376651 -0.9329618
## 313 Mocha Frappe -0.6152817 -1.1890822
## 314 Chocolate Oreo Frappe -0.6202616 -1.4724306
## 315 Strawberry Shake -0.7605102 -0.5012184
## 316 Chocolate Shake -0.7368614 -0.5473443
## 317 Mango Smoothie -0.8854028 -0.4203400
## 318 Mixed Berry Smoothie -0.9669862 -0.5070659
## 319 Raw Mango Cooler -0.6924077 0.1350273
## 320 Mix Berry Cooler -0.6962709 0.1322860
Plotting the results. With the number of dimensions reduced, we can now quickly evaluate which products have similar nutritional values. For example, chocolate and strawberry shakes are almost the same, while Mixed Berry Smoothie and Mango Smoothie are really close.
mds_scatter=nice_scatter(df_std[308:317,c('product','mds_dim1','mds_dim2')],
predictor='mds_dim1',
response='mds_dim2',
group='product',
has.line=F,
xtitle='MDS dimension 1',
ytitle='MDS dimension 2')
ggsave('mds.png', width = 10, height = 6)
Now let’s examine how each of the variables
contributes to the stress value. To do so the MDS algorithm has to be
run on a transposed dataset.
mds_res_t=mds(dist(t(df_std[num_cols])),
ndim=2,
type='ratio')
summary(mds_res_t)
##
## Configurations:
## D1 D2
## proteins 0.4047 -0.4384
## sugars -0.6693 -0.2910
## saturated_fats 0.4588 0.1625
## trans_fats -0.1942 0.5669
##
##
## Stress per point (in %):
## proteins sugars saturated_fats trans_fats
## 26.80 17.50 28.81 26.89
Plotting the stress decomposition over the variables. As the data has no extreme outliers no variables prevent getting a better goodness of fit statistic on their own.
plot(mds_res_t,
plot.type="stressplot",
xlab='Variable',
ylab='% contribution to the stress value')
A different method for mapping points multidimensional points in low dimensional spaces is t-SNE, proposed and described in this extremely influential paper. the chosen perplexity and theta parameters are offering a balance between computational complexity and satisfying results.
tsne_res=Rtsne(df_std[num_cols],
dims=2,
perplexity=30,
verbose=F,
theta=0.5)
df_std[,c('tsne_dim1','tsne_dim2')]=tsne_res$Y
df_std[308:317,c('product','tsne_dim1','tsne_dim2')]
## product tsne_dim1 tsne_dim2
## 311 Iced Coffee 1.1944835 -17.63006
## 312 Cold Coffee Frappe -3.4033495 -17.81250
## 313 Mocha Frappe -3.9197519 -18.43582
## 314 Chocolate Oreo Frappe -4.1396163 -18.61343
## 315 Strawberry Shake -1.3968335 -18.61702
## 316 Chocolate Shake -1.4818395 -18.70644
## 317 Mango Smoothie -0.7003729 -19.83934
## 318 Mixed Berry Smoothie -0.9883538 -20.01803
## 319 Raw Mango Cooler 6.9745094 -18.25156
## 320 Mix Berry Cooler 6.9576922 -18.32046
Plotting the results. Yet again, each product can now be easily compared to the rest quickly and easily. Unsurprisingly, observed similarities and dissimilarities are basically the same as for the MDS results. However, t-SNE seems to have put similar datapoints closer together, while keeping a greater distance from the rest than MDS.
tsne_scatter=nice_scatter(df_std[308:317,c('product','tsne_dim1','tsne_dim2')],
predictor='tsne_dim1',
response='tsne_dim2',
group='product',
has.line=F,
xtitle='t-SNE dimension 1',
ytitle='t-SNE dimension 2')
ggsave('tsne.png', width=10, height=6)
Here are the individual standardized variables together with the fitted dimensions from MDS and t-SNE. Here we can see that products graphically portrayed as similar indeed are rather close in terms of nutritional values.
all_num_cols=c('proteins','sugars','saturated_fats','trans_fats','mds_dim1','mds_dim2','tsne_dim1','tsne_dim2')
df_std_round=df_std['product']
df_std_round[all_num_cols]=round(df_std[all_num_cols],2)
png('table.png',width=800, height=300)
grid.table(df_std_round[308:317,])