Dimensionality Reduction of pizza dataset

Introduction

The aim of this article is to use PCA and T-SNE method for dimension reduction of pizza dataset. PCA method is appropriate for the data, which especially contains continous or quantitative variables. The goal of PCA is to rotate the coordinate system in such a way as to maximize first the variance of the first coordinate, then the variance of the second coordinate, and so on. The coordinate values thus transformed are called the loads of the generated factors (principal components). T-SNE is a non-linear technique primarily used for data exploration and visualizing high-dimensional data (I will expand on this issue later). Compared to PCA (founded in 1933), T-SNE can be considered as a fairly new algorithm (2008).

Libraries and dataset

# Data Handling
library(tidyverse)

# Visualisation
library(ggplot2)
library(GGally)
library(kableExtra)

# PCA
library(factoextra)
library(FactoMineR)
library(gridExtra)
library(factoextra)
library(pca3d)

# T-SNE
library(Rtsne)
library(scatterplot3d)

# Loading dataset
pizza_org <- read.csv('Pizza.csv')
kable(head(pizza_org)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

brand	id	mois	prot	fat	ash	sodium	carb	cal
A	14069	27.82	21.43	44.87	5.11	1.77	0.77	4.93
A	14053	28.49	21.26	43.89	5.34	1.79	1.02	4.84
A	14025	28.35	19.99	45.78	5.08	1.63	0.80	4.95
A	14016	30.55	20.15	43.13	4.79	1.61	1.38	4.74
A	14005	30.49	21.28	41.65	4.82	1.64	1.76	4.67
A	14075	31.14	20.23	42.31	4.92	1.65	1.40	4.67

The dataset was downloaded from this link: https://data.world/sdhilip/pizza-datasets and it contains information on the micronutrients of various pizzas.

The variables in the data set are:

brand - Pizza brand (class label)
id - Sample analised
mois - Amount of water per 100 grams in the sample
prot - Amount of protein per 100 grams in the sample
fat - Amount of fat per 100 grams in the sample
ash - Amount of ash per 100 grams in the sample
sodium - Amount of sodium per 100 grams in the sample
carb - Amount of carbohydrates per 100 grams in the sample
cal - Amount of calories per 100 grams in the sample

Exploratory Data Analysis

Before implementing any Machine Learning algorithm, it is advisable to get know with the data in order to understand it more clearly. In my opinion it is very helpful, because in our future work it is hard to estimate with what kind of data we will have to deal with.

dim(pizza_org)

## [1] 300   9

summary(pizza_org)

##     brand                 id             mois            prot      
##  Length:300         Min.   :14003   Min.   :25.00   Min.   : 6.98  
##  Class :character   1st Qu.:14094   1st Qu.:30.90   1st Qu.: 8.06  
##  Mode  :character   Median :24021   Median :43.30   Median :10.44  
##                     Mean   :20841   Mean   :40.90   Mean   :13.37  
##                     3rd Qu.:24110   3rd Qu.:49.12   3rd Qu.:20.02  
##                     Max.   :34045   Max.   :57.22   Max.   :28.48  
##       fat             ash            sodium            carb       
##  Min.   : 4.38   Min.   :1.170   Min.   :0.2500   Min.   : 0.510  
##  1st Qu.:14.77   1st Qu.:1.450   1st Qu.:0.4500   1st Qu.: 3.467  
##  Median :17.14   Median :2.225   Median :0.4900   Median :23.245  
##  Mean   :20.23   Mean   :2.633   Mean   :0.6694   Mean   :22.865  
##  3rd Qu.:21.43   3rd Qu.:3.592   3rd Qu.:0.7025   3rd Qu.:41.337  
##  Max.   :47.20   Max.   :5.430   Max.   :1.7900   Max.   :48.640  
##       cal       
##  Min.   :2.180  
##  1st Qu.:2.910  
##  Median :3.215  
##  Mean   :3.271  
##  3rd Qu.:3.520  
##  Max.   :5.080

table(pizza_org$brand)

## 
##  A  B  C  D  E  F  G  H  I  J 
## 29 31 27 32 28 30 29 33 29 32

Let’s check if there are any NaN values in the dataset.

apply(pizza_org[1:length(colnames(pizza_org))],2, anyNA)

##  brand     id   mois   prot    fat    ash sodium   carb    cal 
##  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE

As we see, there are no missing values in our dataset.

Id variable

At first let me change the name of “id” variable just into “sample”. It might be good to delve into this variable, because it seems to me that this column should be dropped. That’s why, I will check the uniqness of this column.

colnames(pizza_org)[which(names(pizza_org) == "id")] <- c("sample")
length(unique(pizza_org$sample))

## [1] 291

It occured that the number of unique values is not equal to the number of rows. That’s why, I have to check whether there are some duplicates.

sample_table <- table(pizza_org$sample) == 2
sample_table[sample_table == TRUE]; sum(sample_table[sample_table == TRUE])

## 
## 14029 14044 14045 14126 24035 24043 24049 24089 24110 
##  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

## [1] 9

We see that there are 9 values which potentially might be duplicated - let’s check this records in the dataset.

sample_table_frame <- as.data.frame(sample_table[sample_table == TRUE])
# Set indexes into the first column
sample_table_frame <- cbind(newColName = rownames(sample_table_frame), sample_table_frame) 
rownames(sample_table_frame) <- 1:nrow(sample_table_frame)
# Get values of potential duplicates
kable(pizza_org[pizza_org$sample %in% sample_table_frame$newColName,]) %>% 
  kable_styling()

	brand	sample	mois	prot	fat	ash	sodium	carb	cal
53	B	24049	50.87	13.85	27.64	3.71	1.10	3.93	3.20
66	C	14029	49.73	25.65	19.98	2.51	0.52	2.13	2.91
78	C	24043	52.19	26.00	16.64	4.17	0.61	1.00	2.58
119	D	24043	52.19	26.00	16.64	4.17	0.61	1.00	2.58
125	E	14029	39.59	8.36	4.39	1.52	0.48	46.14	2.58
130	E	14126	37.78	8.30	13.05	1.64	0.49	39.23	3.08
134	E	24110	37.34	7.33	19.61	1.60	0.45	34.12	3.42
137	E	24110	37.34	7.33	19.61	1.60	0.45	34.12	3.42
138	E	24035	34.31	7.98	14.54	1.46	0.49	41.71	3.30
146	E	14126	37.78	8.30	13.05	1.64	0.49	39.23	3.08
162	F	24035	27.65	7.78	17.30	1.29	0.40	46.25	3.72
163	F	24049	28.33	7.82	17.96	1.41	0.45	44.48	3.71
227	H	24089	35.74	7.86	14.27	1.40	0.44	40.73	3.23
239	H	24089	35.74	7.86	14.27	1.40	0.44	40.73	3.23
271	J	14044	47.60	10.43	15.18	2.32	0.56	24.47	2.76
272	J	14045	46.84	9.91	15.50	2.27	0.57	25.48	2.81
299	J	14044	47.60	10.43	15.18	2.32	0.56	24.47	2.76
300	J	14045	46.84	9.91	15.50	2.27	0.57	25.48	2.81

Well the result is quite tricky, because some records are 1:1 equal (for example sample no. = 24110). These records can be treated as duplicates. However, for some sample values all the other “nutrient” variables are the same. Interestingly, they have been assigned to the other brands, which bake pizzas (for example sample = 24043).

I am aware that deleting records is not always the best way to perform further analysis. However, I’m not entirely sure how the data was derived and I’m afraid that duplicated data may interfere with my subsequent matches. Nevertheless, for the purposes of this exercise, I decided to remove these records. For the data that differs only in the brand column, I decided to leave it unchanged. I don’t think that they are incorrect. Perhaps there are some pizzerias, which bake 1:1 the same type of pizza.

# Making a copy of original dataset, in order to not having the same memory address
pizza_non_scaled <- data.frame(pizza_org) 
pizza_non_sc_no_dp <- pizza_non_scaled[!duplicated(pizza_non_scaled), ]

print(paste("The number of removed rows is equal to:", nrow(pizza_org) - nrow(pizza_non_sc_no_dp)))

## [1] "The number of removed rows is equal to: 5"

Moreover, I will also get rid of sample variable, becasue it seems to me that it does not contribute any additional relevant information about the pizzas.

pizza <- subset(pizza_non_sc_no_dp, select = -c(sample))
kable(head(pizza)) %>% 
  kable_styling()

brand	mois	prot	fat	ash	sodium	carb	cal
A	27.82	21.43	44.87	5.11	1.77	0.77	4.93
A	28.49	21.26	43.89	5.34	1.79	1.02	4.84
A	28.35	19.99	45.78	5.08	1.63	0.80	4.95
A	30.55	20.15	43.13	4.79	1.61	1.38	4.74
A	30.49	21.28	41.65	4.82	1.64	1.76	4.67
A	31.14	20.23	42.31	4.92	1.65	1.40	4.67

# Looks good now

Visualisations

Now, let’s “penetrate” the data and look for some interesting relationships and additional information. This will help me to check, if particular companies make their pizzas with similar nutritional ratios or if there are significant differences from brand to brand.

pizza %>% 
  gather("micronutritions", "value", 2:8) %>% 
  ggplot(aes(x = brand, y = value, fill = brand)) +
    geom_boxplot() +
    ggtitle("Histograms of micronutritions") +
    xlab("Brands") +
    ylab("Values") +
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5, size = 14)) +
    facet_wrap(~ micronutritions, scales = "free")

Looking at the histograms above, we can see that each of the pizzerias has varying macronutrients. The greatest similarity among the pizzerias can be seen in the graph for the variable fat, while the greatest variation undoubtedly concerns the variable mois, which is the amount of water per 100 grams of sample.

cor_pizza <- cor(pizza[, 2:length(pizza)])
ggcorr(cor_pizza, label = T, label_round = 2)

It turns out that for the most part the correlations between variables are high (corr > |0.7|), with with predominantly positive values for this indicator.

Dimensions reduction

The aim of this chapter is to implement PCA algorithm, in order to reduce the number of the dimensions. Currently, in my dataset there are 8 different variables, so I think that satisfactory number of components should be lower or equal to 4. Because of the fact that my data is not on the same scale, I will standardize the variables wherever it is possible.

Kmeans, PAM - optimal number of clusters

Out of curiosity, before running PCA, at first let’s focus on finding the most optimal number of clusters. In order to do that, I will use function factoextra::fviz_nbclust() and apply it into Kmeans and PAM algorithm. Of course, there are some other algorithms of clustering like CLARA, but in my case the dataset is too small, so the results might be unclear.

# Scale the data
pizza_scaled <- pizza %>% 
              mutate_if(is.numeric, scale)

# Draw optimal number of clusters for Kmeans and PAM
opt_kmeans_sill <- fviz_nbclust(pizza_scaled[,-1], FUNcluster = kmeans, method = "silhouette") + theme_classic() +
  labs(subtitle = "Silhouette method with K-means")
opt_pam_sill <- fviz_nbclust(pizza_scaled[,-1], FUNcluster = cluster::pam, method = "silhouette") + theme_classic() +
  labs(subtitle = "Silhouette method with PAM")
grid.arrange(opt_kmeans_sill, opt_pam_sill, nrow=2)

As we see, the optimal number of clusters is different: for kmeans it is 7 clusters, while for PAM it is 5 clusters. As a reminder, it is worth mentioning that the brand variable has 10 unique values.

PCA

At first let’s implement basic PCA algorithm, in order to see the data will behave.

basic_PCA <- prcomp(pizza[,-1], center = TRUE, scale = TRUE)
summary(basic_PCA)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6      PC7
## Standard deviation     2.0414 1.5146 0.64461 0.30793 0.16644 0.01841 0.003096
## Proportion of Variance 0.5954 0.3277 0.05936 0.01355 0.00396 0.00005 0.000000
## Cumulative Proportion  0.5954 0.9231 0.98245 0.99599 0.99995 1.00000 1.000000

fviz_screeplot(basic_PCA, addlabels = TRUE)

As a result, we got 3 different statistic, which describes individual component. The most important, seems to be the variance proportion statistic, because it tells us, how much of the total variance a particular component explains. Of course, we expect that as the number of components increases, the value for this statistic will begin to decrease, which is also seen in the example above. In my case, the PC1 explains 59.54% of the total variance, while the PC2 explains only 32.77%. We can also see that the last 3 components have very little influence on the variance explainability, so intuition hints me that it might be worth trying to reduce the data.

Kaiser criterion

I am definitely not satisfied with the number of obtained principal components, so in this case I will dive into my data and try to reduce more dimensions.

There are a lot of tests, which helps to deal with that problem, however I will use the Kaiser criterion. This method is based on calculating the eigenvalues. Especially, we are interested in these components for which eigenvalue is greater than 1. Having this in mind, let’s check if I am able to improve my “reduction quality”.

pizza_eigen <- factoextra::get_eigenvalue(basic_PCA)
pizza_eigen

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.167517e+00     5.953595e+01                    59.53595
## Dim.2 2.294095e+00     3.277279e+01                    92.30874
## Dim.3 4.155157e-01     5.935939e+00                    98.24468
## Dim.4 9.482252e-02     1.354607e+00                    99.59929
## Dim.5 2.770128e-02     3.957325e-01                    99.99502
## Dim.6 3.390803e-04     4.844004e-03                    99.99986
## Dim.7 9.584866e-06     1.369267e-04                   100.00000

Taking into account the Kaiser criterion, only first 3 components seems to be significant. It is consistent with what the graph above showed about the total variance explanation. Having this in mind, I will omit the other components in the following analysis.

PCA visualisation

In order to check the effectiveness of the PCA algorithm, below I present the charts for the obtained (in my case three) components.

#library(factoextra)

fviz_pca_var(basic_PCA, col.var = "contrib", repel = TRUE, axes = c(1, 3)) +
  labs(title="Variables loading for PC1 and PC2", x="PC1", y="PC2")

fviz_pca_var(basic_PCA, col.var = "contrib", repel = TRUE, axes = c(2, 3)) +
  labs(title="Variables loading for PC2 and PC3", x="PC2", y="PC3")

We can also delve a little bit deeper and check which variable has the biggest impact on dimensional spaces.

contrib_plot_1_2 <- fviz_contrib(basic_PCA, choice = "var", axes = 1:2)
contrib_plot_2_3 <- fviz_contrib(basic_PCA, choice = "var", axes = 2:3)

grid.arrange(contrib_plot_1_2, contrib_plot_2_3)

It is also possible to extend the above graphs by showing the observations in two dimensions with a colored quality representation.

#library(pca3d)
gr<-factor(pizza$brand)
pca2d(basic_PCA, components = 1:2, group = gr, biplot=TRUE, biplot.vars=3, legend = "topleft")

pca2d(basic_PCA, components = 2:3, group = gr, biplot=TRUE, biplot.vars=3, legend = "topleft")

T-SNE

T-SNE stands for t-Distributed Stochastic Neighbor Embedding. In a nutshell, The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. Then, it tries to optimize these two similarity measures using a cost function. The algorithm is able to provide a 2D or 3D visual representation of the data. That’s why, it might be useful, in order to better understanding of high-dimensional data or even to classification/regression purposes.

However, using my dataset as an example, it may turn out that I have not presented the whole capabilities of this algorithm. Nevertheless, for the purpose of this exercise and my for self-development, I decided to implement it into my dataset and see how it would work and if it would indicate any differences from PCA.

#library(Rtsne)
set.seed(42) # to ensure reproducibility

colors = rainbow(length(unique(pizza_non_sc_no_dp$brand)))
names(colors) = unique(pizza_non_sc_no_dp$brand)

pizza_matrix <- as.matrix(pizza_non_sc_no_dp[, -1])
tsne <- Rtsne::Rtsne(pizza_matrix, check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=3)

plot(tsne$Y, t='n', 
     main = "Plot of T-SNE", 
     xlab = "T-SNE 1st dimension", 
     ylab = "T-SNE 2nd dimension")
text(tsne$Y, labels=pizza_non_sc_no_dp$brand, col=colors[pizza_non_sc_no_dp$brand])

#library(scatterplot3d)
scatterplot3d::scatterplot3d(x=tsne$Y[,1],y=tsne$Y[,2],z=tsne$Y[,3], color = colors[pizza_non_sc_no_dp$brand], 
                             main = "3 dimmensional T-SNE", 
                             xlab = "T-SNE 1st dimension", 
                             ylab = "T-SNE 2nd dimension", 
                             zlab = "T-SNE 3rd dimension")
legend("topright", legend = c("A","B","C","D","E","F","G","H","I","J"),
      col =  colors, xpd = TRUE, horiz = FALSE, ncol = 2, pch = 16)

Summary

The main goal of this project was to demonstrate whether pizza types baked by different pizzerias can be somehow described by reducing dimensions. Even for such a non-obvious dataset, the PCA algorithm found that 3 components were sufficient to gather some information of how the data looks like. In the case of T-SNE, it is hard to say that the results were satisfactory because it was not entirely possible to group the data into appropriate clusters. This only confirms, as mentioned before, that for a small dataset this algorithm can create some confusion. Nevertheless, in reality, most of the datasets are much larger and more fixed. In such cases, T-SNE might perform significantly better than PCA.