The aim of this article is to use PCA and T-SNE method for dimension reduction of pizza dataset. PCA method is appropriate for the data, which especially contains continous or quantitative variables. The goal of PCA is to rotate the coordinate system in such a way as to maximize first the variance of the first coordinate, then the variance of the second coordinate, and so on. The coordinate values thus transformed are called the loads of the generated factors (principal components). T-SNE is a non-linear technique primarily used for data exploration and visualizing high-dimensional data (I will expand on this issue later). Compared to PCA (founded in 1933), T-SNE can be considered as a fairly new algorithm (2008).
# Data Handling
library(tidyverse)
# Visualisation
library(ggplot2)
library(GGally)
library(kableExtra)
# PCA
library(factoextra)
library(FactoMineR)
library(gridExtra)
library(factoextra)
library(pca3d)
# T-SNE
library(Rtsne)
library(scatterplot3d)
# Loading dataset
pizza_org <- read.csv('Pizza.csv')
kable(head(pizza_org)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| brand | id | mois | prot | fat | ash | sodium | carb | cal |
|---|---|---|---|---|---|---|---|---|
| A | 14069 | 27.82 | 21.43 | 44.87 | 5.11 | 1.77 | 0.77 | 4.93 |
| A | 14053 | 28.49 | 21.26 | 43.89 | 5.34 | 1.79 | 1.02 | 4.84 |
| A | 14025 | 28.35 | 19.99 | 45.78 | 5.08 | 1.63 | 0.80 | 4.95 |
| A | 14016 | 30.55 | 20.15 | 43.13 | 4.79 | 1.61 | 1.38 | 4.74 |
| A | 14005 | 30.49 | 21.28 | 41.65 | 4.82 | 1.64 | 1.76 | 4.67 |
| A | 14075 | 31.14 | 20.23 | 42.31 | 4.92 | 1.65 | 1.40 | 4.67 |
The dataset was downloaded from this link: https://data.world/sdhilip/pizza-datasets and it contains information on the micronutrients of various pizzas.
The variables in the data set are:
Before implementing any Machine Learning algorithm, it is advisable to get know with the data in order to understand it more clearly. In my opinion it is very helpful, because in our future work it is hard to estimate with what kind of data we will have to deal with.
dim(pizza_org)
## [1] 300 9
summary(pizza_org)
## brand id mois prot
## Length:300 Min. :14003 Min. :25.00 Min. : 6.98
## Class :character 1st Qu.:14094 1st Qu.:30.90 1st Qu.: 8.06
## Mode :character Median :24021 Median :43.30 Median :10.44
## Mean :20841 Mean :40.90 Mean :13.37
## 3rd Qu.:24110 3rd Qu.:49.12 3rd Qu.:20.02
## Max. :34045 Max. :57.22 Max. :28.48
## fat ash sodium carb
## Min. : 4.38 Min. :1.170 Min. :0.2500 Min. : 0.510
## 1st Qu.:14.77 1st Qu.:1.450 1st Qu.:0.4500 1st Qu.: 3.467
## Median :17.14 Median :2.225 Median :0.4900 Median :23.245
## Mean :20.23 Mean :2.633 Mean :0.6694 Mean :22.865
## 3rd Qu.:21.43 3rd Qu.:3.592 3rd Qu.:0.7025 3rd Qu.:41.337
## Max. :47.20 Max. :5.430 Max. :1.7900 Max. :48.640
## cal
## Min. :2.180
## 1st Qu.:2.910
## Median :3.215
## Mean :3.271
## 3rd Qu.:3.520
## Max. :5.080
table(pizza_org$brand)
##
## A B C D E F G H I J
## 29 31 27 32 28 30 29 33 29 32
Letâs check if there are any NaN values in the dataset.
apply(pizza_org[1:length(colnames(pizza_org))],2, anyNA)
## brand id mois prot fat ash sodium carb cal
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
As we see, there are no missing values in our dataset.
At first let me change the name of âidâ variable just into âsampleâ. It might be good to delve into this variable, because it seems to me that this column should be dropped. Thatâs why, I will check the uniqness of this column.
colnames(pizza_org)[which(names(pizza_org) == "id")] <- c("sample")
length(unique(pizza_org$sample))
## [1] 291
It occured that the number of unique values is not equal to the number of rows. Thatâs why, I have to check whether there are some duplicates.
sample_table <- table(pizza_org$sample) == 2
sample_table[sample_table == TRUE]; sum(sample_table[sample_table == TRUE])
##
## 14029 14044 14045 14126 24035 24043 24049 24089 24110
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] 9
We see that there are 9 values which potentially might be duplicated - letâs check this records in the dataset.
sample_table_frame <- as.data.frame(sample_table[sample_table == TRUE])
# Set indexes into the first column
sample_table_frame <- cbind(newColName = rownames(sample_table_frame), sample_table_frame)
rownames(sample_table_frame) <- 1:nrow(sample_table_frame)
# Get values of potential duplicates
kable(pizza_org[pizza_org$sample %in% sample_table_frame$newColName,]) %>%
kable_styling()
| brand | sample | mois | prot | fat | ash | sodium | carb | cal | |
|---|---|---|---|---|---|---|---|---|---|
| 53 | B | 24049 | 50.87 | 13.85 | 27.64 | 3.71 | 1.10 | 3.93 | 3.20 |
| 66 | C | 14029 | 49.73 | 25.65 | 19.98 | 2.51 | 0.52 | 2.13 | 2.91 |
| 78 | C | 24043 | 52.19 | 26.00 | 16.64 | 4.17 | 0.61 | 1.00 | 2.58 |
| 119 | D | 24043 | 52.19 | 26.00 | 16.64 | 4.17 | 0.61 | 1.00 | 2.58 |
| 125 | E | 14029 | 39.59 | 8.36 | 4.39 | 1.52 | 0.48 | 46.14 | 2.58 |
| 130 | E | 14126 | 37.78 | 8.30 | 13.05 | 1.64 | 0.49 | 39.23 | 3.08 |
| 134 | E | 24110 | 37.34 | 7.33 | 19.61 | 1.60 | 0.45 | 34.12 | 3.42 |
| 137 | E | 24110 | 37.34 | 7.33 | 19.61 | 1.60 | 0.45 | 34.12 | 3.42 |
| 138 | E | 24035 | 34.31 | 7.98 | 14.54 | 1.46 | 0.49 | 41.71 | 3.30 |
| 146 | E | 14126 | 37.78 | 8.30 | 13.05 | 1.64 | 0.49 | 39.23 | 3.08 |
| 162 | F | 24035 | 27.65 | 7.78 | 17.30 | 1.29 | 0.40 | 46.25 | 3.72 |
| 163 | F | 24049 | 28.33 | 7.82 | 17.96 | 1.41 | 0.45 | 44.48 | 3.71 |
| 227 | H | 24089 | 35.74 | 7.86 | 14.27 | 1.40 | 0.44 | 40.73 | 3.23 |
| 239 | H | 24089 | 35.74 | 7.86 | 14.27 | 1.40 | 0.44 | 40.73 | 3.23 |
| 271 | J | 14044 | 47.60 | 10.43 | 15.18 | 2.32 | 0.56 | 24.47 | 2.76 |
| 272 | J | 14045 | 46.84 | 9.91 | 15.50 | 2.27 | 0.57 | 25.48 | 2.81 |
| 299 | J | 14044 | 47.60 | 10.43 | 15.18 | 2.32 | 0.56 | 24.47 | 2.76 |
| 300 | J | 14045 | 46.84 | 9.91 | 15.50 | 2.27 | 0.57 | 25.48 | 2.81 |
Well the result is quite tricky, because some records are 1:1 equal (for example sample no. = 24110). These records can be treated as duplicates. However, for some sample values all the other ânutrientâ variables are the same. Interestingly, they have been assigned to the other brands, which bake pizzas (for example sample = 24043).
I am aware that deleting records is not always the best way to perform further analysis. However, Iâm not entirely sure how the data was derived and Iâm afraid that duplicated data may interfere with my subsequent matches. Nevertheless, for the purposes of this exercise, I decided to remove these records. For the data that differs only in the brand column, I decided to leave it unchanged. I donât think that they are incorrect. Perhaps there are some pizzerias, which bake 1:1 the same type of pizza.
# Making a copy of original dataset, in order to not having the same memory address
pizza_non_scaled <- data.frame(pizza_org)
pizza_non_sc_no_dp <- pizza_non_scaled[!duplicated(pizza_non_scaled), ]
print(paste("The number of removed rows is equal to:", nrow(pizza_org) - nrow(pizza_non_sc_no_dp)))
## [1] "The number of removed rows is equal to: 5"
Moreover, I will also get rid of sample variable, becasue it seems to me that it does not contribute any additional relevant information about the pizzas.
pizza <- subset(pizza_non_sc_no_dp, select = -c(sample))
kable(head(pizza)) %>%
kable_styling()
| brand | mois | prot | fat | ash | sodium | carb | cal |
|---|---|---|---|---|---|---|---|
| A | 27.82 | 21.43 | 44.87 | 5.11 | 1.77 | 0.77 | 4.93 |
| A | 28.49 | 21.26 | 43.89 | 5.34 | 1.79 | 1.02 | 4.84 |
| A | 28.35 | 19.99 | 45.78 | 5.08 | 1.63 | 0.80 | 4.95 |
| A | 30.55 | 20.15 | 43.13 | 4.79 | 1.61 | 1.38 | 4.74 |
| A | 30.49 | 21.28 | 41.65 | 4.82 | 1.64 | 1.76 | 4.67 |
| A | 31.14 | 20.23 | 42.31 | 4.92 | 1.65 | 1.40 | 4.67 |
# Looks good now
Now, letâs âpenetrateâ the data and look for some interesting relationships and additional information. This will help me to check, if particular companies make their pizzas with similar nutritional ratios or if there are significant differences from brand to brand.
pizza %>%
gather("micronutritions", "value", 2:8) %>%
ggplot(aes(x = brand, y = value, fill = brand)) +
geom_boxplot() +
ggtitle("Histograms of micronutritions") +
xlab("Brands") +
ylab("Values") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 14)) +
facet_wrap(~ micronutritions, scales = "free")
Looking at the histograms above, we can see that each of the pizzerias has varying macronutrients. The greatest similarity among the pizzerias can be seen in the graph for the variable fat, while the greatest variation undoubtedly concerns the variable mois, which is the amount of water per 100 grams of sample.
cor_pizza <- cor(pizza[, 2:length(pizza)])
ggcorr(cor_pizza, label = T, label_round = 2)
It turns out that for the most part the correlations between variables are high (corr > |0.7|), with with predominantly positive values for this indicator.
The aim of this chapter is to implement PCA algorithm, in order to reduce the number of the dimensions. Currently, in my dataset there are 8 different variables, so I think that satisfactory number of components should be lower or equal to 4. Because of the fact that my data is not on the same scale, I will standardize the variables wherever it is possible.
Out of curiosity, before running PCA, at first letâs focus on finding the most optimal number of clusters. In order to do that, I will use function factoextra::fviz_nbclust() and apply it into Kmeans and PAM algorithm. Of course, there are some other algorithms of clustering like CLARA, but in my case the dataset is too small, so the results might be unclear.
# Scale the data
pizza_scaled <- pizza %>%
mutate_if(is.numeric, scale)
# Draw optimal number of clusters for Kmeans and PAM
opt_kmeans_sill <- fviz_nbclust(pizza_scaled[,-1], FUNcluster = kmeans, method = "silhouette") + theme_classic() +
labs(subtitle = "Silhouette method with K-means")
opt_pam_sill <- fviz_nbclust(pizza_scaled[,-1], FUNcluster = cluster::pam, method = "silhouette") + theme_classic() +
labs(subtitle = "Silhouette method with PAM")
grid.arrange(opt_kmeans_sill, opt_pam_sill, nrow=2)
As we see, the optimal number of clusters is different: for kmeans it is 7 clusters, while for PAM it is 5 clusters. As a reminder, it is worth mentioning that the brand variable has 10 unique values.
At first letâs implement basic PCA algorithm, in order to see the data will behave.
basic_PCA <- prcomp(pizza[,-1], center = TRUE, scale = TRUE)
summary(basic_PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0414 1.5146 0.64461 0.30793 0.16644 0.01841 0.003096
## Proportion of Variance 0.5954 0.3277 0.05936 0.01355 0.00396 0.00005 0.000000
## Cumulative Proportion 0.5954 0.9231 0.98245 0.99599 0.99995 1.00000 1.000000
fviz_screeplot(basic_PCA, addlabels = TRUE)
As a result, we got 3 different statistic, which describes individual component. The most important, seems to be the variance proportion statistic, because it tells us, how much of the total variance a particular component explains. Of course, we expect that as the number of components increases, the value for this statistic will begin to decrease, which is also seen in the example above. In my case, the PC1 explains 59.54% of the total variance, while the PC2 explains only 32.77%. We can also see that the last 3 components have very little influence on the variance explainability, so intuition hints me that it might be worth trying to reduce the data.
I am definitely not satisfied with the number of obtained principal components, so in this case I will dive into my data and try to reduce more dimensions.
There are a lot of tests, which helps to deal with that problem, however I will use the Kaiser criterion. This method is based on calculating the eigenvalues. Especially, we are interested in these components for which eigenvalue is greater than 1. Having this in mind, letâs check if I am able to improve my âreduction qualityâ.
pizza_eigen <- factoextra::get_eigenvalue(basic_PCA)
pizza_eigen
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.167517e+00 5.953595e+01 59.53595
## Dim.2 2.294095e+00 3.277279e+01 92.30874
## Dim.3 4.155157e-01 5.935939e+00 98.24468
## Dim.4 9.482252e-02 1.354607e+00 99.59929
## Dim.5 2.770128e-02 3.957325e-01 99.99502
## Dim.6 3.390803e-04 4.844004e-03 99.99986
## Dim.7 9.584866e-06 1.369267e-04 100.00000
Taking into account the Kaiser criterion, only first 3 components seems to be significant. It is consistent with what the graph above showed about the total variance explanation. Having this in mind, I will omit the other components in the following analysis.
In order to check the effectiveness of the PCA algorithm, below I present the charts for the obtained (in my case three) components.
#library(factoextra)
fviz_pca_var(basic_PCA, col.var = "contrib", repel = TRUE, axes = c(1, 3)) +
labs(title="Variables loading for PC1 and PC2", x="PC1", y="PC2")
fviz_pca_var(basic_PCA, col.var = "contrib", repel = TRUE, axes = c(2, 3)) +
labs(title="Variables loading for PC2 and PC3", x="PC2", y="PC3")
We can also delve a little bit deeper and check which variable has the biggest impact on dimensional spaces.
contrib_plot_1_2 <- fviz_contrib(basic_PCA, choice = "var", axes = 1:2)
contrib_plot_2_3 <- fviz_contrib(basic_PCA, choice = "var", axes = 2:3)
grid.arrange(contrib_plot_1_2, contrib_plot_2_3)
It is also possible to extend the above graphs by showing the observations in two dimensions with a colored quality representation.
#library(pca3d)
gr<-factor(pizza$brand)
pca2d(basic_PCA, components = 1:2, group = gr, biplot=TRUE, biplot.vars=3, legend = "topleft")
pca2d(basic_PCA, components = 2:3, group = gr, biplot=TRUE, biplot.vars=3, legend = "topleft")
T-SNE stands for t-Distributed Stochastic Neighbor Embedding. In a nutshell, The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. Then, it tries to optimize these two similarity measures using a cost function. The algorithm is able to provide a 2D or 3D visual representation of the data. Thatâs why, it might be useful, in order to better understanding of high-dimensional data or even to classification/regression purposes.
However, using my dataset as an example, it may turn out that I have not presented the whole capabilities of this algorithm. Nevertheless, for the purpose of this exercise and my for self-development, I decided to implement it into my dataset and see how it would work and if it would indicate any differences from PCA.
#library(Rtsne)
set.seed(42) # to ensure reproducibility
colors = rainbow(length(unique(pizza_non_sc_no_dp$brand)))
names(colors) = unique(pizza_non_sc_no_dp$brand)
pizza_matrix <- as.matrix(pizza_non_sc_no_dp[, -1])
tsne <- Rtsne::Rtsne(pizza_matrix, check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=3)
plot(tsne$Y, t='n',
main = "Plot of T-SNE",
xlab = "T-SNE 1st dimension",
ylab = "T-SNE 2nd dimension")
text(tsne$Y, labels=pizza_non_sc_no_dp$brand, col=colors[pizza_non_sc_no_dp$brand])
#library(scatterplot3d)
scatterplot3d::scatterplot3d(x=tsne$Y[,1],y=tsne$Y[,2],z=tsne$Y[,3], color = colors[pizza_non_sc_no_dp$brand],
main = "3 dimmensional T-SNE",
xlab = "T-SNE 1st dimension",
ylab = "T-SNE 2nd dimension",
zlab = "T-SNE 3rd dimension")
legend("topright", legend = c("A","B","C","D","E","F","G","H","I","J"),
col = colors, xpd = TRUE, horiz = FALSE, ncol = 2, pch = 16)
The main goal of this project was to demonstrate whether pizza types baked by different pizzerias can be somehow described by reducing dimensions. Even for such a non-obvious dataset, the PCA algorithm found that 3 components were sufficient to gather some information of how the data looks like. In the case of T-SNE, it is hard to say that the results were satisfactory because it was not entirely possible to group the data into appropriate clusters. This only confirms, as mentioned before, that for a small dataset this algorithm can create some confusion. Nevertheless, in reality, most of the datasets are much larger and more fixed. In such cases, T-SNE might perform significantly better than PCA.