Introduction

The paper aims to showcase different clustering algorithms and useful statistics tests used to prepare and analyse data for clustering. Three algorithms were used: k-means, Partitioning Around Medoids (PAM) and fuzzy k-means. Analyses are based on the data from the database called “nutrition” that can be found here: https://www.kaggle.com/trolukovich/nutritional-values-for-common-foods-and-products?select=nutrition.csv. This data set contains nutrition data for almost 9 thousand different food items.

First of all, necessary libraries are loaded and the database is imported.

library(tidyverse)
library(dplyr)
library(lubridate)
library(ggplot2)
library(datasets)
library(readxl)
library(xlsx)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(tidyverse)
library(grid)
library(gridExtra)
library(lattice)
nutrition <- read.csv2("nutrition.csv", header=TRUE, sep = ",", stringsAsFactors = TRUE)

Second of all, the rows with NA values or omitted, irrelevant columns are deleted and the food items names are stored as an additional variable.

nutrition <- na.omit(nutrition)
namesfood <- nutrition[1:175,2]

nutrition <- nutrition[,-1]
nutrition <- nutrition[,-1]
nutrition <- nutrition[,-1]
nutrition <- nutrition[,-3]
dim(nutrition)
## [1] 8789   73

Further more, all columns containing char variables are converted to numeric.

for(i in 1:ncol(nutrition)) {
  nutrition[,i] <- as.numeric(nutrition[,i])
}

Also, in order to obtain more interpretable results whole database is normalised.

nutrition2<-nutrition 
nutrition <- scale(nutrition)

After the initial data processing we obtain the normalised matrix containing data for 8789 food products, for which 73 nutritional values has been assigned. Finally, in order not to work with big data, only 175 food items and 15 nutritional characteristics are selected.

nutritiontrim <- as.matrix(nutrition[1:175, 1:15])
nutrition2<-as.matrix(nutrition2[1:175, 1:15])
names_nutr<- as.matrix(colnames(nutritiontrim))
summary(nutritiontrim)
##     calories          total_fat        cholesterol           sodium       
##  Min.   :-1.26152   Min.   :-1.0124   Min.   :-0.89088   Min.   :-1.9078  
##  1st Qu.:-0.96127   1st Qu.:-0.9158   1st Qu.:-0.89088   1st Qu.:-1.0133  
##  Median :-0.18417   Median :-0.4971   Median :-0.89088   Median :-0.2183  
##  Mean   :-0.01106   Mean   :-0.1256   Mean   :-0.36797   Mean   :-0.2200  
##  3rd Qu.: 0.62236   3rd Qu.: 0.3295   3rd Qu.:-0.06053   3rd Qu.: 0.6415  
##  Max.   : 3.82496   Max.   : 2.7234   Max.   : 1.63663   Max.   : 1.8090  
##     choline             folate           folic_acid          niacin       
##  Min.   :-0.78718   Min.   :-1.17831   Min.   :-0.3548   Min.   :-1.2115  
##  1st Qu.:-0.78718   1st Qu.:-1.09848   1st Qu.:-0.3353   1st Qu.:-1.0396  
##  Median :-0.48857   Median :-0.22032   Median :-0.3353   Median :-0.5094  
##  Mean   : 0.01346   Mean   :-0.03467   Mean   :-0.1089   Mean   :-0.2006  
##  3rd Qu.: 0.74183   3rd Qu.: 0.98858   3rd Qu.:-0.3353   3rd Qu.: 0.5672  
##  Max.   : 2.22285   Max.   : 1.58923   Max.   : 4.6595   Max.   : 1.9358  
##  pantothenic_acid     riboflavin           thiamin           vitamin_a       
##  Min.   :-0.96531   Min.   :-1.061069   Min.   :-0.80429   Min.   :-0.97765  
##  1st Qu.:-0.94700   1st Qu.:-0.854052   1st Qu.:-0.67142   1st Qu.:-0.97535  
##  Median :-0.36370   Median :-0.289718   Median :-0.36460   Median :-0.03169  
##  Mean   : 0.07939   Mean   : 0.003946   Mean   : 0.04274   Mean   : 0.07498  
##  3rd Qu.: 0.85000   3rd Qu.: 0.484469   3rd Qu.: 0.45923   3rd Qu.: 1.03051  
##  Max.   : 2.80656   Max.   : 3.379871   Max.   : 3.52501   Max.   : 2.06049  
##  vitamin_a_rae      carotene_alpha    carotene_beta    
##  Min.   :-0.75618   Min.   :-0.2591   Min.   :-0.4945  
##  1st Qu.:-0.75032   1st Qu.:-0.2591   1st Qu.:-0.4945  
##  Median :-0.74446   Median :-0.2267   Median :-0.4879  
##  Mean   :-0.07657   Mean   : 0.1852   Mean   : 0.2510  
##  3rd Qu.: 0.52687   3rd Qu.:-0.2267   3rd Qu.: 0.8318  
##  Max.   : 2.49539   Max.   : 5.9855   Max.   : 3.2059

In order to identify potential outliers the interquartile range rule (IRQ) has been used. IRQ rule states that any observations that are far away from the middle 50% of observations are potential outliers.

library(tidyverse)
matr <- matrix(nrow=175, ncol=15)
for(i in 1:ncol(nutritiontrim)){
  max_interval <- quantile(nutritiontrim[,i],0.75,na.rm = TRUE)+(IQR(nutritiontrim[,i], na.rm = TRUE)*1.5)
  min_interval <- quantile(nutritiontrim[,i],0.25,na.rm = TRUE)-(IQR(nutritiontrim[,i], na.rm = TRUE)*1.5)
  matr[,i] <- ifelse(nutritiontrim[,i]>max_interval|nutritiontrim[,i]<min_interval, 1, 0)
}
outliers <- matrix(nrow=15, ncol=2)

for(i in 1:ncol(matr)){
  outliers[i,1]<-names_nutr[i,1]
  outliers[i,2]<-sum(matr[,i][matr[,i]==1])
}
outliers2<-as.matrix(outliers[, -1])
rownames(outliers2)<-outliers[,1]
colnames(outliers2)<-c("No_of_outliers")
library(knitr)
kable(outliers2)
No_of_outliers
calories 2
total_fat 12
cholesterol 24
sodium 0
choline 0
folate 0
folic_acid 51
niacin 0
pantothenic_acid 0
riboflavin 9
thiamin 8
vitamin_a 0
vitamin_a_rae 2
carotene_alpha 22
carotene_beta 8

Applying IRQ rule and aggregating results into one table allowed us to identify the variables with potential outliers. Variables folic_acid, cholesterol and carotene_alpha are the top three candidates.
Even though the IRQ rule results suggests that there are outliers in the analysed data set, I decided not to exclude any food items from the data set - in my opinion the visual examination of variables did not confirm the results obtained from IRQ.

In the next step the correlation matrix was constructed in order to examine the correlation between analysed variables.

library(corrplot)
## corrplot 0.92 loaded
corrplot.mixed(cor(nutritiontrim), bg="white", upper="pie",lower="number", order="hclust", tl.col="black", tl.pos="lt", diag="l", number.font=0.5, tl.cex=1, number.cex=0.55)


It is not surprising that some variables are strongly correlated, whilst other are not. In food items certain nutrients usually occur together and some do not pair naturally.

Clustering

Clustering of the data might result in gathering valuable information about the phenomenon the analysed data is representing, but only if data set is distinguishable from the totally random distribution. The most popular tool to check whether data have cluster potential or is rather uniformly distributed is called Hopkins statistic. It is a statistical test, which null hypothesis states that data is not uniformly distributed (the null and alternative hypothesis may be switched around in different libraries in R).

hopkins(nutritiontrim, n=nrow(nutritiontrim)-1, header=FALSE) 
## $H
## [1] 0.3186588

Obtained value of the test statistics is equal to 0.32. According to the test interpretation, data should be clusterable - data set is not uniformly distributed. To make sure the results are robust the alternative test from different library was performed.

get_clust_tendency(nutritiontrim, 2, graph=FALSE, gradient=list(low="purple", high="green"), seed = 48)
## $hopkins_stat
## [1] 0.8003733
## 
## $plot
## NULL

Here the null hypothesis states that data is uniformly distributed, and with the test statistics equal to 0.8 we should reject the null hypothesis - data set is clusterable.

K-means

First algorithm used fro clustering is k-means. Before applying the algorithm we should analyse the number of clusters and choose the most appropriate number for our data. We will do that with the help of three different statistics. First statistics used to analyse the number of clusters is silhouette. Silhouette value tells us how similar objects are to their own cluster compared to other clusters. The silhouette value ranges from -1 to 1, and the higher the value the better.

#silhouette
silkmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", method = "s") 
silkmeans


From the graph one can conclude that the most optimal number of clusters is 3 - the value of the silhouette is highest for this number of clusters. Second statistics used is the total within sum of squares. This method informs about the average sum of squared distances from each object to the center of the cluster.

#total within sum of squares 
wsskmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", method = "wss") 
wsskmeans


Performed work suggested two candidates for the optimal number of clusters. In general the lower value the better, which means that objects in clusters are close to each other and similar. On the other hand, with the increasing number of clusters value of the statistics will always drop, as the clusters get smaller. In my opinion there are two numbers that are candidates for the optimal number of clusters: 3 and 5. For both of these numbers with the small increase in the number of clusters the value of the statistics drops sharply, while further increase in the number of clusters does not result in the significant drop.

Third and final statistics used is the gap statistics. The gap statistics is an interesting, and less straightforward method for estimating the number of clusters than two previous ones. The idea is that one should compare the within cluster sum of squares to its expected value under the appropriate distribution (for more information about the gap statistics read: https://hastie.su.domains/Papers/gap.pdf).

#gap statistics
gapkmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", barcolor="navy blue", method = "gap_stat") 
gapkmeans


The gap statistics suggests that 7 clusters should be chosen, as one should look for the first number, where the value of gap statistics drops.
Three statistics used to determine number of clusters produced somewhat heterogeneous results, however both silhouette and the total within sum of squares suggested that 3 is the optimal number of clusters.

K-means algorithm was applied twice using the Manhattan distance for number of clusters set to 3 and 5.

#K-means for 3 clusters
k1<-eclust(nutritiontrim, "kmeans", hc_metric="manhattan",k=3, graph=F)
k12<-fviz_cluster(k1, main="K-means for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)

#K-means for 5 clusters
k2<-eclust(nutritiontrim, "kmeans", hc_metric="manhattan",k=5, graph=F)
k22<-fviz_cluster(k2, main="K-means for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)

#Silhouette info for both iterations of k-means
sil1<-fviz_silhouette(k1, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
##   cluster size ave.sil.width
## 1       1   68          0.27
## 2       2   67          0.06
## 3       3   40          0.05
sil2<-fviz_silhouette(k2, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
##   cluster size ave.sil.width
## 1       1   21          0.03
## 2       2   48          0.07
## 3       3   13          0.27
## 4       4   63          0.24
## 5       5   30          0.13
#create smaller plots
grid.arrange(arrangeGrob(k12, k22,sil1, sil2, nrow=2))

Visual analysis of applied algorithms suggest that 3 clusters is a more appropriate number, however average silhouette is equal to 0.14 in both cases. Another quality measure calculated for the performed k-means is the Calinski-Harabasz index, which might be helpful in comparing two results for number of clusters equal to 3 and 5, respectively. This measure checks the ratio between the variance between clusters to variance within the clusters.

calinhara(nutritiontrim, k1$cluster)
## [1] 29.18115
calinhara(nutritiontrim, k2$cluster)
## [1] 25.24107

Calinski-Harabasz index is higher for 3 clusters than for 5 (29 is bigger than 25), therefore based on this statistics it is more optimal to divide the data into three groups.

PAM

Second algorithm used for clustering is Partioning Around Medoids (PAM). As previously, first the number of clusters has to be determined, and the same three statistics were used: silhouette, total within sum of squares and gap statistics.

#silhouette
silpam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", method = "s") 
silpam


Silhouette suggests that dividing data into 5 clusters is the most optimal according to silhouette statistics.

#total within sum of squares 
wsspam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", method = "wss") 
wsspam


It is hard to visually asses the optimal number of clusters by looking at the graph of total within sum of square, therefore I have decided not to pick any number in this case.

#gap statistics
gappam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", barcolor="navy blue", method = "gap_stat") 
gappam


Gap statistics clearly indicates that number of cluster should be set to 2. Three statistics calculated suggest, that the number of clusters should be set to either 2 or 5.

#PAM for 2 clusters
p1<-eclust(nutritiontrim, "pam", hc_metric="manhattan",k=2, graph=F)
p12<-fviz_cluster(p1, main="PAM for Manhattan distance metric", ggtheme=theme_classic(), xlab=FALSE, ylab=FALSE, palette="Set1", graph=F)

#PAM for 5 clusters
p2<-eclust(nutritiontrim, "pam", hc_metric="manhattan",k=5, graph=F)
p22<-fviz_cluster(p2, main="PAM for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)

#Silhouette info for both iterations of k-means
sil1pam<-fviz_silhouette(p1, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
##   cluster size ave.sil.width
## 1       1  110          0.05
## 2       2   65          0.10
sil2pam<-fviz_silhouette(p2, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
##   cluster size ave.sil.width
## 1       1   72          0.12
## 2       2   24          0.11
## 3       3   36          0.13
## 4       4   30          0.04
## 5       5   13          0.28
#create smaller plots
grid.arrange(arrangeGrob(p12, p22, sil1pam, sil2pam, nrow=2))


Visual analysis of applied algorithms suggest that 5 clusters is a more appropriate number. This observation is confirmed by the fact, that average silhouette width is equal to 0.07 and 0.12 for 2 and 5 clusters, respectively. Similarly to k-means we can perform the Calinski-Harabasz index, to compare the quality of clustering for 2 and 5 clusters.

calinhara(nutritiontrim, p1$cluster)
## [1] 14.18638
calinhara(nutritiontrim, p2$cluster)
## [1] 21.98479

Calculated index suggests that grouping data into 5 clusters is indeed more optimal (21 is bigger than 14). Even though we have performed both k-means and PAM algorithms, the quality of the clusters obtained is below average. Therefore now more sophisticated method will be used to cluster the data set.

Fuzzy k-means

Fuzzy k-means algorithm (FCM) is quite similar to regular k-means. At first the number of clusters needs to be determined, and then each object is assigned a random value (Bezdek 1984). However, unlike the regular k-means, this value is not associated with the membership with one of the clusters, but rather is a vector of probabilities. Each probability represents the degree of similarity between observation and the particular cluster. These values range from 0 to 1, and sum up to 1 for each observation (same as standard probability) (Bezdek 1984). After all objects are assigned values, the algorithm works iteratively, changing the values for each observation until they are optimised. The higher the value assigned for particular cluster, the closer to the center the observation is. The FCM is very similar to k-means.

To calculate the FCM algorithm the number of clusters used was determined in the previous parts: 2, 3 and 5.

#fuzzy k-means for 2 clusters
library(ppclust)
library(fclust)
fuzzykm <- fcm(nutritiontrim, centers=2, m=1.5)
fuzzykm2 <- ppclust2(fuzzykm, "kmeans")
fuzzyplot<-fviz_cluster(fuzzykm2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 2 clusters", geom="point", repel = TRUE)

#fuzzy k-means for 3 clusters
fuzzy1km <- fcm(nutritiontrim, centers=3, m=1.5)
fuzzy1km2 <- ppclust2(fuzzy1km, "kmeans")
fuzzyplot2<-fviz_cluster(fuzzy1km2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 3 clusters", geom="point", repel = TRUE)

#fuzzy k-means for 5 clusters
fuzzy2km <- fcm(nutritiontrim, centers=5, m=1.5, nstart=5, numseed=123)
fuzzy2km2 <- ppclust2(fuzzy2km, "kmeans")
fuzzyplot3<-fviz_cluster(fuzzy2km2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 5 clusters", geom="point", repel = TRUE)
grid.arrange(arrangeGrob(fuzzyplot, fuzzyplot2, fuzzyplot3, nrow=2, ncol=2))

#the m parameter determines how mouch 'fuzzy' should the clustering be, fuzzy k-means for m=1 works like regular k-means


Visual analysis suggests that all three fuzzy k-means algorithms performed much better than both regular k-means and PAM. One can clearly see distinguished clusters. This time, as described at the beginning of this chapter, every observation was assigned degrees of membership in each cluster, instead of binary 1-0 representation.

In order to analyse the quality of performed clustering, the fuzzy version of silhouette should be used.

silfkm<-ppclust2(fuzzykm, "fclust")
#display the degrees for each observation
head(silfkm$U)
##   Cluster 1 Cluster 2
## 1 0.3093284 0.6906716
## 2 0.6481839 0.3518161
## 3 0.1560622 0.8439378
## 4 0.5938865 0.4061135
## 5 0.3034676 0.6965324
## 6 0.3209784 0.6790216

For example the first observation was assigned degree 0.31 for cluster 1 and 0.69 for cluster 2.

In order to formally analyse the quality of performed clustering, the fuzzy version of silhouette should be used.

#value of silhouette for 2 clusters
silfkm<-ppclust2(fuzzykm, "fclust")
silhvalue<-SIL.F(silfkm$Xca, silfkm$U, alpha=1)

#value of silhouette for 3 clusters
silfkm2<-ppclust2(fuzzy1km, "fclust")
silhvalue2<-SIL.F(silfkm2$Xca, silfkm2$U, alpha=1)

#value of silhouette for 5 clusters
silfkm3<-ppclust2(fuzzy2km, "fclust")
silhvalue3<-SIL.F(silfkm3$Xca, silfkm3$U, alpha=1)
#value of silhouette statistics 
c(silhvalue, silhvalue2, silhvalue3)
## [1] 0.3190894 0.3206814 0.4269056

Conclusions regarding the quality of clustering are confirmed by the silhouette statistics. Value of silhouette statistics is 0.32, 0.32 and 0.43 for number of clusters equal to 2, 3 and 5, respectively. These value are much higher than ones obtained for previous algorithms, which might suggest that fuzzy k-means algorithm performance is more optimal.

Fuzzy k-means algorithm for 3 clusters seems to be working efficiently, thus one shall inspect the calculated clusters more thoroughly. First of all, all food items can be displayed grouped into particular cluster, for example cluster 1.

dtfr<-data.frame(namesfood, silfkm2$clus)
colnames(dtfr)[1:2]<-c("Name", "Cluster")
print(dtfr$Name[dtfr$Cluster==1])
##  [1] Nuts, pecans           Lamb, raw, ground      Cheese, camembert     
##  [4] Vegetarian fillets     Crackers, rusk toast   Quail, raw, meat only 
##  [7] Salami, turkey, cooked Ostrich, raw, top loin Nuts, dried, pine nuts
## [10] Cookies, Marie biscuit Emu, raw, outside drum Nuts, dried, beechnuts
## [13] Gravy, mix, dry, onion KEEBLER, Waffle Cones  KEEBLER, Waffle Bowls 
## [16] Egg custards, dry mix  Peanut flour, low fat  Ground turkey, cooked 
## [19] MURRAY, Vanilla Wafer  Bread, toasted, wheat  Spices, garlic powder 
## [22] Fireweed, raw, leaves  Frankfurter, meatless  Emu, raw, flat fillet 
## [25] Emu, raw, inside drum  Snacks, potato sticks  McDONALD'S, Hamburger 
## [28] Mushrooms, raw, enoki  Bacon and beef sticks  Salami, pork, Italian 
## [31] Crackers, whole-wheat  Peanuts, raw, spanish  Fish, raw, butterfish 
## [34] Ham and cheese spread  Peppers, dried, ancho  Parsley, freeze-dried 
## [37] Nuts, dried, pilinuts  Mushrooms, raw, white  Yeast extract spread  
## [40] Emu, raw, fan fillet   Pasta, enriched, dry   Cookies, gingersnaps  
## [43] MURRAY, Honey Graham   Frankfurter, chicken   Ham, canned, chopped  
## [46] Spices, dried, thyme   Corn, dried (Navajo)   Pate, truffle flavor  
## [49] Salami, beef, cooked   Spices, chili powder   Chives, freeze-dried  
## [52] Crackers, multigrain   Spices, ground, mace   Spices, onion powder  
## [55] Barley flour or meal   Garlic bread, frozen   KFC, Popcorn Chicken  
## [58] Rolls, sweet, dinner   Ostrich, raw, ground   Rolls, wheat, dinner  
## [61] Chicken, raw, ground  
## 8789 Levels: Abiyuch, raw ... Zwieback

This raw output is not very informative, hence one should have a look at the descriptive statistics for each cluster. Now we will use the obtained clustering memberships for each object and not normalised data for convincing interpretation.

originaldata<-as.data.frame(cbind(nutrition2, silfkm2$clus))
colnames(originaldata)[16]<-"Cluster"

library(psych)
describeBy(originaldata[,1:15], originaldata[,16])
## 
##  Descriptive statistics by group 
## group: 1
##                  vars  n    mean     sd median trimmed    mad min  max range
## calories            1 61  320.20 156.96    313  312.47 163.09  22  719   697
## total_fat           2 61   66.00  50.91     45   61.80  44.48   3  171   168
## cholesterol         3 61  102.62 127.45      1   91.98   0.00   1  307   306
## sodium              4 61  585.70 344.03    629  590.84 351.38   2 1158  1156
## choline             5 61  318.72 375.09      1  275.78   0.00   1 1044  1043
## folate              6 61  190.69 129.27    193  192.86 192.74   1  365   364
## folic_acid          7 61   30.30  63.65      2   13.39   0.00   1  258   257
## niacin              8 61 2154.57 908.26   2247 2193.10 945.90 285 3699  3414
## pantothenic_acid    9 61  729.44 464.72    774  741.20 567.84   1 1443  1442
## riboflavin         10 61  323.54 181.72    287  304.55 169.02  23  784   761
## thiamin            11 61  356.46 215.19    360  342.86 231.29  30  897   867
## vitamin_a          12 61  316.82 428.07      2  255.12   1.48   1 1222  1221
## vitamin_a_rae      13 61   70.33 127.14      2   40.00   1.48   1  546   545
## carotene_alpha     14 61    2.51   6.85      2    1.65   0.00   1   55    54
## carotene_beta      15 61   16.79  51.00      2    1.69   0.00   1  265   264
##                   skew kurtosis     se
## calories          0.38    -0.14  20.10
## total_fat         0.61    -0.97   6.52
## cholesterol       0.52    -1.64  16.32
## sodium           -0.18    -1.15  44.05
## choline           0.60    -1.28  48.03
## folate           -0.13    -1.54  16.55
## folic_acid        2.12     3.39   8.15
## niacin           -0.28    -0.73 116.29
## pantothenic_acid -0.27    -1.26  59.50
## riboflavin        0.83     0.17  23.27
## thiamin           0.45    -0.60  27.55
## vitamin_a         0.86    -0.91  54.81
## vitamin_a_rae     1.93     3.04  16.28
## carotene_alpha    7.37    53.45   0.88
## carotene_beta     3.41    11.06   6.53
## ------------------------------------------------------------ 
## group: 2
##                  vars  n   mean     sd median trimmed    mad min  max range
## calories            1 57 157.74 154.68    102  135.36 114.16  15  763   748
## total_fat           2 57  20.18  27.38     10   14.13  11.86   1  118   117
## cholesterol         3 57  34.77  80.32      1   15.19   0.00   1  289   288
## sodium              4 57 607.98 334.60    629  616.38 398.82   1 1167  1166
## choline             5 57 144.86 263.98      1   90.43   0.00   1  897   896
## folate              6 57 108.49 126.94     28   95.47  40.03   1  362   361
## folic_acid          7 57   3.21   8.45      2    1.66   0.00   1   49    48
## niacin              8 57 543.63 721.52    210  401.98 309.86   1 3035  3034
## pantothenic_acid    9 57 160.19 235.60     12  120.32  16.31   1  831   830
## riboflavin         10 57  82.63 109.01     42   66.15  60.79   1  685   684
## thiamin            11 57  58.49  79.98     31   42.26  44.48   1  374   373
## vitamin_a          12 57 303.14 366.19      6  261.26   7.41   1 1090  1089
## vitamin_a_rae      13 57  55.58 114.10      2   30.68   1.48   1  534   533
## carotene_alpha     14 57   1.40   0.49      1    1.38   0.00   1    2     1
## carotene_beta      15 57  13.98  38.20      1    3.74   0.00   1  197   196
##                   skew kurtosis    se
## calories          1.50     2.59 20.49
## total_fat         2.26     4.64  3.63
## cholesterol       2.11     2.86 10.64
## sodium           -0.17    -1.02 44.32
## choline           1.59     1.11 34.96
## folate            0.72    -1.10 16.81
## folic_acid        4.91    22.66  1.12
## niacin            1.73     2.46 95.57
## pantothenic_acid  1.32     0.38 31.21
## riboflavin        3.10    13.82 14.44
## thiamin           2.31     5.72 10.59
## vitamin_a         0.70    -1.07 48.50
## vitamin_a_rae     2.56     6.85 15.11
## carotene_alpha    0.38    -1.89  0.07
## carotene_beta     3.27    10.61  5.06
## ------------------------------------------------------------ 
## group: 3
##                  vars  n   mean     sd median trimmed    mad min  max range
## calories            1 57 188.56 180.37    147  162.15 154.19  12  876   864
## total_fat           2 57  39.07  52.13     13   29.36  16.31   1  175   174
## cholesterol         3 57  56.65 101.29      1   37.17   0.00   1  313   312
## sodium              4 57 485.93 352.05    432  469.51 378.06   2 1233  1231
## choline             5 57 439.11 355.12    360  419.91 492.22   1 1130  1129
## folate              6 57 152.32 116.67    127  148.04 171.98   1  362   361
## folic_acid          7 57   6.14  20.57      2    1.98   0.00   1  141   140
## niacin              8 57 800.40 809.08    557  688.94 622.69   1 3036  3035
## pantothenic_acid    9 57 288.46 308.46    206  236.30 200.15   1 1374  1373
## riboflavin         10 57 150.70 165.65     92  120.15 105.26   5  714   709
## thiamin            11 57 101.32 108.20     62   83.11  60.79   1  436   435
## vitamin_a          12 57 765.00 408.98    856  787.47 444.78   5 1321  1316
## vitamin_a_rae      13 57 228.37 159.18    229  220.96 163.09   2  556   554
## carotene_alpha     14 57  41.14  65.01      2   30.13   0.00   1  194   193
## carotene_beta      15 57 320.56 164.34    327  329.28 180.88   1  566   565
##                   skew kurtosis     se
## calories          1.56     2.73  23.89
## total_fat         1.52     1.06   6.90
## cholesterol       1.51     0.67  13.42
## sodium            0.34    -0.92  46.63
## choline           0.36    -1.11  47.04
## folate            0.17    -1.36  15.45
## folic_acid        5.38    30.45   2.72
## niacin            1.15     0.22 107.17
## pantothenic_acid  1.66     2.38  40.86
## riboflavin        1.69     2.51  21.94
## thiamin           1.46     1.39  14.33
## vitamin_a        -0.44    -1.13  54.17
## vitamin_a_rae     0.28    -0.83  21.08
## carotene_alpha    1.30     0.02   8.61
## carotene_beta    -0.36    -0.80  21.77

Three variables are particularly interesting: number of calories, total amount of fat and the cholesterol content. We can see that while food items in clusters 2 and 3 have similar average of calories per object (158 and 189, respectively), products in cluster 1 are much more caloric (320). This is also reflected in the cholesterol and total fat levels: both of these variables are much higher for products in the cluster 1. Statistical analysis suggest that cluster 1 contains more fatty, unhealthy products.

Summary

The nutrition.csv data set was analysed. 150 food items were chosen with 15 nutritional values. Potential outliers and correlation between variables were analysed. Three clustering algorithms were used, for each of them we tried to find the optimal number of clusters with use of three different statistics: silhouette, total within sum of squares and gap statistics. For each clustering performed silhouette value was calculated. Obtained results suggest that fuzzy k-means clustered data in most optimal way, compared to two other algorithms used.

Bibliography

Bezdek, James & Ehrlich, Robert & Full, William. (1984). FCM—the Fuzzy C-Means clustering-algorithm. Computers & Geosciences.