The paper aims to showcase different clustering algorithms and useful statistics tests used to prepare and analyse data for clustering. Three algorithms were used: k-means, Partitioning Around Medoids (PAM) and fuzzy k-means. Analyses are based on the data from the database called “nutrition” that can be found here: https://www.kaggle.com/trolukovich/nutritional-values-for-common-foods-and-products?select=nutrition.csv. This data set contains nutrition data for almost 9 thousand different food items.
First of all, necessary libraries are loaded and the database is imported.
library(tidyverse)
library(dplyr)
library(lubridate)
library(ggplot2)
library(datasets)
library(readxl)
library(xlsx)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(tidyverse)
library(grid)
library(gridExtra)
library(lattice)
nutrition <- read.csv2("nutrition.csv", header=TRUE, sep = ",", stringsAsFactors = TRUE)
Second of all, the rows with NA values or omitted, irrelevant columns are deleted and the food items names are stored as an additional variable.
nutrition <- na.omit(nutrition)
namesfood <- nutrition[1:175,2]
nutrition <- nutrition[,-1]
nutrition <- nutrition[,-1]
nutrition <- nutrition[,-1]
nutrition <- nutrition[,-3]
dim(nutrition)
## [1] 8789 73
Further more, all columns containing char variables are converted to numeric.
for(i in 1:ncol(nutrition)) {
nutrition[,i] <- as.numeric(nutrition[,i])
}
Also, in order to obtain more interpretable results whole database is normalised.
nutrition2<-nutrition
nutrition <- scale(nutrition)
After the initial data processing we obtain the normalised matrix containing data for 8789 food products, for which 73 nutritional values has been assigned. Finally, in order not to work with big data, only 175 food items and 15 nutritional characteristics are selected.
nutritiontrim <- as.matrix(nutrition[1:175, 1:15])
nutrition2<-as.matrix(nutrition2[1:175, 1:15])
names_nutr<- as.matrix(colnames(nutritiontrim))
summary(nutritiontrim)
## calories total_fat cholesterol sodium
## Min. :-1.26152 Min. :-1.0124 Min. :-0.89088 Min. :-1.9078
## 1st Qu.:-0.96127 1st Qu.:-0.9158 1st Qu.:-0.89088 1st Qu.:-1.0133
## Median :-0.18417 Median :-0.4971 Median :-0.89088 Median :-0.2183
## Mean :-0.01106 Mean :-0.1256 Mean :-0.36797 Mean :-0.2200
## 3rd Qu.: 0.62236 3rd Qu.: 0.3295 3rd Qu.:-0.06053 3rd Qu.: 0.6415
## Max. : 3.82496 Max. : 2.7234 Max. : 1.63663 Max. : 1.8090
## choline folate folic_acid niacin
## Min. :-0.78718 Min. :-1.17831 Min. :-0.3548 Min. :-1.2115
## 1st Qu.:-0.78718 1st Qu.:-1.09848 1st Qu.:-0.3353 1st Qu.:-1.0396
## Median :-0.48857 Median :-0.22032 Median :-0.3353 Median :-0.5094
## Mean : 0.01346 Mean :-0.03467 Mean :-0.1089 Mean :-0.2006
## 3rd Qu.: 0.74183 3rd Qu.: 0.98858 3rd Qu.:-0.3353 3rd Qu.: 0.5672
## Max. : 2.22285 Max. : 1.58923 Max. : 4.6595 Max. : 1.9358
## pantothenic_acid riboflavin thiamin vitamin_a
## Min. :-0.96531 Min. :-1.061069 Min. :-0.80429 Min. :-0.97765
## 1st Qu.:-0.94700 1st Qu.:-0.854052 1st Qu.:-0.67142 1st Qu.:-0.97535
## Median :-0.36370 Median :-0.289718 Median :-0.36460 Median :-0.03169
## Mean : 0.07939 Mean : 0.003946 Mean : 0.04274 Mean : 0.07498
## 3rd Qu.: 0.85000 3rd Qu.: 0.484469 3rd Qu.: 0.45923 3rd Qu.: 1.03051
## Max. : 2.80656 Max. : 3.379871 Max. : 3.52501 Max. : 2.06049
## vitamin_a_rae carotene_alpha carotene_beta
## Min. :-0.75618 Min. :-0.2591 Min. :-0.4945
## 1st Qu.:-0.75032 1st Qu.:-0.2591 1st Qu.:-0.4945
## Median :-0.74446 Median :-0.2267 Median :-0.4879
## Mean :-0.07657 Mean : 0.1852 Mean : 0.2510
## 3rd Qu.: 0.52687 3rd Qu.:-0.2267 3rd Qu.: 0.8318
## Max. : 2.49539 Max. : 5.9855 Max. : 3.2059
In order to identify potential outliers the interquartile range rule (IRQ) has been used. IRQ rule states that any observations that are far away from the middle 50% of observations are potential outliers.
library(tidyverse)
matr <- matrix(nrow=175, ncol=15)
for(i in 1:ncol(nutritiontrim)){
max_interval <- quantile(nutritiontrim[,i],0.75,na.rm = TRUE)+(IQR(nutritiontrim[,i], na.rm = TRUE)*1.5)
min_interval <- quantile(nutritiontrim[,i],0.25,na.rm = TRUE)-(IQR(nutritiontrim[,i], na.rm = TRUE)*1.5)
matr[,i] <- ifelse(nutritiontrim[,i]>max_interval|nutritiontrim[,i]<min_interval, 1, 0)
}
outliers <- matrix(nrow=15, ncol=2)
for(i in 1:ncol(matr)){
outliers[i,1]<-names_nutr[i,1]
outliers[i,2]<-sum(matr[,i][matr[,i]==1])
}
outliers2<-as.matrix(outliers[, -1])
rownames(outliers2)<-outliers[,1]
colnames(outliers2)<-c("No_of_outliers")
library(knitr)
kable(outliers2)
| No_of_outliers | |
|---|---|
| calories | 2 |
| total_fat | 12 |
| cholesterol | 24 |
| sodium | 0 |
| choline | 0 |
| folate | 0 |
| folic_acid | 51 |
| niacin | 0 |
| pantothenic_acid | 0 |
| riboflavin | 9 |
| thiamin | 8 |
| vitamin_a | 0 |
| vitamin_a_rae | 2 |
| carotene_alpha | 22 |
| carotene_beta | 8 |
Applying IRQ rule and aggregating results into one table allowed us to identify the variables with potential outliers. Variables folic_acid, cholesterol and carotene_alpha are the top three candidates.
Even though the IRQ rule results suggests that there are outliers in the analysed data set, I decided not to exclude any food items from the data set - in my opinion the visual examination of variables did not confirm the results obtained from IRQ.
In the next step the correlation matrix was constructed in order to examine the correlation between analysed variables.
library(corrplot)
## corrplot 0.92 loaded
corrplot.mixed(cor(nutritiontrim), bg="white", upper="pie",lower="number", order="hclust", tl.col="black", tl.pos="lt", diag="l", number.font=0.5, tl.cex=1, number.cex=0.55)
It is not surprising that some variables are strongly correlated, whilst other are not. In food items certain nutrients usually occur together and some do not pair naturally.
Clustering of the data might result in gathering valuable information about the phenomenon the analysed data is representing, but only if data set is distinguishable from the totally random distribution. The most popular tool to check whether data have cluster potential or is rather uniformly distributed is called Hopkins statistic. It is a statistical test, which null hypothesis states that data is not uniformly distributed (the null and alternative hypothesis may be switched around in different libraries in R).
hopkins(nutritiontrim, n=nrow(nutritiontrim)-1, header=FALSE)
## $H
## [1] 0.3186588
Obtained value of the test statistics is equal to 0.32. According to the test interpretation, data should be clusterable - data set is not uniformly distributed. To make sure the results are robust the alternative test from different library was performed.
get_clust_tendency(nutritiontrim, 2, graph=FALSE, gradient=list(low="purple", high="green"), seed = 48)
## $hopkins_stat
## [1] 0.8003733
##
## $plot
## NULL
Here the null hypothesis states that data is uniformly distributed, and with the test statistics equal to 0.8 we should reject the null hypothesis - data set is clusterable.
First algorithm used fro clustering is k-means. Before applying the algorithm we should analyse the number of clusters and choose the most appropriate number for our data. We will do that with the help of three different statistics. First statistics used to analyse the number of clusters is silhouette. Silhouette value tells us how similar objects are to their own cluster compared to other clusters. The silhouette value ranges from -1 to 1, and the higher the value the better.
#silhouette
silkmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", method = "s")
silkmeans
From the graph one can conclude that the most optimal number of clusters is 3 - the value of the silhouette is highest for this number of clusters. Second statistics used is the total within sum of squares. This method informs about the average sum of squared distances from each object to the center of the cluster.
#total within sum of squares
wsskmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", method = "wss")
wsskmeans
Performed work suggested two candidates for the optimal number of clusters. In general the lower value the better, which means that objects in clusters are close to each other and similar. On the other hand, with the increasing number of clusters value of the statistics will always drop, as the clusters get smaller. In my opinion there are two numbers that are candidates for the optimal number of clusters: 3 and 5. For both of these numbers with the small increase in the number of clusters the value of the statistics drops sharply, while further increase in the number of clusters does not result in the significant drop.
Third and final statistics used is the gap statistics. The gap statistics is an interesting, and less straightforward method for estimating the number of clusters than two previous ones. The idea is that one should compare the within cluster sum of squares to its expected value under the appropriate distribution (for more information about the gap statistics read: https://hastie.su.domains/Papers/gap.pdf).
#gap statistics
gapkmeans <- fviz_nbclust(nutritiontrim, kmeans, linecolor="navy blue", barcolor="navy blue", method = "gap_stat")
gapkmeans
The gap statistics suggests that 7 clusters should be chosen, as one should look for the first number, where the value of gap statistics drops.
Three statistics used to determine number of clusters produced somewhat heterogeneous results, however both silhouette and the total within sum of squares suggested that 3 is the optimal number of clusters.
K-means algorithm was applied twice using the Manhattan distance for number of clusters set to 3 and 5.
#K-means for 3 clusters
k1<-eclust(nutritiontrim, "kmeans", hc_metric="manhattan",k=3, graph=F)
k12<-fviz_cluster(k1, main="K-means for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)
#K-means for 5 clusters
k2<-eclust(nutritiontrim, "kmeans", hc_metric="manhattan",k=5, graph=F)
k22<-fviz_cluster(k2, main="K-means for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)
#Silhouette info for both iterations of k-means
sil1<-fviz_silhouette(k1, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
## cluster size ave.sil.width
## 1 1 68 0.27
## 2 2 67 0.06
## 3 3 40 0.05
sil2<-fviz_silhouette(k2, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
## cluster size ave.sil.width
## 1 1 21 0.03
## 2 2 48 0.07
## 3 3 13 0.27
## 4 4 63 0.24
## 5 5 30 0.13
#create smaller plots
grid.arrange(arrangeGrob(k12, k22,sil1, sil2, nrow=2))
Visual analysis of applied algorithms suggest that 3 clusters is a more appropriate number, however average silhouette is equal to 0.14 in both cases. Another quality measure calculated for the performed k-means is the Calinski-Harabasz index, which might be helpful in comparing two results for number of clusters equal to 3 and 5, respectively. This measure checks the ratio between the variance between clusters to variance within the clusters.
calinhara(nutritiontrim, k1$cluster)
## [1] 29.18115
calinhara(nutritiontrim, k2$cluster)
## [1] 25.24107
Calinski-Harabasz index is higher for 3 clusters than for 5 (29 is bigger than 25), therefore based on this statistics it is more optimal to divide the data into three groups.
Second algorithm used for clustering is Partioning Around Medoids (PAM). As previously, first the number of clusters has to be determined, and the same three statistics were used: silhouette, total within sum of squares and gap statistics.
#silhouette
silpam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", method = "s")
silpam
Silhouette suggests that dividing data into 5 clusters is the most optimal according to silhouette statistics.
#total within sum of squares
wsspam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", method = "wss")
wsspam
It is hard to visually asses the optimal number of clusters by looking at the graph of total within sum of square, therefore I have decided not to pick any number in this case.
#gap statistics
gappam <- fviz_nbclust(nutritiontrim, cluster::pam, linecolor="navy blue", barcolor="navy blue", method = "gap_stat")
gappam
Gap statistics clearly indicates that number of cluster should be set to 2. Three statistics calculated suggest, that the number of clusters should be set to either 2 or 5.
#PAM for 2 clusters
p1<-eclust(nutritiontrim, "pam", hc_metric="manhattan",k=2, graph=F)
p12<-fviz_cluster(p1, main="PAM for Manhattan distance metric", ggtheme=theme_classic(), xlab=FALSE, ylab=FALSE, palette="Set1", graph=F)
#PAM for 5 clusters
p2<-eclust(nutritiontrim, "pam", hc_metric="manhattan",k=5, graph=F)
p22<-fviz_cluster(p2, main="PAM for Manhattan distance metric", xlab=FALSE, ylab=FALSE, ggtheme=theme_classic(), palette="Set1", graph=F)
#Silhouette info for both iterations of k-means
sil1pam<-fviz_silhouette(p1, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
## cluster size ave.sil.width
## 1 1 110 0.05
## 2 2 65 0.10
sil2pam<-fviz_silhouette(p2, ggtheme=theme_classic(), palette="Set1", main="Average silhouette width", graph=F)
## cluster size ave.sil.width
## 1 1 72 0.12
## 2 2 24 0.11
## 3 3 36 0.13
## 4 4 30 0.04
## 5 5 13 0.28
#create smaller plots
grid.arrange(arrangeGrob(p12, p22, sil1pam, sil2pam, nrow=2))
Visual analysis of applied algorithms suggest that 5 clusters is a more appropriate number. This observation is confirmed by the fact, that average silhouette width is equal to 0.07 and 0.12 for 2 and 5 clusters, respectively. Similarly to k-means we can perform the Calinski-Harabasz index, to compare the quality of clustering for 2 and 5 clusters.
calinhara(nutritiontrim, p1$cluster)
## [1] 14.18638
calinhara(nutritiontrim, p2$cluster)
## [1] 21.98479
Calculated index suggests that grouping data into 5 clusters is indeed more optimal (21 is bigger than 14). Even though we have performed both k-means and PAM algorithms, the quality of the clusters obtained is below average. Therefore now more sophisticated method will be used to cluster the data set.
Fuzzy k-means algorithm (FCM) is quite similar to regular k-means. At first the number of clusters needs to be determined, and then each object is assigned a random value (Bezdek 1984). However, unlike the regular k-means, this value is not associated with the membership with one of the clusters, but rather is a vector of probabilities. Each probability represents the degree of similarity between observation and the particular cluster. These values range from 0 to 1, and sum up to 1 for each observation (same as standard probability) (Bezdek 1984). After all objects are assigned values, the algorithm works iteratively, changing the values for each observation until they are optimised. The higher the value assigned for particular cluster, the closer to the center the observation is. The FCM is very similar to k-means.
To calculate the FCM algorithm the number of clusters used was determined in the previous parts: 2, 3 and 5.
#fuzzy k-means for 2 clusters
library(ppclust)
library(fclust)
fuzzykm <- fcm(nutritiontrim, centers=2, m=1.5)
fuzzykm2 <- ppclust2(fuzzykm, "kmeans")
fuzzyplot<-fviz_cluster(fuzzykm2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 2 clusters", geom="point", repel = TRUE)
#fuzzy k-means for 3 clusters
fuzzy1km <- fcm(nutritiontrim, centers=3, m=1.5)
fuzzy1km2 <- ppclust2(fuzzy1km, "kmeans")
fuzzyplot2<-fviz_cluster(fuzzy1km2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 3 clusters", geom="point", repel = TRUE)
#fuzzy k-means for 5 clusters
fuzzy2km <- fcm(nutritiontrim, centers=5, m=1.5, nstart=5, numseed=123)
fuzzy2km2 <- ppclust2(fuzzy2km, "kmeans")
fuzzyplot3<-fviz_cluster(fuzzy2km2, data = nutritiontrim, palette = "Set1", xlab=FALSE, ylab=FALSE, main="Fuzzy for 5 clusters", geom="point", repel = TRUE)
grid.arrange(arrangeGrob(fuzzyplot, fuzzyplot2, fuzzyplot3, nrow=2, ncol=2))
#the m parameter determines how mouch 'fuzzy' should the clustering be, fuzzy k-means for m=1 works like regular k-means
Visual analysis suggests that all three fuzzy k-means algorithms performed much better than both regular k-means and PAM. One can clearly see distinguished clusters. This time, as described at the beginning of this chapter, every observation was assigned degrees of membership in each cluster, instead of binary 1-0 representation.
In order to analyse the quality of performed clustering, the fuzzy version of silhouette should be used.
silfkm<-ppclust2(fuzzykm, "fclust")
#display the degrees for each observation
head(silfkm$U)
## Cluster 1 Cluster 2
## 1 0.3093284 0.6906716
## 2 0.6481839 0.3518161
## 3 0.1560622 0.8439378
## 4 0.5938865 0.4061135
## 5 0.3034676 0.6965324
## 6 0.3209784 0.6790216
For example the first observation was assigned degree 0.31 for cluster 1 and 0.69 for cluster 2.
In order to formally analyse the quality of performed clustering, the fuzzy version of silhouette should be used.
#value of silhouette for 2 clusters
silfkm<-ppclust2(fuzzykm, "fclust")
silhvalue<-SIL.F(silfkm$Xca, silfkm$U, alpha=1)
#value of silhouette for 3 clusters
silfkm2<-ppclust2(fuzzy1km, "fclust")
silhvalue2<-SIL.F(silfkm2$Xca, silfkm2$U, alpha=1)
#value of silhouette for 5 clusters
silfkm3<-ppclust2(fuzzy2km, "fclust")
silhvalue3<-SIL.F(silfkm3$Xca, silfkm3$U, alpha=1)
#value of silhouette statistics
c(silhvalue, silhvalue2, silhvalue3)
## [1] 0.3190894 0.3206814 0.4269056
Conclusions regarding the quality of clustering are confirmed by the silhouette statistics. Value of silhouette statistics is 0.32, 0.32 and 0.43 for number of clusters equal to 2, 3 and 5, respectively. These value are much higher than ones obtained for previous algorithms, which might suggest that fuzzy k-means algorithm performance is more optimal.
Fuzzy k-means algorithm for 3 clusters seems to be working efficiently, thus one shall inspect the calculated clusters more thoroughly. First of all, all food items can be displayed grouped into particular cluster, for example cluster 1.
dtfr<-data.frame(namesfood, silfkm2$clus)
colnames(dtfr)[1:2]<-c("Name", "Cluster")
print(dtfr$Name[dtfr$Cluster==1])
## [1] Nuts, pecans Lamb, raw, ground Cheese, camembert
## [4] Vegetarian fillets Crackers, rusk toast Quail, raw, meat only
## [7] Salami, turkey, cooked Ostrich, raw, top loin Nuts, dried, pine nuts
## [10] Cookies, Marie biscuit Emu, raw, outside drum Nuts, dried, beechnuts
## [13] Gravy, mix, dry, onion KEEBLER, Waffle Cones KEEBLER, Waffle Bowls
## [16] Egg custards, dry mix Peanut flour, low fat Ground turkey, cooked
## [19] MURRAY, Vanilla Wafer Bread, toasted, wheat Spices, garlic powder
## [22] Fireweed, raw, leaves Frankfurter, meatless Emu, raw, flat fillet
## [25] Emu, raw, inside drum Snacks, potato sticks McDONALD'S, Hamburger
## [28] Mushrooms, raw, enoki Bacon and beef sticks Salami, pork, Italian
## [31] Crackers, whole-wheat Peanuts, raw, spanish Fish, raw, butterfish
## [34] Ham and cheese spread Peppers, dried, ancho Parsley, freeze-dried
## [37] Nuts, dried, pilinuts Mushrooms, raw, white Yeast extract spread
## [40] Emu, raw, fan fillet Pasta, enriched, dry Cookies, gingersnaps
## [43] MURRAY, Honey Graham Frankfurter, chicken Ham, canned, chopped
## [46] Spices, dried, thyme Corn, dried (Navajo) Pate, truffle flavor
## [49] Salami, beef, cooked Spices, chili powder Chives, freeze-dried
## [52] Crackers, multigrain Spices, ground, mace Spices, onion powder
## [55] Barley flour or meal Garlic bread, frozen KFC, Popcorn Chicken
## [58] Rolls, sweet, dinner Ostrich, raw, ground Rolls, wheat, dinner
## [61] Chicken, raw, ground
## 8789 Levels: Abiyuch, raw ... Zwieback
This raw output is not very informative, hence one should have a look at the descriptive statistics for each cluster. Now we will use the obtained clustering memberships for each object and not normalised data for convincing interpretation.
originaldata<-as.data.frame(cbind(nutrition2, silfkm2$clus))
colnames(originaldata)[16]<-"Cluster"
library(psych)
describeBy(originaldata[,1:15], originaldata[,16])
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range
## calories 1 61 320.20 156.96 313 312.47 163.09 22 719 697
## total_fat 2 61 66.00 50.91 45 61.80 44.48 3 171 168
## cholesterol 3 61 102.62 127.45 1 91.98 0.00 1 307 306
## sodium 4 61 585.70 344.03 629 590.84 351.38 2 1158 1156
## choline 5 61 318.72 375.09 1 275.78 0.00 1 1044 1043
## folate 6 61 190.69 129.27 193 192.86 192.74 1 365 364
## folic_acid 7 61 30.30 63.65 2 13.39 0.00 1 258 257
## niacin 8 61 2154.57 908.26 2247 2193.10 945.90 285 3699 3414
## pantothenic_acid 9 61 729.44 464.72 774 741.20 567.84 1 1443 1442
## riboflavin 10 61 323.54 181.72 287 304.55 169.02 23 784 761
## thiamin 11 61 356.46 215.19 360 342.86 231.29 30 897 867
## vitamin_a 12 61 316.82 428.07 2 255.12 1.48 1 1222 1221
## vitamin_a_rae 13 61 70.33 127.14 2 40.00 1.48 1 546 545
## carotene_alpha 14 61 2.51 6.85 2 1.65 0.00 1 55 54
## carotene_beta 15 61 16.79 51.00 2 1.69 0.00 1 265 264
## skew kurtosis se
## calories 0.38 -0.14 20.10
## total_fat 0.61 -0.97 6.52
## cholesterol 0.52 -1.64 16.32
## sodium -0.18 -1.15 44.05
## choline 0.60 -1.28 48.03
## folate -0.13 -1.54 16.55
## folic_acid 2.12 3.39 8.15
## niacin -0.28 -0.73 116.29
## pantothenic_acid -0.27 -1.26 59.50
## riboflavin 0.83 0.17 23.27
## thiamin 0.45 -0.60 27.55
## vitamin_a 0.86 -0.91 54.81
## vitamin_a_rae 1.93 3.04 16.28
## carotene_alpha 7.37 53.45 0.88
## carotene_beta 3.41 11.06 6.53
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range
## calories 1 57 157.74 154.68 102 135.36 114.16 15 763 748
## total_fat 2 57 20.18 27.38 10 14.13 11.86 1 118 117
## cholesterol 3 57 34.77 80.32 1 15.19 0.00 1 289 288
## sodium 4 57 607.98 334.60 629 616.38 398.82 1 1167 1166
## choline 5 57 144.86 263.98 1 90.43 0.00 1 897 896
## folate 6 57 108.49 126.94 28 95.47 40.03 1 362 361
## folic_acid 7 57 3.21 8.45 2 1.66 0.00 1 49 48
## niacin 8 57 543.63 721.52 210 401.98 309.86 1 3035 3034
## pantothenic_acid 9 57 160.19 235.60 12 120.32 16.31 1 831 830
## riboflavin 10 57 82.63 109.01 42 66.15 60.79 1 685 684
## thiamin 11 57 58.49 79.98 31 42.26 44.48 1 374 373
## vitamin_a 12 57 303.14 366.19 6 261.26 7.41 1 1090 1089
## vitamin_a_rae 13 57 55.58 114.10 2 30.68 1.48 1 534 533
## carotene_alpha 14 57 1.40 0.49 1 1.38 0.00 1 2 1
## carotene_beta 15 57 13.98 38.20 1 3.74 0.00 1 197 196
## skew kurtosis se
## calories 1.50 2.59 20.49
## total_fat 2.26 4.64 3.63
## cholesterol 2.11 2.86 10.64
## sodium -0.17 -1.02 44.32
## choline 1.59 1.11 34.96
## folate 0.72 -1.10 16.81
## folic_acid 4.91 22.66 1.12
## niacin 1.73 2.46 95.57
## pantothenic_acid 1.32 0.38 31.21
## riboflavin 3.10 13.82 14.44
## thiamin 2.31 5.72 10.59
## vitamin_a 0.70 -1.07 48.50
## vitamin_a_rae 2.56 6.85 15.11
## carotene_alpha 0.38 -1.89 0.07
## carotene_beta 3.27 10.61 5.06
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range
## calories 1 57 188.56 180.37 147 162.15 154.19 12 876 864
## total_fat 2 57 39.07 52.13 13 29.36 16.31 1 175 174
## cholesterol 3 57 56.65 101.29 1 37.17 0.00 1 313 312
## sodium 4 57 485.93 352.05 432 469.51 378.06 2 1233 1231
## choline 5 57 439.11 355.12 360 419.91 492.22 1 1130 1129
## folate 6 57 152.32 116.67 127 148.04 171.98 1 362 361
## folic_acid 7 57 6.14 20.57 2 1.98 0.00 1 141 140
## niacin 8 57 800.40 809.08 557 688.94 622.69 1 3036 3035
## pantothenic_acid 9 57 288.46 308.46 206 236.30 200.15 1 1374 1373
## riboflavin 10 57 150.70 165.65 92 120.15 105.26 5 714 709
## thiamin 11 57 101.32 108.20 62 83.11 60.79 1 436 435
## vitamin_a 12 57 765.00 408.98 856 787.47 444.78 5 1321 1316
## vitamin_a_rae 13 57 228.37 159.18 229 220.96 163.09 2 556 554
## carotene_alpha 14 57 41.14 65.01 2 30.13 0.00 1 194 193
## carotene_beta 15 57 320.56 164.34 327 329.28 180.88 1 566 565
## skew kurtosis se
## calories 1.56 2.73 23.89
## total_fat 1.52 1.06 6.90
## cholesterol 1.51 0.67 13.42
## sodium 0.34 -0.92 46.63
## choline 0.36 -1.11 47.04
## folate 0.17 -1.36 15.45
## folic_acid 5.38 30.45 2.72
## niacin 1.15 0.22 107.17
## pantothenic_acid 1.66 2.38 40.86
## riboflavin 1.69 2.51 21.94
## thiamin 1.46 1.39 14.33
## vitamin_a -0.44 -1.13 54.17
## vitamin_a_rae 0.28 -0.83 21.08
## carotene_alpha 1.30 0.02 8.61
## carotene_beta -0.36 -0.80 21.77
Three variables are particularly interesting: number of calories, total amount of fat and the cholesterol content. We can see that while food items in clusters 2 and 3 have similar average of calories per object (158 and 189, respectively), products in cluster 1 are much more caloric (320). This is also reflected in the cholesterol and total fat levels: both of these variables are much higher for products in the cluster 1. Statistical analysis suggest that cluster 1 contains more fatty, unhealthy products.
The nutrition.csv data set was analysed. 150 food items were chosen with 15 nutritional values. Potential outliers and correlation between variables were analysed. Three clustering algorithms were used, for each of them we tried to find the optimal number of clusters with use of three different statistics: silhouette, total within sum of squares and gap statistics. For each clustering performed silhouette value was calculated. Obtained results suggest that fuzzy k-means clustered data in most optimal way, compared to two other algorithms used.
Bezdek, James & Ehrlich, Robert & Full, William. (1984). FCM—the Fuzzy C-Means clustering-algorithm. Computers & Geosciences.