library(tidyverse)
library(ggplot2)
library(GGally)
library(broom)
library(dplyr)
library(Hmisc)
library(olsrr)
library(psych)
library(gridExtra)
library(cluster)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
# Set a custom color palette with lower alpha (transparency)
custom_palette <- adjustcolor(viridis::viridis(3, direction = -1), alpha.f = 0.8)
custom_palette <- c(custom_palette, adjustcolor("#C9A8E1", alpha.f = 0.8))
data <- read.csv("Cereals_data.csv")
str(data)
## 'data.frame': 77 obs. of 12 variables:
## $ Name : chr "100%_Bran " "100%_Natural_Bran " "All-Bran " "All-Bran_with_Extra_Fiber " ...
## $ Calories: int 70 120 70 50 110 110 110 130 90 90 ...
## $ Protein : int 4 3 4 4 2 2 2 3 2 3 ...
## $ Fat : int 1 5 1 0 2 2 0 2 1 0 ...
## $ Sodium : int 130 15 260 140 200 180 125 210 200 210 ...
## $ Fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ Carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ Sugars : int 6 8 5 0 8 10 14 8 6 5 ...
## $ Potass : int 280 135 320 330 NA 70 30 100 125 190 ...
## $ Vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
## $ Weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ Rating : num 68.4 34 59.4 93.7 34.4 ...
summary(data)
## Name Calories Protein Fat
## Length:77 Min. : 50.0 Min. :1.000 Min. :0.000
## Class :character 1st Qu.:100.0 1st Qu.:2.000 1st Qu.:0.000
## Mode :character Median :110.0 Median :3.000 Median :1.000
## Mean :106.9 Mean :2.545 Mean :1.013
## 3rd Qu.:110.0 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :160.0 Max. :6.000 Max. :5.000
##
## Sodium Fiber Carbo Sugars
## Min. : 0.0 Min. : 0.000 Min. : 5.0 Min. : 0.000
## 1st Qu.:130.0 1st Qu.: 1.000 1st Qu.:12.0 1st Qu.: 3.000
## Median :180.0 Median : 2.000 Median :14.5 Median : 7.000
## Mean :159.7 Mean : 2.152 Mean :14.8 Mean : 7.026
## 3rd Qu.:210.0 3rd Qu.: 3.000 3rd Qu.:17.0 3rd Qu.:11.000
## Max. :320.0 Max. :14.000 Max. :23.0 Max. :15.000
## NA's :1 NA's :1
## Potass Vitamins Weight Rating
## Min. : 15.00 Min. : 0.00 Min. :0.50 Min. :18.04
## 1st Qu.: 42.50 1st Qu.: 25.00 1st Qu.:1.00 1st Qu.:33.17
## Median : 90.00 Median : 25.00 Median :1.00 Median :40.40
## Mean : 98.67 Mean : 28.25 Mean :1.03 Mean :42.67
## 3rd Qu.:120.00 3rd Qu.: 25.00 3rd Qu.:1.00 3rd Qu.:50.83
## Max. :330.00 Max. :100.00 Max. :1.50 Max. :93.70
## NA's :2
This data set consists of 77 rows and 12 variables: Name, Rating, Calories, Protein, Fat, Sodium, Carbo, Sugars, Potass, Weight, Vitamins and Fiber. All variables are numeric for the exception of Name.
# check for missing data
colSums(is.na(data))
## Name Calories Protein Fat Sodium Fiber Carbo Sugars
## 0 0 0 0 0 0 1 1
## Potass Vitamins Weight Rating
## 2 0 0 0
data <- na.omit(data)
# add row names
rownames(data) <- data$Name
data <- data[-which(names(data) == "Rating")]
Three rows with missing data were identified and removed. Descriptive statistics, histograms and a correlation matrix of numeric variables were generated and basic exploratory data analysis provided insights about the shape, relationships and usability of the data for clustering. For this clustering task, Rating was removed from the set.
hist(data$Calories,
col = "#C9A8E1",
border = "black",
xlab = "Calories",
ylab = "Frequency",
main = "Histogram of Calories")
table(data$Calories)
##
## 50 70 80 90 100 110 120 130 140 150 160
## 3 2 1 7 15 28 10 2 3 2 1
Calories has a range of 110, with a minimum of 50 and max of 160. The shape of the histogram appears somewhat normal with the mode (110) lying in the middle of the data. The median is 110 and mean is 106.9.
hist(data$Carbo,
col = "#C9A8E1",
border = "black",
xlab = "Carbo",
ylab = "Frequency",
main = "Histogram of Carbo")
table(data$Carbo)
##
## 5 7 8 9 10 10.5 11 11.5 12 13 13.5 14 15 16 17 18
## 1 1 2 1 2 2 5 1 7 8 1 6 8 7 6 3
## 19 20 21 22 23
## 1 3 6 2 1
Carbo ranges from 5 to 23. The shape of the histogram appears somewhat normal with the highest frequency values (13 and 15) lying in the middle of the data. The median is 14.5 and mean is 14.73.
hist(data$Sugars,
col = "#C9A8E1",
border = "black",
xlab = "Sugars",
ylab = "Frequency",
main = "Histogram of Sugars")
table(data$Sugars)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 6 1 3 13 1 5 7 4 4 4 5 5 7 4 3 2
Sugars has a range of 0-15. There are two peaks in the data, around the 3 mark and 12 mark. The median is 7 and mean is 7.108.
cor(data[,-1], use="complete.obs")
## Calories Protein Fat Sodium Fiber
## Calories 1.00000000 0.03399166 0.5073732397 0.2962474981 -0.29521183
## Protein 0.03399166 1.00000000 0.2023533963 0.0115588913 0.51400610
## Fat 0.50737324 0.20235340 1.0000000000 0.0008219036 0.01403587
## Sodium 0.29624750 0.01155889 0.0008219036 1.0000000000 -0.07073492
## Fiber -0.29521183 0.51400610 0.0140358654 -0.0707349230 1.00000000
## Carbo 0.27060605 -0.03674326 -0.2849336855 0.3284091857 -0.37908370
## Sugars 0.56912054 -0.28658397 0.2871524866 0.0370589612 -0.15094850
## Potass -0.07136125 0.57874284 0.1996367171 -0.0394380876 0.91150392
## Vitamins 0.25984556 0.05479952 -0.0305139099 0.3315759640 -0.03871734
## Weight 0.69645215 0.23067141 0.2217141647 0.3125335701 0.24629218
## Carbo Sugars Potass Vitamins Weight
## Calories 0.27060605 0.569120535 -0.071361247 0.25984556 0.6964521
## Protein -0.03674326 -0.286583967 0.578742837 0.05479952 0.2306714
## Fat -0.28493369 0.287152487 0.199636717 -0.03051391 0.2217142
## Sodium 0.32840919 0.037058961 -0.039438088 0.33157596 0.3125336
## Fiber -0.37908370 -0.150948502 0.911503921 -0.03871734 0.2462922
## Carbo 1.00000000 -0.452069189 -0.365002934 0.25357897 0.1448053
## Sugars -0.45206919 1.000000000 0.001413982 0.07295438 0.4605471
## Potass -0.36500293 0.001413982 1.000000000 -0.00263583 0.4205615
## Vitamins 0.25357897 0.072954382 -0.002635830 1.00000000 0.3204348
## Weight 0.14480528 0.460547135 0.420561534 0.32043480 1.0000000
Fiber and Potass have a Pearson’s correlation coefficient of 0.912 which introduces potential issues of high multicollinearity. Weight and Calories is the next highest correlation at 0.696, but being less than 0.8 the relationship shouldn’t pose a problem in the modeling. Calories, Sugars, and Carbo were chosen in the final model.
# Specify the columns of interest
vars <- c("Calories", "Sugars", "Carbo")
# Customize the appearance of the plot with the custom color palette
ggpairs(data[, vars],
title = "Pairwise Plots: Calories, Sugars, Carbo",
lower = list(continuous = wrap("points", alpha = 0.5, color = custom_palette[1])),
upper = list(continuous = wrap("cor", size = 3)),
diag = list(continuous = wrap("barDiag", binwidth = 1, fill = custom_palette[1]))
) +
theme_minimal(base_size = 12) +
theme(
panel.background = element_rect(fill = "white"),
text = element_text(color = "black"),
axis.text = element_text(color = "black"),
strip.text = element_text(color = "black"),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.position = "none"
)
Sample size: - With 77 rows of data, the small size might create unstable clusters. - A small sample size makes it challenging to implement test and train data sets.
Unbalanced variables: - Using the unbalanced variables could lead to certain features being overemphasized, - Unbalanced variables make it difficult for the algorithm to cluster data points outside of the more prominent values. - kmeans is sensitive to dominant values, smaller values could be seen as outliers and data points with these values might be harder to identify. - Possibly incorporate resampling.
Missing Features: - Would have been helpful to have features commonly found on nutrition lables such as saturated fat, trans fat, added sugar, etc. - Would have been helpful to have breakdown of each kind of Vitamin instead of total counts.
# vars: calories, carbo, sugars
clust_data <- dplyr:: select(data, Name, Calories, Carbo, Sugars)
rownames(clust_data) <- clust_data$Name
The choice to include Calories, Carbo and Sugars in the final model was influenced by the natural distribution and broad value range of the variables, guided by general nutrition knowledge. Protein and Fat were excluded from the model since cereals are mostly carb-heavy, and we kept Sugar for its significance in how people gauge food healthiness. Adding Calories gives a quick snapshot of a product’s nutrition that many consumers find handy.
# standardize data
z_clust_data_3 <- scale(clust_data[,-1]) # z standardize except fpr name colyumn
# silhoutte method
fviz_nbclust(z_clust_data_3, kmeans, method = "silhouette") + geom_vline(xintercept = 2, linetype = 2) +
labs(subtitle = "Silhouette Method")
Calories, Carbo and Sugars were standardized using z-scores and the silhouette method was used to determine the best number of clusters. The average silhouette scores of the first 10 k-means cluster results were plotted to see which value of k produces the largest width size. Results indicate the optimal number of clusters is either k = 2 or k = 4. These values were explored using the kmeans and eclust functions.
# clustering using eclust
set.seed(123)
km.4<- eclust(z_clust_data_3, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")
#km.4
# Assuming 'km.4' is your kmeans object
library(factoextra)
# Cluster plot with purple theme
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = custom_palette, ggtheme = theme_minimal() +
theme(panel.background = element_rect(fill = "white"),
text = element_text(color = "black"),
axis.text = element_text(color = "black"),
strip.text = element_text(color = "black"),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.position = "none")) +
ggtitle("K-Means Cluster Plot")
# Cluster silhouette plot with purple theme
fviz_silhouette(km.4, palette = custom_palette, ggtheme = theme_minimal() +
theme(panel.background = element_rect(fill = "white"),
text = element_text(color = "black"),
axis.text = element_text(color = "black"),
strip.text = element_text(color = "black"),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5), )
) +
ggtitle("K-Means Silhouette Plot")
## cluster size ave.sil.width
## 1 1 5 0.52
## 2 2 7 0.45
## 3 3 32 0.49
## 4 4 30 0.44
The average silhouette widths for each cluster show an improvement from prior trials with values 0.52, 0.45, 0.49 and 0.44. In the cluster plot, we can see there are clear defined, clearly separated clusters with only slight overlap in comparison to previous trials. The average silhouette width is 0.47, which is on the cusp of being good evidence for meaningful clusters or an indication of a good fit.
set.seed(124)
km4_stan <- kmeans(z_clust_data_3, centers=4)
#km4_stan
# cluster sizes
table(km4_stan$cluster)
##
## 1 2 3 4
## 5 30 7 32
# cluster centers
clust_data$km4_stan <- km4_stan$cluster
km4_stan$centers
## Calories Carbo Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2 -0.2869242 0.7632371 -0.8659505
## 3 1.8775896 0.7301995 0.7617294
## 4 0.2443055 -0.6291584 0.8211288
# between sum of squares
km4_stan$betweenss
## [1] 160.3295
# within sum of squares
km4_stan$withinss
## [1] 5.611843 27.210434 5.503128 20.345098
# total sum of squares
km4_stan$totss
## [1] 219
# cluster assignments
rownames(clust_data)[clust_data$km4_stan == 1]
## [1] "100%_Bran " "All-Bran "
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "
## [5] "Puffed_Wheat "
rownames(clust_data)[clust_data$km4_stan == 2]
## [1] "Bran_Chex " "Bran_Flakes "
## [3] "Cheerios " "Corn_Chex "
## [5] "Corn_Flakes " "Crispix "
## [7] "Double_Chex " "Grape_Nuts_Flakes "
## [9] "Grape-Nuts " "Great_Grains_Pecan "
## [11] "Just_Right_Crunchy__Nuggets " "Kix "
## [13] "Maypo " "Multi-Grain_Cheerios "
## [15] "Nutri-grain_Wheat " "Product_19 "
## [17] "Quaker_Oat_Squares " "Raisin_Squares "
## [19] "Rice_Chex " "Rice_Krispies "
## [21] "Shredded_Wheat " "Shredded_Wheat_'n'Bran "
## [23] "Shredded_Wheat_spoon_size " "Special_K "
## [25] "Strawberry_Fruit_Wheats " "Total_Corn_Flakes "
## [27] "Total_Whole_Grain " "Triples "
## [29] "Wheat_Chex " "Wheaties "
rownames(clust_data)[clust_data$km4_stan == 3]
## [1] "Basic_4 " "Just_Right_Fruit_&_Nut "
## [3] "Muesli_Raisins,_Dates,_&_Almonds " "Muesli_Raisins,_Peaches,_&_Pecans "
## [5] "Mueslix_Crispy_Blend " "Nutri-Grain_Almond-Raisin "
## [7] "Total_Raisin_Bran "
rownames(clust_data)[clust_data$km4_stan == 4]
## [1] "100%_Natural_Bran "
## [2] "Apple_Cinnamon_Cheerios "
## [3] "Apple_Jacks "
## [4] "Cap'n'Crunch "
## [5] "Cinnamon_Toast_Crunch "
## [6] "Clusters "
## [7] "Cocoa_Puffs "
## [8] "Corn_Pops "
## [9] "Count_Chocula "
## [10] "Cracklin'_Oat_Bran "
## [11] "Crispy_Wheat_&_Raisins "
## [12] "Froot_Loops "
## [13] "Frosted_Flakes "
## [14] "Frosted_Mini-Wheats "
## [15] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [16] "Fruitful_Bran "
## [17] "Fruity_Pebbles "
## [18] "Golden_Crisp "
## [19] "Golden_Grahams "
## [20] "Honey_Graham_Ohs "
## [21] "Honey_Nut_Cheerios "
## [22] "Honey-comb "
## [23] "Life "
## [24] "Lucky_Charms "
## [25] "Nut&Honey_Crunch "
## [26] "Oatmeal_Raisin_Crisp "
## [27] "Post_Nat._Raisin_Bran "
## [28] "Raisin_Bran "
## [29] "Raisin_Nut_Bran "
## [30] "Smacks "
## [31] "Trix "
## [32] "Wheaties_Honey_Gold "
The total sum of squares cluster variability is 73.2%. The within-cluster sum of squares are all relatively low, at 5.61, 27.21, 5.5 and 20.35, suggesting the points within each cluster are close to the centroids. The between-sum of squares is 160.33, which is large compared to the within-cluster variation, suggesting good separation between clusters.
Despite setting different seeds, the results of eclust and kmeans functions produce the same results, both producing clusters with sizes 5, 7, 32 and 30 and identical cluster means. This is a good sign, indicating a consistent pattern has been identified, regardless of the initial cluster center placement.
To see if normalizing the data oppose to standardizing the data would have an effect on the clustering, the subset of data containing Carbo, Sugars and Calories was normalized and a silhouette plot was produced. The plot indicates using k = 2 or k = 4 clusters produces the largest average silhouette widths, suggesting the best fit.
min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
norm_clust_data <- as.data.frame(lapply(clust_data[2:4], min_max_norm))
summary(norm_clust_data)
## Calories Carbo Sugars
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4545 1st Qu.:0.3889 1st Qu.:0.2000
## Median :0.5455 Median :0.5278 Median :0.4667
## Mean :0.5184 Mean :0.5405 Mean :0.4739
## 3rd Qu.:0.5455 3rd Qu.:0.6667 3rd Qu.:0.7333
## Max. :1.0000 Max. :1.0000 Max. :1.0000
set.seed(123)
km.4<- eclust(norm_clust_data, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")
# cluster plot
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = "custom_palette")
# silhoutte plot
fviz_silhouette(km.4, palette = "custom_palette")
## cluster size ave.sil.width
## 1 1 15 0.18
## 2 2 7 0.36
## 3 3 30 0.45
## 4 4 22 0.52
# cluster centers
table(km.4$cluster)
##
## 1 2 3 4
## 15 7 30 22
km.4$centers
## Calories Carbo Sugars
## 1 0.3151515 0.4037037 0.2977778
## 2 0.8571429 0.6984127 0.6952381
## 3 0.5696970 0.4018519 0.7311111
## 4 0.4793388 0.7727273 0.1727273
Normalizing the data seems to have brought the clusters closer to one another. For the most part, the original shape is retained. However, cluster 1 has appeared to have rotated, or is drawn more toward the middle. All four clusters are now overlapping.
set.seed(125)
km4_norm <- kmeans(norm_clust_data, centers=4)
#km4_norm
table(km4_norm$cluster)
##
## 1 2 3 4
## 17 22 5 30
clust_data$km4_norm <- km4_norm$cluster
km4_norm$centers
## Calories Carbo Sugars
## 1 0.48663102 0.5081699 0.4274510
## 2 0.49173554 0.7828283 0.1757576
## 3 0.07272727 0.2000000 0.1466667
## 4 0.63030303 0.4379630 0.7733333
km4_norm$withinss
## [1] 0.5049706 0.8206604 0.3180398 1.4855696
km4_norm$betweenss
## [1] 8.823832
km4_norm$totss
## [1] 11.95307
rownames(clust_data)[clust_data$km4_norm == 1]
## [1] "Basic_4 " "Bran_Chex "
## [3] "Bran_Flakes " "Clusters "
## [5] "Cracklin'_Oat_Bran " "Frosted_Mini-Wheats "
## [7] "Golden_Grahams " "Grape_Nuts_Flakes "
## [9] "Great_Grains_Pecan " "Just_Right_Crunchy__Nuggets "
## [11] "Life " "Multi-Grain_Cheerios "
## [13] "Quaker_Oat_Squares " "Raisin_Nut_Bran "
## [15] "Raisin_Squares " "Strawberry_Fruit_Wheats "
## [17] "Wheaties_Honey_Gold "
rownames(clust_data)[clust_data$km4_norm == 2]
## [1] "Cheerios " "Corn_Chex "
## [3] "Corn_Flakes " "Crispix "
## [5] "Double_Chex " "Grape-Nuts "
## [7] "Kix " "Maypo "
## [9] "Nutri-Grain_Almond-Raisin " "Nutri-grain_Wheat "
## [11] "Product_19 " "Rice_Chex "
## [13] "Rice_Krispies " "Shredded_Wheat "
## [15] "Shredded_Wheat_'n'Bran " "Shredded_Wheat_spoon_size "
## [17] "Special_K " "Total_Corn_Flakes "
## [19] "Total_Whole_Grain " "Triples "
## [21] "Wheat_Chex " "Wheaties "
rownames(clust_data)[clust_data$km4_norm == 3]
## [1] "100%_Bran " "All-Bran "
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "
## [5] "Puffed_Wheat "
rownames(clust_data)[clust_data$km4_norm == 4]
## [1] "100%_Natural_Bran "
## [2] "Apple_Cinnamon_Cheerios "
## [3] "Apple_Jacks "
## [4] "Cap'n'Crunch "
## [5] "Cinnamon_Toast_Crunch "
## [6] "Cocoa_Puffs "
## [7] "Corn_Pops "
## [8] "Count_Chocula "
## [9] "Crispy_Wheat_&_Raisins "
## [10] "Froot_Loops "
## [11] "Frosted_Flakes "
## [12] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [13] "Fruitful_Bran "
## [14] "Fruity_Pebbles "
## [15] "Golden_Crisp "
## [16] "Honey_Graham_Ohs "
## [17] "Honey_Nut_Cheerios "
## [18] "Honey-comb "
## [19] "Just_Right_Fruit_&_Nut "
## [20] "Lucky_Charms "
## [21] "Muesli_Raisins,_Dates,_&_Almonds "
## [22] "Muesli_Raisins,_Peaches,_&_Pecans "
## [23] "Mueslix_Crispy_Blend "
## [24] "Nut&Honey_Crunch "
## [25] "Oatmeal_Raisin_Crisp "
## [26] "Post_Nat._Raisin_Bran "
## [27] "Raisin_Bran "
## [28] "Smacks "
## [29] "Total_Raisin_Bran "
## [30] "Trix "
The cluster sizes changed from 5, 7, 30 and 32 in the standardized data to 17, 22, 5 and 30. The total sum of squares is 73.8% capturing slightly more variability than the standardized data. The between-sum of squares is 8.55 and within-cluster sum of squares is 0.6125, 0.679, 1.591 and 0.521, so comparatively, the between sum of squares is far larger. When ran multiple times in a row under different seeds, results end up slightly different each time which shows the pattern may not be as strong or consistent as it is with the standardized data, posing problems for interpretation and generalizability.
For further examination of the impact of normalization on the clustering, the cluster means and cereal names between the standardized and normalized data for variables Calories, Sugars and Carbo were compared:
set.seed(124)
# clustering with kmeans
km4_stan <- kmeans(z_clust_data_3, centers=4)
#km4_stan
# cluster sizes
table(km4_stan$cluster)
##
## 1 2 3 4
## 5 30 7 32
# cluster centers
clust_data$km4_stan <- km4_stan$cluster
km4_stan$centers
## Calories Carbo Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2 -0.2869242 0.7632371 -0.8659505
## 3 1.8775896 0.7301995 0.7617294
## 4 0.2443055 -0.6291584 0.8211288
# between sum of squares
km4_stan$betweenss
## [1] 160.3295
# within sum of squares
km4_stan$withinss
## [1] 5.611843 27.210434 5.503128 20.345098
# total sum of squares
km4_stan$totss
## [1] 219
# cluster assignments
rownames(clust_data)[clust_data$km4_stan == 1]
## [1] "100%_Bran " "All-Bran "
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "
## [5] "Puffed_Wheat "
rownames(clust_data)[clust_data$km4_stan == 2]
## [1] "Bran_Chex " "Bran_Flakes "
## [3] "Cheerios " "Corn_Chex "
## [5] "Corn_Flakes " "Crispix "
## [7] "Double_Chex " "Grape_Nuts_Flakes "
## [9] "Grape-Nuts " "Great_Grains_Pecan "
## [11] "Just_Right_Crunchy__Nuggets " "Kix "
## [13] "Maypo " "Multi-Grain_Cheerios "
## [15] "Nutri-grain_Wheat " "Product_19 "
## [17] "Quaker_Oat_Squares " "Raisin_Squares "
## [19] "Rice_Chex " "Rice_Krispies "
## [21] "Shredded_Wheat " "Shredded_Wheat_'n'Bran "
## [23] "Shredded_Wheat_spoon_size " "Special_K "
## [25] "Strawberry_Fruit_Wheats " "Total_Corn_Flakes "
## [27] "Total_Whole_Grain " "Triples "
## [29] "Wheat_Chex " "Wheaties "
rownames(clust_data)[clust_data$km4_stan == 3]
## [1] "Basic_4 " "Just_Right_Fruit_&_Nut "
## [3] "Muesli_Raisins,_Dates,_&_Almonds " "Muesli_Raisins,_Peaches,_&_Pecans "
## [5] "Mueslix_Crispy_Blend " "Nutri-Grain_Almond-Raisin "
## [7] "Total_Raisin_Bran "
rownames(clust_data)[clust_data$km4_stan == 4]
## [1] "100%_Natural_Bran "
## [2] "Apple_Cinnamon_Cheerios "
## [3] "Apple_Jacks "
## [4] "Cap'n'Crunch "
## [5] "Cinnamon_Toast_Crunch "
## [6] "Clusters "
## [7] "Cocoa_Puffs "
## [8] "Corn_Pops "
## [9] "Count_Chocula "
## [10] "Cracklin'_Oat_Bran "
## [11] "Crispy_Wheat_&_Raisins "
## [12] "Froot_Loops "
## [13] "Frosted_Flakes "
## [14] "Frosted_Mini-Wheats "
## [15] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [16] "Fruitful_Bran "
## [17] "Fruity_Pebbles "
## [18] "Golden_Crisp "
## [19] "Golden_Grahams "
## [20] "Honey_Graham_Ohs "
## [21] "Honey_Nut_Cheerios "
## [22] "Honey-comb "
## [23] "Life "
## [24] "Lucky_Charms "
## [25] "Nut&Honey_Crunch "
## [26] "Oatmeal_Raisin_Crisp "
## [27] "Post_Nat._Raisin_Bran "
## [28] "Raisin_Bran "
## [29] "Raisin_Nut_Bran "
## [30] "Smacks "
## [31] "Trix "
## [32] "Wheaties_Honey_Gold "
# clustering with eclust
set.seed(123)
km.4<- eclust(z_clust_data_3, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")
km.4$withinss
## [1] 5.611843 5.503128 20.345098 27.210434
km.4$betweenss
## [1] 160.3295
km.4$centers
## Calories Carbo Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2 1.8775896 0.7301995 0.7617294
## 3 0.2443055 -0.6291584 0.8211288
## 4 -0.2869242 0.7632371 -0.8659505
# Cluster silhouette plot with purple theme and legend
fviz_silhouette(km.4, palette = custom_palette)
## cluster size ave.sil.width
## 1 1 5 0.52
## 2 2 7 0.45
## 3 3 32 0.49
## 4 4 30 0.44
# Cluster plot with purple theme
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = custom_palette)
Cluster 1: Pure Grain Cereals
Cluster 2: Low-sugar Cereals
Cluster 3: Fruit & Nut cereals
Cluster 4: Sweet Treat Cereals
Small data sets pose a risk for overfitting. Due to the lack of data, it was not possible to assess this with the use of a testing and training set.
Insights must be interpreted cautiously. The average silhouette width for the final model is 0.47, which is only on the cusp of being significant evidence for a good fit. The insights from assessing cluster means may be better as grounds for further analyses and models, rather than definitive evidence of cereals having certain characteristics.