library(tidyverse)
library(ggplot2)
library(GGally)
library(broom)
library(dplyr)
library(Hmisc)
library(olsrr)
library(psych)
library(gridExtra)
library(cluster)
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

# Set a custom color palette with lower alpha (transparency)
custom_palette <- adjustcolor(viridis::viridis(3, direction = -1), alpha.f = 0.8)
custom_palette <- c(custom_palette, adjustcolor("#C9A8E1", alpha.f = 0.8))

Data

Data prep

data <- read.csv("Cereals_data.csv")

Summary statistics

str(data)

## 'data.frame':    77 obs. of  12 variables:
##  $ Name    : chr  "100%_Bran " "100%_Natural_Bran " "All-Bran " "All-Bran_with_Extra_Fiber " ...
##  $ Calories: int  70 120 70 50 110 110 110 130 90 90 ...
##  $ Protein : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ Fat     : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ Sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ Fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ Carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ Sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ Potass  : int  280 135 320 330 NA 70 30 100 125 190 ...
##  $ Vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
##  $ Weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ Rating  : num  68.4 34 59.4 93.7 34.4 ...

summary(data)

##      Name              Calories        Protein           Fat       
##  Length:77          Min.   : 50.0   Min.   :1.000   Min.   :0.000  
##  Class :character   1st Qu.:100.0   1st Qu.:2.000   1st Qu.:0.000  
##  Mode  :character   Median :110.0   Median :3.000   Median :1.000  
##                     Mean   :106.9   Mean   :2.545   Mean   :1.013  
##                     3rd Qu.:110.0   3rd Qu.:3.000   3rd Qu.:2.000  
##                     Max.   :160.0   Max.   :6.000   Max.   :5.000  
##                                                                    
##      Sodium          Fiber            Carbo          Sugars      
##  Min.   :  0.0   Min.   : 0.000   Min.   : 5.0   Min.   : 0.000  
##  1st Qu.:130.0   1st Qu.: 1.000   1st Qu.:12.0   1st Qu.: 3.000  
##  Median :180.0   Median : 2.000   Median :14.5   Median : 7.000  
##  Mean   :159.7   Mean   : 2.152   Mean   :14.8   Mean   : 7.026  
##  3rd Qu.:210.0   3rd Qu.: 3.000   3rd Qu.:17.0   3rd Qu.:11.000  
##  Max.   :320.0   Max.   :14.000   Max.   :23.0   Max.   :15.000  
##                                   NA's   :1      NA's   :1       
##      Potass          Vitamins          Weight         Rating     
##  Min.   : 15.00   Min.   :  0.00   Min.   :0.50   Min.   :18.04  
##  1st Qu.: 42.50   1st Qu.: 25.00   1st Qu.:1.00   1st Qu.:33.17  
##  Median : 90.00   Median : 25.00   Median :1.00   Median :40.40  
##  Mean   : 98.67   Mean   : 28.25   Mean   :1.03   Mean   :42.67  
##  3rd Qu.:120.00   3rd Qu.: 25.00   3rd Qu.:1.00   3rd Qu.:50.83  
##  Max.   :330.00   Max.   :100.00   Max.   :1.50   Max.   :93.70  
##  NA's   :2

This data set consists of 77 rows and 12 variables: Name, Rating, Calories, Protein, Fat, Sodium, Carbo, Sugars, Potass, Weight, Vitamins and Fiber. All variables are numeric for the exception of Name.

# check for missing data 
colSums(is.na(data))

##     Name Calories  Protein      Fat   Sodium    Fiber    Carbo   Sugars 
##        0        0        0        0        0        0        1        1 
##   Potass Vitamins   Weight   Rating 
##        2        0        0        0

data <- na.omit(data)

# add row names
rownames(data) <- data$Name

data <- data[-which(names(data) == "Rating")]

Three rows with missing data were identified and removed. Descriptive statistics, histograms and a correlation matrix of numeric variables were generated and basic exploratory data analysis provided insights about the shape, relationships and usability of the data for clustering. For this clustering task, Rating was removed from the set.

Histograms

hist(data$Calories, 
     col = "#C9A8E1",            
     border = "black",          
     xlab = "Calories",      
     ylab = "Frequency",
     main = "Histogram of Calories")

table(data$Calories)

## 
##  50  70  80  90 100 110 120 130 140 150 160 
##   3   2   1   7  15  28  10   2   3   2   1

Calories has a range of 110, with a minimum of 50 and max of 160. The shape of the histogram appears somewhat normal with the mode (110) lying in the middle of the data. The median is 110 and mean is 106.9.

hist(data$Carbo, 
     col = "#C9A8E1",            
     border = "black",          
     xlab = "Carbo",      
     ylab = "Frequency",        
     main = "Histogram of Carbo")

table(data$Carbo)

## 
##    5    7    8    9   10 10.5   11 11.5   12   13 13.5   14   15   16   17   18 
##    1    1    2    1    2    2    5    1    7    8    1    6    8    7    6    3 
##   19   20   21   22   23 
##    1    3    6    2    1

Carbo ranges from 5 to 23. The shape of the histogram appears somewhat normal with the highest frequency values (13 and 15) lying in the middle of the data. The median is 14.5 and mean is 14.73.

hist(data$Sugars, 
     col = "#C9A8E1",            
     border = "black",          
     xlab = "Sugars",      
     ylab = "Frequency",        
     main = "Histogram of Sugars")

table(data$Sugars)

## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
##  6  1  3 13  1  5  7  4  4  4  5  5  7  4  3  2

Sugars has a range of 0-15. There are two peaks in the data, around the 3 mark and 12 mark. The median is 7 and mean is 7.108.

Correlations Matrix

cor(data[,-1], use="complete.obs")

##             Calories     Protein           Fat        Sodium       Fiber
## Calories  1.00000000  0.03399166  0.5073732397  0.2962474981 -0.29521183
## Protein   0.03399166  1.00000000  0.2023533963  0.0115588913  0.51400610
## Fat       0.50737324  0.20235340  1.0000000000  0.0008219036  0.01403587
## Sodium    0.29624750  0.01155889  0.0008219036  1.0000000000 -0.07073492
## Fiber    -0.29521183  0.51400610  0.0140358654 -0.0707349230  1.00000000
## Carbo     0.27060605 -0.03674326 -0.2849336855  0.3284091857 -0.37908370
## Sugars    0.56912054 -0.28658397  0.2871524866  0.0370589612 -0.15094850
## Potass   -0.07136125  0.57874284  0.1996367171 -0.0394380876  0.91150392
## Vitamins  0.25984556  0.05479952 -0.0305139099  0.3315759640 -0.03871734
## Weight    0.69645215  0.23067141  0.2217141647  0.3125335701  0.24629218
##                Carbo       Sugars       Potass    Vitamins    Weight
## Calories  0.27060605  0.569120535 -0.071361247  0.25984556 0.6964521
## Protein  -0.03674326 -0.286583967  0.578742837  0.05479952 0.2306714
## Fat      -0.28493369  0.287152487  0.199636717 -0.03051391 0.2217142
## Sodium    0.32840919  0.037058961 -0.039438088  0.33157596 0.3125336
## Fiber    -0.37908370 -0.150948502  0.911503921 -0.03871734 0.2462922
## Carbo     1.00000000 -0.452069189 -0.365002934  0.25357897 0.1448053
## Sugars   -0.45206919  1.000000000  0.001413982  0.07295438 0.4605471
## Potass   -0.36500293  0.001413982  1.000000000 -0.00263583 0.4205615
## Vitamins  0.25357897  0.072954382 -0.002635830  1.00000000 0.3204348
## Weight    0.14480528  0.460547135  0.420561534  0.32043480 1.0000000

Fiber and Potass have a Pearson’s correlation coefficient of 0.912 which introduces potential issues of high multicollinearity. Weight and Calories is the next highest correlation at 0.696, but being less than 0.8 the relationship shouldn’t pose a problem in the modeling. Calories, Sugars, and Carbo were chosen in the final model.

# Specify the columns of interest
vars <- c("Calories", "Sugars", "Carbo")


# Customize the appearance of the plot with the custom color palette
ggpairs(data[, vars],
        title = "Pairwise Plots: Calories, Sugars, Carbo",
        lower = list(continuous = wrap("points", alpha = 0.5, color = custom_palette[1])),
        upper = list(continuous = wrap("cor", size = 3)),
        diag = list(continuous = wrap("barDiag", binwidth = 1, fill = custom_palette[1]))
) +
  theme_minimal(base_size = 12) +
  theme(
    panel.background = element_rect(fill = "white"),
    text = element_text(color = "black"),
    axis.text = element_text(color = "black"),
    strip.text = element_text(color = "black"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = "none"
  )

Data Limitations

Sample size: - With 77 rows of data, the small size might create unstable clusters. - A small sample size makes it challenging to implement test and train data sets.

Unbalanced variables: - Using the unbalanced variables could lead to certain features being overemphasized, - Unbalanced variables make it difficult for the algorithm to cluster data points outside of the more prominent values. - kmeans is sensitive to dominant values, smaller values could be seen as outliers and data points with these values might be harder to identify. - Possibly incorporate resampling.

Missing Features: - Would have been helpful to have features commonly found on nutrition lables such as saturated fat, trans fat, added sugar, etc. - Would have been helpful to have breakdown of each kind of Vitamin instead of total counts.

K-means clusters

Subsetting data

# vars: calories, carbo, sugars
clust_data <- dplyr:: select(data, Name, Calories, Carbo, Sugars)
rownames(clust_data) <- clust_data$Name

The choice to include Calories, Carbo and Sugars in the final model was influenced by the natural distribution and broad value range of the variables, guided by general nutrition knowledge. Protein and Fat were excluded from the model since cereals are mostly carb-heavy, and we kept Sugar for its significance in how people gauge food healthiness. Adding Calories gives a quick snapshot of a product’s nutrition that many consumers find handy.

Z-standardization

# standardize data
z_clust_data_3 <- scale(clust_data[,-1]) # z standardize except fpr name colyumn


# silhoutte method
fviz_nbclust(z_clust_data_3, kmeans, method = "silhouette") + geom_vline(xintercept = 2, linetype = 2) +
  labs(subtitle = "Silhouette Method")

Calories, Carbo and Sugars were standardized using z-scores and the silhouette method was used to determine the best number of clusters. The average silhouette scores of the first 10 k-means cluster results were plotted to see which value of k produces the largest width size. Results indicate the optimal number of clusters is either k = 2 or k = 4. These values were explored using the kmeans and eclust functions.

eclust function, k = 4

# clustering using eclust
set.seed(123)
km.4<- eclust(z_clust_data_3, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")
#km.4

# Assuming 'km.4' is your kmeans object

library(factoextra)


# Cluster plot with purple theme
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = custom_palette, ggtheme = theme_minimal() +
               theme(panel.background = element_rect(fill = "white"),
                     text = element_text(color = "black"),
                     axis.text = element_text(color = "black"),
                     strip.text = element_text(color = "black"),
                     plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
                     legend.position = "none")) +
  ggtitle("K-Means Cluster Plot")

# Cluster silhouette plot with purple theme
fviz_silhouette(km.4, palette = custom_palette, ggtheme = theme_minimal() +
                   theme(panel.background = element_rect(fill = "white"),
                         text = element_text(color = "black"),
                         axis.text = element_text(color = "black"),
                         strip.text = element_text(color = "black"),
                         plot.title = element_text(size = 16, face = "bold", hjust = 0.5), )
) +
  ggtitle("K-Means Silhouette Plot")

##   cluster size ave.sil.width
## 1       1    5          0.52
## 2       2    7          0.45
## 3       3   32          0.49
## 4       4   30          0.44

The average silhouette widths for each cluster show an improvement from prior trials with values 0.52, 0.45, 0.49 and 0.44. In the cluster plot, we can see there are clear defined, clearly separated clusters with only slight overlap in comparison to previous trials. The average silhouette width is 0.47, which is on the cusp of being good evidence for meaningful clusters or an indication of a good fit.

k-means function, k = 4

set.seed(124)
km4_stan <- kmeans(z_clust_data_3, centers=4)
#km4_stan 

# cluster sizes
table(km4_stan$cluster)

## 
##  1  2  3  4 
##  5 30  7 32

# cluster centers
clust_data$km4_stan <- km4_stan$cluster
km4_stan$centers

##     Calories      Carbo     Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2 -0.2869242  0.7632371 -0.8659505
## 3  1.8775896  0.7301995  0.7617294
## 4  0.2443055 -0.6291584  0.8211288

# between sum of squares 
km4_stan$betweenss

## [1] 160.3295

# within sum of squares
km4_stan$withinss

## [1]  5.611843 27.210434  5.503128 20.345098

# total sum of squares
km4_stan$totss

## [1] 219

# cluster assignments
rownames(clust_data)[clust_data$km4_stan == 1]

## [1] "100%_Bran "                 "All-Bran "                 
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "              
## [5] "Puffed_Wheat "

rownames(clust_data)[clust_data$km4_stan == 2]

##  [1] "Bran_Chex "                   "Bran_Flakes "                
##  [3] "Cheerios "                    "Corn_Chex "                  
##  [5] "Corn_Flakes "                 "Crispix "                    
##  [7] "Double_Chex "                 "Grape_Nuts_Flakes "          
##  [9] "Grape-Nuts "                  "Great_Grains_Pecan "         
## [11] "Just_Right_Crunchy__Nuggets " "Kix "                        
## [13] "Maypo "                       "Multi-Grain_Cheerios "       
## [15] "Nutri-grain_Wheat "           "Product_19 "                 
## [17] "Quaker_Oat_Squares "          "Raisin_Squares "             
## [19] "Rice_Chex "                   "Rice_Krispies "              
## [21] "Shredded_Wheat "              "Shredded_Wheat_'n'Bran "     
## [23] "Shredded_Wheat_spoon_size "   "Special_K "                  
## [25] "Strawberry_Fruit_Wheats "     "Total_Corn_Flakes "          
## [27] "Total_Whole_Grain "           "Triples "                    
## [29] "Wheat_Chex "                  "Wheaties "

rownames(clust_data)[clust_data$km4_stan == 3]

## [1] "Basic_4 "                           "Just_Right_Fruit_&_Nut "           
## [3] "Muesli_Raisins,_Dates,_&_Almonds "  "Muesli_Raisins,_Peaches,_&_Pecans "
## [5] "Mueslix_Crispy_Blend "              "Nutri-Grain_Almond-Raisin "        
## [7] "Total_Raisin_Bran "

rownames(clust_data)[clust_data$km4_stan == 4]

##  [1] "100%_Natural_Bran "                    
##  [2] "Apple_Cinnamon_Cheerios "              
##  [3] "Apple_Jacks "                          
##  [4] "Cap'n'Crunch "                         
##  [5] "Cinnamon_Toast_Crunch "                
##  [6] "Clusters "                             
##  [7] "Cocoa_Puffs "                          
##  [8] "Corn_Pops "                            
##  [9] "Count_Chocula "                        
## [10] "Cracklin'_Oat_Bran "                   
## [11] "Crispy_Wheat_&_Raisins "               
## [12] "Froot_Loops "                          
## [13] "Frosted_Flakes "                       
## [14] "Frosted_Mini-Wheats "                  
## [15] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [16] "Fruitful_Bran "                        
## [17] "Fruity_Pebbles "                       
## [18] "Golden_Crisp "                         
## [19] "Golden_Grahams "                       
## [20] "Honey_Graham_Ohs "                     
## [21] "Honey_Nut_Cheerios "                   
## [22] "Honey-comb "                           
## [23] "Life "                                 
## [24] "Lucky_Charms "                         
## [25] "Nut&Honey_Crunch "                     
## [26] "Oatmeal_Raisin_Crisp "                 
## [27] "Post_Nat._Raisin_Bran "                
## [28] "Raisin_Bran "                          
## [29] "Raisin_Nut_Bran "                      
## [30] "Smacks "                               
## [31] "Trix "                                 
## [32] "Wheaties_Honey_Gold "

The total sum of squares cluster variability is 73.2%. The within-cluster sum of squares are all relatively low, at 5.61, 27.21, 5.5 and 20.35, suggesting the points within each cluster are close to the centroids. The between-sum of squares is 160.33, which is large compared to the within-cluster variation, suggesting good separation between clusters.

Despite setting different seeds, the results of eclust and kmeans functions produce the same results, both producing clusters with sizes 5, 7, 32 and 30 and identical cluster means. This is a good sign, indicating a consistent pattern has been identified, regardless of the initial cluster center placement.

Min-max normalization

To see if normalizing the data oppose to standardizing the data would have an effect on the clustering, the subset of data containing Carbo, Sugars and Calories was normalized and a silhouette plot was produced. The plot indicates using k = 2 or k = 4 clusters produces the largest average silhouette widths, suggesting the best fit.

min_max_norm <- function(x) {
    (x - min(x)) / (max(x) - min(x))
  }

norm_clust_data <- as.data.frame(lapply(clust_data[2:4], min_max_norm))
summary(norm_clust_data)

##     Calories          Carbo            Sugars      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.4545   1st Qu.:0.3889   1st Qu.:0.2000  
##  Median :0.5455   Median :0.5278   Median :0.4667  
##  Mean   :0.5184   Mean   :0.5405   Mean   :0.4739  
##  3rd Qu.:0.5455   3rd Qu.:0.6667   3rd Qu.:0.7333  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

eclust function: k = 4

set.seed(123)
km.4<- eclust(norm_clust_data, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")

# cluster plot
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = "custom_palette")

# silhoutte plot
fviz_silhouette(km.4, palette = "custom_palette")

##   cluster size ave.sil.width
## 1       1   15          0.18
## 2       2    7          0.36
## 3       3   30          0.45
## 4       4   22          0.52

# cluster centers
table(km.4$cluster)

## 
##  1  2  3  4 
## 15  7 30 22

km.4$centers

##    Calories     Carbo    Sugars
## 1 0.3151515 0.4037037 0.2977778
## 2 0.8571429 0.6984127 0.6952381
## 3 0.5696970 0.4018519 0.7311111
## 4 0.4793388 0.7727273 0.1727273

Normalizing the data seems to have brought the clusters closer to one another. For the most part, the original shape is retained. However, cluster 1 has appeared to have rotated, or is drawn more toward the middle. All four clusters are now overlapping.

k-means function: k = 4

set.seed(125)
km4_norm <- kmeans(norm_clust_data, centers=4)
#km4_norm
table(km4_norm$cluster)

## 
##  1  2  3  4 
## 17 22  5 30

clust_data$km4_norm <- km4_norm$cluster
km4_norm$centers

##     Calories     Carbo    Sugars
## 1 0.48663102 0.5081699 0.4274510
## 2 0.49173554 0.7828283 0.1757576
## 3 0.07272727 0.2000000 0.1466667
## 4 0.63030303 0.4379630 0.7733333

km4_norm$withinss

## [1] 0.5049706 0.8206604 0.3180398 1.4855696

km4_norm$betweenss

## [1] 8.823832

km4_norm$totss

## [1] 11.95307

rownames(clust_data)[clust_data$km4_norm == 1]

##  [1] "Basic_4 "                     "Bran_Chex "                  
##  [3] "Bran_Flakes "                 "Clusters "                   
##  [5] "Cracklin'_Oat_Bran "          "Frosted_Mini-Wheats "        
##  [7] "Golden_Grahams "              "Grape_Nuts_Flakes "          
##  [9] "Great_Grains_Pecan "          "Just_Right_Crunchy__Nuggets "
## [11] "Life "                        "Multi-Grain_Cheerios "       
## [13] "Quaker_Oat_Squares "          "Raisin_Nut_Bran "            
## [15] "Raisin_Squares "              "Strawberry_Fruit_Wheats "    
## [17] "Wheaties_Honey_Gold "

rownames(clust_data)[clust_data$km4_norm == 2]

##  [1] "Cheerios "                  "Corn_Chex "                
##  [3] "Corn_Flakes "               "Crispix "                  
##  [5] "Double_Chex "               "Grape-Nuts "               
##  [7] "Kix "                       "Maypo "                    
##  [9] "Nutri-Grain_Almond-Raisin " "Nutri-grain_Wheat "        
## [11] "Product_19 "                "Rice_Chex "                
## [13] "Rice_Krispies "             "Shredded_Wheat "           
## [15] "Shredded_Wheat_'n'Bran "    "Shredded_Wheat_spoon_size "
## [17] "Special_K "                 "Total_Corn_Flakes "        
## [19] "Total_Whole_Grain "         "Triples "                  
## [21] "Wheat_Chex "                "Wheaties "

rownames(clust_data)[clust_data$km4_norm == 3]

## [1] "100%_Bran "                 "All-Bran "                 
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "              
## [5] "Puffed_Wheat "

rownames(clust_data)[clust_data$km4_norm == 4]

##  [1] "100%_Natural_Bran "                    
##  [2] "Apple_Cinnamon_Cheerios "              
##  [3] "Apple_Jacks "                          
##  [4] "Cap'n'Crunch "                         
##  [5] "Cinnamon_Toast_Crunch "                
##  [6] "Cocoa_Puffs "                          
##  [7] "Corn_Pops "                            
##  [8] "Count_Chocula "                        
##  [9] "Crispy_Wheat_&_Raisins "               
## [10] "Froot_Loops "                          
## [11] "Frosted_Flakes "                       
## [12] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [13] "Fruitful_Bran "                        
## [14] "Fruity_Pebbles "                       
## [15] "Golden_Crisp "                         
## [16] "Honey_Graham_Ohs "                     
## [17] "Honey_Nut_Cheerios "                   
## [18] "Honey-comb "                           
## [19] "Just_Right_Fruit_&_Nut "               
## [20] "Lucky_Charms "                         
## [21] "Muesli_Raisins,_Dates,_&_Almonds "     
## [22] "Muesli_Raisins,_Peaches,_&_Pecans "    
## [23] "Mueslix_Crispy_Blend "                 
## [24] "Nut&Honey_Crunch "                     
## [25] "Oatmeal_Raisin_Crisp "                 
## [26] "Post_Nat._Raisin_Bran "                
## [27] "Raisin_Bran "                          
## [28] "Smacks "                               
## [29] "Total_Raisin_Bran "                    
## [30] "Trix "

The cluster sizes changed from 5, 7, 30 and 32 in the standardized data to 17, 22, 5 and 30. The total sum of squares is 73.8% capturing slightly more variability than the standardized data. The between-sum of squares is 8.55 and within-cluster sum of squares is 0.6125, 0.679, 1.591 and 0.521, so comparatively, the between sum of squares is far larger. When ran multiple times in a row under different seeds, results end up slightly different each time which shows the pattern may not be as strong or consistent as it is with the standardized data, posing problems for interpretation and generalizability.

For further examination of the impact of normalization on the clustering, the cluster means and cereal names between the standardized and normalized data for variables Calories, Sugars and Carbo were compared:

Final Model

kmeans function

set.seed(124)

# clustering with kmeans
km4_stan <- kmeans(z_clust_data_3, centers=4)
#km4_stan 

# cluster sizes
table(km4_stan$cluster)

## 
##  1  2  3  4 
##  5 30  7 32

# cluster centers
clust_data$km4_stan <- km4_stan$cluster
km4_stan$centers

##     Calories      Carbo     Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2 -0.2869242  0.7632371 -0.8659505
## 3  1.8775896  0.7301995  0.7617294
## 4  0.2443055 -0.6291584  0.8211288

# between sum of squares 
km4_stan$betweenss

## [1] 160.3295

# within sum of squares
km4_stan$withinss

## [1]  5.611843 27.210434  5.503128 20.345098

# total sum of squares
km4_stan$totss

## [1] 219

# cluster assignments
rownames(clust_data)[clust_data$km4_stan == 1]

## [1] "100%_Bran "                 "All-Bran "                 
## [3] "All-Bran_with_Extra_Fiber " "Puffed_Rice "              
## [5] "Puffed_Wheat "

rownames(clust_data)[clust_data$km4_stan == 2]

##  [1] "Bran_Chex "                   "Bran_Flakes "                
##  [3] "Cheerios "                    "Corn_Chex "                  
##  [5] "Corn_Flakes "                 "Crispix "                    
##  [7] "Double_Chex "                 "Grape_Nuts_Flakes "          
##  [9] "Grape-Nuts "                  "Great_Grains_Pecan "         
## [11] "Just_Right_Crunchy__Nuggets " "Kix "                        
## [13] "Maypo "                       "Multi-Grain_Cheerios "       
## [15] "Nutri-grain_Wheat "           "Product_19 "                 
## [17] "Quaker_Oat_Squares "          "Raisin_Squares "             
## [19] "Rice_Chex "                   "Rice_Krispies "              
## [21] "Shredded_Wheat "              "Shredded_Wheat_'n'Bran "     
## [23] "Shredded_Wheat_spoon_size "   "Special_K "                  
## [25] "Strawberry_Fruit_Wheats "     "Total_Corn_Flakes "          
## [27] "Total_Whole_Grain "           "Triples "                    
## [29] "Wheat_Chex "                  "Wheaties "

rownames(clust_data)[clust_data$km4_stan == 3]

## [1] "Basic_4 "                           "Just_Right_Fruit_&_Nut "           
## [3] "Muesli_Raisins,_Dates,_&_Almonds "  "Muesli_Raisins,_Peaches,_&_Pecans "
## [5] "Mueslix_Crispy_Blend "              "Nutri-Grain_Almond-Raisin "        
## [7] "Total_Raisin_Bran "

rownames(clust_data)[clust_data$km4_stan == 4]

##  [1] "100%_Natural_Bran "                    
##  [2] "Apple_Cinnamon_Cheerios "              
##  [3] "Apple_Jacks "                          
##  [4] "Cap'n'Crunch "                         
##  [5] "Cinnamon_Toast_Crunch "                
##  [6] "Clusters "                             
##  [7] "Cocoa_Puffs "                          
##  [8] "Corn_Pops "                            
##  [9] "Count_Chocula "                        
## [10] "Cracklin'_Oat_Bran "                   
## [11] "Crispy_Wheat_&_Raisins "               
## [12] "Froot_Loops "                          
## [13] "Frosted_Flakes "                       
## [14] "Frosted_Mini-Wheats "                  
## [15] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [16] "Fruitful_Bran "                        
## [17] "Fruity_Pebbles "                       
## [18] "Golden_Crisp "                         
## [19] "Golden_Grahams "                       
## [20] "Honey_Graham_Ohs "                     
## [21] "Honey_Nut_Cheerios "                   
## [22] "Honey-comb "                           
## [23] "Life "                                 
## [24] "Lucky_Charms "                         
## [25] "Nut&Honey_Crunch "                     
## [26] "Oatmeal_Raisin_Crisp "                 
## [27] "Post_Nat._Raisin_Bran "                
## [28] "Raisin_Bran "                          
## [29] "Raisin_Nut_Bran "                      
## [30] "Smacks "                               
## [31] "Trix "                                 
## [32] "Wheaties_Honey_Gold "

# clustering with eclust
set.seed(123)
km.4<- eclust(z_clust_data_3, "kmeans", k=4, graph = FALSE, hc_metric = "euclidean")

km.4$withinss

## [1]  5.611843  5.503128 20.345098 27.210434

km.4$betweenss

## [1] 160.3295

km.4$centers

##     Calories      Carbo     Sugars
## 1 -2.4706356 -1.5750879 -1.1259424
## 2  1.8775896  0.7301995  0.7617294
## 3  0.2443055 -0.6291584  0.8211288
## 4 -0.2869242  0.7632371 -0.8659505

# Cluster silhouette plot with purple theme and legend
fviz_silhouette(km.4, palette = custom_palette)

##   cluster size ave.sil.width
## 1       1    5          0.52
## 2       2    7          0.45
## 3       3   32          0.49
## 4       4   30          0.44

# Cluster plot with purple theme
fviz_cluster(km.4, geom = "point", ellipse.type = "norm", palette = custom_palette)

Cluster 1: Pure Grain Cereals

From the cluster plot, we can see cluster 1 is most clearly separated from the other clusters. It has the largest average silhouette width, 0.52 and a low within-sum of squares, 5.612 suggesting sufficient “tightness” within the cluster compared to the between sum of squares, 160.33. Cluster 1 contains cereals with lower than average Calories (-2.47), Sugars (-1.126) and Carbo (-1.57). There are five cereals in this cluster. Cereals containing bran are in this cluster, as well as Puffed Rice and Puffed Wheat. Generally, these cereals are known for being healthy choices to most cereals, with simple ingredients, so it makes sense they would rank low in these three areas and be grouped as such.

Cluster 2: Low-sugar Cereals

Cluster 2 contains 30 cereals with lower than average Calories (-0.287) and Sugars (-0.86) but higher than average Carbo (0.763). Some of these cereals are Corn Chex, Bran Chex, Grape Nuts, Corn Flakes and Cheerios. The within sum of squares is 27.21 which despite being the largest, is still relatively small compared to the between sum of squares. From the cluster plot, we see this cluster does have the more overlap with other clusters than any of the other 3. There is one potential misclassification shown in the silouette plot, which is likely the same point from the cluster plot, sitting between cluster 1 and cluster 4.

Cluster 3: Fruit & Nut cereals

Cluster 3 contains cereals with higher than average Calories (1.877), Sugars (0.73) and Carbo (0.762). The 7 cereals in this cluster all contain some kind of dried fruit or nuts. Cluster 2 has an average silhouette width of 0.45, which means it is almost sufficient evidence for being a well-fitting cluster. The within sum of squares is 5.50, only slightly lower than that for cluster 1 (5.61), which is represented in the slight overlap with clusters 3 and 4 that cluster 1 does not have.

Cluster 4: Sweet Treat Cereals

Cluster 4 contains cereals with slightly above average Calories (0.244), the highest average Sugars (0.821) and lower than average Carbo (-0.629). There are 32 cereals in this cluster. These cereals have the most in common with the low sugar cereals, represented by the overlap between clusters 3 and 4. This is interesting considering the center means are almost the inverse of each other (0.244 and -0.287 for calories, -0.629 and 0.763 for Carbo and 0.821 and -0.867 for Sugars), but perhaps the close-to-center means for Calories in each of these clusters makes it more challenging for the algorithm to extract the differences. The within sum of squares for cluster 3 is 20.345, which is third largest.

Data limitations

Small data sets pose a risk for overfitting. Due to the lack of data, it was not possible to assess this with the use of a testing and training set.
Insights must be interpreted cautiously. The average silhouette width for the final model is 0.47, which is only on the cusp of being significant evidence for a good fit. The insights from assessing cluster means may be better as grounds for further analyses and models, rather than definitive evidence of cereals having certain characteristics.

Cereal Characteristics: Cluster Analysis

2023-12-12