Clustering Food Products

Cluster analysis is an unsupervised learning algorithm that finds similarities between data and groups similar objects into clusters. In this study we will cluster food products based in nutritional content with three different clustering methods.

Data is retrieved from openfoodfacts.org

Getting and Cleaning Data

The Product2.csv file consists of data on food products sold in the US from openfoodfacts.org. First 2 are character (col1 is productid, col2 is product names), the rest (48) are numberic and identifies nutritional content.

#Load libraries
library(cluster)
library(NbClust)
library(ggplot2)
library(MASS)
library(dendextend)
library(dplyr)
library(kohonen)
library(heatmaply)

ProductData <- read.csv("products2.csv", na.strings = "undefined",
                        colClasses = c("character", "character", rep("numeric", 48)))
# Delete first column
ProductData$prodid <-NULL

Row and Column Cleaning

#row cleaning
# Select Product Data Subset that has less than 35 NA's
ProdDataSubset <- ProductData[rowSums(is.na(ProductData))<35,]

#column cleaning
# Identify columns that are mostly empty (select less than 20 is empty)
Enames <- sapply(ProdDataSubset, function(x) sum(is.na(x))<20)
# Subset that 
ProdDataSubset <- ProdDataSubset[,which(Enames)]

# Remove duplicates
CleanProductData <- ProdDataSubset %>% distinct(prodname, .keep_all = TRUE)

# Set NA's to zero
CleanProductData[is.na(CleanProductData)] <- 0

# Set productname as row names
rownames(CleanProductData) <- CleanProductData[,1]
CleanProductData$prodname <- NULL

Scaling Data

# Scale the data
ScaledProdData <- scale(CleanProductData)
ScaledProdData[is.na(ScaledProdData)] <- 0

K-Means Partitioning

In k-means clustering method, user needs to define data needs to be divided into how many clusters. NbClust calculates different statistics to find the optimum clustering number.

# Whats the best number?
bestK <- NbClust(ScaledProdData, min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 5 proposed 9 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

After defining the best number for clustering, algorithm finds a central point and assigns objects to the nearest central point. This process continues recursively until no change occurs.

According to majority rule, the best number for clustering is 2. With kmeans function we will divide the products in 2 clusters.

KMeansClusters <- kmeans(ScaledProdData, 2, nstart = 25)
plot(ScaledProdData, col = KMeansClusters$cluster)

clusplot(ScaledProdData, KMeansClusters$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0, cex=0.5)

#Which products are in which clusters (first 10 rows)
rownames(ScaledProdData)[KMeansClusters$cluster==1][0:10]

##  [1] " Nissin Demae Five Spices Artificial Beef Flavor"
##  [2] " Nissin Demae Artificial Chicken Flavor"         
##  [3] " Shortbread Cookies"                             
##  [4] " Primo Taglio Genoa Salame"                      
##  [5] " Nutella"                                        
##  [6] " Creamy Peanut Butter"                           
##  [7] " Milk Chocolate Peanut Butter Cups"              
##  [8] " Coconut oil"                                    
##  [9] " Whoppers"                                       
## [10] " Macaroni N' Cheese"

rownames(ScaledProdData)[KMeansClusters$cluster==2][0:10]

##  [1] " undefined"                                         
##  [2] " Almond milk chocolate"                             
##  [3] " Chocolate chip cookies"                            
##  [4] " Quaker Oats 100% Natural Whole Grain OLD FASHIONED"
##  [5] " Rosemary & Olive Oil"                              
##  [6] " Organic Chocolate Almond Non-Dairy Beverage"       
##  [7] " Prosciutto"                                        
##  [8] " Organic lemon tea"                                 
##  [9] " Chocolate Chunk & Chip"                            
## [10] " Jimmy Chips - BBQ Flavored "

Hierarchical agglomerative clustering model with agnes

In hierachical clustering approach, objects that has the least dissimilarity are merged step by step composing several levels of partitioning.Dendogram can be cut in any desired level then each connected objects forms a cluster.

##Decreasing the size of the dataset by creating a subset that has less than 25 NA's instead of 35
#row cleaning
# Select Product Data Subset that has less than 25 NA's
ProdDataSubset <- ProductData[rowSums(is.na(ProductData))<25,]
#column cleaning
# Identify columns that are mostly empty (select less than 20 is empty)
Enames <- sapply(ProdDataSubset, function(x) sum(is.na(x))<20)
# Subset that 
ProdDataSubset <- ProdDataSubset[,which(Enames)]
# Remove duplicates
CleanProductData <- ProdDataSubset %>% distinct(prodname, .keep_all = TRUE)
# Set NA's to zero
CleanProductData[is.na(CleanProductData)] <- 0
# Set productname as row names
rownames(CleanProductData) <- CleanProductData[,1]
CleanProductData$prodname <- NULL
# Scale the data
ScaledProdData <- scale(CleanProductData)
ScaledProdData[is.na(ScaledProdData)] <- 0

HierClusters <- agnes(ScaledProdData, method = "complete", metric = "euclidean")
##which.plots=2 is for dendrogram
plot(HierClusters, which.plots=2, cex = 0.5)

Building Heatmap By Using Agnes as a Clustering Algorithm

Dynamic heatmap is combined with the dendrogram plotted using agnes algorithm. Y-axis respresents the food names and x-axis holds the different features such as fiber serving, vitamin A serving. Viewing the nature of the color patterns of each column, gives us some idea about the general product features. For example, for the data cluster shown in the figure below, we see that vitamin D column has a darker column than fiber serving which means this product cluster has better values in terms of fiber serving than vitamin D.

stdist <- dist(ScaledProdData, method = "euclidian")
heatmaply(ScaledProdData, hclustfun=function(stdist){as.hclust(agnes(stdist, method="ward"))},margins = c(150, 200))

Rectangular areas that are with the same colors gives us an idea about the patterns in the data. For example, for the chocolate chip and chocolate chip peanut crunch lots of values has similar colors (except magnesium serving). That’s why they are placed in the same cluster. In the first overall graph, we see that upper side has more light and dark colors comparing to the lower side. If we check the dendrogram, we see that these parts are divided into 2 different clusters. Upper cluster which includes foods such as chocolate chip, chocolate mix have larger fat values and smaller vitamin and salt values comparing to lower cluster.

Building Kohonen Self Organizing Map

# Create Kohonen SOM
statesom <- som(ScaledProdData, grid = somgrid(5, 4, "hexagonal"))

# Plot map
plot(statesom, type="mapping", labels = rownames(ScaledProdData), cex=0.5)

# Look at codes
plot(statesom, type="codes", labels = rownames(ScaledProdData))