This project is an analysis of the Mc Donald Menu. It focuses on the link between the amount of calories in each product and the quantity of sugar containted in this product. The aim is to be able to cluster the items of the menu in order to define which products are the ones that potentially are better for your health from the point of view of daily sugar intake.

DATA PREPARATION

setwd("C:/Users/Home/Documents/Erasmus Warsaw/projet/projet 1")
data<-read.csv("menu.csv", sep=",", dec=",", header=TRUE) 
# Select Variables
Data <-data[,c(4, 19)] 
Sugar <- as.numeric(as.character(Data[,2]))
Calories <- as.numeric(as.character(Data[,1]))
#create a new dataframe
Data <- data.frame (Calories, Sugar)
#rename the headers 
row.names(Data) <- data[,2]
names(Data) <- c('Calories', 'Sugar')
head(Data)
##                                  Calories Sugar
## Egg McMuffin                          300     3
## Egg White Delight                     250     3
## Sausage McMuffin                      370     2
## Sausage McMuffin with Egg             450     2
## Sausage McMuffin with Egg Whites      400     2
## Steak & Egg McMuffin                  430     3

First visualisation of the Data

Datastand<-scale(Data) 
dm<-dist(Datastand) 
HClustering <-hclust(dm, method="complete") 
plot(HClustering)   

On this dendogram we can see that the “box of 40 mcnuggets” is an outlier, to avoid having bias in the analysis this data will be removed from the dataset.

#Delete the outlier
data <- data[-83,]
Data <- Data [-83,]

Datastand<-scale(Data) 
dm<-dist(Datastand) 
HClustering <-hclust(dm, method="complete") 
plot(HClustering)   

PREDIAGNOSIS

Manual selection of optimal number of clusters

DataCentered <-center_scale(Data) 
# a) variance_explained 
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters = TRUE)

# b) silhouette 
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

# c) AIC
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters=TRUE, criterion="AIC")

Criterion of explained variance : the optimal number of cluster would be either three or six. The curve of the plot changes significantly reaching these values . Criterion Sihoulette : the number of cluster with the highest value is 6 cluster. Criterion of AIC : the optimal number of cluster would once again be either three or six.

Automatic selection of the optimal number of clusters

#With PAM
BestPamClust <-pamk(Data, krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(Data, "dist"), critout=FALSE) 
plot(pam(Data, BestPamClust$nc))

CLUSTERING

Using the KMeans

km1<-eclust(Data, "kmeans", hc_metric="euclidean",k=2)

km2<-eclust(Data, "kmeans", hc_metric="manhattan", k=3)

km3<-eclust(Data, "kmeans", hc_metric="euclidean", k=6)

Hopkins’statistic

hopkins(Data, n=nrow(Data)-1) 
## $H
## [1] 0.2379939
get_clust_tendency(Data, 2 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)  
## $hopkins_stat
## [1] 0.2578084
## 
## $plot

get_clust_tendency(Data, 3 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123) 
## $hopkins_stat
## [1] 0.5238726
## 
## $plot

get_clust_tendency(Data, 6 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)  
## $hopkins_stat
## [1] 0.1097989
## 
## $plot

The Hopkins statistic here is implemented in the opposite way, which means that the more the value is close to 0 the more the dataset in clusterable. With a value of 0.23 we can conclude that we can reject the null hypothesis stating that the data is not significantly clusterable. The statistic is the more significant for a cluster value of 6 clusters.

Calinski-Harabsasz and Duda-Hart measures of clustering quality

Quality of clustering with Calinski-Harabasz index

KM2Clust <-kmeans(Data, 2) 
round(calinhara(Data,KM2Clust$cluster),digits=2)
## [1] 424.19
KM3Clust<-kmeans(Data, 3) 
round(calinhara(Data,KM3Clust$cluster),digits=2)
## [1] 511.52
KM6Clust<-kmeans(Data, 6) 
round(calinhara(Data,KM6Clust$cluster),digits=2)
## [1] 758.22

If we use the Calinski-Harabsasz criterion to compare the different possibilities of clustering solution, the solution of 6 clusters is the best one as its statistic is the highest.

Quality of clustering with Duda-Hart measure

# H0: homogeneity of cluster  
# H1: heterogeneity of cluster 

KM2Clust <- kmeans(Data,2) 
dudahart2(Data,KM2Clust$cluster)$`p.value`
## [1] 1.058629e-10
KM3Clust <- kmeans(Data,3) 
dudahart2(Data,KM3Clust$cluster) $`p.value`
## [1] 0
KM6Clust <- kmeans(Data,6) 
dudahart2(Data,KM6Clust$cluster) $`p.value`
## [1] 0

The Duda-Hart statistic test if the data should be split. This statistic is significant for every possible solution proving the heterogeneity of the clusters.

Shadow statistics

Using 2 clusters
KMClust <-cclust(Data, 2 , dist="euclidean") 
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
plot(Data, col=predict(KMClust))
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
points(KMClust@centers, pch="x", cex=2, col=1) 

plot(KMClust)

shadow(KMClust)
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
##         1         2 
## 0.5257704 0.4205249
plot(shadow(KMClust))

Using 6 clusters
KMClust <-cclust(Data, 6 , dist="euclidean") 
plot(Data, col=predict(KMClust))
points(KMClust@centers, pch="x", cex=2, col=1) 

plot(KMClust)

shadow(KMClust)
##         1         2         3         4         5         6 
## 0.5036452 0.2133176 0.5439308 0.5431137 0.6060189 0.5644044
plot(shadow(KMClust))

The Shadow statistic for both alternatives are quite high which means that the points are not really close of the centroid, an indication of a good cluster.

Rand index (ARI), Jaccard index, Fowlkes-Mallows index

d1<-cclust(Data, 2, dist="euclidean")
d2<-cclust(Data, 3 , dist="euclidean")
randIndex(d1, d2)
##      ARI 
## 0.410234
d2<-cclust(Data, 6 , dist="euclidean")
randIndex(d1, d2)
##       ARI 
## 0.3512126

The rand index displays the dissimilarity between the clustering method using two, three or six clusters.

Do the categories of food match the clustering?

This last analysis is usefull to understand wether the clustering results would be related to the food category of the data.

nbrClusters <- 8
KMblaClust <- kmeans(Data, nbrClusters) 
data$clusters <- KMblaClust$cluster
Groupement <- data.frame ( )

for (clus in c(1:nbrClusters)) {
  dat<-subset(data, clusters == clus)
  NbrObservations <- count (dat)
  MainAliment <- count(dat, Category)
  MainAliment <- MainAliment[MainAliment$n == max(MainAliment$n),]
  ratio <- MainAliment$n / NbrObservations 
  newRow <- data.frame(MainAliment$Category, ratio)  
  names(newRow) <- c("Category", "Ratio")
  Groupement <- rbind (Groupement, newRow)
}

barplot(height=Groupement$Ratio, names=c(1:8), col= "lightblue", names.arg= Groupement$Category,las=2 )

This plot proves that the food category does not explain the clustering results. Some food categories are designated many times as the top representant of the cluster and some other categories don’t even appear.