This project is an analysis of the Mc Donald Menu. It focuses on the link between the amount of calories in each product and the quantity of sugar containted in this product. The aim is to be able to cluster the items of the menu in order to define which products are the ones that potentially are better for your health from the point of view of daily sugar intake.
setwd("C:/Users/Home/Documents/Erasmus Warsaw/projet/projet 1")
data<-read.csv("menu.csv", sep=",", dec=",", header=TRUE)
# Select Variables
Data <-data[,c(4, 19)]
Sugar <- as.numeric(as.character(Data[,2]))
Calories <- as.numeric(as.character(Data[,1]))
#create a new dataframe
Data <- data.frame (Calories, Sugar)
#rename the headers
row.names(Data) <- data[,2]
names(Data) <- c('Calories', 'Sugar')
head(Data)
## Calories Sugar
## Egg McMuffin 300 3
## Egg White Delight 250 3
## Sausage McMuffin 370 2
## Sausage McMuffin with Egg 450 2
## Sausage McMuffin with Egg Whites 400 2
## Steak & Egg McMuffin 430 3
Datastand<-scale(Data)
dm<-dist(Datastand)
HClustering <-hclust(dm, method="complete")
plot(HClustering)
On this dendogram we can see that the “box of 40 mcnuggets” is an outlier, to avoid having bias in the analysis this data will be removed from the dataset.
#Delete the outlier
data <- data[-83,]
Data <- Data [-83,]
Datastand<-scale(Data)
dm<-dist(Datastand)
HClustering <-hclust(dm, method="complete")
plot(HClustering)
DataCentered <-center_scale(Data)
# a) variance_explained
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters = TRUE)
# b) silhouette
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
# c) AIC
opt<-Optimal_Clusters_KMeans(DataCentered, max_clusters=10, plot_clusters=TRUE, criterion="AIC")
Criterion of explained variance : the optimal number of cluster would be either three or six. The curve of the plot changes significantly reaching these values . Criterion Sihoulette : the number of cluster with the highest value is 6 cluster. Criterion of AIC : the optimal number of cluster would once again be either three or six.
#With PAM
BestPamClust <-pamk(Data, krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(Data, "dist"), critout=FALSE)
plot(pam(Data, BestPamClust$nc))
Using the KMeans
km1<-eclust(Data, "kmeans", hc_metric="euclidean",k=2)
km2<-eclust(Data, "kmeans", hc_metric="manhattan", k=3)
km3<-eclust(Data, "kmeans", hc_metric="euclidean", k=6)
hopkins(Data, n=nrow(Data)-1)
## $H
## [1] 0.2379939
get_clust_tendency(Data, 2 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.2578084
##
## $plot
get_clust_tendency(Data, 3 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.5238726
##
## $plot
get_clust_tendency(Data, 6 , graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.1097989
##
## $plot
The Hopkins statistic here is implemented in the opposite way, which means that the more the value is close to 0 the more the dataset in clusterable. With a value of 0.23 we can conclude that we can reject the null hypothesis stating that the data is not significantly clusterable. The statistic is the more significant for a cluster value of 6 clusters.
KM2Clust <-kmeans(Data, 2)
round(calinhara(Data,KM2Clust$cluster),digits=2)
## [1] 424.19
KM3Clust<-kmeans(Data, 3)
round(calinhara(Data,KM3Clust$cluster),digits=2)
## [1] 511.52
KM6Clust<-kmeans(Data, 6)
round(calinhara(Data,KM6Clust$cluster),digits=2)
## [1] 758.22
If we use the Calinski-Harabsasz criterion to compare the different possibilities of clustering solution, the solution of 6 clusters is the best one as its statistic is the highest.
# H0: homogeneity of cluster
# H1: heterogeneity of cluster
KM2Clust <- kmeans(Data,2)
dudahart2(Data,KM2Clust$cluster)$`p.value`
## [1] 1.058629e-10
KM3Clust <- kmeans(Data,3)
dudahart2(Data,KM3Clust$cluster) $`p.value`
## [1] 0
KM6Clust <- kmeans(Data,6)
dudahart2(Data,KM6Clust$cluster) $`p.value`
## [1] 0
The Duda-Hart statistic test if the data should be split. This statistic is significant for every possible solution proving the heterogeneity of the clusters.
KMClust <-cclust(Data, 2 , dist="euclidean")
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
plot(Data, col=predict(KMClust))
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
points(KMClust@centers, pch="x", cex=2, col=1)
plot(KMClust)
shadow(KMClust)
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## 1 2
## 0.5257704 0.4205249
plot(shadow(KMClust))
KMClust <-cclust(Data, 6 , dist="euclidean")
plot(Data, col=predict(KMClust))
points(KMClust@centers, pch="x", cex=2, col=1)
plot(KMClust)
shadow(KMClust)
## 1 2 3 4 5 6
## 0.5036452 0.2133176 0.5439308 0.5431137 0.6060189 0.5644044
plot(shadow(KMClust))
The Shadow statistic for both alternatives are quite high which means that the points are not really close of the centroid, an indication of a good cluster.
d1<-cclust(Data, 2, dist="euclidean")
d2<-cclust(Data, 3 , dist="euclidean")
randIndex(d1, d2)
## ARI
## 0.410234
d2<-cclust(Data, 6 , dist="euclidean")
randIndex(d1, d2)
## ARI
## 0.3512126
The rand index displays the dissimilarity between the clustering method using two, three or six clusters.
This last analysis is usefull to understand wether the clustering results would be related to the food category of the data.
nbrClusters <- 8
KMblaClust <- kmeans(Data, nbrClusters)
data$clusters <- KMblaClust$cluster
Groupement <- data.frame ( )
for (clus in c(1:nbrClusters)) {
dat<-subset(data, clusters == clus)
NbrObservations <- count (dat)
MainAliment <- count(dat, Category)
MainAliment <- MainAliment[MainAliment$n == max(MainAliment$n),]
ratio <- MainAliment$n / NbrObservations
newRow <- data.frame(MainAliment$Category, ratio)
names(newRow) <- c("Category", "Ratio")
Groupement <- rbind (Groupement, newRow)
}
barplot(height=Groupement$Ratio, names=c(1:8), col= "lightblue", names.arg= Groupement$Category,las=2 )
This plot proves that the food category does not explain the clustering results. Some food categories are designated many times as the top representant of the cluster and some other categories don’t even appear.