In this study, l will attempt to implement the best clustering approach which will consider the variables used to determined the wine quality.The dataset consists of 13 variables of 1599 different wines. The data used in this analysis is the WineQuality dataset from the following website; https://www.kaggle.com/danielpanizzo/wine-quality
Clustering is most important field in unsupervised learning which deals with identifying subgroups in a collection of unlabeled data. These subgroups comprise of similar items where there is similarity within each subgroup but dissimilarities to other subgroups.The clusters formed within this study will be represented by multiple visualization plots. This study aims to determine the optimal cluster groups and to study the similarity in variables in the allocated clusters.
Before the study was conducted, the undermentioned libraries were downloaded and the dataset was imported directly into R studio from the files.
#install.packages("cluster")
library(cluster)
library(ggplot2)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
#install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#install.packages("flexclust")
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
#install.packages("gridExtra")
library(gridExtra)
#install.packages("fpc")
library(fpc)
#install.packages("clustertend")
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(corrplot)
## corrplot 0.92 loaded
library(pillar)
Wine <- read.csv("C:\\Users\\User\\Desktop\\wineQualityReds.csv")
summary(Wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## type
## Length:1599
## Class :character
## Mode :character
##
##
##
Each column is on a different scale so it is difficult to compare the columns. It is therefore paramount that the columns are normalized around the mean zero and standard deviation.Consequently, we work with the standard deviation so that it is easier to compare accross the different columns. However, before this can be done, the column “TYPE” is discounted as it already given therefore, we will only consider the characteristics used to determine the TYPE of the wine (A, B or C), thus essentially 3 groups. The “type” column is detached using the NULL function.
wine.q = Wine
Wine$type<- NULL
View(wine.q)
wine.stand <- scale(Wine[-1])
head(wine.stand)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,] -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## [2,] -0.2984541 1.9668271 -1.391037 0.04340257 0.22380518
## [3,] -0.2984541 1.2966596 -1.185699 -0.16937425 0.09632273
## [4,] 1.6543385 -1.3840105 1.483689 -0.45307667 -0.26487754
## [5,] -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## [6,] -0.5281944 0.7381867 -1.391037 -0.52400227 -0.26487754
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## [1,] -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## [2,] 0.87236532 0.6241680 0.02825193 -0.7197081 0.12891007
## [3,] -0.08364328 0.2289750 0.13422152 -0.3310730 -0.04807379
## [4,] 0.10755844 0.4113718 0.66406945 -0.9787982 -0.46103614
## [5,] -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## [6,] -0.27484500 -0.1966174 0.55809987 1.2882399 -0.57902538
## alcohol
## [1,] -0.9599458
## [2,] -0.5845942
## [3,] -0.5845942
## [4,] -0.5845942
## [5,] -0.9599458
## [6,] -0.9599458
We use k- means clustering by first predetermining the number of clusters. Since the quality of wine is rated from 0 to 10, 10 clusters will be predetermined cluster to begin with.Consequently, the algorithm will come up with 10 means, and allocate each row to the closest mean. This procedure is continuously adjusted until the results can no longer be improved.
results <- kmeans(wine.stand,3)
attributes(results)
## $names
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
##
## $class
## [1] "kmeans"
We can see the attributes of the clustering results includes size, total standard deviations, withinss etc, but we are interested in the cluster and centers only. We can view these centers below: And these are the centers of each attribute.
results$centers
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 0.9884753 -0.68136597 1.0168389 0.03009531 0.269597010
## 2 -0.0958790 0.03722766 0.0971056 0.40100180 -0.004974135
## 3 -0.6495260 0.46255701 -0.7687501 -0.22734608 -0.188033956
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 -0.4611634 -0.4737959 0.4324145 -0.7347242 0.5598374
## 2 1.0766153 1.3342828 0.2786982 -0.1843720 -0.1957198
## 3 -0.2272437 -0.3507257 -0.4489129 0.6141559 -0.2951957
## alcohol
## 1 0.28587974
## 2 -0.52148613
## 3 0.06588483
Now, we can compare these centers with the original quality ranking, this is done by comparing the quality column in the initial data to the clustered data.
table(wine.q$type,results$cluster)
##
## 1 2 3
## A 123 12 57
## B 141 243 307
## C 245 115 356
From the above k-means clustering, it is can not be conclude whether the predetermining number of cluster is efficient thus further supporting methods are needed.
We can use the packages installed earlier to come up with better suited clusters for the dataset.
fviz_nbclust(wine.stand, kmeans,method = "wss")
fviz_nbclust(wine.stand, kmeans,method = "silhouette")
fviz_nbclust(wine.stand, kmeans,method = "gap_stat")
## Warning: did not converge in 10 iterations
From the first method “wss”,with total sum of squares, we can find the optimum number of clusters using the elbow or knee rule to determine after which number of clusters at which the sum of squares does not decrease much.From the line graph, it is evident that the optimal number of clusters would be between 4 or 5 clusters.
On the other hand, according the second method, “silhouette”, the optimum number of cluster would be 2.This is because the optimum cluster number would be that maximizes the average silhouette width.
Lastly, with the “gap_stat” method, the optimum number of clusters would be 10 as it maximizes the gap statistic.
Taking the above methods into consideration, the “silhouette” method with 2 clusters is chose as the basis of the analysis of the cluster by plot below. This is because this method provides less overlapping between the clusters than any other method.
fviz_cluster(kmeans(wine.stand, centers = 2, iter.max = 100, nstart = 100), data = wine.stand)
We can also visualize the comparison between the wine variables and the allocated clusters :
library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::dim_desc() masks pillar::dim_desc()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
clusters <- kmeans(wine.stand, centers = 2, iter.max = 100, nstart = 100)
wine.c <- wine.q |> mutate(cluster = clusters$cluster)
wine.c |> ggplot(aes(x=fixed.acidity, y=pH, col=as.factor(cluster))) + geom_point()
From the plot above, it is evident that wines with a lower fixed.acidity and higher pH were assigned to cluster 1 while wines with a lower pH and higher fixed.acidity were assigned mostly to cluster 2. The above diagram can be plotted for each variable in order to ermine the similarity within each cluster.
Through the analysis of the different clustering methods, we were able to determine the optimum number of clusters that made the dataset visually presentable and easy to understand.