Wine Type Clustering

Karen T. Gurupira

Introduction

In this study, l will attempt to implement the best clustering approach which will consider the variables used to determined the wine quality.The dataset consists of 13 variables of 1599 different wines. The data used in this analysis is the WineQuality dataset from the following website; https://www.kaggle.com/danielpanizzo/wine-quality

Clustering is most important field in unsupervised learning which deals with identifying subgroups in a collection of unlabeled data. These subgroups comprise of similar items where there is similarity within each subgroup but dissimilarities to other subgroups.The clusters formed within this study will be represented by multiple visualization plots. This study aims to determine the optimal cluster groups and to study the similarity in variables in the allocated clusters.

Data Retrival.

Before the study was conducted, the undermentioned libraries were downloaded and the dataset was imported directly into R studio from the files.

#install.packages("cluster")
library(cluster)
library(ggplot2)

## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.

#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#install.packages("flexclust")
library(flexclust)

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

#install.packages("gridExtra")
library(gridExtra)
#install.packages("fpc")
library(fpc)
#install.packages("clustertend")
library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(corrplot)

## corrplot 0.92 loaded

library(pillar)

Wine <- read.csv("C:\\Users\\User\\Desktop\\wineQualityReds.csv")

summary(Wine)

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 2.200   Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##      type          
##  Length:1599       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Each column is on a different scale so it is difficult to compare the columns. It is therefore paramount that the columns are normalized around the mean zero and standard deviation.Consequently, we work with the standard deviation so that it is easier to compare accross the different columns. However, before this can be done, the column “TYPE” is discounted as it already given therefore, we will only consider the characteristics used to determine the TYPE of the wine (A, B or C), thus essentially 3 groups. The “type” column is detached using the NULL function.

wine.q = Wine
Wine$type<- NULL
View(wine.q)

wine.stand <- scale(Wine[-1])
head(wine.stand)

##      fixed.acidity volatile.acidity citric.acid residual.sugar   chlorides
## [1,]    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## [2,]    -0.2984541        1.9668271   -1.391037     0.04340257  0.22380518
## [3,]    -0.2984541        1.2966596   -1.185699    -0.16937425  0.09632273
## [4,]     1.6543385       -1.3840105    1.483689    -0.45307667 -0.26487754
## [5,]    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## [6,]    -0.5281944        0.7381867   -1.391037    -0.52400227 -0.26487754
##      free.sulfur.dioxide total.sulfur.dioxide    density         pH   sulphates
## [1,]         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## [2,]          0.87236532            0.6241680 0.02825193 -0.7197081  0.12891007
## [3,]         -0.08364328            0.2289750 0.13422152 -0.3310730 -0.04807379
## [4,]          0.10755844            0.4113718 0.66406945 -0.9787982 -0.46103614
## [5,]         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## [6,]         -0.27484500           -0.1966174 0.55809987  1.2882399 -0.57902538
##         alcohol
## [1,] -0.9599458
## [2,] -0.5845942
## [3,] -0.5845942
## [4,] -0.5845942
## [5,] -0.9599458
## [6,] -0.9599458

Clustering Methods.

We use k- means clustering by first predetermining the number of clusters. Since the quality of wine is rated from 0 to 10, 10 clusters will be predetermined cluster to begin with.Consequently, the algorithm will come up with 10 means, and allocate each row to the closest mean. This procedure is continuously adjusted until the results can no longer be improved.

results <- kmeans(wine.stand,3)
attributes(results)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"      
## 
## $class
## [1] "kmeans"

We can see the attributes of the clustering results includes size, total standard deviations, withinss etc, but we are interested in the cluster and centers only. We can view these centers below: And these are the centers of each attribute.

results$centers

##   fixed.acidity volatile.acidity citric.acid residual.sugar    chlorides
## 1     0.9884753      -0.68136597   1.0168389     0.03009531  0.269597010
## 2    -0.0958790       0.03722766   0.0971056     0.40100180 -0.004974135
## 3    -0.6495260       0.46255701  -0.7687501    -0.22734608 -0.188033956
##   free.sulfur.dioxide total.sulfur.dioxide    density         pH  sulphates
## 1          -0.4611634           -0.4737959  0.4324145 -0.7347242  0.5598374
## 2           1.0766153            1.3342828  0.2786982 -0.1843720 -0.1957198
## 3          -0.2272437           -0.3507257 -0.4489129  0.6141559 -0.2951957
##       alcohol
## 1  0.28587974
## 2 -0.52148613
## 3  0.06588483

Now, we can compare these centers with the original quality ranking, this is done by comparing the quality column in the initial data to the clustered data.

table(wine.q$type,results$cluster)

##    
##       1   2   3
##   A 123  12  57
##   B 141 243 307
##   C 245 115 356

Vizualisations

From the above k-means clustering, it is can not be conclude whether the predetermining number of cluster is efficient thus further supporting methods are needed.

We can use the packages installed earlier to come up with better suited clusters for the dataset.

fviz_nbclust(wine.stand, kmeans,method = "wss")

fviz_nbclust(wine.stand, kmeans,method = "silhouette")

fviz_nbclust(wine.stand, kmeans,method = "gap_stat")

## Warning: did not converge in 10 iterations

From the first method “wss”,with total sum of squares, we can find the optimum number of clusters using the elbow or knee rule to determine after which number of clusters at which the sum of squares does not decrease much.From the line graph, it is evident that the optimal number of clusters would be between 4 or 5 clusters.

On the other hand, according the second method, “silhouette”, the optimum number of cluster would be 2.This is because the optimum cluster number would be that maximizes the average silhouette width.

Lastly, with the “gap_stat” method, the optimum number of clusters would be 10 as it maximizes the gap statistic.

Taking the above methods into consideration, the “silhouette” method with 2 clusters is chose as the basis of the analysis of the cluster by plot below. This is because this method provides less overlapping between the clusters than any other method.

fviz_cluster(kmeans(wine.stand, centers = 2, iter.max = 100, nstart = 100), data = wine.stand)

We can also visualize the comparison between the wine variables and the allocated clusters :

library("tidyverse")

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine()  masks gridExtra::combine()
## x dplyr::dim_desc() masks pillar::dim_desc()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()

clusters <- kmeans(wine.stand, centers = 2, iter.max = 100, nstart = 100)
wine.c <- wine.q |> mutate(cluster = clusters$cluster)
wine.c |> ggplot(aes(x=fixed.acidity, y=pH, col=as.factor(cluster))) + geom_point()

Conlusion

From the plot above, it is evident that wines with a lower fixed.acidity and higher pH were assigned to cluster 1 while wines with a lower pH and higher fixed.acidity were assigned mostly to cluster 2. The above diagram can be plotted for each variable in order to ermine the similarity within each cluster.

Through the analysis of the different clustering methods, we were able to determine the optimum number of clusters that made the dataset visually presentable and easy to understand.

References

https://www.kaggle.com/danielpanizzo/wine-quality

https://stackoverflow.com/questions

https://community.rstudio.com/