I am going to use the Wine Quality data from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

The objective of this exercise is to show how to classify observations when we don’t have an explicit classification criterion. In machine learning this is known as unsupervised learning. The methods that we will study come from the marketing and statistics literature.

I have already downloaded the data and stored on my computer. For this exercise we will use only the white wine data.

library(pastecs)
## Loading required package: boot
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data")
wine<- read.csv("winequality-white.csv")

K-means Clustering

We will use all the 12 variables available to us. First we will start with k-means clustering. Make sure that all the variables are numeric and either interval or ratio.

str(wine)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed_acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile_acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric_acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual_sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free_sulfur_dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total_sulfur_dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
options(scipen = 100)
options(digits = 3)

pastecs::stat.desc(wine)
##              fixed_acidity volatile_acidity citric_acid residual_sugar
## nbr.val          4898.0000       4898.00000  4898.00000      4898.0000
## nbr.null            0.0000          0.00000    19.00000         0.0000
## nbr.na              0.0000          0.00000     0.00000         0.0000
## min                 3.8000          0.08000     0.00000         0.6000
## max                14.2000          1.10000     1.66000        65.8000
## range              10.4000          1.02000     1.66000        65.2000
## sum             33574.7500       1362.82500  1636.87000     31305.1500
## median              6.8000          0.26000     0.32000         5.2000
## mean                6.8548          0.27824     0.33419         6.3914
## SE.mean             0.0121          0.00144     0.00173         0.0725
## CI.mean.0.95        0.0236          0.00282     0.00339         0.1421
## var                 0.7121          0.01016     0.01465        25.7258
## std.dev             0.8439          0.10079     0.12102         5.0721
## coef.var            0.1231          0.36226     0.36213         0.7936
##                chlorides free_sulfur_dioxide total_sulfur_dioxide
## nbr.val      4898.000000            4898.000             4898.000
## nbr.null        0.000000               0.000                0.000
## nbr.na          0.000000               0.000                0.000
## min             0.009000               2.000                9.000
## max             0.346000             289.000              440.000
## range           0.337000             287.000              431.000
## sum           224.193000          172939.000           677690.500
## median          0.043000              34.000              134.000
## mean            0.045772              35.308              138.361
## SE.mean         0.000312               0.243                0.607
## CI.mean.0.95    0.000612               0.476                1.190
## var             0.000477             289.243             1806.085
## std.dev         0.021848              17.007               42.498
## coef.var        0.477318               0.482                0.307
##                    density          pH  sulphates    alcohol    quality
## nbr.val      4898.00000000  4898.00000 4898.00000  4898.0000  4898.0000
## nbr.null        0.00000000     0.00000    0.00000     0.0000     0.0000
## nbr.na          0.00000000     0.00000    0.00000     0.0000     0.0000
## min             0.98711000     2.72000    0.22000     8.0000     3.0000
## max             1.03898000     3.82000    1.08000    14.2000     9.0000
## range           0.05187000     1.10000    0.86000     6.2000     6.0000
## sum          4868.74609000 15616.13000 2399.27000 51498.8800 28790.0000
## median          0.99374000     3.18000    0.47000    10.4000     6.0000
## mean            0.99402738     3.18827    0.48985    10.5143     5.8779
## SE.mean         0.00004274     0.00216    0.00163     0.0176     0.0127
## CI.mean.0.95    0.00008378     0.00423    0.00320     0.0345     0.0248
## var             0.00000895     0.02280    0.01302     1.5144     0.7844
## std.dev         0.00299091     0.15100    0.11413     1.2306     0.8856
## coef.var        0.00300888     0.04736    0.23298     0.1170     0.1507
options(scipen = 0)

In order to perform k-means clustering, we need to tell R how many clusters we want. This has to be decided ex ante. However, we may have literally no idea about the number of clusters in the data.

Ideally, we should have fewer clusters and within each cluster very low variance. However, the number of clusters and the variance within each cluster are negatively correlated. For example, if each observation in the data is considered as a cluster then there is zero variance within the cluster. However, there are numerous clusters and there is no reduction in the data. So we need to make a tradeoff between the two. For this, we will plot a curve for the average variance per cluster for models with different numbers of clusters.

For this example, I will plot a curve for up to 20 clusters.

ss <- rep(0,20)
for (i in 1:20) {
  set.seed(20)
  ss[i] <- sum(kmeans(wine,centers=i,iter.max = 100, nstart = 25)$withinss)
  }

ss2 <- data.frame(cbind(ss,seq(1,20)))
colnames(ss2) <- c("SS","Cluster")

ggplot(data = ss2,aes(x = Cluster, y=SS)) +
  geom_point(color="blue") + geom_line() 

The curve will look like a hockey stick. We see that after around 5 clusters the slope of the line is flattening. This means that as we increase the numbers of cluster beyond 5, we don’t get a large reduction in the average variance within each cluster. This is necessarily a subjective way of looking at things. For example, for some people the real flattening may appear to happen at 16 clusters. We will use 5 clusters for this example and print the centroids.

set.seed(20)
kclusters <- kmeans(wine,centers = 5, nstart = 25)

kclusters$centers
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1          6.96            0.286       0.354           8.82    0.0511
## 2          6.78            0.270       0.323           4.73    0.0419
## 3          6.84            0.272       0.336           7.01    0.0473
## 4          7.01            0.307       0.356          10.03    0.0523
## 5          6.81            0.280       0.316           3.45    0.0402
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                47.0                179.1   0.996 3.18     0.508    9.83
## 2                28.2                113.1   0.993 3.19     0.484   10.96
## 3                37.7                145.3   0.994 3.20     0.486   10.40
## 4                55.3                221.7   0.997 3.18     0.518    9.54
## 5                18.9                 77.5   0.992 3.18     0.469   11.26
##   quality
## 1    5.63
## 2    6.07
## 3    5.94
## 4    5.52
## 5    5.90

Hierarchical Clustering

In hierarchical clustering, we make decisions based on a dendogram. In general reading a dendogram with a large number of observations is very difficult. Therefore, I am going to randomly select 50 observations from the original dataset and then we will use it for clustering.

set.seed(100)
w_ind <- sample(nrow(wine),50,replace = F)
wine2 <- wine[w_ind,]

Next we will perform hierarchical clustering and generate the a dendogram

hclusters <- hclust(dist(wine2))
plot(hclusters)

Again, selecting the clusters is sibjective. Looking at the dendogram, cutting the tree at around 4 clusters looks reasonable. Let’s create 4 clusters and see how many observations each cluster consists of.

hclustersCut <- cutree(hclusters, 4)
table(hclustersCut)
## hclustersCut
##  1  2  3  4 
##  9 16  9 16

Finally, let’s look at the means of variables for each cluster

wine2$Clusters <- hclustersCut
wine2 <- dplyr::group_by(wine2,Clusters)
a <- dplyr::summarise_each(wine2, funs(mean))
print.data.frame(a)
##   Clusters fixed_acidity volatile_acidity citric_acid residual_sugar
## 1        1          6.86            0.281       0.433           7.28
## 2        2          6.84            0.281       0.353          11.39
## 3        3          6.63            0.280       0.292           4.32
## 4        4          6.88            0.277       0.315           4.72
##   chlorides free_sulfur_dioxide total_sulfur_dioxide density   pH
## 1    0.0502                45.7                  153   0.995 3.21
## 2    0.0631                47.2                  190   0.997 3.17
## 3    0.0358                23.8                   81   0.992 3.21
## 4    0.0442                34.4                  113   0.992 3.14
##   sulphates alcohol quality
## 1     0.504    9.87    5.78
## 2     0.479    9.56    5.50
## 3     0.463   11.10    5.89
## 4     0.463   11.42    6.31