I am going to use the Wine Quality data from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
The objective of this exercise is to show how to classify observations when we don’t have an explicit classification criterion. In machine learning this is known as unsupervised learning. The methods that we will study come from the marketing and statistics literature.
I have already downloaded the data and stored on my computer. For this exercise we will use only the white wine data.
library(pastecs)
## Loading required package: boot
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data")
wine<- read.csv("winequality-white.csv")
We will use all the 12 variables available to us. First we will start with k-means clustering. Make sure that all the variables are numeric and either interval or ratio.
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed_acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile_acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric_acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual_sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free_sulfur_dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total_sulfur_dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
options(scipen = 100)
options(digits = 3)
pastecs::stat.desc(wine)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## nbr.val 4898.0000 4898.00000 4898.00000 4898.0000
## nbr.null 0.0000 0.00000 19.00000 0.0000
## nbr.na 0.0000 0.00000 0.00000 0.0000
## min 3.8000 0.08000 0.00000 0.6000
## max 14.2000 1.10000 1.66000 65.8000
## range 10.4000 1.02000 1.66000 65.2000
## sum 33574.7500 1362.82500 1636.87000 31305.1500
## median 6.8000 0.26000 0.32000 5.2000
## mean 6.8548 0.27824 0.33419 6.3914
## SE.mean 0.0121 0.00144 0.00173 0.0725
## CI.mean.0.95 0.0236 0.00282 0.00339 0.1421
## var 0.7121 0.01016 0.01465 25.7258
## std.dev 0.8439 0.10079 0.12102 5.0721
## coef.var 0.1231 0.36226 0.36213 0.7936
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## nbr.val 4898.000000 4898.000 4898.000
## nbr.null 0.000000 0.000 0.000
## nbr.na 0.000000 0.000 0.000
## min 0.009000 2.000 9.000
## max 0.346000 289.000 440.000
## range 0.337000 287.000 431.000
## sum 224.193000 172939.000 677690.500
## median 0.043000 34.000 134.000
## mean 0.045772 35.308 138.361
## SE.mean 0.000312 0.243 0.607
## CI.mean.0.95 0.000612 0.476 1.190
## var 0.000477 289.243 1806.085
## std.dev 0.021848 17.007 42.498
## coef.var 0.477318 0.482 0.307
## density pH sulphates alcohol quality
## nbr.val 4898.00000000 4898.00000 4898.00000 4898.0000 4898.0000
## nbr.null 0.00000000 0.00000 0.00000 0.0000 0.0000
## nbr.na 0.00000000 0.00000 0.00000 0.0000 0.0000
## min 0.98711000 2.72000 0.22000 8.0000 3.0000
## max 1.03898000 3.82000 1.08000 14.2000 9.0000
## range 0.05187000 1.10000 0.86000 6.2000 6.0000
## sum 4868.74609000 15616.13000 2399.27000 51498.8800 28790.0000
## median 0.99374000 3.18000 0.47000 10.4000 6.0000
## mean 0.99402738 3.18827 0.48985 10.5143 5.8779
## SE.mean 0.00004274 0.00216 0.00163 0.0176 0.0127
## CI.mean.0.95 0.00008378 0.00423 0.00320 0.0345 0.0248
## var 0.00000895 0.02280 0.01302 1.5144 0.7844
## std.dev 0.00299091 0.15100 0.11413 1.2306 0.8856
## coef.var 0.00300888 0.04736 0.23298 0.1170 0.1507
options(scipen = 0)
In order to perform k-means clustering, we need to tell R how many clusters we want. This has to be decided ex ante. However, we may have literally no idea about the number of clusters in the data.
Ideally, we should have fewer clusters and within each cluster very low variance. However, the number of clusters and the variance within each cluster are negatively correlated. For example, if each observation in the data is considered as a cluster then there is zero variance within the cluster. However, there are numerous clusters and there is no reduction in the data. So we need to make a tradeoff between the two. For this, we will plot a curve for the average variance per cluster for models with different numbers of clusters.
For this example, I will plot a curve for up to 20 clusters.
ss <- rep(0,20)
for (i in 1:20) {
set.seed(20)
ss[i] <- sum(kmeans(wine,centers=i,iter.max = 100, nstart = 25)$withinss)
}
ss2 <- data.frame(cbind(ss,seq(1,20)))
colnames(ss2) <- c("SS","Cluster")
ggplot(data = ss2,aes(x = Cluster, y=SS)) +
geom_point(color="blue") + geom_line()
The curve will look like a hockey stick. We see that after around 5 clusters the slope of the line is flattening. This means that as we increase the numbers of cluster beyond 5, we don’t get a large reduction in the average variance within each cluster. This is necessarily a subjective way of looking at things. For example, for some people the real flattening may appear to happen at 16 clusters. We will use 5 clusters for this example and print the centroids.
set.seed(20)
kclusters <- kmeans(wine,centers = 5, nstart = 25)
kclusters$centers
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 6.96 0.286 0.354 8.82 0.0511
## 2 6.78 0.270 0.323 4.73 0.0419
## 3 6.84 0.272 0.336 7.01 0.0473
## 4 7.01 0.307 0.356 10.03 0.0523
## 5 6.81 0.280 0.316 3.45 0.0402
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 47.0 179.1 0.996 3.18 0.508 9.83
## 2 28.2 113.1 0.993 3.19 0.484 10.96
## 3 37.7 145.3 0.994 3.20 0.486 10.40
## 4 55.3 221.7 0.997 3.18 0.518 9.54
## 5 18.9 77.5 0.992 3.18 0.469 11.26
## quality
## 1 5.63
## 2 6.07
## 3 5.94
## 4 5.52
## 5 5.90
In hierarchical clustering, we make decisions based on a dendogram. In general reading a dendogram with a large number of observations is very difficult. Therefore, I am going to randomly select 50 observations from the original dataset and then we will use it for clustering.
set.seed(100)
w_ind <- sample(nrow(wine),50,replace = F)
wine2 <- wine[w_ind,]
Next we will perform hierarchical clustering and generate the a dendogram
hclusters <- hclust(dist(wine2))
plot(hclusters)
Again, selecting the clusters is sibjective. Looking at the dendogram, cutting the tree at around 4 clusters looks reasonable. Let’s create 4 clusters and see how many observations each cluster consists of.
hclustersCut <- cutree(hclusters, 4)
table(hclustersCut)
## hclustersCut
## 1 2 3 4
## 9 16 9 16
Finally, let’s look at the means of variables for each cluster
wine2$Clusters <- hclustersCut
wine2 <- dplyr::group_by(wine2,Clusters)
a <- dplyr::summarise_each(wine2, funs(mean))
print.data.frame(a)
## Clusters fixed_acidity volatile_acidity citric_acid residual_sugar
## 1 1 6.86 0.281 0.433 7.28
## 2 2 6.84 0.281 0.353 11.39
## 3 3 6.63 0.280 0.292 4.32
## 4 4 6.88 0.277 0.315 4.72
## chlorides free_sulfur_dioxide total_sulfur_dioxide density pH
## 1 0.0502 45.7 153 0.995 3.21
## 2 0.0631 47.2 190 0.997 3.17
## 3 0.0358 23.8 81 0.992 3.21
## 4 0.0442 34.4 113 0.992 3.14
## sulphates alcohol quality
## 1 0.504 9.87 5.78
## 2 0.479 9.56 5.50
## 3 0.463 11.10 5.89
## 4 0.463 11.42 6.31