As data scientists of a whiskey shop are asked to make a product recommendation for whiskey based on each customer’s taste preferences!
Purpose: to form a group of whiskeys that have a distinctive taste characteristic in each cluster
Read the data.
(whiskey <- read.csv("whiskies.txt"))glimpse(whiskey)## Rows: 86
## Columns: 17
## $ RowID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Distillery <chr> "Aberfeldy", "Aberlour", "AnCnoc", "Ardbeg", "Ardmore", "Ar…
## $ Body <int> 2, 3, 1, 4, 2, 2, 0, 2, 2, 2, 4, 3, 4, 2, 3, 2, 1, 2, 2, 1,…
## $ Sweetness <int> 2, 3, 3, 1, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ Smoky <int> 2, 1, 2, 4, 2, 1, 0, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 3, 2,…
## $ Medicinal <int> 0, 0, 0, 4, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2,…
## $ Tobacco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Honey <int> 2, 4, 2, 0, 1, 1, 1, 2, 1, 0, 2, 3, 2, 2, 3, 2, 0, 1, 2, 2,…
## $ Spicy <int> 1, 3, 0, 2, 1, 1, 1, 1, 0, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2,…
## $ Winey <int> 2, 2, 0, 0, 1, 1, 0, 2, 0, 0, 3, 1, 0, 0, 1, 1, 1, 2, 1, 1,…
## $ Nutty <int> 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 3, 0, 2, 0, 2, 2, 0, 2, 1, 2,…
## $ Malty <int> 2, 3, 2, 2, 3, 1, 2, 2, 2, 1, 0, 2, 2, 2, 3, 2, 2, 2, 1, 2,…
## $ Fruity <int> 2, 3, 3, 1, 1, 1, 3, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 2,…
## $ Floral <int> 2, 2, 2, 0, 1, 2, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,…
## $ Postcode <chr> " \tPH15 2EB", " \tAB38 9PJ", " \tAB5 5LI", " \tPA42 7EB", …
## $ Latitude <int> 286580, 326340, 352960, 141560, 355350, 194050, 247670, 340…
## $ Longitude <dbl> 749680, 842570, 839320, 646220, 829140, 649950, 672610, 848…
The data used are Malt Whiskey distillation data from 86
distilleries, obtained from the research of Dr. Wisehart (University of
St. Andrews). Each whiskey is scored 0-4 out of 12 flavor categories
based on organoleptic tests:
- Body: level of strength of taste (light/heavy)
- Sweetness: level of sweetness
- Smoky: level of smoke taste
- Medicinal: level of bitter taste (medicine)
- Tobacco: tobacco taste level
- Honey: level of honey taste
- Spicy: spicy level
- Winey: wine taste level
- Nutty: nutty flavor level
- Malty: wheat flavor level
- Fruity: fruit flavor level
- Floral: floral flavor level
Check missing value
anyNA(whiskey)## [1] FALSE
Data Cleansing
# assign value from Distillery column to row name
rownames(whiskey) <- whiskey$Distillery
# discard unused columns
whiskey <- whiskey %>%
select(-c(RowID, Distillery, Postcode, Latitude, Longitude))
head(whiskey)Check the scale between variables
summary(whiskey)## Body Sweetness Smoky Medicinal
## Min. :0.00 Min. :1.000 Min. :0.000 Min. :0.0000
## 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.00 Median :2.000 Median :1.000 Median :0.0000
## Mean :2.07 Mean :2.291 Mean :1.535 Mean :0.5465
## 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.00 Max. :4.000 Max. :4.000 Max. :4.0000
## Tobacco Honey Spicy Winey
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :1.000 Median :1.000 Median :1.0000
## Mean :0.1163 Mean :1.244 Mean :1.384 Mean :0.9767
## 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :4.000 Max. :3.000 Max. :4.0000
## Nutty Malty Fruity Floral
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :1.465 Mean :1.802 Mean :1.802 Mean :1.698
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :3.000 Max. :3.000 Max. :4.000
since the data types are quite the same, so there is no need to do scaling.
Here we will explore the data distribution of each numeric variable
using density plot and the correlation between each variable using
scatterplot which were provided within ggpairs function
from GGally package.
ggpairs(whiskey[,c(1:7)], showStrips = F) +
theme(axis.text = element_text(colour = "black", size = 11),
strip.background = element_rect(fill = "#d63d2d"),
strip.text = element_text(colour = "white", size = 12,
face = "bold"))
It can be seen that there is a strong correlation between some variables
from the data, including
Body-Smoky,
Smoky-Medicinal. This result indicates that
this dataset has multicollinearity and might not be suitable for various
classification algorithms (which have non-multicollinearity as their
assumption).
Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.
Obtaining K optimum.
fviz_nbclust(x = whiskey,
FUNcluster = kmeans,
method = "wss")
From the plots, we can see that 5 is the optimum number of
K. After k=5, increasing the number of K does not result in a
considerable decrease of the total within sum of squares (strong
internal cohesion) nor a considerable increase of between sum of square
and between/total sum of squares ratio (maximum external
separation).
# k-means clustering
set.seed(50)
(whiskey_k <- kmeans(x = whiskey,
centers = 5))## K-means clustering with 5 clusters of sizes 7, 16, 37, 6, 20
##
## Cluster means:
## Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 1 3.571429 2.285714 1.857143 0.1428571 0.00000000 1.7142857 1.714286 2.8571429
## 2 1.875000 2.000000 2.000000 1.0000000 0.18750000 1.1250000 1.437500 0.9375000
## 3 1.432432 2.486486 1.054054 0.2432432 0.05405405 0.9729730 1.108108 0.4594595
## 4 3.666667 1.500000 3.666667 3.3333333 0.66666667 0.1666667 1.666667 0.5000000
## 5 2.400000 2.400000 1.300000 0.0500000 0.05000000 2.0000000 1.650000 1.4500000
## Nutty Malty Fruity Floral
## 1 2.000000 1.571429 2.285714 1.1428571
## 2 1.500000 1.812500 1.125000 1.0625000
## 3 1.162162 1.675676 1.972973 2.1081081
## 4 1.166667 1.333333 1.166667 0.1666667
## 5 1.900000 2.250000 2.050000 2.1000000
##
## Clustering vector:
## Aberfeldy Aberlour AnCnoc Ardbeg
## 5 5 3 4
## Ardmore ArranIsleOf Auchentoshan Auchroisk
## 2 3 3 5
## Aultmore Balblair Balmenach Belvenie
## 3 2 1 5
## BenNevis Benriach Benrinnes Benromach
## 5 3 5 5
## Bladnoch BlairAthol Bowmore Bruichladdich
## 3 5 2 2
## Bunnahabhain Caol Ila Cardhu Clynelish
## 3 4 3 4
## Craigallechie Craigganmore Dailuaine Dalmore
## 5 3 1 1
## Dalwhinnie Deanston Dufftown Edradour
## 3 5 3 5
## GlenDeveronMacduff GlenElgin GlenGarioch GlenGrant
## 2 3 2 3
## GlenKeith GlenMoray GlenOrd GlenScotia
## 3 3 5 2
## GlenSpey Glenallachie Glendronach Glendullan
## 3 3 1 5
## Glenfarclas Glenfiddich Glengoyne Glenkinchie
## 5 3 3 3
## Glenlivet Glenlossie Glenmorangie Glenrothes
## 5 3 3 2
## Glenturret Highland Park Inchgower Isle of Jura
## 5 2 3 2
## Knochando Lagavulin Laphroig Linkwood
## 5 4 4 3
## Loch Lomond Longmorn Macallan Mannochmore
## 3 5 1 3
## Miltonduff Mortlach Oban OldFettercairn
## 3 1 2 2
## OldPulteney RoyalBrackla RoyalLochnagar Scapa
## 2 3 1 5
## Speyburn Speyside Springbank Strathisla
## 3 3 2 5
## Strathmill Talisker Tamdhu Tamnavulin
## 3 4 3 3
## Teaninich Tobermory Tomatin Tomintoul
## 3 3 2 3
## Tormore Tullibardine
## 2 3
##
## Within cluster sum of squares by cluster:
## [1] 26.57143 85.93750 162.32432 24.33333 77.50000
## (between_SS / total_SS = 43.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
whiskey_k$iter## [1] 5
The goodness of clustering results can be seen from 3 values:
Within Sum of Squares ($withinss): the sum of the
squared distances from each observation to the centroid of each
cluster.
Between Sum of Squares ($betweenss): the sum of the
weighted squared distances from each centroid to the global average. The
weight is based on the number of observations in the cluster.
Total Sum of Squares ($totss): the sum of the
squared distances from each observation to the global average.
Chek WSS values.
whiskey_k$withinss## [1] 26.57143 85.93750 162.32432 24.33333 77.50000
whiskey_k$tot.withinss## [1] 376.6666
whiskey_k$betweenss## [1] 289.1706
whiskey_k$totss## [1] 665.8372
Nevertheless, new cluster can be made using this dataset and these new clusters also have different characteristics owned by each cluster. Visualization and profiling of cluster results can give us additional information about each clusters which can be useful for us from a business perspective.
To visualize the result of K-means clustering we can use various functions from factoextra package or by combining it with PCA. This time will use factoextra package (I will combine the result with PCA in a later section).
# data preparation for visualization & profiling
whiskey$cluster <- as.factor(whiskey_k$cluster)
whiskey# clustering visualization
fviz_cluster(object = whiskey_k,
data = whiskey %>% select(-cluster))# cluster profiling
(whiskey_centroid <- whiskey %>%
group_by(cluster) %>%
summarise_all(mean))Cluster Profiling:
Cluster 1:
Cluster 2:
Clusters 3:
Clusters 4:
Clusters 5:
# Additional Profiiling
ggRadar(data=whiskey,
aes(colour=cluster),
interactive=TRUE)PCA using FactoMineR
quanti <- whiskey %>%
select_if(is.numeric) %>%
colnames()
# numeric column index
quantivar <- which(colnames(whiskey) %in% quanti)
# numeric column name (qualitative)
quali <- whiskey %>%
select_if(is.factor) %>%
colnames()
# categoric column name
qualivar <- which(colnames(whiskey) %in% quali)
(whiskey_pca <- PCA(X = whiskey, #data
scale.unit = FALSE,
quali.sup = qualivar,
ncp = 13,
graph = FALSE))## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 86 individuals, described by 13 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$quali.sup" "results for the supplementary categorical variables"
## 12 "$quali.sup$coord" "coord. for the supplementary categories"
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"
## 14 "$call" "summary statistics"
## 15 "$call$centre" "mean of the variables"
## 16 "$call$ecart.type" "standard error of the variables"
## 17 "$call$row.w" "weights for the individuals"
## 18 "$call$col.w" "weights for the variables"
summary(whiskey_pca)##
## Call:
## PCA(X = whiskey, scale.unit = FALSE, ncp = 13, quali.sup = qualivar,
## graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.331 1.488 0.740 0.639 0.560 0.464 0.395
## % of var. 30.111 19.218 9.560 8.250 7.231 5.992 5.108
## Cumulative % of var. 30.111 49.329 58.889 67.139 74.370 80.363 85.471
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.355 0.271 0.248 0.178 0.073
## % of var. 4.587 3.498 3.198 2.297 0.949
## Cumulative % of var. 90.058 93.556 96.754 99.051 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## Aberfeldy | 1.685 | -0.503 0.126 0.089 | 1.122 0.984 0.443 | -0.161
## Aberlour | 4.058 | -1.479 1.091 0.133 | 3.005 7.056 0.548 | 1.517
## AnCnoc | 2.733 | -1.253 0.783 0.210 | -0.654 0.334 0.057 | -0.285
## Ardbeg | 5.484 | 5.272 13.862 0.924 | -0.510 0.203 0.009 | 0.807
## Ardmore | 1.917 | 0.213 0.023 0.012 | 0.174 0.024 0.008 | -0.677
## ArranIsleOf | 2.179 | 0.075 0.003 0.001 | -0.866 0.586 0.158 | -0.633
## Auchentoshan | 3.414 | -2.472 3.048 0.524 | -1.702 2.265 0.249 | 0.444
## Auchroisk | 1.929 | -0.800 0.320 0.172 | 1.231 1.184 0.407 | -0.987
## Aultmore | 2.018 | -0.744 0.276 0.136 | -0.819 0.525 0.165 | -0.723
## Balblair | 2.298 | 0.956 0.455 0.173 | -0.904 0.639 0.155 | 0.111
## ctr cos2
## Aberfeldy 0.041 0.009 |
## Aberlour 3.616 0.140 |
## AnCnoc 0.127 0.011 |
## Ardbeg 1.022 0.022 |
## Ardmore 0.719 0.125 |
## ArranIsleOf 0.629 0.084 |
## Auchentoshan 0.309 0.017 |
## Auchroisk 1.531 0.262 |
## Aultmore 0.822 0.128 |
## Balblair 0.019 0.002 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Body | 0.551 13.046 0.355 | 0.599 24.138 0.420 | 0.026 0.091
## Sweetness | -0.310 4.120 0.189 | 0.057 0.217 0.006 | -0.227 6.963
## Smoky | 0.730 22.843 0.722 | 0.084 0.473 0.010 | 0.188 4.788
## Medicinal | 0.878 33.094 0.796 | -0.196 2.585 0.040 | 0.037 0.186
## Tobacco | 0.140 0.841 0.191 | -0.024 0.040 0.006 | -0.001 0.000
## Honey | -0.337 4.880 0.158 | 0.510 17.472 0.361 | 0.095 1.215
## Spicy | 0.089 0.338 0.013 | 0.214 3.079 0.075 | 0.602 48.894
## Winey | -0.057 0.140 0.004 | 0.780 40.915 0.708 | -0.201 5.435
## Nutty | -0.073 0.227 0.008 | 0.318 6.779 0.151 | -0.154 3.188
## Malty | -0.195 1.634 0.097 | 0.126 1.060 0.040 | 0.093 1.175
## cos2
## Body 0.001 |
## Sweetness 0.101 |
## Smoky 0.048 |
## Medicinal 0.001 |
## Tobacco 0.000 |
## Honey 0.013 |
## Spicy 0.595 |
## Winey 0.047 |
## Nutty 0.035 |
## Malty 0.022 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3
## cluster_1 | 2.699 | 0.419 0.024 0.753 | 2.532 0.880 5.697 | -0.172
## cluster_2 | 1.197 | 0.887 0.550 2.561 | -0.208 0.030 -0.751 | -0.271
## cluster_3 | 1.222 | -0.778 0.405 -4.083 | -0.904 0.547 -5.938 | -0.041
## cluster_4 | 4.507 | 4.473 0.985 7.396 | -0.275 0.004 -0.569 | 0.254
## cluster_5 | 1.381 | -0.759 0.302 -2.522 | 1.035 0.562 4.307 | 0.276
## cos2 v.test
## cluster_1 0.004 -0.548 |
## cluster_2 0.051 -1.387 |
## cluster_3 0.001 -0.382 |
## cluster_4 0.003 0.745 |
## cluster_5 0.040 1.631 |
whiskey_pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.33128029 30.1109794 30.11098
## comp 2 1.48790511 19.2178865 49.32887
## comp 3 0.74017815 9.5601927 58.88906
## comp 4 0.63876410 8.2503219 67.13938
## comp 5 0.55983472 7.2308645 74.37024
## comp 6 0.46394222 5.9923101 80.36256
## comp 7 0.39548319 5.1080886 85.47064
## comp 8 0.35514396 4.5870642 90.05771
## comp 9 0.27083293 3.4980971 93.55580
## comp 10 0.24757748 3.1977281 96.75353
## comp 11 0.17787006 2.2973821 99.05092
## comp 12 0.07348093 0.9490848 100.00000
head(whiskey_pca$ind$coord)## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Aberfeldy -0.50338406 1.1220223 -0.1612002 0.5058255 0.28415007 -0.3329955
## Aberlour -1.47888827 3.0048507 1.5170911 -0.1385370 0.71028940 -0.3806686
## AnCnoc -1.25311288 -0.6537207 -0.2847196 0.9274739 -0.11275869 -0.5467528
## Ardbeg 5.27172367 -0.5100752 0.8066720 0.2040745 0.02469125 -0.5135499
## Ardmore 0.21346596 0.1743390 -0.6766643 0.5265755 0.48622054 -0.5768535
## ArranIsleOf 0.07483212 -0.8659972 -0.6326211 -1.5459138 0.26842170 0.3714962
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Aberfeldy -0.6437431 -0.03710274 0.04374452 -0.01261079 0.6560820
## Aberlour 1.1664548 0.32346567 0.28577886 0.78827475 -0.3189369
## AnCnoc 0.6234907 1.01866744 0.37702014 0.42175779 1.5664268
## Ardbeg 0.4131476 0.09757668 0.52324695 -0.04655983 -0.4122918
## Ardmore 0.3747291 -0.72475267 0.55363138 -1.11922415 0.1933856
## ArranIsleOf -0.5189825 0.49826940 -0.10268105 0.46911993 -0.4434524
## Dim.12
## Aberfeldy -0.07072755
## Aberlour 0.10285343
## AnCnoc 0.12767015
## Ardbeg -0.66147722
## Ardmore -0.18375609
## ArranIsleOf -0.21337095
Through PCA, We can retain some informative principal components (high in cumulative variance) from Kernels dataset to perform dimensionality reduction. By doing this, I can reduce the dimension of the dataset while also retaining as much information as possible.
In this study, I want to retain at least 90% of the information from our data. From the PCA summary (whiskey_pca$eig), I picked PC1-PC8 from a total of 12 PC. By doing this, I was able to reduce the dimension from my original data while retaining 90% of the information from the data.
We can extract the values of PC1-PC8 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
# making a new data frame from PCA result
(whiskey_var90 <- as.data.frame(whiskey_pca$ind$coord[ , 1:8]))From the previous section, we have discussed that PCA can be combined with clustering to obtain better visualization of our clustering result, or simply to understand the pattern in our dataset. This can be done by using a biplot, a common plot in PCA to visualize high dimensional data using PC1 and PC2 as the axes.
We can use plot.PCA to visualize a
PCA object with added arguments for customization.
plot.PCA(
x = whiskey_pca,
choix = "ind",
habillage = T,
label = "quali",
col.ind = whiskey$Type,
title = "Colored by Type"
)plot.PCA(
x = whiskey_pca,
choix = "ind",
habillage = T,
label = "quali",
col.ind = whiskey$cluster,
title = "Colored by Cluster"
)
The plots above are examples of individual factor map
of a biplot. The points in the plot resemble observations and colored by
their Type (original Kernel type) and Cluster (Kernel by clustering
result). Dim1 and Dim2 are PC1 and PC2 respectively, with their own
share (percentage) of information from the total information of the
dataset.
From the biplot, we can clearly see in the Colored by Type plot, some observations from different clusters were located really close with one another and an overlapping view of clusters can be seen. Meanwhile, in the Colored by Cluster plot, we can see that the clusters separate nicely without overlapping view of clusters.
This visualization supports the assumption made during clustering result analysis, which was, “..there might be Kernels with similar geometrical properties which originate from different type/species. This indicates that the geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types.”
After this, I will focus on the interpretation of biplots which observations were colored based on clusters that we have made before.
From the unsupervised learning analysis above, we can summarize that:
K-means clustering can be done using this dataset, although, the clusters did not resemble Kernels types. Geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types. Additional properties such as genetic and metabolites properties of each Kernel might be needed to obtain such clustering.
Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 12 PC according to the total information we want to retain. In this article, I used 8 PCs to reduce the dimension from my original data while retaining 90% of the information from the data.
The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.