1. DATA INTRODUCTION

As data scientists of a whiskey shop are asked to make a product recommendation for whiskey based on each customer’s taste preferences!

Purpose: to form a group of whiskeys that have a distinctive taste characteristic in each cluster

1.1. Data Preparation

Read the data.

(whiskey <- read.csv("whiskies.txt"))

glimpse(whiskey)

## Rows: 86
## Columns: 17
## $ RowID      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Distillery <chr> "Aberfeldy", "Aberlour", "AnCnoc", "Ardbeg", "Ardmore", "Ar…
## $ Body       <int> 2, 3, 1, 4, 2, 2, 0, 2, 2, 2, 4, 3, 4, 2, 3, 2, 1, 2, 2, 1,…
## $ Sweetness  <int> 2, 3, 3, 1, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ Smoky      <int> 2, 1, 2, 4, 2, 1, 0, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 3, 2,…
## $ Medicinal  <int> 0, 0, 0, 4, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2,…
## $ Tobacco    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Honey      <int> 2, 4, 2, 0, 1, 1, 1, 2, 1, 0, 2, 3, 2, 2, 3, 2, 0, 1, 2, 2,…
## $ Spicy      <int> 1, 3, 0, 2, 1, 1, 1, 1, 0, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2,…
## $ Winey      <int> 2, 2, 0, 0, 1, 1, 0, 2, 0, 0, 3, 1, 0, 0, 1, 1, 1, 2, 1, 1,…
## $ Nutty      <int> 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 3, 0, 2, 0, 2, 2, 0, 2, 1, 2,…
## $ Malty      <int> 2, 3, 2, 2, 3, 1, 2, 2, 2, 1, 0, 2, 2, 2, 3, 2, 2, 2, 1, 2,…
## $ Fruity     <int> 2, 3, 3, 1, 1, 1, 3, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 2,…
## $ Floral     <int> 2, 2, 2, 0, 1, 2, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,…
## $ Postcode   <chr> " \tPH15 2EB", " \tAB38 9PJ", " \tAB5 5LI", " \tPA42 7EB", …
## $ Latitude   <int> 286580, 326340, 352960, 141560, 355350, 194050, 247670, 340…
## $ Longitude  <dbl> 749680, 842570, 839320, 646220, 829140, 649950, 672610, 848…

The data used are Malt Whiskey distillation data from 86 distilleries, obtained from the research of Dr. Wisehart (University of St. Andrews). Each whiskey is scored 0-4 out of 12 flavor categories based on organoleptic tests:
- Body: level of strength of taste (light/heavy)
- Sweetness: level of sweetness
- Smoky: level of smoke taste
- Medicinal: level of bitter taste (medicine)
- Tobacco: tobacco taste level
- Honey: level of honey taste
- Spicy: spicy level
- Winey: wine taste level
- Nutty: nutty flavor level
- Malty: wheat flavor level
- Fruity: fruit flavor level
- Floral: floral flavor level

1.2. Data Preprocessing

Check missing value

anyNA(whiskey)

## [1] FALSE

Data Cleansing

# assign value from Distillery column to row name
rownames(whiskey) <- whiskey$Distillery

# discard unused columns
whiskey <- whiskey %>% 
  select(-c(RowID, Distillery, Postcode, Latitude, Longitude))

head(whiskey)

2. DATA EXPLORATORY

Check the scale between variables

summary(whiskey)

##       Body        Sweetness         Smoky         Medicinal     
##  Min.   :0.00   Min.   :1.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.00   Median :2.000   Median :1.000   Median :0.0000  
##  Mean   :2.07   Mean   :2.291   Mean   :1.535   Mean   :0.5465  
##  3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :4.00   Max.   :4.000   Max.   :4.000   Max.   :4.0000  
##     Tobacco           Honey           Spicy           Winey       
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.000   Median :1.000   Median :1.0000  
##  Mean   :0.1163   Mean   :1.244   Mean   :1.384   Mean   :0.9767  
##  3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :4.000   Max.   :3.000   Max.   :4.0000  
##      Nutty           Malty           Fruity          Floral     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :1.465   Mean   :1.802   Mean   :1.802   Mean   :1.698  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.000   Max.   :4.000

since the data types are quite the same, so there is no need to do scaling.

Here we will explore the data distribution of each numeric variable using density plot and the correlation between each variable using scatterplot which were provided within ggpairs function from GGally package.

ggpairs(whiskey[,c(1:7)], showStrips = F) + 
  theme(axis.text = element_text(colour = "black", size = 11),
        strip.background = element_rect(fill = "#d63d2d"),
        strip.text = element_text(colour = "white", size = 12,
                                  face = "bold"))

It can be seen that there is a strong correlation between some variables from the data, including Body-Smoky, Smoky-Medicinal. This result indicates that this dataset has multicollinearity and might not be suitable for various classification algorithms (which have non-multicollinearity as their assumption).

Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

3. CLUSTERING

Obtaining K optimum.

fviz_nbclust(x = whiskey,
             FUNcluster = kmeans,
             method = "wss")

From the plots, we can see that 5 is the optimum number of K. After k=5, increasing the number of K does not result in a considerable decrease of the total within sum of squares (strong internal cohesion) nor a considerable increase of between sum of square and between/total sum of squares ratio (maximum external separation).

# k-means clustering
set.seed(50)
(whiskey_k <- kmeans(x = whiskey,
                    centers = 5))

## K-means clustering with 5 clusters of sizes 7, 16, 37, 6, 20
## 
## Cluster means:
##       Body Sweetness    Smoky Medicinal    Tobacco     Honey    Spicy     Winey
## 1 3.571429  2.285714 1.857143 0.1428571 0.00000000 1.7142857 1.714286 2.8571429
## 2 1.875000  2.000000 2.000000 1.0000000 0.18750000 1.1250000 1.437500 0.9375000
## 3 1.432432  2.486486 1.054054 0.2432432 0.05405405 0.9729730 1.108108 0.4594595
## 4 3.666667  1.500000 3.666667 3.3333333 0.66666667 0.1666667 1.666667 0.5000000
## 5 2.400000  2.400000 1.300000 0.0500000 0.05000000 2.0000000 1.650000 1.4500000
##      Nutty    Malty   Fruity    Floral
## 1 2.000000 1.571429 2.285714 1.1428571
## 2 1.500000 1.812500 1.125000 1.0625000
## 3 1.162162 1.675676 1.972973 2.1081081
## 4 1.166667 1.333333 1.166667 0.1666667
## 5 1.900000 2.250000 2.050000 2.1000000
## 
## Clustering vector:
##          Aberfeldy           Aberlour             AnCnoc             Ardbeg 
##                  5                  5                  3                  4 
##            Ardmore        ArranIsleOf       Auchentoshan          Auchroisk 
##                  2                  3                  3                  5 
##           Aultmore           Balblair          Balmenach           Belvenie 
##                  3                  2                  1                  5 
##           BenNevis           Benriach          Benrinnes          Benromach 
##                  5                  3                  5                  5 
##           Bladnoch         BlairAthol            Bowmore      Bruichladdich 
##                  3                  5                  2                  2 
##       Bunnahabhain           Caol Ila             Cardhu          Clynelish 
##                  3                  4                  3                  4 
##      Craigallechie       Craigganmore          Dailuaine            Dalmore 
##                  5                  3                  1                  1 
##         Dalwhinnie           Deanston           Dufftown           Edradour 
##                  3                  5                  3                  5 
## GlenDeveronMacduff          GlenElgin        GlenGarioch          GlenGrant 
##                  2                  3                  2                  3 
##          GlenKeith          GlenMoray            GlenOrd         GlenScotia 
##                  3                  3                  5                  2 
##           GlenSpey       Glenallachie        Glendronach         Glendullan 
##                  3                  3                  1                  5 
##        Glenfarclas        Glenfiddich          Glengoyne        Glenkinchie 
##                  5                  3                  3                  3 
##          Glenlivet         Glenlossie       Glenmorangie         Glenrothes 
##                  5                  3                  3                  2 
##         Glenturret      Highland Park          Inchgower       Isle of Jura 
##                  5                  2                  3                  2 
##          Knochando          Lagavulin           Laphroig           Linkwood 
##                  5                  4                  4                  3 
##        Loch Lomond           Longmorn           Macallan        Mannochmore 
##                  3                  5                  1                  3 
##         Miltonduff           Mortlach               Oban     OldFettercairn 
##                  3                  1                  2                  2 
##        OldPulteney       RoyalBrackla     RoyalLochnagar              Scapa 
##                  2                  3                  1                  5 
##           Speyburn           Speyside         Springbank         Strathisla 
##                  3                  3                  2                  5 
##         Strathmill           Talisker             Tamdhu         Tamnavulin 
##                  3                  4                  3                  3 
##          Teaninich          Tobermory            Tomatin          Tomintoul 
##                  3                  3                  2                  3 
##            Tormore       Tullibardine 
##                  2                  3 
## 
## Within cluster sum of squares by cluster:
## [1]  26.57143  85.93750 162.32432  24.33333  77.50000
##  (between_SS / total_SS =  43.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced.

whiskey_k$iter

## [1] 5

The goodness of clustering results can be seen from 3 values:

Within Sum of Squares ($withinss): the sum of the squared distances from each observation to the centroid of each cluster.
Between Sum of Squares ($betweenss): the sum of the weighted squared distances from each centroid to the global average. The weight is based on the number of observations in the cluster.
Total Sum of Squares ($totss): the sum of the squared distances from each observation to the global average.
Chek WSS values.

whiskey_k$withinss

## [1]  26.57143  85.93750 162.32432  24.33333  77.50000

whiskey_k$tot.withinss

## [1] 376.6666

Chek BSS/TSS ratio.

whiskey_k$betweenss

## [1] 289.1706

whiskey_k$totss

## [1] 665.8372

Nevertheless, new cluster can be made using this dataset and these new clusters also have different characteristics owned by each cluster. Visualization and profiling of cluster results can give us additional information about each clusters which can be useful for us from a business perspective.

To visualize the result of K-means clustering we can use various functions from factoextra package or by combining it with PCA. This time will use factoextra package (I will combine the result with PCA in a later section).

# data preparation for visualization & profiling
whiskey$cluster <- as.factor(whiskey_k$cluster)
whiskey

# clustering visualization
fviz_cluster(object = whiskey_k, 
             data = whiskey %>% select(-cluster))

# cluster profiling
(whiskey_centroid <- whiskey %>% 
  group_by(cluster) %>% 
  summarise_all(mean))

Cluster Profiling:

Cluster 1:
- Highest in the taste of honey, winey, nutty, malty, friuty.
- Lowest medicinal and Tobacco
- Label : nano-nano/juicy sweet whiskey
Cluster 2:
- Highest Body, smoky, medicinal, tobacco
- Lowest sweetness, honey, malty, floral
- Label : whiskey bad boys, tongkongan gentlemen -> bitter whiskey
Clusters 3:
- The tallest floral
- Lowest body, smoky, tobacco
- Label : whiskey flavored flowers / flower garden -> whiskey paradise
Clusters 4:
- The highest is none / no one dominates
- Lowest fruity taste
- Label: medicore flavored whiskey
Clusters 5:
- Highest in sweetness, spicy taste
- Lowest nutty
- Label: bitchy sweet whiskey

# Additional Profiiling
ggRadar(data=whiskey, 
        aes(colour=cluster), 
        interactive=TRUE)

4. PRINCIPAL COMPONENT ANALYSIS

PCA using FactoMineR

quanti <- whiskey %>% 
  select_if(is.numeric) %>% 
  colnames()

# numeric column index
quantivar <- which(colnames(whiskey) %in% quanti)

# numeric column name (qualitative)
quali <- whiskey %>% 
  select_if(is.factor) %>% 
  colnames()

# categoric column name
qualivar <- which(colnames(whiskey) %in% quali)

(whiskey_pca <- PCA(X = whiskey, #data
                  scale.unit = FALSE,
                  quali.sup = qualivar,
                  ncp = 13,
                  graph = FALSE))

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 86 individuals, described by 13 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"

summary(whiskey_pca)

## 
## Call:
## PCA(X = whiskey, scale.unit = FALSE, ncp = 13, quali.sup = qualivar,  
##      graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.331   1.488   0.740   0.639   0.560   0.464   0.395
## % of var.             30.111  19.218   9.560   8.250   7.231   5.992   5.108
## Cumulative % of var.  30.111  49.329  58.889  67.139  74.370  80.363  85.471
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.355   0.271   0.248   0.178   0.073
## % of var.              4.587   3.498   3.198   2.297   0.949
## Cumulative % of var.  90.058  93.556  96.754  99.051 100.000
## 
## Individuals (the 10 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## Aberfeldy    |  1.685 | -0.503  0.126  0.089 |  1.122  0.984  0.443 | -0.161
## Aberlour     |  4.058 | -1.479  1.091  0.133 |  3.005  7.056  0.548 |  1.517
## AnCnoc       |  2.733 | -1.253  0.783  0.210 | -0.654  0.334  0.057 | -0.285
## Ardbeg       |  5.484 |  5.272 13.862  0.924 | -0.510  0.203  0.009 |  0.807
## Ardmore      |  1.917 |  0.213  0.023  0.012 |  0.174  0.024  0.008 | -0.677
## ArranIsleOf  |  2.179 |  0.075  0.003  0.001 | -0.866  0.586  0.158 | -0.633
## Auchentoshan |  3.414 | -2.472  3.048  0.524 | -1.702  2.265  0.249 |  0.444
## Auchroisk    |  1.929 | -0.800  0.320  0.172 |  1.231  1.184  0.407 | -0.987
## Aultmore     |  2.018 | -0.744  0.276  0.136 | -0.819  0.525  0.165 | -0.723
## Balblair     |  2.298 |  0.956  0.455  0.173 | -0.904  0.639  0.155 |  0.111
##                 ctr   cos2  
## Aberfeldy     0.041  0.009 |
## Aberlour      3.616  0.140 |
## AnCnoc        0.127  0.011 |
## Ardbeg        1.022  0.022 |
## Ardmore       0.719  0.125 |
## ArranIsleOf   0.629  0.084 |
## Auchentoshan  0.309  0.017 |
## Auchroisk     1.531  0.262 |
## Aultmore      0.822  0.128 |
## Balblair      0.019  0.002 |
## 
## Variables (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Body         |  0.551 13.046  0.355 |  0.599 24.138  0.420 |  0.026  0.091
## Sweetness    | -0.310  4.120  0.189 |  0.057  0.217  0.006 | -0.227  6.963
## Smoky        |  0.730 22.843  0.722 |  0.084  0.473  0.010 |  0.188  4.788
## Medicinal    |  0.878 33.094  0.796 | -0.196  2.585  0.040 |  0.037  0.186
## Tobacco      |  0.140  0.841  0.191 | -0.024  0.040  0.006 | -0.001  0.000
## Honey        | -0.337  4.880  0.158 |  0.510 17.472  0.361 |  0.095  1.215
## Spicy        |  0.089  0.338  0.013 |  0.214  3.079  0.075 |  0.602 48.894
## Winey        | -0.057  0.140  0.004 |  0.780 40.915  0.708 | -0.201  5.435
## Nutty        | -0.073  0.227  0.008 |  0.318  6.779  0.151 | -0.154  3.188
## Malty        | -0.195  1.634  0.097 |  0.126  1.060  0.040 |  0.093  1.175
##                cos2  
## Body          0.001 |
## Sweetness     0.101 |
## Smoky         0.048 |
## Medicinal     0.001 |
## Tobacco       0.000 |
## Honey         0.013 |
## Spicy         0.595 |
## Winey         0.047 |
## Nutty         0.035 |
## Malty         0.022 |
## 
## Supplementary categories
##                  Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3
## cluster_1    |  2.699 |  0.419  0.024  0.753 |  2.532  0.880  5.697 | -0.172
## cluster_2    |  1.197 |  0.887  0.550  2.561 | -0.208  0.030 -0.751 | -0.271
## cluster_3    |  1.222 | -0.778  0.405 -4.083 | -0.904  0.547 -5.938 | -0.041
## cluster_4    |  4.507 |  4.473  0.985  7.396 | -0.275  0.004 -0.569 |  0.254
## cluster_5    |  1.381 | -0.759  0.302 -2.522 |  1.035  0.562  4.307 |  0.276
##                cos2 v.test  
## cluster_1     0.004 -0.548 |
## cluster_2     0.051 -1.387 |
## cluster_3     0.001 -0.382 |
## cluster_4     0.003  0.745 |
## cluster_5     0.040  1.631 |

whiskey_pca$eig

##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1  2.33128029             30.1109794                          30.11098
## comp 2  1.48790511             19.2178865                          49.32887
## comp 3  0.74017815              9.5601927                          58.88906
## comp 4  0.63876410              8.2503219                          67.13938
## comp 5  0.55983472              7.2308645                          74.37024
## comp 6  0.46394222              5.9923101                          80.36256
## comp 7  0.39548319              5.1080886                          85.47064
## comp 8  0.35514396              4.5870642                          90.05771
## comp 9  0.27083293              3.4980971                          93.55580
## comp 10 0.24757748              3.1977281                          96.75353
## comp 11 0.17787006              2.2973821                          99.05092
## comp 12 0.07348093              0.9490848                         100.00000

head(whiskey_pca$ind$coord)

##                   Dim.1      Dim.2      Dim.3      Dim.4       Dim.5      Dim.6
## Aberfeldy   -0.50338406  1.1220223 -0.1612002  0.5058255  0.28415007 -0.3329955
## Aberlour    -1.47888827  3.0048507  1.5170911 -0.1385370  0.71028940 -0.3806686
## AnCnoc      -1.25311288 -0.6537207 -0.2847196  0.9274739 -0.11275869 -0.5467528
## Ardbeg       5.27172367 -0.5100752  0.8066720  0.2040745  0.02469125 -0.5135499
## Ardmore      0.21346596  0.1743390 -0.6766643  0.5265755  0.48622054 -0.5768535
## ArranIsleOf  0.07483212 -0.8659972 -0.6326211 -1.5459138  0.26842170  0.3714962
##                  Dim.7       Dim.8       Dim.9      Dim.10     Dim.11
## Aberfeldy   -0.6437431 -0.03710274  0.04374452 -0.01261079  0.6560820
## Aberlour     1.1664548  0.32346567  0.28577886  0.78827475 -0.3189369
## AnCnoc       0.6234907  1.01866744  0.37702014  0.42175779  1.5664268
## Ardbeg       0.4131476  0.09757668  0.52324695 -0.04655983 -0.4122918
## Ardmore      0.3747291 -0.72475267  0.55363138 -1.11922415  0.1933856
## ArranIsleOf -0.5189825  0.49826940 -0.10268105  0.46911993 -0.4434524
##                  Dim.12
## Aberfeldy   -0.07072755
## Aberlour     0.10285343
## AnCnoc       0.12767015
## Ardbeg      -0.66147722
## Ardmore     -0.18375609
## ArranIsleOf -0.21337095

Through PCA, We can retain some informative principal components (high in cumulative variance) from Kernels dataset to perform dimensionality reduction. By doing this, I can reduce the dimension of the dataset while also retaining as much information as possible.

In this study, I want to retain at least 90% of the information from our data. From the PCA summary (whiskey_pca$eig), I picked PC1-PC8 from a total of 12 PC. By doing this, I was able to reduce the dimension from my original data while retaining 90% of the information from the data.

We can extract the values of PC1-PC8 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

# making a new data frame from PCA result
(whiskey_var90 <- as.data.frame(whiskey_pca$ind$coord[ , 1:8]))

5. COMBINING CLUSTERING AND PCA

From the previous section, we have discussed that PCA can be combined with clustering to obtain better visualization of our clustering result, or simply to understand the pattern in our dataset. This can be done by using a biplot, a common plot in PCA to visualize high dimensional data using PC1 and PC2 as the axes.

We can use plot.PCA to visualize a PCA object with added arguments for customization.

plot.PCA(
  x = whiskey_pca,
  choix = "ind",
  habillage = T,
  label = "quali",
  col.ind = whiskey$Type,
  title = "Colored by Type"
)

plot.PCA(
  x = whiskey_pca,
  choix = "ind",
  habillage = T,
  label = "quali",
  col.ind = whiskey$cluster,
  title = "Colored by Cluster"
)

The plots above are examples of individual factor map of a biplot. The points in the plot resemble observations and colored by their Type (original Kernel type) and Cluster (Kernel by clustering result). Dim1 and Dim2 are PC1 and PC2 respectively, with their own share (percentage) of information from the total information of the dataset.

From the biplot, we can clearly see in the Colored by Type plot, some observations from different clusters were located really close with one another and an overlapping view of clusters can be seen. Meanwhile, in the Colored by Cluster plot, we can see that the clusters separate nicely without overlapping view of clusters.

This visualization supports the assumption made during clustering result analysis, which was, “..there might be Kernels with similar geometrical properties which originate from different type/species. This indicates that the geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types.”

After this, I will focus on the interpretation of biplots which observations were colored based on clusters that we have made before.

6. SUMMARY

From the unsupervised learning analysis above, we can summarize that:

K-means clustering can be done using this dataset, although, the clusters did not resemble Kernels types. Geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types. Additional properties such as genetic and metabolites properties of each Kernel might be needed to obtain such clustering.
Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 12 PC according to the total information we want to retain. In this article, I used 8 PCs to reduce the dimension from my original data while retaining 90% of the information from the data.
The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.

WHISKEY - Unsupervised Learning Analysis

Tubagus Fathul Arifin

2022-10-21