Beer clustering - kmeans, PAM, clara

Introduction

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

In my project I would like to implement clustering methods which will group the records in chosen data set in the best possible way. My dataset contains information about different beers. I am going to use clustering algorithms to group beers and compare it to groups from the data set.

Dataset - “Beer Profile and Ratings Data Set”

This data set contains tasting profiles and consumer reviews for 3197 unique beers from 934 different breweries. It is provided on the Kaggle website: https://www.kaggle.com/datasets/ruthgn/beer-profile-and-ratings-data-set.

Variables

Id	Name	Description
1	ABV	Alcohol content of beer
2	Min IBU	The minimum IBU value beer can possess
3	Max IBU	The maximum IBU value beer can possess
4	Astringency	This and next columns represent the tasting profile features of beer.
5	Body	Mouthfeel
6	Alcohol	Mouthfeel
7	Bitter	Taste
8	Sweet	Taste
9	Sour	Taste
10	Salty	Taste
11	Fruits	Flavor and aroma
12	Hoppy	Flavor and aroma
13	Spices	Flavor and aroma
14	Malty	Flavor and aroma

Statistics and Exploratory Data Analysis

#load data
beer <- read_excel("C:/Users/skarbowiak/Desktop/Studia II UW/Semestr 1 DSaBA/Unsupervised learning 2022Z/Clustering projects/datasets/beer_data.xlsx")

#check dataset first rows
head(beer)

## # A tibble: 6 × 22
##      ID Name   Style Brewery Beer …¹ Descr…² ABV   Min I…³ Max I…⁴ Astri…⁵  Body
##   <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <dbl> <dbl>
## 1     1 Lambi… Lamb… Brouwe… Brouwe… "Notes… 5           0      10      18    10
## 2     2 Sang … Wild… Cascad… Cascad… "Notes… 9.3         5      30      16    30
## 3     3 Helio… Happ… Helios… Helios… "Notes… 5           0       0       6     7
## 4     4 De Tr… Lamb… Brouwe… Brouwe… "Notes… 5.5         0      10      51    23
## 5     5 Framb… Lamb… Brouwe… Brouwe… "Notes… 5          15      21      26    20
## 6     6 Rogge… Rye … Beaver… Beaver… "Notes… 4.6        10      20       8    23
## # … with 11 more variables: Alcohol <dbl>, Bitter <dbl>, Sweet <dbl>,
## #   Sour <dbl>, Salty <dbl>, Fruits <dbl>, Hoppy <dbl>, Spices <dbl>,
## #   Malty <dbl>, review <chr>, score <dbl>, and abbreviated variable names
## #   ¹`Beer Name (Full)`, ²Description, ³`Min IBU`, ⁴`Max IBU`, ⁵Astringency

#check details about dataset
str(beer)

## tibble [3,197 × 22] (S3: tbl_df/tbl/data.frame)
##  $ ID              : num [1:3197] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Name            : chr [1:3197] "Lambik (2 Year Old Unblended)" "Sang Noir" "Helios Goya Dry" "De Troch Oude Gueuze" ...
##  $ Style           : chr [1:3197] "Lambic - Traditional" "Wild Ale" "Happoshu" "Lambic - Gueuze" ...
##  $ Brewery         : chr [1:3197] "Brouwerij Girardin" "Cascade Brewing / Raccoon Lodge & Brewpub" "Helios Distillery Co., Ltd." "Brouwerij De Troch" ...
##  $ Beer Name (Full): chr [1:3197] "Brouwerij Girardin Lambik (2 Year Old Unblended)" "Cascade Brewing / Raccoon Lodge & Brewpub Cascade Sang Noir" "Helios Distillery Co., Ltd. Helios Goya Dry" "Brouwerij De Troch De Troch Gueuze" ...
##  $ Description     : chr [1:3197] "Notes:" "Notes:Sang Noir is a blend of imperial spiced red ales aged in bourbon and wine barrels with Bing cherries for "| __truncated__ "Notes:" "Notes:" ...
##  $ ABV             : chr [1:3197] "5" "9.3" "5" "5.5" ...
##  $ Min IBU         : num [1:3197] 0 5 0 0 15 10 20 15 50 25 ...
##  $ Max IBU         : num [1:3197] 10 30 0 10 21 20 40 21 70 50 ...
##  $ Astringency     : num [1:3197] 18 16 6 51 26 8 35 30 16 5 ...
##  $ Body            : num [1:3197] 10 30 7 23 20 23 39 25 42 44 ...
##  $ Alcohol         : num [1:3197] 2 38 3 5 4 7 5 6 18 49 ...
##  $ Bitter          : num [1:3197] 12 1 13 14 4 2 53 5 91 14 ...
##  $ Sweet           : num [1:3197] 21 47 8 49 97 37 37 99 40 86 ...
##  $ Sour            : num [1:3197] 48 98 6 200 95 39 38 119 55 21 ...
##  $ Salty           : num [1:3197] 0 0 0 7 0 0 1 0 0 0 ...
##  $ Fruits          : num [1:3197] 24 72 6 96 94 37 38 113 76 53 ...
##  $ Hoppy           : num [1:3197] 15 3 14 8 8 4 77 12 117 5 ...
##  $ Spices          : num [1:3197] 2 19 5 14 0 16 20 0 2 12 ...
##  $ Malty           : num [1:3197] 9 47 7 30 8 51 63 8 30 57 ...
##  $ review          : chr [1:3197] "5" "4.807692" "4.75" "4.75" ...
##  $ score           : num [1:3197] 3 3 3 3 3 3 3 3 3 3 ...

summary(beer)

##        ID           Name              Style             Brewery         
##  Min.   :   1   Length:3197        Length:3197        Length:3197       
##  1st Qu.: 800   Class :character   Class :character   Class :character  
##  Median :1599   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1599                                                           
##  3rd Qu.:2398                                                           
##  Max.   :3197                                                           
##  Beer Name (Full)   Description            ABV               Min IBU     
##  Length:3197        Length:3197        Length:3197        Min.   : 0.00  
##  Class :character   Class :character   Class :character   1st Qu.:15.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :20.00  
##                                                           Mean   :21.18  
##                                                           3rd Qu.:25.00  
##                                                           Max.   :65.00  
##     Max IBU        Astringency         Body           Alcohol      
##  Min.   :  0.00   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 25.00   1st Qu.: 9.00   1st Qu.: 29.00   1st Qu.:  6.00  
##  Median : 35.00   Median :14.00   Median : 40.00   Median : 11.00  
##  Mean   : 38.99   Mean   :16.52   Mean   : 46.13   Mean   : 17.06  
##  3rd Qu.: 45.00   3rd Qu.:21.00   3rd Qu.: 58.00   3rd Qu.: 22.00  
##  Max.   :100.00   Max.   :81.00   Max.   :175.00   Max.   :139.00  
##      Bitter           Sweet             Sour            Salty       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 17.00   1st Qu.: 33.00   1st Qu.: 11.00   1st Qu.: 0.000  
##  Median : 31.00   Median : 54.00   Median : 22.00   Median : 0.000  
##  Mean   : 36.36   Mean   : 58.27   Mean   : 33.15   Mean   : 1.017  
##  3rd Qu.: 52.00   3rd Qu.: 77.00   3rd Qu.: 42.00   3rd Qu.: 1.000  
##  Max.   :150.00   Max.   :263.00   Max.   :284.00   Max.   :48.000  
##      Fruits           Hoppy            Spices           Malty       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.: 18.00   1st Qu.:  4.00   1st Qu.: 45.00  
##  Median : 29.00   Median : 33.00   Median : 10.00   Median : 73.00  
##  Mean   : 38.53   Mean   : 40.92   Mean   : 18.35   Mean   : 75.33  
##  3rd Qu.: 60.00   3rd Qu.: 56.00   3rd Qu.: 23.00   3rd Qu.:103.00  
##  Max.   :175.00   Max.   :172.00   Max.   :184.00   Max.   :239.00  
##     review              score      
##  Length:3197        Min.   :1.000  
##  Class :character   1st Qu.:1.000  
##  Mode  :character   Median :2.000  
##                     Mean   :2.033  
##                     3rd Qu.:3.000  
##                     Max.   :3.000

Data set contains some unimportant variables (ID, Name etc.) and also ABV variable has wrong type. I will deal with it. The data set does not contain NA’s, so I don’t need to worry about imputation or other way of dealing with empty values. Moreover beer set has 3197 rows, so we can also classify it as a big dataset with potential to use CLARA algorithm. Looking at statistics I have observed that there can be a lot of outliers.

beer_proc_score <- beer[,-1:-6] #delete variables for set to score
beer_proc_score$review <- floor(as.numeric(beer_proc_score$review)) 
beer_proc_score$ABV <-as.numeric(beer_proc_score$ABV) #convert from char to num
beer_proc <- beer_proc_score[, ! names(beer_proc_score) %in% c("review", "score")]

#visualisation of outliers on boxplot
boxplot_beer_data <- boxplot(beer_proc,las=2)

As I supposed every variables contains outliers. I will decide how to deal with them later.

correlationEstimates <- cor(beer_proc)
corrplot(correlationEstimates, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

correlationEstimates

##                     ABV     Min IBU     Max IBU Astringency        Body
## ABV          1.00000000  0.43200478  0.50103728 -0.16952052  0.24167312
## Min IBU      0.43200478  1.00000000  0.85424806 -0.07150127  0.32533790
## Max IBU      0.50103728  0.85424806  1.00000000 -0.12027328  0.31061666
## Astringency -0.16952052 -0.07150127 -0.12027328  1.00000000 -0.05953971
## Body         0.24167312  0.32533790  0.31061666 -0.05953971  1.00000000
## Alcohol      0.65490813  0.32369408  0.39281140 -0.17198688  0.26888501
## Bitter       0.06738792  0.53945216  0.47808036  0.11468598  0.54223642
## Sweet        0.46348750  0.22713878  0.27729218 -0.02145640  0.45884180
## Sour         0.10079488 -0.07309796 -0.04327508  0.57102991 -0.12673331
## Salty       -0.12008883 -0.05751177 -0.08321352  0.34715504 -0.09927735
## Fruits       0.29100109  0.06633532  0.17292851  0.34523213 -0.04815457
## Hoppy       -0.05259620  0.40747532  0.34516752  0.33095085  0.07013823
## Spices       0.19146831 -0.04615179  0.04453346 -0.08379502  0.18512299
## Malty        0.16206018  0.30004067  0.28821916 -0.08208537  0.75422818
##                  Alcohol       Bitter       Sweet         Sour        Salty
## ABV          0.654908132  0.067387915  0.46348750  0.100794883 -0.120088829
## Min IBU      0.323694076  0.539452155  0.22713878 -0.073097956 -0.057511768
## Max IBU      0.392811402  0.478080363  0.27729218 -0.043275080 -0.083213518
## Astringency -0.171986878  0.114685977 -0.02145640  0.571029913  0.347155038
## Body         0.268885007  0.542236421  0.45884180 -0.126733314 -0.099277352
## Alcohol      1.000000000  0.009087782  0.52703889  0.048767388 -0.094329293
## Bitter       0.009087782  1.000000000  0.09170547 -0.136913688  0.004692825
## Sweet        0.527038889  0.091705467  1.00000000  0.257912561 -0.131917834
## Sour         0.048767388 -0.136913688  0.25791256  1.000000000  0.098172842
## Salty       -0.094329293  0.004692825 -0.13191783  0.098172842  1.000000000
## Fruits       0.254299063 -0.093449864  0.48202994  0.785882542  0.026919585
## Hoppy       -0.079949288  0.712886753 -0.03432745  0.068894607  0.172606178
## Spices       0.252875793 -0.084048103  0.10754762  0.001831036 -0.023078823
## Malty        0.270105608  0.565570029  0.47103197 -0.303266373 -0.028241289
##                  Fruits       Hoppy       Spices       Malty
## ABV          0.29100109 -0.05259620  0.191468311  0.16206018
## Min IBU      0.06633532  0.40747532 -0.046151786  0.30004067
## Max IBU      0.17292851  0.34516752  0.044533456  0.28821916
## Astringency  0.34523213  0.33095085 -0.083795019 -0.08208537
## Body        -0.04815457  0.07013823  0.185122992  0.75422818
## Alcohol      0.25429906 -0.07994929  0.252875793  0.27010561
## Bitter      -0.09344986  0.71288675 -0.084048103  0.56557003
## Sweet        0.48202994 -0.03432745  0.107547623  0.47103197
## Sour         0.78588254  0.06889461  0.001831036 -0.30326637
## Salty        0.02691958  0.17260618 -0.023078823 -0.02824129
## Fruits       1.00000000  0.11040734  0.148281264 -0.19688969
## Hoppy        0.11040734  1.00000000 -0.131963707  0.19576698
## Spices       0.14828126 -0.13196371  1.000000000  0.06139856
## Malty       -0.19688969  0.19576698  0.061398556  1.00000000

Some of the variables are highly, positively correlated (Sour~Fruits, MinIBU ~ MaxIBU), but there is no correlation level close to 1 (the highest is 0.85 for IBUs) so I won’t delete any variables from the data set.
Previous summaries showed that the values of variables are on different scales so I will normalize them using scale() function to get proper results that can be easily interpret.

stand_beer_proc <- as.data.frame(lapply(beer_proc, scale))
boxplot_beer_data_stand <- boxplot(stand_beer_proc,las=2)

Now I am going to analyze the distribution of data with quartiles, quantiles and percentiles to decide how to deal with outliers in ABV and Salty variables based on boxplots.

#ventiles - to remove outliers based on boxplot
quantile(stand_beer_proc$ABV, probs = seq(0, 1, 1/20))

##          0%          5%         10%         15%         20%         25% 
## -2.56250301 -0.99202610 -0.79571649 -0.67793072 -0.59940687 -0.59940687 
##         30%         35%         40%         45%         50%         55% 
## -0.56014495 -0.48162111 -0.40309726 -0.36383534 -0.20678765 -0.16752572 
##         60%         65%         70%         75%         80%         85% 
## -0.01047803  0.14656966  0.26435543  0.42140312  0.57845081  0.93180811 
##         90%         95%        100% 
##  1.28516542  1.75630849 20.01310253

quantile(stand_beer_proc$Salty, probs = seq(0, 1, 1/20))

##           0%           5%          10%          15%          20%          25% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          30%          35%          40%          45%          50%          55% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          60%          65%          70%          75%          80%          85% 
## -0.008066779 -0.008066779 -0.008066779 -0.008066779  0.460833104  0.460833104 
##          90%          95%         100% 
##  0.929732987  1.398632870 22.030227727

quantile(stand_beer_proc$ABV, probs = seq(0, 1, 1/100))   # Percentiles

##            0%            1%            2%            3%            4% 
## -2.5625030075 -2.3661933941 -1.7772645539 -1.1490737910 -0.9920261003 
##            5%            6%            7%            8%            9% 
## -0.9920261003 -0.9135022549 -0.9135022549 -0.8349784096 -0.7957164869 
##           10%           11%           12%           13%           14% 
## -0.7957164869 -0.7957164869 -0.7564545642 -0.7171926415 -0.7171926415 
##           15%           16%           17%           18%           19% 
## -0.6779307189 -0.6779307189 -0.6779307189 -0.6386687962 -0.6386687962 
##           20%           21%           22%           23%           24% 
## -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735 
##           25%           26%           27%           28%           29% 
## -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735 -0.5601449508 
##           30%           31%           32%           33%           34% 
## -0.5601449508 -0.5208830281 -0.5208830281 -0.5208830281 -0.4816211055 
##           35%           36%           37%           38%           39% 
## -0.4816211055 -0.4816211055 -0.4423591828 -0.4423591828 -0.4030972601 
##           40%           41%           42%           43%           44% 
## -0.4030972601 -0.4030972601 -0.4030972601 -0.4030972601 -0.3638353374 
##           45%           46%           47%           48%           49% 
## -0.3638353374 -0.3245734147 -0.2853114921 -0.2853114921 -0.2460495694 
##           50%           51%           52%           53%           54% 
## -0.2067876467 -0.2067876467 -0.2067876467 -0.2067876467 -0.2067876467 
##           55%           56%           57%           58%           59% 
## -0.1675257240 -0.1282638013 -0.0890018787 -0.0497399560 -0.0104780333 
##           60%           61%           62%           63%           64% 
## -0.0104780333 -0.0104780333 -0.0002699334  0.0680458121  0.1073077348 
##           65%           66%           67%           68%           69% 
##  0.1465696574  0.1858315801  0.1858315801  0.1858315801  0.1858315801 
##           70%           71%           72%           73%           74% 
##  0.2643554255  0.2839863868  0.3821411935  0.3821411935  0.3821411935 
##           75%           76%           77%           78%           79% 
##  0.4214031162  0.4999269616  0.5784508069  0.5784508069  0.5784508069 
##           80%           81%           82%           83%           84% 
##  0.5784508069  0.6177127296  0.7245051593  0.7747604203  0.7747604203 
##           85%           86%           87%           88%           89% 
##  0.9318081110  0.9710700337  0.9710700337  1.0888558018  1.1673796471 
##           90%           91%           92%           93%           94% 
##  1.2851654152  1.3636892605  1.3636892605  1.4422131059  1.5599988739 
##           95%           96%           97%           98%           99% 
##  1.7563084873  1.9526181007  2.1489277141  2.1552096218  2.7001651086 
##          100% 
## 20.0131025337

quantile(stand_beer_proc$Salty, probs = seq(0, 1, 1/100))   # Percentiles

##           0%           1%           2%           3%           4%           5% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##           6%           7%           8%           9%          10%          11% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          12%          13%          14%          15%          16%          17% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          18%          19%          20%          21%          22%          23% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          24%          25%          26%          27%          28%          29% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          30%          31%          32%          33%          34%          35% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          36%          37%          38%          39%          40%          41% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          42%          43%          44%          45%          46%          47% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          48%          49%          50%          51%          52%          53% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          54%          55%          56%          57%          58%          59% 
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 
##          60%          61%          62%          63%          64%          65% 
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 
##          66%          67%          68%          69%          70%          71% 
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 
##          72%          73%          74%          75%          76%          77% 
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 
##          78%          79%          80%          81%          82%          83% 
##  0.460833104  0.460833104  0.460833104  0.460833104  0.460833104  0.460833104 
##          84%          85%          86%          87%          88%          89% 
##  0.460833104  0.460833104  0.460833104  0.929732987  0.929732987  0.929732987 
##          90%          91%          92%          93%          94%          95% 
##  0.929732987  0.929732987  1.079780949  1.398632870  1.398632870  1.398632870 
##          96%          97%          98%          99%         100% 
##  1.867532753  1.867532753  2.336432636  3.274232402 22.030227727

quantile(stand_beer_proc$Alcohol, probs = seq(0, 1, 1/100))   # Outlier-less variable to compare with other two

##           0%           1%           2%           3%           4%           5% 
## -0.984112918 -0.984112918 -0.926413956 -0.926413956 -0.868714994 -0.868714994 
##           6%           7%           8%           9%          10%          11% 
## -0.868714994 -0.868714994 -0.811016032 -0.811016032 -0.811016032 -0.753317070 
##          12%          13%          14%          15%          16%          17% 
## -0.753317070 -0.753317070 -0.753317070 -0.753317070 -0.695618108 -0.695618108 
##          18%          19%          20%          21%          22%          23% 
## -0.695618108 -0.695618108 -0.695618108 -0.695618108 -0.637919146 -0.637919146 
##          24%          25%          26%          27%          28%          29% 
## -0.637919146 -0.637919146 -0.637919146 -0.637919146 -0.580220184 -0.580220184 
##          30%          31%          32%          33%          34%          35% 
## -0.580220184 -0.580220184 -0.580220184 -0.580220184 -0.522521222 -0.522521222 
##          36%          37%          38%          39%          40%          41% 
## -0.522521222 -0.522521222 -0.522521222 -0.522521222 -0.464822260 -0.464822260 
##          42%          43%          44%          45%          46%          47% 
## -0.464822260 -0.464822260 -0.464822260 -0.407123298 -0.407123298 -0.407123298 
##          48%          49%          50%          51%          52%          53% 
## -0.407123298 -0.349424336 -0.349424336 -0.349424336 -0.349424336 -0.291725374 
##          54%          55%          56%          57%          58%          59% 
## -0.291725374 -0.291725374 -0.234026412 -0.234026412 -0.234026412 -0.176327450 
##          60%          61%          62%          63%          64%          65% 
## -0.176327450 -0.176327450 -0.118628488 -0.118628488 -0.060929526 -0.060929526 
##          66%          67%          68%          69%          70%          71% 
## -0.060929526 -0.003230564  0.054468398  0.054468398  0.112167360  0.112167360 
##          72%          73%          74%          75%          76%          77% 
##  0.169866322  0.227565284  0.227565284  0.285264245  0.342963207  0.342963207 
##          78%          79%          80%          81%          82%          83% 
##  0.400662169  0.458361131  0.504520301  0.573759055  0.573759055  0.631458017 
##          84%          85%          86%          87%          88%          89% 
##  0.746855941  0.804554903  0.862253865  0.977651789  1.093049713  1.208447637 
##          90%          91%          92%          93%          94%          95% 
##  1.266146599  1.381544523  1.554641409  1.727738295  1.843136218  2.073932066 
##          96%          97%          98%          99%         100% 
##  2.371658710  2.650921686  3.174828261  3.864907846  7.036042795

Based on the 99-100% percentile I’ve decided to cut some Salty and ABV outliers.

beer_proc_score_clean <- beer_proc_score[beer_proc_score$ABV<24 & beer_proc_score$Salty<24,]
stand_beer_proc_clean <- stand_beer_proc[stand_beer_proc$ABV<8 & stand_beer_proc$Salty<8,] #clean high outliers

Cluster tendency

In this part of my project I will check the dataset cluster-ability using Hopkins statistics and I will try to determine the most optimal number of clusters.

hop_rez <- hopkins(stand_beer_proc_clean)
hop_rez

## [1] 1

The result is 1 so it indicates clustered data.

Now I am going to find the optimal clusters number for kmeans using different criteria - variance explained, silhouette width and the Akaike information criterion.

optK<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters = TRUE)

optK_s<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

optK_a<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters=TRUE, criterion="AIC")

As we can see on variance_explained chart using “elbow” method it is not obvious to choose the optimal number of clusters. It can bee 2 or 3. Using silhouette statistics it is also not clear. The biggest value is for 8 clusters but differences between different silhouette statistics values aren’t significant. We can also tell that the best options is to choose 2 or 3 clusters. The AIC method indicates results similar to elbow method.

Since calra is an extension of pam we will use results of below codes for finding optimal cluster number for pam and clara methods using cluster sums of squares and silhouette criterion.

pam <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
clara <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
pam_wss <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::pam, method = "wss") + theme_classic() 
clara_wss <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::clara, method = "wss") + theme_classic() 
grid.arrange(pam,clara,pam_wss,clara_wss, ncol=2)

The optimal number of clusters for PAM method can be 6 according to silhouette statistics but also a 4 (not significant difference). For PAM total within sum of square method it can be also 4. For CLARA for analogous methods it may be 8 (but actually 4-5) or 3-5 according to cluster sums of squares.

Clustering

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

kmeans_method <- eclust(stand_beer_proc_clean, k=3, FUNcluster="kmeans", hc_metric="minkowski", graph=TRUE)

x_chart <- fviz_silhouette(kmeans_method)

##   cluster size ave.sil.width
## 1       1  637          0.10
## 2       2 1334          0.16
## 3       3 1219          0.19

y_chart <- fviz_cluster(kmeans_method, data = stand_beer_proc_clean, elipse.type = "convex",graph=TRUE)
grid.arrange(x_chart, y_chart, ncol=2)

I have chosen to use 3 clusters for k-means method. The results are not the best, the clusters are overlapping. Also the average silhouette width is poor. The most objects is assigned to cluster 2 but the highest average silhouette width is for cluster 3. I have played with different distances measures “minkowski”, “euclidean”, “canberra” but it doesn’t change anything significantly.

I would try to compare the results of clustered object with the beer’s score from original data set.

table(kmeans_method$cluster,beer_proc_score_clean$score)

##    
##       1   2   3
##   1  86 289 262
##   2 565 458 311
##   3 220 598 401

As we can see the results are bad and the most of clusters are wrong in that case. The beers probably aren’t clustered in terms of ratings.

Then I will check k-means with 2 clusters.

kmeans_method <- eclust(stand_beer_proc_clean, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=TRUE)

x_chart <- fviz_silhouette(kmeans_method)

##   cluster size ave.sil.width
## 1       1 1255          0.11
## 2       2 1935          0.24

y_chart <- fviz_cluster(kmeans_method, data = stand_beer_proc_clean, elipse.type = "convex",graph=TRUE)
grid.arrange(x_chart, y_chart, ncol=2)

The average silhouette width is 0.19 and there is a lot of clusters that don’t overlap with each other. Maybe the beer is clustered into light and dark?

PAM

The PAM algorithm searches for k representative objects in a data set (k medoids) and then assigns each object to the closest medoid in order to create clusters. Its aim is to minimize the sum of dissimilarities between the objects in a cluster and the center of the same cluster (medoid). It is known to be a robust version of k-means as it is considered to be less sensitive to outliers.

pam_method <- eclust(stand_beer_proc_clean, k=6, FUNcluster="pam", hc_metric="pearson", graph=TRUE)

x_chart <- fviz_silhouette(pam_method)

##   cluster size ave.sil.width
## 1       1  805          0.26
## 2       2  719          0.07
## 3       3  187          0.30
## 4       4  514          0.13
## 5       5  409          0.09
## 6       6  556          0.26

y_chart <- fviz_cluster(pam_method, data = stand_beer_proc_clean, elipse.type = "convex")
grid.arrange(x_chart, y_chart, ncol=2)

Data clustered with PAM algorithm also overlaps. I’ve decided to use 6 cluster since it has better results than 4. As we can see according to average silhouette width per cluster the worse value has second and fifth cluster. Cluster 3 is the best but also has less objects attached that others.

CLARA

The CLARA algorithm is specifically used for clustering large sets of data. CLARA does so by repeatedly sampling a data set and then applying the algorithm PAM, a k-medoids solver, to these samples and then clusters the remainder of the data according to the medoids given by these sampled results. My data set contains more than 3000 observations so it is worth to try CLARA.

clara_method <- eclust(stand_beer_proc_clean, k=6, FUNcluster="clara", hc_metric="pearson", graph=TRUE)

x_chart <- fviz_silhouette(clara_method)

##   cluster size ave.sil.width
## 1       1  605          0.33
## 2       2  624          0.06
## 3       3  422          0.08
## 4       4  347          0.18
## 5       5  368          0.09
## 6       6  824          0.21

y_chart <- fviz_cluster(clara_method, data = stand_beer_proc_clean, elipse.type = "convex")
grid.arrange(x_chart, y_chart, ncol=2)

I have decided to go with 6 clusters. The average silhouette width is 0.17 but it differs across clusters. The first and last clusters look better. They are more embedded. Other objects from other clusters are overlapping really strong.

Summary

The beer data set occurs to be not so easy to cluster. The Hopkins statistics value also wasn’t helpful. The best algorithm in terms of average silhouette width turned out to be K-means algorithm with only 2 clusters. Maybe it was the best way to divide the beers into light and dark. The second best was PAM algorithm. Manipulation with different distance metrics hasn’t changed the outputs of methods at all. Even CLARA algorithm wasn’t the best option what may proved that my data set was small enough to performance and results better using PAM.