Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
In my project I would like to implement clustering methods which will group the records in chosen data set in the best possible way. My dataset contains information about different beers. I am going to use clustering algorithms to group beers and compare it to groups from the data set.
This data set contains tasting profiles and consumer reviews for 3197 unique beers from 934 different breweries. It is provided on the Kaggle website: https://www.kaggle.com/datasets/ruthgn/beer-profile-and-ratings-data-set.
| Id | Name | Description |
|---|---|---|
| 1 | ABV | Alcohol content of beer |
| 2 | Min IBU | The minimum IBU value beer can possess |
| 3 | Max IBU | The maximum IBU value beer can possess |
| 4 | Astringency | This and next columns represent the tasting profile features of beer. |
| 5 | Body | Mouthfeel |
| 6 | Alcohol | Mouthfeel |
| 7 | Bitter | Taste |
| 8 | Sweet | Taste |
| 9 | Sour | Taste |
| 10 | Salty | Taste |
| 11 | Fruits | Flavor and aroma |
| 12 | Hoppy | Flavor and aroma |
| 13 | Spices | Flavor and aroma |
| 14 | Malty | Flavor and aroma |
#load data
beer <- read_excel("C:/Users/skarbowiak/Desktop/Studia II UW/Semestr 1 DSaBA/Unsupervised learning 2022Z/Clustering projects/datasets/beer_data.xlsx")
#check dataset first rows
head(beer)
## # A tibble: 6 × 22
## ID Name Style Brewery Beer …¹ Descr…² ABV Min I…³ Max I…⁴ Astri…⁵ Body
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Lambi… Lamb… Brouwe… Brouwe… "Notes… 5 0 10 18 10
## 2 2 Sang … Wild… Cascad… Cascad… "Notes… 9.3 5 30 16 30
## 3 3 Helio… Happ… Helios… Helios… "Notes… 5 0 0 6 7
## 4 4 De Tr… Lamb… Brouwe… Brouwe… "Notes… 5.5 0 10 51 23
## 5 5 Framb… Lamb… Brouwe… Brouwe… "Notes… 5 15 21 26 20
## 6 6 Rogge… Rye … Beaver… Beaver… "Notes… 4.6 10 20 8 23
## # … with 11 more variables: Alcohol <dbl>, Bitter <dbl>, Sweet <dbl>,
## # Sour <dbl>, Salty <dbl>, Fruits <dbl>, Hoppy <dbl>, Spices <dbl>,
## # Malty <dbl>, review <chr>, score <dbl>, and abbreviated variable names
## # ¹`Beer Name (Full)`, ²Description, ³`Min IBU`, ⁴`Max IBU`, ⁵Astringency
#check details about dataset
str(beer)
## tibble [3,197 × 22] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:3197] 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr [1:3197] "Lambik (2 Year Old Unblended)" "Sang Noir" "Helios Goya Dry" "De Troch Oude Gueuze" ...
## $ Style : chr [1:3197] "Lambic - Traditional" "Wild Ale" "Happoshu" "Lambic - Gueuze" ...
## $ Brewery : chr [1:3197] "Brouwerij Girardin" "Cascade Brewing / Raccoon Lodge & Brewpub" "Helios Distillery Co., Ltd." "Brouwerij De Troch" ...
## $ Beer Name (Full): chr [1:3197] "Brouwerij Girardin Lambik (2 Year Old Unblended)" "Cascade Brewing / Raccoon Lodge & Brewpub Cascade Sang Noir" "Helios Distillery Co., Ltd. Helios Goya Dry" "Brouwerij De Troch De Troch Gueuze" ...
## $ Description : chr [1:3197] "Notes:" "Notes:Sang Noir is a blend of imperial spiced red ales aged in bourbon and wine barrels with Bing cherries for "| __truncated__ "Notes:" "Notes:" ...
## $ ABV : chr [1:3197] "5" "9.3" "5" "5.5" ...
## $ Min IBU : num [1:3197] 0 5 0 0 15 10 20 15 50 25 ...
## $ Max IBU : num [1:3197] 10 30 0 10 21 20 40 21 70 50 ...
## $ Astringency : num [1:3197] 18 16 6 51 26 8 35 30 16 5 ...
## $ Body : num [1:3197] 10 30 7 23 20 23 39 25 42 44 ...
## $ Alcohol : num [1:3197] 2 38 3 5 4 7 5 6 18 49 ...
## $ Bitter : num [1:3197] 12 1 13 14 4 2 53 5 91 14 ...
## $ Sweet : num [1:3197] 21 47 8 49 97 37 37 99 40 86 ...
## $ Sour : num [1:3197] 48 98 6 200 95 39 38 119 55 21 ...
## $ Salty : num [1:3197] 0 0 0 7 0 0 1 0 0 0 ...
## $ Fruits : num [1:3197] 24 72 6 96 94 37 38 113 76 53 ...
## $ Hoppy : num [1:3197] 15 3 14 8 8 4 77 12 117 5 ...
## $ Spices : num [1:3197] 2 19 5 14 0 16 20 0 2 12 ...
## $ Malty : num [1:3197] 9 47 7 30 8 51 63 8 30 57 ...
## $ review : chr [1:3197] "5" "4.807692" "4.75" "4.75" ...
## $ score : num [1:3197] 3 3 3 3 3 3 3 3 3 3 ...
summary(beer)
## ID Name Style Brewery
## Min. : 1 Length:3197 Length:3197 Length:3197
## 1st Qu.: 800 Class :character Class :character Class :character
## Median :1599 Mode :character Mode :character Mode :character
## Mean :1599
## 3rd Qu.:2398
## Max. :3197
## Beer Name (Full) Description ABV Min IBU
## Length:3197 Length:3197 Length:3197 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.:15.00
## Mode :character Mode :character Mode :character Median :20.00
## Mean :21.18
## 3rd Qu.:25.00
## Max. :65.00
## Max IBU Astringency Body Alcohol
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 25.00 1st Qu.: 9.00 1st Qu.: 29.00 1st Qu.: 6.00
## Median : 35.00 Median :14.00 Median : 40.00 Median : 11.00
## Mean : 38.99 Mean :16.52 Mean : 46.13 Mean : 17.06
## 3rd Qu.: 45.00 3rd Qu.:21.00 3rd Qu.: 58.00 3rd Qu.: 22.00
## Max. :100.00 Max. :81.00 Max. :175.00 Max. :139.00
## Bitter Sweet Sour Salty
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.: 33.00 1st Qu.: 11.00 1st Qu.: 0.000
## Median : 31.00 Median : 54.00 Median : 22.00 Median : 0.000
## Mean : 36.36 Mean : 58.27 Mean : 33.15 Mean : 1.017
## 3rd Qu.: 52.00 3rd Qu.: 77.00 3rd Qu.: 42.00 3rd Qu.: 1.000
## Max. :150.00 Max. :263.00 Max. :284.00 Max. :48.000
## Fruits Hoppy Spices Malty
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 18.00 1st Qu.: 4.00 1st Qu.: 45.00
## Median : 29.00 Median : 33.00 Median : 10.00 Median : 73.00
## Mean : 38.53 Mean : 40.92 Mean : 18.35 Mean : 75.33
## 3rd Qu.: 60.00 3rd Qu.: 56.00 3rd Qu.: 23.00 3rd Qu.:103.00
## Max. :175.00 Max. :172.00 Max. :184.00 Max. :239.00
## review score
## Length:3197 Min. :1.000
## Class :character 1st Qu.:1.000
## Mode :character Median :2.000
## Mean :2.033
## 3rd Qu.:3.000
## Max. :3.000
Data set contains some unimportant variables (ID, Name etc.) and also ABV variable has wrong type. I will deal with it. The data set does not contain NA’s, so I don’t need to worry about imputation or other way of dealing with empty values. Moreover beer set has 3197 rows, so we can also classify it as a big dataset with potential to use CLARA algorithm. Looking at statistics I have observed that there can be a lot of outliers.
beer_proc_score <- beer[,-1:-6] #delete variables for set to score
beer_proc_score$review <- floor(as.numeric(beer_proc_score$review))
beer_proc_score$ABV <-as.numeric(beer_proc_score$ABV) #convert from char to num
beer_proc <- beer_proc_score[, ! names(beer_proc_score) %in% c("review", "score")]
#visualisation of outliers on boxplot
boxplot_beer_data <- boxplot(beer_proc,las=2)
As I supposed every variables contains outliers. I will decide how to deal with them later.
correlationEstimates <- cor(beer_proc)
corrplot(correlationEstimates, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
correlationEstimates
## ABV Min IBU Max IBU Astringency Body
## ABV 1.00000000 0.43200478 0.50103728 -0.16952052 0.24167312
## Min IBU 0.43200478 1.00000000 0.85424806 -0.07150127 0.32533790
## Max IBU 0.50103728 0.85424806 1.00000000 -0.12027328 0.31061666
## Astringency -0.16952052 -0.07150127 -0.12027328 1.00000000 -0.05953971
## Body 0.24167312 0.32533790 0.31061666 -0.05953971 1.00000000
## Alcohol 0.65490813 0.32369408 0.39281140 -0.17198688 0.26888501
## Bitter 0.06738792 0.53945216 0.47808036 0.11468598 0.54223642
## Sweet 0.46348750 0.22713878 0.27729218 -0.02145640 0.45884180
## Sour 0.10079488 -0.07309796 -0.04327508 0.57102991 -0.12673331
## Salty -0.12008883 -0.05751177 -0.08321352 0.34715504 -0.09927735
## Fruits 0.29100109 0.06633532 0.17292851 0.34523213 -0.04815457
## Hoppy -0.05259620 0.40747532 0.34516752 0.33095085 0.07013823
## Spices 0.19146831 -0.04615179 0.04453346 -0.08379502 0.18512299
## Malty 0.16206018 0.30004067 0.28821916 -0.08208537 0.75422818
## Alcohol Bitter Sweet Sour Salty
## ABV 0.654908132 0.067387915 0.46348750 0.100794883 -0.120088829
## Min IBU 0.323694076 0.539452155 0.22713878 -0.073097956 -0.057511768
## Max IBU 0.392811402 0.478080363 0.27729218 -0.043275080 -0.083213518
## Astringency -0.171986878 0.114685977 -0.02145640 0.571029913 0.347155038
## Body 0.268885007 0.542236421 0.45884180 -0.126733314 -0.099277352
## Alcohol 1.000000000 0.009087782 0.52703889 0.048767388 -0.094329293
## Bitter 0.009087782 1.000000000 0.09170547 -0.136913688 0.004692825
## Sweet 0.527038889 0.091705467 1.00000000 0.257912561 -0.131917834
## Sour 0.048767388 -0.136913688 0.25791256 1.000000000 0.098172842
## Salty -0.094329293 0.004692825 -0.13191783 0.098172842 1.000000000
## Fruits 0.254299063 -0.093449864 0.48202994 0.785882542 0.026919585
## Hoppy -0.079949288 0.712886753 -0.03432745 0.068894607 0.172606178
## Spices 0.252875793 -0.084048103 0.10754762 0.001831036 -0.023078823
## Malty 0.270105608 0.565570029 0.47103197 -0.303266373 -0.028241289
## Fruits Hoppy Spices Malty
## ABV 0.29100109 -0.05259620 0.191468311 0.16206018
## Min IBU 0.06633532 0.40747532 -0.046151786 0.30004067
## Max IBU 0.17292851 0.34516752 0.044533456 0.28821916
## Astringency 0.34523213 0.33095085 -0.083795019 -0.08208537
## Body -0.04815457 0.07013823 0.185122992 0.75422818
## Alcohol 0.25429906 -0.07994929 0.252875793 0.27010561
## Bitter -0.09344986 0.71288675 -0.084048103 0.56557003
## Sweet 0.48202994 -0.03432745 0.107547623 0.47103197
## Sour 0.78588254 0.06889461 0.001831036 -0.30326637
## Salty 0.02691958 0.17260618 -0.023078823 -0.02824129
## Fruits 1.00000000 0.11040734 0.148281264 -0.19688969
## Hoppy 0.11040734 1.00000000 -0.131963707 0.19576698
## Spices 0.14828126 -0.13196371 1.000000000 0.06139856
## Malty -0.19688969 0.19576698 0.061398556 1.00000000
Some of the variables are highly, positively correlated (Sour~Fruits,
MinIBU ~ MaxIBU), but there is no correlation level close to 1 (the
highest is 0.85 for IBUs) so I won’t delete any variables from the data
set.
Previous summaries showed that the values of variables are on different
scales so I will normalize them using scale() function to get proper
results that can be easily interpret.
stand_beer_proc <- as.data.frame(lapply(beer_proc, scale))
boxplot_beer_data_stand <- boxplot(stand_beer_proc,las=2)
Now I am going to analyze the distribution of data with quartiles, quantiles and percentiles to decide how to deal with outliers in ABV and Salty variables based on boxplots.
#ventiles - to remove outliers based on boxplot
quantile(stand_beer_proc$ABV, probs = seq(0, 1, 1/20))
## 0% 5% 10% 15% 20% 25%
## -2.56250301 -0.99202610 -0.79571649 -0.67793072 -0.59940687 -0.59940687
## 30% 35% 40% 45% 50% 55%
## -0.56014495 -0.48162111 -0.40309726 -0.36383534 -0.20678765 -0.16752572
## 60% 65% 70% 75% 80% 85%
## -0.01047803 0.14656966 0.26435543 0.42140312 0.57845081 0.93180811
## 90% 95% 100%
## 1.28516542 1.75630849 20.01310253
quantile(stand_beer_proc$Salty, probs = seq(0, 1, 1/20))
## 0% 5% 10% 15% 20% 25%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 30% 35% 40% 45% 50% 55%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 60% 65% 70% 75% 80% 85%
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 0.460833104 0.460833104
## 90% 95% 100%
## 0.929732987 1.398632870 22.030227727
quantile(stand_beer_proc$ABV, probs = seq(0, 1, 1/100)) # Percentiles
## 0% 1% 2% 3% 4%
## -2.5625030075 -2.3661933941 -1.7772645539 -1.1490737910 -0.9920261003
## 5% 6% 7% 8% 9%
## -0.9920261003 -0.9135022549 -0.9135022549 -0.8349784096 -0.7957164869
## 10% 11% 12% 13% 14%
## -0.7957164869 -0.7957164869 -0.7564545642 -0.7171926415 -0.7171926415
## 15% 16% 17% 18% 19%
## -0.6779307189 -0.6779307189 -0.6779307189 -0.6386687962 -0.6386687962
## 20% 21% 22% 23% 24%
## -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735
## 25% 26% 27% 28% 29%
## -0.5994068735 -0.5994068735 -0.5994068735 -0.5994068735 -0.5601449508
## 30% 31% 32% 33% 34%
## -0.5601449508 -0.5208830281 -0.5208830281 -0.5208830281 -0.4816211055
## 35% 36% 37% 38% 39%
## -0.4816211055 -0.4816211055 -0.4423591828 -0.4423591828 -0.4030972601
## 40% 41% 42% 43% 44%
## -0.4030972601 -0.4030972601 -0.4030972601 -0.4030972601 -0.3638353374
## 45% 46% 47% 48% 49%
## -0.3638353374 -0.3245734147 -0.2853114921 -0.2853114921 -0.2460495694
## 50% 51% 52% 53% 54%
## -0.2067876467 -0.2067876467 -0.2067876467 -0.2067876467 -0.2067876467
## 55% 56% 57% 58% 59%
## -0.1675257240 -0.1282638013 -0.0890018787 -0.0497399560 -0.0104780333
## 60% 61% 62% 63% 64%
## -0.0104780333 -0.0104780333 -0.0002699334 0.0680458121 0.1073077348
## 65% 66% 67% 68% 69%
## 0.1465696574 0.1858315801 0.1858315801 0.1858315801 0.1858315801
## 70% 71% 72% 73% 74%
## 0.2643554255 0.2839863868 0.3821411935 0.3821411935 0.3821411935
## 75% 76% 77% 78% 79%
## 0.4214031162 0.4999269616 0.5784508069 0.5784508069 0.5784508069
## 80% 81% 82% 83% 84%
## 0.5784508069 0.6177127296 0.7245051593 0.7747604203 0.7747604203
## 85% 86% 87% 88% 89%
## 0.9318081110 0.9710700337 0.9710700337 1.0888558018 1.1673796471
## 90% 91% 92% 93% 94%
## 1.2851654152 1.3636892605 1.3636892605 1.4422131059 1.5599988739
## 95% 96% 97% 98% 99%
## 1.7563084873 1.9526181007 2.1489277141 2.1552096218 2.7001651086
## 100%
## 20.0131025337
quantile(stand_beer_proc$Salty, probs = seq(0, 1, 1/100)) # Percentiles
## 0% 1% 2% 3% 4% 5%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 6% 7% 8% 9% 10% 11%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 12% 13% 14% 15% 16% 17%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 18% 19% 20% 21% 22% 23%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 24% 25% 26% 27% 28% 29%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 30% 31% 32% 33% 34% 35%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 36% 37% 38% 39% 40% 41%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 42% 43% 44% 45% 46% 47%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 48% 49% 50% 51% 52% 53%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 54% 55% 56% 57% 58% 59%
## -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662 -0.476966662
## 60% 61% 62% 63% 64% 65%
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779
## 66% 67% 68% 69% 70% 71%
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779
## 72% 73% 74% 75% 76% 77%
## -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779 -0.008066779
## 78% 79% 80% 81% 82% 83%
## 0.460833104 0.460833104 0.460833104 0.460833104 0.460833104 0.460833104
## 84% 85% 86% 87% 88% 89%
## 0.460833104 0.460833104 0.460833104 0.929732987 0.929732987 0.929732987
## 90% 91% 92% 93% 94% 95%
## 0.929732987 0.929732987 1.079780949 1.398632870 1.398632870 1.398632870
## 96% 97% 98% 99% 100%
## 1.867532753 1.867532753 2.336432636 3.274232402 22.030227727
quantile(stand_beer_proc$Alcohol, probs = seq(0, 1, 1/100)) # Outlier-less variable to compare with other two
## 0% 1% 2% 3% 4% 5%
## -0.984112918 -0.984112918 -0.926413956 -0.926413956 -0.868714994 -0.868714994
## 6% 7% 8% 9% 10% 11%
## -0.868714994 -0.868714994 -0.811016032 -0.811016032 -0.811016032 -0.753317070
## 12% 13% 14% 15% 16% 17%
## -0.753317070 -0.753317070 -0.753317070 -0.753317070 -0.695618108 -0.695618108
## 18% 19% 20% 21% 22% 23%
## -0.695618108 -0.695618108 -0.695618108 -0.695618108 -0.637919146 -0.637919146
## 24% 25% 26% 27% 28% 29%
## -0.637919146 -0.637919146 -0.637919146 -0.637919146 -0.580220184 -0.580220184
## 30% 31% 32% 33% 34% 35%
## -0.580220184 -0.580220184 -0.580220184 -0.580220184 -0.522521222 -0.522521222
## 36% 37% 38% 39% 40% 41%
## -0.522521222 -0.522521222 -0.522521222 -0.522521222 -0.464822260 -0.464822260
## 42% 43% 44% 45% 46% 47%
## -0.464822260 -0.464822260 -0.464822260 -0.407123298 -0.407123298 -0.407123298
## 48% 49% 50% 51% 52% 53%
## -0.407123298 -0.349424336 -0.349424336 -0.349424336 -0.349424336 -0.291725374
## 54% 55% 56% 57% 58% 59%
## -0.291725374 -0.291725374 -0.234026412 -0.234026412 -0.234026412 -0.176327450
## 60% 61% 62% 63% 64% 65%
## -0.176327450 -0.176327450 -0.118628488 -0.118628488 -0.060929526 -0.060929526
## 66% 67% 68% 69% 70% 71%
## -0.060929526 -0.003230564 0.054468398 0.054468398 0.112167360 0.112167360
## 72% 73% 74% 75% 76% 77%
## 0.169866322 0.227565284 0.227565284 0.285264245 0.342963207 0.342963207
## 78% 79% 80% 81% 82% 83%
## 0.400662169 0.458361131 0.504520301 0.573759055 0.573759055 0.631458017
## 84% 85% 86% 87% 88% 89%
## 0.746855941 0.804554903 0.862253865 0.977651789 1.093049713 1.208447637
## 90% 91% 92% 93% 94% 95%
## 1.266146599 1.381544523 1.554641409 1.727738295 1.843136218 2.073932066
## 96% 97% 98% 99% 100%
## 2.371658710 2.650921686 3.174828261 3.864907846 7.036042795
Based on the 99-100% percentile I’ve decided to cut some Salty and ABV outliers.
beer_proc_score_clean <- beer_proc_score[beer_proc_score$ABV<24 & beer_proc_score$Salty<24,]
stand_beer_proc_clean <- stand_beer_proc[stand_beer_proc$ABV<8 & stand_beer_proc$Salty<8,] #clean high outliers
In this part of my project I will check the dataset cluster-ability using Hopkins statistics and I will try to determine the most optimal number of clusters.
hop_rez <- hopkins(stand_beer_proc_clean)
hop_rez
## [1] 1
The result is 1 so it indicates clustered data.
Now I am going to find the optimal clusters number for kmeans using different criteria - variance explained, silhouette width and the Akaike information criterion.
optK<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters = TRUE)
optK_s<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
optK_a<-Optimal_Clusters_KMeans(stand_beer_proc_clean, max_clusters=10, plot_clusters=TRUE, criterion="AIC")
As we can see on variance_explained chart using “elbow” method it is not obvious to choose the optimal number of clusters. It can bee 2 or 3. Using silhouette statistics it is also not clear. The biggest value is for 8 clusters but differences between different silhouette statistics values aren’t significant. We can also tell that the best options is to choose 2 or 3 clusters. The AIC method indicates results similar to elbow method.
Since calra is an extension of pam we will use results of below codes for finding optimal cluster number for pam and clara methods using cluster sums of squares and silhouette criterion.
pam <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
clara <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
pam_wss <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::pam, method = "wss") + theme_classic()
clara_wss <- fviz_nbclust(stand_beer_proc_clean, FUNcluster = cluster::clara, method = "wss") + theme_classic()
grid.arrange(pam,clara,pam_wss,clara_wss, ncol=2)
The optimal number of clusters for PAM method can be 6 according to silhouette statistics but also a 4 (not significant difference). For PAM total within sum of square method it can be also 4. For CLARA for analogous methods it may be 8 (but actually 4-5) or 3-5 according to cluster sums of squares.
K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.
kmeans_method <- eclust(stand_beer_proc_clean, k=3, FUNcluster="kmeans", hc_metric="minkowski", graph=TRUE)
x_chart <- fviz_silhouette(kmeans_method)
## cluster size ave.sil.width
## 1 1 637 0.10
## 2 2 1334 0.16
## 3 3 1219 0.19
y_chart <- fviz_cluster(kmeans_method, data = stand_beer_proc_clean, elipse.type = "convex",graph=TRUE)
grid.arrange(x_chart, y_chart, ncol=2)
I have chosen to use 3 clusters for k-means method. The results are not the best, the clusters are overlapping. Also the average silhouette width is poor. The most objects is assigned to cluster 2 but the highest average silhouette width is for cluster 3. I have played with different distances measures “minkowski”, “euclidean”, “canberra” but it doesn’t change anything significantly.
I would try to compare the results of clustered object with the beer’s score from original data set.
table(kmeans_method$cluster,beer_proc_score_clean$score)
##
## 1 2 3
## 1 86 289 262
## 2 565 458 311
## 3 220 598 401
As we can see the results are bad and the most of clusters are wrong in that case. The beers probably aren’t clustered in terms of ratings.
Then I will check k-means with 2 clusters.
kmeans_method <- eclust(stand_beer_proc_clean, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=TRUE)
x_chart <- fviz_silhouette(kmeans_method)
## cluster size ave.sil.width
## 1 1 1255 0.11
## 2 2 1935 0.24
y_chart <- fviz_cluster(kmeans_method, data = stand_beer_proc_clean, elipse.type = "convex",graph=TRUE)
grid.arrange(x_chart, y_chart, ncol=2)
The average silhouette width is 0.19 and there is a lot of clusters that don’t overlap with each other. Maybe the beer is clustered into light and dark?
The PAM algorithm searches for k representative objects in a data set (k medoids) and then assigns each object to the closest medoid in order to create clusters. Its aim is to minimize the sum of dissimilarities between the objects in a cluster and the center of the same cluster (medoid). It is known to be a robust version of k-means as it is considered to be less sensitive to outliers.
pam_method <- eclust(stand_beer_proc_clean, k=6, FUNcluster="pam", hc_metric="pearson", graph=TRUE)
x_chart <- fviz_silhouette(pam_method)
## cluster size ave.sil.width
## 1 1 805 0.26
## 2 2 719 0.07
## 3 3 187 0.30
## 4 4 514 0.13
## 5 5 409 0.09
## 6 6 556 0.26
y_chart <- fviz_cluster(pam_method, data = stand_beer_proc_clean, elipse.type = "convex")
grid.arrange(x_chart, y_chart, ncol=2)
Data clustered with PAM algorithm also overlaps. I’ve decided to use 6 cluster since it has better results than 4. As we can see according to average silhouette width per cluster the worse value has second and fifth cluster. Cluster 3 is the best but also has less objects attached that others.
The CLARA algorithm is specifically used for clustering large sets of data. CLARA does so by repeatedly sampling a data set and then applying the algorithm PAM, a k-medoids solver, to these samples and then clusters the remainder of the data according to the medoids given by these sampled results. My data set contains more than 3000 observations so it is worth to try CLARA.
clara_method <- eclust(stand_beer_proc_clean, k=6, FUNcluster="clara", hc_metric="pearson", graph=TRUE)
x_chart <- fviz_silhouette(clara_method)
## cluster size ave.sil.width
## 1 1 605 0.33
## 2 2 624 0.06
## 3 3 422 0.08
## 4 4 347 0.18
## 5 5 368 0.09
## 6 6 824 0.21
y_chart <- fviz_cluster(clara_method, data = stand_beer_proc_clean, elipse.type = "convex")
grid.arrange(x_chart, y_chart, ncol=2)
I have decided to go with 6 clusters. The average silhouette width is 0.17 but it differs across clusters. The first and last clusters look better. They are more embedded. Other objects from other clusters are overlapping really strong.
The beer data set occurs to be not so easy to cluster. The Hopkins statistics value also wasn’t helpful. The best algorithm in terms of average silhouette width turned out to be K-means algorithm with only 2 clusters. Maybe it was the best way to divide the beers into light and dark. The second best was PAM algorithm. Manipulation with different distance metrics hasn’t changed the outputs of methods at all. Even CLARA algorithm wasn’t the best option what may proved that my data set was small enough to performance and results better using PAM.