Beer to me was always a filler drink, used as a basis for socialization, an appetizer for a good meal or a beverage to go along with sporting events. Never was it an experience in itself, which can be attributed to the fact that I could never distinguish between beers, with all of them falling into the ‘good’ category. This paper aims to change that using Multi-Dimensional Scaling, with an addition of PAM clustering, to find the most unique beers there are. The data used was obtained from Kaggle link. It is a dataset of 1.5 million beer reviews from the site BeerAdvocate link.
library(smacof)
library(ggplot2)
library(factoextra)
library(knitr)
We start by importing and inspecting the data.
beer_data <- read.csv('beer_reviews.csv')
head(beer_data, 3)
## brewery_id brewery_name review_time review_overall review_aroma
## 1 10325 Vecchio Birraio 1234817823 1.5 2.0
## 2 10325 Vecchio Birraio 1235915097 3.0 2.5
## 3 10325 Vecchio Birraio 1235916604 3.0 2.5
## review_appearance review_profilename beer_style review_palate
## 1 2.5 stcules Hefeweizen 1.5
## 2 3.0 stcules English Strong Ale 3.0
## 3 3.0 stcules Foreign / Export Stout 3.0
## review_taste beer_name beer_abv beer_beerid
## 1 1.5 Sausa Weizen 5.0 47986
## 2 3.0 Red Moon 6.2 48213
## 3 3.0 Black Horse Black Beer 6.5 48215
The columns we are interested in are beer_beerid and the different scores given by reviewers. This means that overall scores will not be taken into account. But first, due to the datas immense size, 100 random beers will be sampled. The seed of the random sampling will reiterated a few times in order to find the most interesting MDS result. For now, a representative model will be established.
set.seed(123) #seed subject to change
data <- beer_data[sample(nrow(beer_data), 100), ]
head(data, 5)
## brewery_id brewery_name review_time review_overall
## 1237518 604 Brasserie Dubuisson Frères sprl 1229481996 4.0
## 134058 5318 Port Brewing Company / Pizza Port 1219260102 4.0
## 1172598 10099 Dogfish Head Brewery 1303689900 4.0
## 685285 16533 Hub City Brewing Company 1220218501 2.5
## 1274894 24 Traquair House Brewery Lld 1244252798 4.5
## review_aroma review_appearance review_profilename
## 1237518 4.0 4.5 HeyItsChili
## 134058 4.0 4.5 PatronWizard
## 1172598 4.0 5.0 JayQue
## 685285 3.5 4.0 Deuane
## 1274894 4.0 3.0 Seru1
## beer_style review_palate review_taste
## 1237518 Belgian Strong Dark Ale 3.5 4.0
## 134058 American Double / Imperial IPA 3.5 3.5
## 1172598 American Double / Imperial Pilsner 4.0 4.0
## 685285 Oatmeal Stout 3.0 3.0
## 1274894 Scotch Ale / Wee Heavy 3.5 4.5
## beer_name beer_abv beer_beerid
## 1237518 Bush De Noël 12.0 2232
## 134058 Hop 15 10.0 31323
## 1172598 My Antonia 7.5 60078
## 685285 Hub City Oatmeal Stout 5.6 40316
## 1274894 Traquair Jacobite 8.0 652
Minding duplicates in sampling (sampling 2 different reviews of the same beer), the scores are transformed into mean scores.
data <- na.omit(data)
average_scores <- aggregate(cbind(review_aroma, review_taste, review_palate, review_appearance) ~ beer_beerid, data = data, FUN = mean)
rownames(average_scores) <- average_scores$beer_beerid
head(average_scores)
## beer_beerid review_aroma review_taste review_palate review_appearance
## 30 30 3.5 3.5 3.5 4.0
## 236 236 4.0 3.5 3.5 4.5
## 276 276 5.0 5.0 4.5 4.5
## 297 297 4.0 4.0 3.5 4.0
## 361 361 4.0 4.5 4.0 4.0
## 646 646 5.0 5.0 4.5 4.5
The resulting table presents itself as seen above. It is a subset of 5 columns, with 1 column being the beer id and the rest being the different scores regarding the aspects of the beer. Rows with missing data have also been removed. One can note that in the case of this sample 3 reviews have been merged (a mean has been derived from their scores) and one row has been deleted due to the missing values. This gives us a sample of 96.
This data is now ready for further processing.
In order to perform MDS, one must extract the scores as a matrix and then derive the distances between different points in that matrix. This is done below:
X <- as.matrix(average_scores[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')])
distances <- dist(X)
The result is a similarity matrix, which can now be subject to MDS. The package used for MDS is chosen to be smacof. Two dimensions are chosen.
fit.data<-mds(distances, ndim=2, type="ordinal")
With MDS conducted, we can proceed to the analysis of the quality of the scaling based on the first sample provided.
The main measure of the quality of fit of the model is its overall SPP (stress per point) value. Lower values indicate a better fit, with a suggestion that models with a value below 0.15 are considered good fits 1. For this, a SMACOF permutation test will be used.
set.seed(123)
permFit<-permtest(fit.data, nrep = 100, verbose=FALSE)
permFit
##
## Call: permtest.smacof(object = fit.data, nrep = 100, verbose = FALSE)
##
## SMACOF Permutation Test
## Number of objects: 96
## Number of replications (permutations): 100
##
## Observed stress value: 0.082
## p-value: <0.001
The resulting observed stress value indicates a good fit of the model, with the p-value < 0.05 indicating that the model is in fact not a random fit. A further step can be conducted, examining singular stress per point values for each observation.
head(fit.data$spp, 20)
## 30 236 276 297 361 646 652 653
## 0.1632531 1.3723348 0.1907011 0.5080244 0.2613608 0.1907011 0.8210687 0.5258820
## 694 754 778 779 1093 1094 1212 1344
## 0.6231601 0.5592561 2.0237329 0.2654387 0.2613608 1.7148729 0.6231601 0.4669895
## 1346 1385 1575 1577
## 0.2822144 0.6927095 0.2654387 0.4289081
In a graphical form of a stress decomposition chart:
plot(fit.data, plot.type = "stressplot")
One can note the low stress proportion, peaking out at around 8%. This indicates high similarity, with the observations with the highest score carrying the least fitting information in the terms of this model. Still, as the SPP is lower than 10% for all variables, there are no deep outliers.
mds_df <- data.frame(MDS1 = fit.data$conf[, 1], MDS2 = fit.data$conf[, 2])
mds_df$labels <- average_scores$beer_beerid
ggplot(mds_df, aes(x = MDS1, y = MDS2)) +
geom_point() +
geom_text(aes(label=labels), vjust=1, hjust=1) +
ggtitle("MDS Plot of Beer Reviews") +
xlab("MDS Dimension 1") +
ylab("MDS Dimension 2")
Using the ggplot2 package, we can now visualize the result of MDS on a 2 dimensional plane. Outliers are visible, being the most dissimilar observations. Let us create 5 such models with different samples, compiling the outliers from each MDS iteration into a table giving us most unique beers from the dataset. The function for this can be inspected below. The measure used for establishing the outliers was Mahalanobis distance, with the threshold being the 97.5% quantile of the Chi-squared distribution with 2 degrees of freedom. Distances from the centroid of each MDS result distribution were established, and then filtered with the aforementioned criteria in mind. The result is 27 of the most unique beers.
set.seed(407)
perform_mds5 <- function(data) {
average_scores <- aggregate(cbind(review_aroma, review_taste, review_palate, review_appearance) ~ beer_beerid, data = data, FUN = mean)
rownames(average_scores) <- average_scores$beer_beerid
average_scores <- na.omit(average_scores)
X <- as.matrix(average_scores[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')])
distances <- dist(X)
fit.data <- mds(distances, ndim=2, type="ordinal")
coordinates <- fit.data$conf
coordinates$md <- mahalanobis(coordinates, colMeans(coordinates), cov(coordinates))
outliers <- coordinates$md > qchisq(0.975, df = 2)
outlier_df <- average_scores[outliers==TRUE, ]
data <- data[!duplicated(data$beer_beerid), ]
outlier_df <- merge(outlier_df, data[, c('beer_beerid', 'beer_name', 'brewery_name')], by = 'beer_beerid', all.x = TRUE)
return(outlier_df)
}
final_data <- data.frame()
for(i in 1:5) {
sample_data <- beer_data[sample(nrow(beer_data), 100), ]
inter_data <- perform_mds5(sample_data)
sample_data <- sample_data[!duplicated(sample_data$beer_beerid), ]
final_data <- rbind(final_data, inter_data)
}
kable(final_data[, c('beer_name', 'brewery_name')])
| beer_name | brewery_name |
|---|---|
| Kokanee | Labatt Brewing Company Ltd. |
| Hitachino Nest Red Rice Ale | Kiuchi Brewery |
| 60 Minute IPA | Dogfish Head Brewery |
| Oat (Imperial Oatmeal Stout) | Southern Tier Brewing Company |
| Heineken Premium Light Lager | Heineken Nederland B.V. |
| Coors | Coors Brewing Company |
| Icehouse | Miller Brewing Co. |
| Coney Island Albino Python | Shmaltz Brewing Company |
| Oktoberfest | Santa Fe Brewing Company |
| Miller High Life | Miller Brewing Co. |
| Tsingtao | Tsingtao Brewery Co., Ltd. |
| Pabst Blue Ribbon (PBR) | Pabst Brewing Company |
| Bavaria 8.6 | Bavaria Brouwerij N.V. |
| Midas Touch Golden Elixir | Dogfish Head Brewery |
| Bell’s Special Double Cream Stout | Bell’s Brewery, Inc. |
| Pearl River Lager Beer | Guangzhou Zhujiang Brewery Co. Ltd. |
| Le Coq Imperial Extra Double Stout | Harvey & Son Ltd. |
| Red Stripe Jamaican Lager | Desnoes & Geddes Limited |
| Leinenkugel’s Honey Weiss | Jacob Leinenkugel Brewing Company |
| Carlsberg Master Brew | Carlsberg Danmark A/S |
| Michelob Porter | Anheuser-Busch |
| Biscotti | Lift Bridge Brewery |
| Labatt Blue Light Lime | Labatt Brewing Company Ltd. |
| Dos Equis Amber Lager | Cervecería Cuauhtémoc Moctezuma, S.A. de C.V. |
| Eumundi Lager | Eumundi Brewing Group LTD. |
| Gale’s Prize Old Ale | George Gale & Company Ltd |
| Erebuni | Pivzavod Abovian |
The table above presents the outliers meeting the threshold established in the function. Those are the most dissimilar observations in relation to the whole sample, meaning they are the most unique. Those observations often obtained a lower score in one or more fields of the review, which might reinforce the view that most beers fall into the ‘good’ category.
As many of the 27 presented beers are not available in Poland, additional PAM clustering will be conducted in order to establish similar observations, a contrary practice to the ones used to this point. This is done to get a taste of each group without the need to import 27 different beers. The medoid observation will be chosen as the representative beer of the group.
First, the gap statistic graph is rendered in order to establish the optimal number of clusters. The graph indicates the optimal number of clusters to be 1, but this is not useful in the context of the desired result. Therefore 5 clusters will be chosen as the gap statistic for 6 is lower than its value for 5 clusters (8 clusters were also inspected but ultimately abandoned after visual examination).
clustering_data <- final_data[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')]
rownames(clustering_data) <- final_data$beer_beerid
fviz_nbclust(clustering_data, FUNcluster=cluster::pam, method="gap_stat")+ theme_classic() +ggtitle('PAM')
pa1 <- eclust(clustering_data, "pam", hc_metric="spearman",k=5)
result_med <- as.data.frame(pa1$medoids)
result_med$beer_beerid <- rownames(result_med)
beers_to_try <- merge(result_med, final_data[, c('beer_beerid', 'beer_name', 'brewery_name')], by = 'beer_beerid', all.x = TRUE)
kable(beers_to_try)
| beer_beerid | review_aroma | review_taste | review_palate | review_appearance | beer_name | brewery_name |
|---|---|---|---|---|---|---|
| 47231 | 3.0 | 1.5 | 1.5 | 3.5 | Biscotti | Lift Bridge Brewery |
| 57405 | 2.0 | 2.5 | 2.5 | 2.0 | Labatt Blue Light Lime | Labatt Brewing Company Ltd. |
| 62353 | 2.0 | 4.0 | 5.0 | 3.5 | Oktoberfest | Santa Fe Brewing Company |
| 767 | 1.5 | 1.5 | 2.0 | 2.0 | Tsingtao | Tsingtao Brewery Co., Ltd. |
| 8768 | 2.0 | 2.0 | 3.5 | 4.0 | Erebuni | Pivzavod Abovian |
As we can see, the medoids are given in the dataframe above. If one wants to overview the output from the previous paragraph while not having to import 27 beers, he would be best off trying those five listed above.
MDS has proven to be a useful tool, theoretically delivering the desired result. Through the analysis of dissimilarities and then similarities, it has singled out the most unique beers from 5 samples, which were then analyzed for similarities to give us a representative observation of each cluster. One could ask themselves if analyzing the data based on reviews is the right approach, which is a valid concern. Approaches based on text mining from more generously described reviews could prove to be more effective. Nevertheless, the quality of the result obtained in thus paper could be subject to further empirical analysis.
Dugard P, Todman J and Staines H (2010) p.275 Approaching multivariate analysis. A practical introduction. Second Edition. Routledge:New York.↩︎