Finding new beer-related experiences

Finding new beer-related experiences

Beer to me was always a filler drink, used as a basis for socialization, an appetizer for a good meal or a beverage to go along with sporting events. Never was it an experience in itself, which can be attributed to the fact that I could never distinguish between beers, with all of them falling into the ‘good’ category. This paper aims to change that using Multi-Dimensional Scaling, with an addition of PAM clustering, to find the most unique beers there are. The data used was obtained from Kaggle link. It is a dataset of 1.5 million beer reviews from the site BeerAdvocate link.

Loading the necessary packages

library(smacof)
library(ggplot2)
library(factoextra)
library(knitr)

Inspecting and transforming the data

We start by importing and inspecting the data.

beer_data <- read.csv('beer_reviews.csv')
head(beer_data, 3)

##   brewery_id    brewery_name review_time review_overall review_aroma
## 1      10325 Vecchio Birraio  1234817823            1.5          2.0
## 2      10325 Vecchio Birraio  1235915097            3.0          2.5
## 3      10325 Vecchio Birraio  1235916604            3.0          2.5
##   review_appearance review_profilename             beer_style review_palate
## 1               2.5            stcules             Hefeweizen           1.5
## 2               3.0            stcules     English Strong Ale           3.0
## 3               3.0            stcules Foreign / Export Stout           3.0
##   review_taste              beer_name beer_abv beer_beerid
## 1          1.5           Sausa Weizen      5.0       47986
## 2          3.0               Red Moon      6.2       48213
## 3          3.0 Black Horse Black Beer      6.5       48215

The columns we are interested in are beer_beerid and the different scores given by reviewers. This means that overall scores will not be taken into account. But first, due to the datas immense size, 100 random beers will be sampled. The seed of the random sampling will reiterated a few times in order to find the most interesting MDS result. For now, a representative model will be established.

set.seed(123) #seed subject to change
data <- beer_data[sample(nrow(beer_data), 100), ]
head(data, 5)

##         brewery_id                      brewery_name review_time review_overall
## 1237518        604   Brasserie Dubuisson Frères sprl  1229481996            4.0
## 134058        5318 Port Brewing Company / Pizza Port  1219260102            4.0
## 1172598      10099              Dogfish Head Brewery  1303689900            4.0
## 685285       16533          Hub City Brewing Company  1220218501            2.5
## 1274894         24        Traquair House Brewery Lld  1244252798            4.5
##         review_aroma review_appearance review_profilename
## 1237518          4.0               4.5        HeyItsChili
## 134058           4.0               4.5       PatronWizard
## 1172598          4.0               5.0             JayQue
## 685285           3.5               4.0             Deuane
## 1274894          4.0               3.0              Seru1
##                                 beer_style review_palate review_taste
## 1237518            Belgian Strong Dark Ale           3.5          4.0
## 134058      American Double / Imperial IPA           3.5          3.5
## 1172598 American Double / Imperial Pilsner           4.0          4.0
## 685285                       Oatmeal Stout           3.0          3.0
## 1274894             Scotch Ale / Wee Heavy           3.5          4.5
##                      beer_name beer_abv beer_beerid
## 1237518           Bush De Noël     12.0        2232
## 134058                  Hop 15     10.0       31323
## 1172598             My Antonia      7.5       60078
## 685285  Hub City Oatmeal Stout      5.6       40316
## 1274894      Traquair Jacobite      8.0         652

Minding duplicates in sampling (sampling 2 different reviews of the same beer), the scores are transformed into mean scores.

data <- na.omit(data)
average_scores <- aggregate(cbind(review_aroma, review_taste, review_palate, review_appearance) ~ beer_beerid, data = data, FUN = mean)
rownames(average_scores) <- average_scores$beer_beerid
head(average_scores)

##     beer_beerid review_aroma review_taste review_palate review_appearance
## 30           30          3.5          3.5           3.5               4.0
## 236         236          4.0          3.5           3.5               4.5
## 276         276          5.0          5.0           4.5               4.5
## 297         297          4.0          4.0           3.5               4.0
## 361         361          4.0          4.5           4.0               4.0
## 646         646          5.0          5.0           4.5               4.5

The resulting table presents itself as seen above. It is a subset of 5 columns, with 1 column being the beer id and the rest being the different scores regarding the aspects of the beer. Rows with missing data have also been removed. One can note that in the case of this sample 3 reviews have been merged (a mean has been derived from their scores) and one row has been deleted due to the missing values. This gives us a sample of 96.

This data is now ready for further processing.

Deriving distances and performing MDS

In order to perform MDS, one must extract the scores as a matrix and then derive the distances between different points in that matrix. This is done below:

X <- as.matrix(average_scores[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')])
distances <- dist(X)

The result is a similarity matrix, which can now be subject to MDS. The package used for MDS is chosen to be smacof. Two dimensions are chosen.

fit.data<-mds(distances, ndim=2,  type="ordinal")

With MDS conducted, we can proceed to the analysis of the quality of the scaling based on the first sample provided.

Quality of fit

The main measure of the quality of fit of the model is its overall SPP (stress per point) value. Lower values indicate a better fit, with a suggestion that models with a value below 0.15 are considered good fits ¹. For this, a SMACOF permutation test will be used.

set.seed(123) 
permFit<-permtest(fit.data, nrep = 100, verbose=FALSE) 
permFit

## 
## Call: permtest.smacof(object = fit.data, nrep = 100, verbose = FALSE)
## 
## SMACOF Permutation Test
## Number of objects: 96 
## Number of replications (permutations): 100 
## 
## Observed stress value: 0.082 
## p-value: <0.001

The resulting observed stress value indicates a good fit of the model, with the p-value < 0.05 indicating that the model is in fact not a random fit. A further step can be conducted, examining singular stress per point values for each observation.

head(fit.data$spp, 20)

##        30       236       276       297       361       646       652       653 
## 0.1632531 1.3723348 0.1907011 0.5080244 0.2613608 0.1907011 0.8210687 0.5258820 
##       694       754       778       779      1093      1094      1212      1344 
## 0.6231601 0.5592561 2.0237329 0.2654387 0.2613608 1.7148729 0.6231601 0.4669895 
##      1346      1385      1575      1577 
## 0.2822144 0.6927095 0.2654387 0.4289081

In a graphical form of a stress decomposition chart:

plot(fit.data, plot.type = "stressplot")

One can note the low stress proportion, peaking out at around 8%. This indicates high similarity, with the observations with the highest score carrying the least fitting information in the terms of this model. Still, as the SPP is lower than 10% for all variables, there are no deep outliers.

Analysis and further steps

mds_df <- data.frame(MDS1 = fit.data$conf[, 1], MDS2 = fit.data$conf[, 2])
mds_df$labels <- average_scores$beer_beerid
ggplot(mds_df, aes(x = MDS1, y = MDS2)) +
  geom_point() +
  geom_text(aes(label=labels), vjust=1, hjust=1) +
  ggtitle("MDS Plot of Beer Reviews") +
  xlab("MDS Dimension 1") +
  ylab("MDS Dimension 2")

Using the ggplot2 package, we can now visualize the result of MDS on a 2 dimensional plane. Outliers are visible, being the most dissimilar observations. Let us create 5 such models with different samples, compiling the outliers from each MDS iteration into a table giving us most unique beers from the dataset. The function for this can be inspected below. The measure used for establishing the outliers was Mahalanobis distance, with the threshold being the 97.5% quantile of the Chi-squared distribution with 2 degrees of freedom. Distances from the centroid of each MDS result distribution were established, and then filtered with the aforementioned criteria in mind. The result is 27 of the most unique beers.

set.seed(407)
perform_mds5 <- function(data) {
  average_scores <- aggregate(cbind(review_aroma, review_taste, review_palate, review_appearance) ~ beer_beerid, data = data, FUN = mean)
  rownames(average_scores) <- average_scores$beer_beerid
  average_scores <- na.omit(average_scores)
  X <- as.matrix(average_scores[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')])
  distances <- dist(X)
  fit.data <- mds(distances, ndim=2,  type="ordinal")
  coordinates <- fit.data$conf
  coordinates$md <- mahalanobis(coordinates, colMeans(coordinates), cov(coordinates))
  outliers <- coordinates$md > qchisq(0.975, df = 2)
  outlier_df <- average_scores[outliers==TRUE, ]
  data <- data[!duplicated(data$beer_beerid), ]
  outlier_df <- merge(outlier_df, data[, c('beer_beerid', 'beer_name', 'brewery_name')], by = 'beer_beerid', all.x = TRUE)
  return(outlier_df)
}

final_data <- data.frame()

for(i in 1:5) {
  sample_data <- beer_data[sample(nrow(beer_data), 100), ]
  inter_data <- perform_mds5(sample_data)
  sample_data <- sample_data[!duplicated(sample_data$beer_beerid), ]
  final_data <- rbind(final_data, inter_data)

}

kable(final_data[, c('beer_name', 'brewery_name')])

beer_name	brewery_name
Kokanee	Labatt Brewing Company Ltd.
Hitachino Nest Red Rice Ale	Kiuchi Brewery
60 Minute IPA	Dogfish Head Brewery
Oat (Imperial Oatmeal Stout)	Southern Tier Brewing Company
Heineken Premium Light Lager	Heineken Nederland B.V.
Coors	Coors Brewing Company
Icehouse	Miller Brewing Co.
Coney Island Albino Python	Shmaltz Brewing Company
Oktoberfest	Santa Fe Brewing Company
Miller High Life	Miller Brewing Co.
Tsingtao	Tsingtao Brewery Co., Ltd.
Pabst Blue Ribbon (PBR)	Pabst Brewing Company
Bavaria 8.6	Bavaria Brouwerij N.V.
Midas Touch Golden Elixir	Dogfish Head Brewery
Bell’s Special Double Cream Stout	Bell’s Brewery, Inc.
Pearl River Lager Beer	Guangzhou Zhujiang Brewery Co. Ltd.
Le Coq Imperial Extra Double Stout	Harvey & Son Ltd.
Red Stripe Jamaican Lager	Desnoes & Geddes Limited
Leinenkugel’s Honey Weiss	Jacob Leinenkugel Brewing Company
Carlsberg Master Brew	Carlsberg Danmark A/S
Michelob Porter	Anheuser-Busch
Biscotti	Lift Bridge Brewery
Labatt Blue Light Lime	Labatt Brewing Company Ltd.
Dos Equis Amber Lager	Cervecería Cuauhtémoc Moctezuma, S.A. de C.V.
Eumundi Lager	Eumundi Brewing Group LTD.
Gale’s Prize Old Ale	George Gale & Company Ltd
Erebuni	Pivzavod Abovian

The table above presents the outliers meeting the threshold established in the function. Those are the most dissimilar observations in relation to the whole sample, meaning they are the most unique. Those observations often obtained a lower score in one or more fields of the review, which might reinforce the view that most beers fall into the ‘good’ category.

Additional clustering

As many of the 27 presented beers are not available in Poland, additional PAM clustering will be conducted in order to establish similar observations, a contrary practice to the ones used to this point. This is done to get a taste of each group without the need to import 27 different beers. The medoid observation will be chosen as the representative beer of the group.

First, the gap statistic graph is rendered in order to establish the optimal number of clusters. The graph indicates the optimal number of clusters to be 1, but this is not useful in the context of the desired result. Therefore 5 clusters will be chosen as the gap statistic for 6 is lower than its value for 5 clusters (8 clusters were also inspected but ultimately abandoned after visual examination).

clustering_data <- final_data[, c('review_aroma', 'review_taste', 'review_palate', 'review_appearance')]
rownames(clustering_data) <- final_data$beer_beerid
fviz_nbclust(clustering_data, FUNcluster=cluster::pam, method="gap_stat")+ theme_classic() +ggtitle('PAM')

pa1 <- eclust(clustering_data, "pam", hc_metric="spearman",k=5)

result_med <- as.data.frame(pa1$medoids)
result_med$beer_beerid <- rownames(result_med)
beers_to_try <- merge(result_med, final_data[, c('beer_beerid', 'beer_name', 'brewery_name')], by = 'beer_beerid', all.x = TRUE)
kable(beers_to_try)

beer_beerid	review_aroma	review_taste	review_palate	review_appearance	beer_name	brewery_name
47231	3.0	1.5	1.5	3.5	Biscotti	Lift Bridge Brewery
57405	2.0	2.5	2.5	2.0	Labatt Blue Light Lime	Labatt Brewing Company Ltd.
62353	2.0	4.0	5.0	3.5	Oktoberfest	Santa Fe Brewing Company
767	1.5	1.5	2.0	2.0	Tsingtao	Tsingtao Brewery Co., Ltd.
8768	2.0	2.0	3.5	4.0	Erebuni	Pivzavod Abovian

As we can see, the medoids are given in the dataframe above. If one wants to overview the output from the previous paragraph while not having to import 27 beers, he would be best off trying those five listed above.

Conclusions

MDS has proven to be a useful tool, theoretically delivering the desired result. Through the analysis of dissimilarities and then similarities, it has singled out the most unique beers from 5 samples, which were then analyzed for similarities to give us a representative observation of each cluster. One could ask themselves if analyzing the data based on reviews is the right approach, which is a valid concern. Approaches based on text mining from more generously described reviews could prove to be more effective. Nevertheless, the quality of the result obtained in thus paper could be subject to further empirical analysis.

Bibliography

Dugard P, Todman J and Staines H (2010) p.275 Approaching multivariate analysis. A practical introduction. Second Edition. Routledge:New York.↩︎

Finding new beer-related experiences through MDS

Jan Szczepanek

2024-02-22