BeerAdvocate Data Analysis

Francisco Javier Carela Ferrer

September 2023

To carry out this task, the free software R and the R Markdown editing tool will be used.

In this task, a dataset of 1.5 million beer reviews will be analyzed. The data set is available at the following link https://data.world/socialmediadata/beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five “aspects”: appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. We also have reviews from ratebeer.

Preliminary analysis

First we load the require libraries:

library(data.table) # Handle large databases quickly
library(readr) # Load csv
library(DT) # Presentation of tables in RMarkdown
library(zoo) # Manipulation of dates, time series, ... etc.
library(Amelia) # Useful to work with NA data
library(ggplot2) # Graphical representation
library(psych) # Obtaining correlograms

In this first phase, the data is loaded for correct processing.

beer_reviews <- read_csv("beer_reviews.csv")
beer_reviews<-data.table(beer_reviews)

We will use the datatatable library, which is useful when working with large databases in R, as in our case.

The dataset has 13 columns, which indicate:

  • brewery_id: Identifier of the brewery manufacturer.

  • brewery_name: Name of the beer manufacturer.

  • review_time: Date of the review, expressed in seconds since January 1, 1970.

  • review_overall: Overall rating of the beer.

  • review_aroma: Aroma rating.

  • review_appearance: Appearance rating.

  • review_profilename: Name of the user who made the review.

  • beer_style: Beer style.

  • review_palate: Palate rating.

  • review_taste: Taste/flavor evaluation.

  • beer_name: Beer name.

  • beer_abv: % alcohol of the beer.

  • beer_beerid: Beer identifier.

To obtain this information you can simply access a generic beerAdvocate page such as https://www.beeradvocate.com/beer/profile/23222/78820/.

Exploration of the dataset

Before starting, a brief preliminary analysis of our set and a preprocessing of the data is going to be done.

missmap(beer_reviews,col=c("blue","red"),legend=TRUE)

With this graph we observe that the variable beer_abv has some missing data (NA), but the rest of the variables are complete. As we have too many variables to visualize, let’s make a summary:

summary(beer_reviews)
##    brewery_id    brewery_name        review_time        review_overall 
##  Min.   :    1   Length:1586614     Min.   :8.407e+08   Min.   :0.000  
##  1st Qu.:  143   Class :character   1st Qu.:1.173e+09   1st Qu.:3.500  
##  Median :  429   Mode  :character   Median :1.239e+09   Median :4.000  
##  Mean   : 3130                      Mean   :1.224e+09   Mean   :3.816  
##  3rd Qu.: 2372                      3rd Qu.:1.289e+09   3rd Qu.:4.500  
##  Max.   :28003                      Max.   :1.326e+09   Max.   :5.000  
##                                                                        
##   review_aroma   review_appearance review_profilename  beer_style       
##  Min.   :1.000   Min.   :0.000     Length:1586614     Length:1586614    
##  1st Qu.:3.500   1st Qu.:3.500     Class :character   Class :character  
##  Median :4.000   Median :4.000     Mode  :character   Mode  :character  
##  Mean   :3.736   Mean   :3.842                                          
##  3rd Qu.:4.000   3rd Qu.:4.000                                          
##  Max.   :5.000   Max.   :5.000                                          
##                                                                         
##  review_palate    review_taste    beer_name            beer_abv    
##  Min.   :1.000   Min.   :1.000   Length:1586614     Min.   : 0.01  
##  1st Qu.:3.500   1st Qu.:3.500   Class :character   1st Qu.: 5.20  
##  Median :4.000   Median :4.000   Mode  :character   Median : 6.50  
##  Mean   :3.744   Mean   :3.793                      Mean   : 7.04  
##  3rd Qu.:4.000   3rd Qu.:4.500                      3rd Qu.: 8.50  
##  Max.   :5.000   Max.   :5.000                      Max.   :57.70  
##                                                     NA's   :67785  
##   beer_beerid   
##  Min.   :    3  
##  1st Qu.: 1717  
##  Median :13906  
##  Mean   :21713  
##  3rd Qu.:39441  
##  Max.   :77317  
## 

A priori, it does not seem that we have errors derived from counting, or radically different data within the set. We note that, the variables that give us information about the users’ opinion about the beers are : review_overall ,review_aroma, review_appearance, review_palate, review_taste. All are continuous variables that take values in the interval [0,5].

Since these are the variables that provide us with the most information, we can perform a correlation analysis between them:

beer_analysis <- beer_reviews[,c("review_overall" ,"review_aroma", "review_appearance", "review_palate", "review_taste")]
corPlot(beer_analysis, cex = 1.2, tl.cex = 0.00001,addCoef.col = 1,cl.cex = 0.5)

As we can see, they all have a positive correlation. In addition the review_overall variable is strongly correlated with the others, which makes sense, as it is calculated based on the others.

Finally we observe if there are duplicate rows

summary(duplicated(beer_reviews))
##    Mode   FALSE 
## logical 1586614

Debugging the dataset

Since we are not finding duplicate data or values that are NA completely, we are not going to delete any observations. We are going to change the review_time variable to date format:

mydates<-beer_reviews$review_time
class(mydates) = c('POSIXt','POSIXct')
beer_reviews$review_time<-mydates
summary(beer_reviews$review_time)
##                  Min.               1st Qu.                Median 
## "1996-08-22 02:00:01" "2007-03-07 00:36:28" "2009-04-08 17:01:21" 
##                  Mean               3rd Qu.                  Max. 
## "2008-10-15 18:48:00" "2010-11-01 00:40:05" "2012-01-11 13:35:48"

Which beer to choose and based on what criteria?

To choose the 3 best beers, I would take into account 2 criteria:

  • Criterion 1: Sufficient number of reviews. For example >= 15 reviews. For this we will work with the variable beer_beerid.
  • Criterion 2:The highest average overall rating (review_overall).

To establish the first criterion we can perform the following filtering:

beer_reviews$beer_beerid<-factor(beer_reviews$beer_beerid) # We transform it into a factor to work better

group_mean<-beer_reviews[ , .(means = mean(review_overall)), by = beer_beerid]
total<-beer_reviews[ , .N, by = beer_beerid]
beers_comparison<-cbind(group_mean,total[,c("N")])
#setnames(beers_comparison, old, new)

beers_comparison<-beers_comparison[N>=15,] # Criterion 1

We can see the frequency distribution of the resulting variable by number of observations:

pl <- ggplot(beers_comparison, aes(x=means))
pl +  geom_histogram(binwidth = 0.2, col='black', fill='blue', alpha=0.4) + ggtitle('Frequency distribution of the variable review_overall')

setorder(beers_comparison, -means) # Criterion 2
datatable(beers_comparison)

The means object contains the average number of reviews per beer, while the total object contains the number of reviews per beer. beers_comparison collects both information in the same data frame.

We observe that the beers have a representative number of reviews to be able to conclude that they are the best on the basis of this criterion.

ids_comparison<-apply(beers_comparison[1:3,c("beer_beerid")],1, as.numeric) # Ids we are going to search in the main dataset

unique(beer_reviews[beer_beerid==ids_comparison[1],c("beer_name")])
##      beer_name
## 1: Rare D.O.S.
unique(beer_reviews[beer_beerid==ids_comparison[2],c("beer_name")])
##      beer_name
## 1: Dirty Horse
unique(beer_reviews[beer_beerid==ids_comparison[3],c("beer_name")])
##                      beer_name
## 1: Southampton Berliner Weisse

Therefore, the best beers based on this criterion are Rare D.O.S. , Dirty Horse and Southampton Berliner Weisse.

Recommendation based on aroma and appearance

We will perform a similar procedure, but changing Criterion 2 (review_overall) to review_arome and review_appearance.

beer_reviews$beer_beerid<-factor(beer_reviews$beer_beerid) # We transform it into a factor to work better

aroma_mean<-beer_reviews[ , .(aroma_mean = mean(review_aroma)), by = beer_beerid] # Average per aroma of each beer
app_mean<-beer_reviews[ , .(app_mean = mean(review_appearance)), by = beer_beerid] # Average beer per appearance
total<-beer_reviews[ , .N, by = beer_beerid]
beers_comparison<-cbind(aroma_mean,app_mean[,c("app_mean")],total[,c("N")])
#setnames(beers_comparison, old, new)
beers_comparison<-beers_comparison[N>=15,] # Criterion 1

In this case, as we would have to establish a “weighting” or importance to each of the two averages, what we will do is simply add both averages by review_arome and review_appearance, and then order them as in the previous point:

beers_comparison$means <- beers_comparison[,c("aroma_mean")] + beers_comparison[,c("app_mean")]

We can see a representation of the resulting distribution

pl <- ggplot(beers_comparison, aes(x=means))
pl +  geom_histogram(binwidth = 0.4, col='black', fill='blue', alpha=0.4) + ggtitle('Distribución de frecuencias de al suma de arome y appearance')

setorder(beers_comparison, -means) # Criterion 2
datatable(head(beers_comparison))

We search for the names as in the previous section, to answer the question:

ids_comparison<-apply(beers_comparison[1,c("beer_beerid")],1, as.numeric) # Ids we are going to search in the main dataset

datatable(unique(beer_reviews[beer_beerid==ids_comparison[1],c("beer_name")]))

Therefore, and according to the criteria applied, if what you like most is the aroma and appearance, I would recommend the M Belgian-Style Barleywine beer.

Classification and classification characteristics

As classification ideas we will provide the following:

  • Classification by manufacturer:
    • Popularity
    • Rating
    • Trend
  • Classification by style:
    • Popularity
    • Rating
    • Trend
  • Ranking by % of alcohol.

The variables we are going to take as reference for the classification are review_overall, review_time , brewery_id and beer_style and beer_abv.

At the trend* point we will choose the data from the last year of the dataset, and see which one has the most reviews, which will not indicate the trend or the beer that is the most fashionable at the moment. In our case we will choose the beers that were the most fashionable during 2011, up to November, which is as far as we have data available.

reviews_2012 <- beer_reviews[review_time>'2011-01-01',]
reviews_2012 <- reviews_2012[review_time<'2012-01-01',]

Classification by manufacturer

# We create the variables necessary for the classification
beer_reviews[ , ratings_brewerly := .N, by = brewery_id]
beer_reviews[ , mean_brewery := mean(review_overall), by = brewery_id]
brewery_info<-unique(beer_reviews[,c("brewery_id","brewery_name","ratings_brewerly","mean_brewery")])

Popularity

datatable(setorder(brewery_info, -ratings_brewerly)) 

Rating

brewery_info2<-brewery_info[ratings_brewerly>=15,] # We choose the ones with enough comments
datatable(setorder(brewery_info2, -mean_brewery)) 

Trend

reviews_2012[ , ratings_brewerly2012 := .N, by = brewery_id]
tendencia2012<-unique(reviews_2012[,c("brewery_id","brewery_name","ratings_brewerly2012")])
datatable(setorder(tendencia2012, -ratings_brewerly2012)) # Brewery trend during 2012

Classification by style

beer_reviews[ , rating_style := .N, by = beer_style]
beer_reviews[ , mean_style := mean(review_overall), by = beer_style]
style_info<-unique(beer_reviews[,c("beer_style","rating_style","mean_style")])

Popularity

datatable(setorder(style_info, -rating_style))

Rating

style_info2<-style_info[rating_style>=15,] # We choose the ones with enough comments
datatable(setorder(style_info2, -mean_style))

Trend

reviews_2012[ , ratings_style2012 := .N, by = beer_style]
tendencia2012<-unique(reviews_2012[,c("beer_style","ratings_style2012")])
datatable(setorder(tendencia2012, -ratings_style2012)) # Brewery trend during 2012

Classification by % alcohol

To make this classification correctly, we would need the opinion of an expert brewer. If we make it our own, we can base it on its frequency distribution:

beers_info<-unique(beer_reviews[,"beer_beerid","beer_abv"])
pl <- ggplot(beers_info, aes(x=beer_abv))
pl +  geom_histogram(binwidth = 0.8, col='black', fill='blue', alpha=0.4) + ggtitle('Frequency distribution of alcohol by beer')

In this case, we will apply a simple criterion of non-alcoholic beers, mild (0-3), normal (3-6) and strong (6-10) and very strong (>10). It would also be interesting to apply unsupervised learning methods such as K-means to make groupings of this variable.

beers_info_freq <- cut(beers_info$beer_abv, breaks = c(0,0.1,3,6,10, Inf), 
             labels = c('non', 'mild', 'normal','strong','very strong'), right = FALSE)

summary(beers_info_freq)
##         non        mild      normal      strong very strong        NA's 
##           3          32         187         175         133           1

This would be the classification of the different beers in our dataset. We will store this information in the dataset.

beer_reviews$beer_abv <- cut(beer_reviews$beer_abv, breaks = c(0,0.1,3,6,10, Inf), 
             labels = c('non', 'mild', 'normal','strong','very strong'), right = FALSE)

summary(beer_reviews$beer_abv)
##         non        mild      normal      strong very strong        NA's 
##          23        3391      612256      708801      194358       67785

Influence of the factors (arome, taste, appearance, palette) in determining the overall quality.

In this section, we are going to plan a GLM model to see the influence of the different ratings on the overall rating.

GLM hypothesis testing

We have continuous response variables, so we already know that the continuity hypothesis of the response variables is satisfied.

modelo1 = glm(formula = review_overall~ +review_overall+review_aroma+review_appearance+review_palate+review_taste, data = beer_reviews)

plot(modelo1, col = "blue")

- In the Residuals vs fitted plot we see that we do not have strong influence points. - We have no apparent homoscedasticity problems. - The hypothesis of normality of the residuals is fulfilled. - Where we might find a possible problem is in the nonlinearity of the residuals,

Model fit and conclusions

Based on the above, we are able to propose a GLM, taking as independent variable the review_overall variable:

summary(modelo1)
## 
## Call:
## glm(formula = review_overall ~ +review_overall + review_aroma + 
##     review_appearance + review_palate + review_taste, data = beer_reviews)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8722  -0.2566  -0.0041   0.2441   3.6667  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.4395778  0.0023021  190.95   <2e-16 ***
## review_aroma      0.0478092  0.0007234   66.09   <2e-16 ***
## review_appearance 0.0358910  0.0006998   51.29   <2e-16 ***
## review_palate     0.2584742  0.0007599  340.14   <2e-16 ***
## review_taste      0.5515296  0.0007770  709.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1775317)
## 
##     Null deviance: 823922  on 1586613  degrees of freedom
## Residual deviance: 281673  on 1586609  degrees of freedom
## AIC: 1759992
## 
## Number of Fisher Scoring iterations: 2

Based on this model, we can conclude:

  • All ratings positively affect the overall rating.

  • All variables are significant in explaining the overall rating.

  • The variable review_taste, and therefore the taste of a beer, has the greatest influence on the overall rating.