To carry out this task, the free software R and the R Markdown editing tool will be used.
In this task, a dataset of 1.5 million beer reviews will be analyzed. The data set is available at the following link https://data.world/socialmediadata/beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five “aspects”: appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. We also have reviews from ratebeer.
Preliminary analysis
First we load the require libraries:
library(data.table) # Handle large databases quickly
library(readr) # Load csv
library(DT) # Presentation of tables in RMarkdown
library(zoo) # Manipulation of dates, time series, ... etc.
library(Amelia) # Useful to work with NA data
library(ggplot2) # Graphical representation
library(psych) # Obtaining correlogramsIn this first phase, the data is loaded for correct processing.
We will use the datatatable library, which is useful when working with large databases in R, as in our case.
The dataset has 13 columns, which indicate:
brewery_id: Identifier of the brewery manufacturer.
brewery_name: Name of the beer manufacturer.
review_time: Date of the review, expressed in seconds since January 1, 1970.
review_overall: Overall rating of the beer.
review_aroma: Aroma rating.
review_appearance: Appearance rating.
review_profilename: Name of the user who made the review.
beer_style: Beer style.
review_palate: Palate rating.
review_taste: Taste/flavor evaluation.
beer_name: Beer name.
beer_abv: % alcohol of the beer.
beer_beerid: Beer identifier.
To obtain this information you can simply access a generic beerAdvocate page such as https://www.beeradvocate.com/beer/profile/23222/78820/.
Exploration of the dataset
Before starting, a brief preliminary analysis of our set and a preprocessing of the data is going to be done.
With this graph we observe that the variable beer_abv has some missing data (NA), but the rest of the variables are complete. As we have too many variables to visualize, let’s make a summary:
## brewery_id brewery_name review_time review_overall
## Min. : 1 Length:1586614 Min. :8.407e+08 Min. :0.000
## 1st Qu.: 143 Class :character 1st Qu.:1.173e+09 1st Qu.:3.500
## Median : 429 Mode :character Median :1.239e+09 Median :4.000
## Mean : 3130 Mean :1.224e+09 Mean :3.816
## 3rd Qu.: 2372 3rd Qu.:1.289e+09 3rd Qu.:4.500
## Max. :28003 Max. :1.326e+09 Max. :5.000
##
## review_aroma review_appearance review_profilename beer_style
## Min. :1.000 Min. :0.000 Length:1586614 Length:1586614
## 1st Qu.:3.500 1st Qu.:3.500 Class :character Class :character
## Median :4.000 Median :4.000 Mode :character Mode :character
## Mean :3.736 Mean :3.842
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
##
## review_palate review_taste beer_name beer_abv
## Min. :1.000 Min. :1.000 Length:1586614 Min. : 0.01
## 1st Qu.:3.500 1st Qu.:3.500 Class :character 1st Qu.: 5.20
## Median :4.000 Median :4.000 Mode :character Median : 6.50
## Mean :3.744 Mean :3.793 Mean : 7.04
## 3rd Qu.:4.000 3rd Qu.:4.500 3rd Qu.: 8.50
## Max. :5.000 Max. :5.000 Max. :57.70
## NA's :67785
## beer_beerid
## Min. : 3
## 1st Qu.: 1717
## Median :13906
## Mean :21713
## 3rd Qu.:39441
## Max. :77317
##
A priori, it does not seem that we have errors derived from counting, or radically different data within the set. We note that, the variables that give us information about the users’ opinion about the beers are : review_overall ,review_aroma, review_appearance, review_palate, review_taste. All are continuous variables that take values in the interval [0,5].
Since these are the variables that provide us with the most information, we can perform a correlation analysis between them:
beer_analysis <- beer_reviews[,c("review_overall" ,"review_aroma", "review_appearance", "review_palate", "review_taste")]
corPlot(beer_analysis, cex = 1.2, tl.cex = 0.00001,addCoef.col = 1,cl.cex = 0.5)As we can see, they all have a positive correlation. In addition the review_overall variable is strongly correlated with the others, which makes sense, as it is calculated based on the others.
Finally we observe if there are duplicate rows
## Mode FALSE
## logical 1586614
Debugging the dataset
Since we are not finding duplicate data or values that are NA completely, we are not going to delete any observations. We are going to change the review_time variable to date format:
mydates<-beer_reviews$review_time
class(mydates) = c('POSIXt','POSIXct')
beer_reviews$review_time<-mydates
summary(beer_reviews$review_time)## Min. 1st Qu. Median
## "1996-08-22 02:00:01" "2007-03-07 00:36:28" "2009-04-08 17:01:21"
## Mean 3rd Qu. Max.
## "2008-10-15 18:48:00" "2010-11-01 00:40:05" "2012-01-11 13:35:48"
Which beer to choose and based on what criteria?
To choose the 3 best beers, I would take into account 2 criteria:
- Criterion 1: Sufficient number of reviews. For example >= 15 reviews. For this we will work with the variable beer_beerid.
- Criterion 2:The highest average overall rating (review_overall).
To establish the first criterion we can perform the following filtering:
beer_reviews$beer_beerid<-factor(beer_reviews$beer_beerid) # We transform it into a factor to work better
group_mean<-beer_reviews[ , .(means = mean(review_overall)), by = beer_beerid]
total<-beer_reviews[ , .N, by = beer_beerid]
beers_comparison<-cbind(group_mean,total[,c("N")])
#setnames(beers_comparison, old, new)
beers_comparison<-beers_comparison[N>=15,] # Criterion 1We can see the frequency distribution of the resulting variable by number of observations:
pl <- ggplot(beers_comparison, aes(x=means))
pl + geom_histogram(binwidth = 0.2, col='black', fill='blue', alpha=0.4) + ggtitle('Frequency distribution of the variable review_overall')The means object contains the average number of reviews per beer, while the total object contains the number of reviews per beer. beers_comparison collects both information in the same data frame.
We observe that the beers have a representative number of reviews to be able to conclude that they are the best on the basis of this criterion.
ids_comparison<-apply(beers_comparison[1:3,c("beer_beerid")],1, as.numeric) # Ids we are going to search in the main dataset
unique(beer_reviews[beer_beerid==ids_comparison[1],c("beer_name")])## beer_name
## 1: Rare D.O.S.
## beer_name
## 1: Dirty Horse
## beer_name
## 1: Southampton Berliner Weisse
Therefore, the best beers based on this criterion are Rare D.O.S. , Dirty Horse and Southampton Berliner Weisse.
Recommendation based on aroma and appearance
We will perform a similar procedure, but changing Criterion 2 (review_overall) to review_arome and review_appearance.
beer_reviews$beer_beerid<-factor(beer_reviews$beer_beerid) # We transform it into a factor to work better
aroma_mean<-beer_reviews[ , .(aroma_mean = mean(review_aroma)), by = beer_beerid] # Average per aroma of each beer
app_mean<-beer_reviews[ , .(app_mean = mean(review_appearance)), by = beer_beerid] # Average beer per appearance
total<-beer_reviews[ , .N, by = beer_beerid]
beers_comparison<-cbind(aroma_mean,app_mean[,c("app_mean")],total[,c("N")])
#setnames(beers_comparison, old, new)
beers_comparison<-beers_comparison[N>=15,] # Criterion 1In this case, as we would have to establish a “weighting” or importance to each of the two averages, what we will do is simply add both averages by review_arome and review_appearance, and then order them as in the previous point:
We can see a representation of the resulting distribution
pl <- ggplot(beers_comparison, aes(x=means))
pl + geom_histogram(binwidth = 0.4, col='black', fill='blue', alpha=0.4) + ggtitle('Distribución de frecuencias de al suma de arome y appearance')We search for the names as in the previous section, to answer the question:
ids_comparison<-apply(beers_comparison[1,c("beer_beerid")],1, as.numeric) # Ids we are going to search in the main dataset
datatable(unique(beer_reviews[beer_beerid==ids_comparison[1],c("beer_name")]))Therefore, and according to the criteria applied, if what you like most is the aroma and appearance, I would recommend the M Belgian-Style Barleywine beer.
Classification and classification characteristics
As classification ideas we will provide the following:
- Classification by manufacturer:
- Popularity
- Rating
- Trend
- Classification by style:
- Popularity
- Rating
- Trend
- Ranking by % of alcohol.
The variables we are going to take as reference for the classification are review_overall, review_time , brewery_id and beer_style and beer_abv.
At the trend* point we will choose the data from the last year of the dataset, and see which one has the most reviews, which will not indicate the trend or the beer that is the most fashionable at the moment. In our case we will choose the beers that were the most fashionable during 2011, up to November, which is as far as we have data available.
reviews_2012 <- beer_reviews[review_time>'2011-01-01',]
reviews_2012 <- reviews_2012[review_time<'2012-01-01',]Classification by manufacturer
# We create the variables necessary for the classification
beer_reviews[ , ratings_brewerly := .N, by = brewery_id]
beer_reviews[ , mean_brewery := mean(review_overall), by = brewery_id]
brewery_info<-unique(beer_reviews[,c("brewery_id","brewery_name","ratings_brewerly","mean_brewery")])Rating
Classification by style
beer_reviews[ , rating_style := .N, by = beer_style]
beer_reviews[ , mean_style := mean(review_overall), by = beer_style]
style_info<-unique(beer_reviews[,c("beer_style","rating_style","mean_style")])Rating
Classification by % alcohol
To make this classification correctly, we would need the opinion of an expert brewer. If we make it our own, we can base it on its frequency distribution:
beers_info<-unique(beer_reviews[,"beer_beerid","beer_abv"])
pl <- ggplot(beers_info, aes(x=beer_abv))
pl + geom_histogram(binwidth = 0.8, col='black', fill='blue', alpha=0.4) + ggtitle('Frequency distribution of alcohol by beer')In this case, we will apply a simple criterion of non-alcoholic beers, mild (0-3), normal (3-6) and strong (6-10) and very strong (>10). It would also be interesting to apply unsupervised learning methods such as K-means to make groupings of this variable.
beers_info_freq <- cut(beers_info$beer_abv, breaks = c(0,0.1,3,6,10, Inf),
labels = c('non', 'mild', 'normal','strong','very strong'), right = FALSE)
summary(beers_info_freq)## non mild normal strong very strong NA's
## 3 32 187 175 133 1
This would be the classification of the different beers in our dataset. We will store this information in the dataset.
beer_reviews$beer_abv <- cut(beer_reviews$beer_abv, breaks = c(0,0.1,3,6,10, Inf),
labels = c('non', 'mild', 'normal','strong','very strong'), right = FALSE)
summary(beer_reviews$beer_abv)## non mild normal strong very strong NA's
## 23 3391 612256 708801 194358 67785
Influence of the factors (arome, taste, appearance, palette) in determining the overall quality.
In this section, we are going to plan a GLM model to see the influence of the different ratings on the overall rating.
GLM hypothesis testing
We have continuous response variables, so we already know that the continuity hypothesis of the response variables is satisfied.
modelo1 = glm(formula = review_overall~ +review_overall+review_aroma+review_appearance+review_palate+review_taste, data = beer_reviews)
plot(modelo1, col = "blue")
- In the Residuals vs fitted plot we see that we do not have strong
influence points. - We have no apparent homoscedasticity problems. - The
hypothesis of normality of the residuals is fulfilled. - Where we might
find a possible problem is in the nonlinearity of the residuals,
Model fit and conclusions
Based on the above, we are able to propose a GLM, taking as independent variable the review_overall variable:
##
## Call:
## glm(formula = review_overall ~ +review_overall + review_aroma +
## review_appearance + review_palate + review_taste, data = beer_reviews)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8722 -0.2566 -0.0041 0.2441 3.6667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4395778 0.0023021 190.95 <2e-16 ***
## review_aroma 0.0478092 0.0007234 66.09 <2e-16 ***
## review_appearance 0.0358910 0.0006998 51.29 <2e-16 ***
## review_palate 0.2584742 0.0007599 340.14 <2e-16 ***
## review_taste 0.5515296 0.0007770 709.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1775317)
##
## Null deviance: 823922 on 1586613 degrees of freedom
## Residual deviance: 281673 on 1586609 degrees of freedom
## AIC: 1759992
##
## Number of Fisher Scoring iterations: 2
Based on this model, we can conclude:
All ratings positively affect the overall rating.
All variables are significant in explaining the overall rating.
The variable review_taste, and therefore the taste of a beer, has the greatest influence on the overall rating.