Background

Drinkers like their beer cold. However, as is the case with all alcoholic drinks, beer is as controversial as loved. According to some estimates, global beer consumption stands at 200 million kilolitres and growing. Without a doubt, brewing beer is a significant economic activity. However, detractors argue that alcohol addiction wrecks individuals and families, leading to death. But people continue to drink regardless.

In this article, I use data from beer advocate on the ratings of beers (by the beer lovers, of course). I supplement this data with beer consumption data from Wikipedia.

wiki_adress <- "https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita"

wiki_data <- read_html(wiki_adress) %>% 
  
  html_nodes("table") %>% 
  
  html_table() %>% 
  
  .[[1]] %>% 
  
  janitor::clean_names() %>% 
  
  mutate(country = str_remove_all(country, "\\*")) %>% 
  
  mutate(total_nationalconsumption_a_million_litresper_year = 
           
           parse_number(total_nationalconsumption_a_million_litresper_year)) %>% 
  
  select(-sources)

Objectives

This article seeks answers to the following questions:

  1. Which countries have the highest per capita beer consumption?
  2. Which countries have the highest beer consumption?
  3. What is the most popular beer and beer type?
  4. Which brewing company brews the most popular beers in the data set.
  5. Which beer categories have the highest alcohol content.
  6. Which beers have the weirdest names?
  7. What factors drive beer ratings.
###############################################################
## Scrapping function -----
###############################################################

my_comprehensive_scrapper <- function(url, n_rows, category){
  
## get the Beers, number of votes and average rating ----
  
  beer_data_main <- read_html(url) %>% 
    
    html_nodes(".hr_bottom_light b") %>% 
    
    html_text() %>% 
    
    tibble() %>% 
    
    mutate(key = rep(1:3, n_rows))
  
  beer <- data.frame(beer = filter(beer_data_main, key == 1))
  
  votes <- data.frame(votes = filter(beer_data_main, key == 2))
  
  avg <- data.frame(avg = filter(beer_data_main, key == 3))
  
  long_beers <- bind_cols(beer, votes, avg) %>% 
    
    select(-ends_with("key")) %>% 
    
    set_names(c("beer", "votes", "rating_average"))
  
## Get the beer category- like stout ----
  
  beer_category <- read_html(url) %>% 
    
    html_nodes("#ba-content a~ a") %>% 
    
    html_text() %>% 
    
    tibble() %>% 
    
    set_names("type")
  
## Get the full table  ----
  
  full_table <- read_html(url) %>% 
    
    html_nodes("table") %>% 
    
    html_table() %>% 
    
    .[[1]] %>% 
    
    filter(!is.na(X1)) %>% 
    
    select(-X1, -X3, -X4, -X5)
  
## Combine the three tables to form one table ----
  
long_beers %>% bind_cols(full_table) %>% 
    
    bind_cols(beer_category) %>% 
    
    mutate(X2 = str_remove_all(X2, beer)) %>% 
    
    mutate(X2 = str_remove_all(X2, type)) %>% 
    
    mutate(alcohol_perc = str_extract(X2, "\\d{1,2}\\.\\d{2}%$"), 
           
           X2 = str_remove_all(X2, "\\|\\s?\\d{1,2}\\.\\d{2}%$"),
           
           alcohol_perc = parse_number(alcohol_perc),
           
           category = category) %>% 
    
    rename(brewer = X2)
  
}
###################################################################################

top_250_beers <- my_comprehensive_scrapper(url = "https://www.beeradvocate.com/beer/top-rated/", 
                                           
                                           n_rows = 250, category = "top_250")


trending_beers <- my_comprehensive_scrapper(url = "https://www.beeradvocate.com/beer/trending/", 
                                            
                                            n_rows = 100, category = "trending")

top_new <- my_comprehensive_scrapper(url = "https://www.beeradvocate.com/beer/top-new/", 
                                            
                                            n_rows = 250, category = "new")

fame_beer <- my_comprehensive_scrapper(url = "https://www.beeradvocate.com/beer/fame/", 
                                     
                                     n_rows = 250, category = "fame")

popular_beer <- my_comprehensive_scrapper(url = "https://www.beeradvocate.com/beer/popular/", 
                                       
                                       n_rows = 250, category = "popular")

#########################################################################################

Summary of Results

  1. The Czech Republic has the highest per capita beer consumption.

  2. China and the United States are the most extensive beer markets.

  3. The most popular beers in the data set are Kentucky Brunch Brand Stout and Assassin - Double Barrel-Aged by Toppling Goliath Brewing Company.

  4. The most popular beer category by median ratings is IPA - New England followed closely by Sour - Fruited Kettle Sour and Wild Ale.

  5. Tree House Brewing Company has the most beer brands featured in the dataset.

  6. Beers made by Anchorage Brewing Company have the highest median rating.

  7. There is no shortage when it comes to beers with weird names. Sample these:

  1. The alcohol content and brewer are significant drivers of beer ratings at a 1% significance level.
## Combine the three datasets ----

full_beer_data <- top_250_beers %>% 
  
  bind_rows(trending_beers) %>% 
  
  bind_rows(top_new) %>% 
  
  bind_rows(fame_beer) %>% 
  
  bind_rows(popular_beer) %>% 
  
## Add alcohol percentage for beers with the missing data

  mutate(
    
    alcohol_perc = case_when(
      
      beer == "Rare Scooop" ~ 6.16,
      
      beer == "Madness & Civilization #14" ~ 3.81,
      
      beer == "Thumbprint Lots O' Peach 21" ~ 7.55,
      
      beer == "Persevere" ~ 6.12,
      
      beer == "Sankt" ~ 7.21,
      
      beer == "Nonconformist 02" ~ 7.94,
      
      beer == "Civil Disobedience #31" ~ 6,
      
      beer == "Anna Pear" ~ 6.5,
      
      beer == "Foster" ~ 4,
      
      beer == "Leaves Of Grass: October 3, 2019" ~ 6.5,
      
      beer == "Medianoche Reserve (2021)" ~ 4,
      
      beer == "Steadfast" ~ 6.8,
      
      beer == "Tree Of Bliss" ~ 8,
      
      beer == "Tree Of Light" ~ 12,
      
      beer == "We Are Made Of The Same Stardust" ~ 9.3,
      
      TRUE ~ alcohol_perc
      
    )) %>% 

###################################################################################
## Add beer category light, medium, strong
  mutate(strength_class = case_when(
    
    alcohol_perc <= 5 ~ "lite",
    
    alcohol_perc > 5 & alcohol_perc <= 10 ~ "medium",
    
    alcohol_perc > 10 & alcohol_perc <= 15 ~ "strong",
    
    alcohol_perc > 15 & alcohol_perc <= 20 ~ "super",
    
    TRUE ~ "ultra"
    
  )) %>% 
  
## Convert rating average and votes to numeric
  
mutate(votes = parse_number(votes),
       
       rating_average = parse_number(rating_average)) %>% 
  
relocate(category, votes, rating_average, 
         
         .after = strength_class) %>% 
  
  select(-category) %>% 
  
  group_by(beer, brewer, type, alcohol_perc, strength_class) %>% 
  
  summarise(votes = sum(votes),
            
            rating_average = mean(rating_average)) %>% 
  
  ungroup() %>% 
  
  distinct() %>% 
  
  mutate(brewer = str_trim(brewer),
         
         beer = str_trim(beer),
         
         type = str_trim(type))

##########################################################################

The Data

As noted before, there are two sets of data. The data from Wikipedia has 61 rows and 5 columns of the following variables.

wiki_data %>% 
  
  select(where(is.numeric)) %>% 
  
  skim_without_charts() %>% 
  
  select(-skim_type) %>% 
  
  rename(Variable = skim_variable,
         
         Missing = n_missing, Mean = numeric.mean,
         
         SD = numeric.sd, Min = numeric.p0,
         
         Q1 = numeric.p25, Median = numeric.p50,
         
         Q3 = numeric.p75, Max = numeric.p100) %>% 
  
  kbl(., booktabs = TRUE, 
      
      caption = "Summary of Numeric Variables") %>% 
  
  kable_classic(full_width = TRUE, font_size = 10) %>% 
  
  footnote(number = "Data source: Wikipedia")
Summary of Numeric Variables
Variable Missing complete_rate Mean SD Min Q1 Median Q3 Max
consumptionper_capita_1_litres_per_year 0 1.0000000 54.9541 34.574054 0.7 27.00 58.4 76.1 188.6
x2018change_litres_per_year 21 0.6557377 0.4975 3.235499 -3.2 -1.45 -0.3 1.5 14.2
total_nationalconsumption_a_million_litresper_year 18 0.7049180 3430.2326 7605.899636 50.0 297.50 565.0 2734.5 43266.0
year 5 0.9180328 2018.0893 1.697879 2013.0 2018.00 2019.0 2019.0 2020.0
1 Data source: Wikipedia
################################################################################
wiki_data %>% 
  
  select(country) %>% 
  
  skimr::skim() %>% 
  
  select(-skim_type) %>% 
  
  rename(Variable = skim_variable) %>% 
  
  kbl(., booktabs = TRUE,
      
      caption = "Summary of Non-numeric Variables") %>% 
  
  kable_classic(full_width = TRUE, font_size = 10) %>% 
  
  footnote(number = "Data source: Wikipedia")
Summary of Non-numeric Variables
Variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace
country 0 1 4 23 0 61 0
1 Data source: Wikipedia

The second data set comes from beeradvocate.com. The data has the following variables.

I start by visualizing the numeric variables. The distribution of votes cast is bimodal. Again, there appears to be some positive non-linear relationship between alcohol percentage in a beer and its average rating. However, the average rating of beers skews to the left. The alcohol percentage in beers skews to the right.

## Distribution of alcohol content in beers

(
full_beer_data %>% 
  
  ggplot(mapping = aes(x = alcohol_perc)) + 
  
  geom_histogram(col = "gray", 
               
               fill = "purple") + 
  
  labs(x = "Alcohol Percentage",
       
       y = "Count",
       
       title = "Distribution of Alcohol Percentage") +


## Distribution of rating average 
full_beer_data %>% 
  
  ggplot(mapping = aes(x = rating_average)) + 
  
  geom_density(col = "gray", 
               
               fill = "purple") +
  
  labs(x = "Rating Average",
       
       y = "Count",
       
       title = "Distribution of Beer Rating")) /

## Distribition of votes received

(full_beer_data %>% 
  
  ggplot(mapping = aes(x = votes, y = ..count..)) + 
  
  geom_density(col = "purple", 
               
               fill = "purple") +
  
  scale_x_log10() +
  
  labs(x = "Votes",
       
       y = "Count",
       
       title = "Distribution of Votes Cast") +


## Average rating versus alcohol percentage

full_beer_data %>% 
  
  ggplot(mapping = aes(x = rating_average, 
                       
                       y = alcohol_perc,
                       
                       col = type,
                       
                       size = votes)) +
  
  geom_point(show.legend = FALSE) +
  
  labs(x = "Average Rating",  y = "Alcohol Percentage",
       
       title = "Alcohol Percentage Versus Average Rating of Beers",
       
       caption = "Size shows number of votes. Colors show beer categories
       
       John Karuitha, 2022: Data Source: BeerAdvocate.com")
)

Next, I summarize the quantitative variables.

full_beer_data %>% 
  
  select(where(is.numeric)) %>% 
  
  skim_without_charts() %>% 
  
  select(-skim_type) %>% 
  
  rename(Variable = skim_variable,
         
         Missing = n_missing, 
         
         Mean = numeric.mean,
         
         SD = numeric.sd, 
         
         Min = numeric.p0,
         
         Q1 = numeric.p25, 
         
         Median = numeric.p50,
         
         Q3 = numeric.p75, 
         
         Max = numeric.p100) %>% 
  
  kbl(., booktabs = TRUE, 
      
      caption = "Summary of Numeric Variables") %>% 
  
  kable_classic(full_width = TRUE, font_size = 10) %>% 
  
  footnote(number = "Data source: BeerAdvocate.com")
Summary of Numeric Variables
Variable Missing complete_rate Mean SD Min Q1 Median Q3 Max
alcohol_perc 0 1 9.282901 3.2384802 3.50 7.00 8.50 11.30 28.00
votes 0 1 3180.127590 5728.4047146 5.00 33.00 845.00 3974.00 53472.00
rating_average 0 1 4.293746 0.3295495 1.87 4.16 4.34 4.51 4.83
1 Data source: BeerAdvocate.com
##########################################################################
full_beer_data %>% 
  
  select(-alcohol_perc, -votes, -rating_average) %>% 
  
  skimr::skim() %>% 
  
  select(-skim_type) %>% 
  
  rename(Variable = skim_variable) %>% 
  
  kbl(., booktabs = TRUE,
      
      caption = "Summary of Non-numeric Variables") %>% 
  
  kable_classic(full_width = TRUE, font_size = 10) %>% 
  
  footnote(number = "Data source: BeerAdvocate.com")
Summary of Non-numeric Variables
Variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace
beer 0 1 2 69 0 905 0
brewer 0 1 0 104 1 295 0
type 0 1 6 36 0 81 0
strength_class 0 1 4 6 0 5 0
1 Data source: BeerAdvocate.com
##########################################################################

Exploring the data

The data shows that the Czech Republic has the highest per capita beer consumption. It is notable that except for Namibia, an African Country, the rest of the countries with the highest per capita beer consumption are from Europe.

wiki_data %>% 
  
  arrange(desc(consumptionper_capita_1_litres_per_year)) %>% 
  
  head(10) %>% 
  
  ggplot(mapping = aes(x = fct_reorder(country, consumptionper_capita_1_litres_per_year, max),
                       
                       y = consumptionper_capita_1_litres_per_year)) + 
  
  geom_col(show.legend = FALSE, color = "black") + 
  
  coord_flip() +
  
  labs(x = "Country", 
       
       y = "Beer Consumption, Litres per Year",
       
       caption = "Data Source: Wikipedia",
       
       title = "Per Capita Beer Consumption in the World")

The absolute amounts of alcohol taken could be a function of the country’s population and level of income. In this case, China and the United States lead in total beer consumption.

wiki_data %>% 
  
  arrange(desc(total_nationalconsumption_a_million_litresper_year)) %>% 
  
  head(10) %>% 
  
  ggplot(mapping = aes(x = fct_reorder(country, total_nationalconsumption_a_million_litresper_year, max), 
                       
                       y = total_nationalconsumption_a_million_litresper_year)) +
  
  geom_col(color = "black", show.legend = FALSE) +
  
  labs(x = "Country", y = "Alcohol Consumed, Millions of Litres",
       
       title = "Alcohol Consumption in the World") + 
  
  coord_flip()

Notably, Germany, Spain and Poland appear in both lists. These three countries present prime markets for beer.