Analysis of Beers and Breweries Around the World

Previous Goals

There were my previous goals that I hoped to deliver by the end of this project:

  1. Create an interactive geovisualization of the location of different breweries in the United States and the beers that they offer, with filters for regions, beer category, beer style, age of the breweries, and kinds of breweries.
  2. Create an interactive geovisualization for the location of different breweries and the ratings of the beers and be able to filter by location and ratings. The goal of these visualizations is to explore the data and see if there are any patterns that we can use to our advantage.
  3. Based on what information the visualizations offer, create a prediction algorithm for beer ratings. Given location, beer category, beer style, alcohol content (ABV, IBU), presence of a social media account, and brewery type, can I create an accurate recommendation algorithm? For this, I will use regression.

Current Goals

I found that the Python wrapper I wanted to use before was rendered useless from RateBeer’s recent updates to the site. In the past few weeks, RateBeer has changed the infrastructure of their website and changed much of how the data on their website is encoded. Since the wrapper has not been updated recently to take into account the changes of RateBeer’s website, I could not use the wrapper without taking a serious digression from this project and fix the Python script. So I have changed the goals of this project.

  1. Create an interactive geovisualization of the location breweries across every country, with the ability to select which country we’d like to look at.
  2. Provide layers to this geovisualization from the going from country to state to city, “zooming” on a factor up to three, such that a user can see where a brewery or specific beer is located.
  3. Allow the user to filter visualization on a variety of characteristics, such as style, abv, and ibu.

The focus of my project has shifted from using machine machine learning to gauge beer likability to exploring the trove of data provided by BreweryDB, emphasizing user interaction and data visualization. My vision of this project is to take a user through the story told by the data, what are the top styles made in the US, what characteristics do those styles have, where are those located throughout the country, a state, or a city. That is only an example of what my visualization can do, but I hope to make exploration of the data provided by BreweryDB much more accessible via a user interface through interactive visualizations, something similar to a visual essay done by Russell Goldenberg from The Pudding (maybe as not as cool or as polished, but a publication like that is the dream standard for this project).

But first, I want to provide some background information into beers and the process of brewing beer.

How Beer is Made

Beer is mainly made out of four ingredients: water, yeast, a grain, such as barley, and hops.

  1. Malting: The grain is prepared for boiling by being steeped in water and allowed to partially germinate, softening the kernel. This isolates the natural enzymes that will later on break down the starch into sugar. This process is stopped by heating, drying out and cracking the barley, turning it into malted barley.
  2. Mashing: The malted barley is soaked in hot water, allowing the natural enzymes in the grain to turn starch into sugar for the yeast to consume later during fermentation.
  3. Sparging: After the starch has been turned to sugar, the liquid becomes full of sugar and is separated from the grains in a process called lautering and the liquid that is leftover is now called wort.
  4. Boiling and Cooling: The wort is placed in a boil kettle and is boiled to kill any micro-organisms left in the liquid and is also when hops and spices are added over the period of about an hour. Hops provide bitterness in beer, balancing out the sugar. After, the wort is then quickly cooled down, strained, and filtered so that yeast can be added to it without killing it from the heat of the liquid.
  5. Fermentation: The waiting period for the yeast to consume the sugar and turn it into alcohol, which is typically a few weeks.
  6. Carbonation or Aging: After the fermentation period, the beer is still uncarbonated. It can either be artificially carbonated or it can be “bottle conditioned” and allowed to age with the CO2 produced by the yeast.

SRM & Original and Final Gravity

Most of us know what abv and ibu is, since these statistics are often displayed on the beer itself. But what is SRM, and original and final gravity and what does that have to do with beer?

SRM is short for the Standard Reference Method, the color system used by breweries for finished beer and malts.

SRM IDs and Examples of Beer Styles
ranges examples
1.0 - 2.0 Pale lager
2.0 - 3.0 Pilsener
3.0 - 4.0 Blonde Ale
4.0 - 6.0 Weissbeer
6.0 - 8.0 India Pale Ale
8.0 - 10.0 Saison
10.0 - 13.0 English bitter
13.0 - 17.0 Double IPA
17.0 - 20.0 Amber Ale
20.0 - 24.0 Brown Ale
24.0 - 29.0 Porter
29.0 -35.0 Stout
35.0 - 40.0 Foreign Stout
40.0+ Imperial Stout

Gravity, in the context of brewing alcohol, is the density of the wort or must compared to water. Original gravity refers to the gravity of the liquid before fermentation, and final gravity is its gravity after fermentation.

Data Importing

To access BreweryDB data on breweries and their beers, as well as their locations, I used the BreweryDB API. This API used an API key to access the information in the API, and I built 4 main data dictionaries about 66K+ beers and the breweries that make them.

In the BreweryDB framework, beers have associated brewery, style, and category data. The BreweryDB website is continually updated with a staff constantly checking authenticity of beers and their breweries. However, some breweries are more visible than others, especially those with websites versus those based in foreign countries, and accordingly there can be missing information about some of the beers and breweries, and that will be discussed in detail later on. The beer styles and categories based off of the Brewers Association Style Guidelines.

A beer can have several breweries that make it, and a brewery can be tied to several locations, such as a main brewery and its microbreweries (check if this is a thing). However, the relationship of location to physical coordinates is a one-to-one, with every unique location given a unique id, denoted locationId. The beer to breweries relationship is one-to-many and the breweries to locations relationship is a one-to-many relationship as well.

BreweryDB has over 66K+ unique beers, with other 1300+ pages of JSON data about beers and their breweries. On my laptop it took over an hour to pull all the data about the beers. I have written code that updates all the dictionaries every time that the script is run, rewriting the dictionaries kept on disk if need be, All four separate data dictionaries are stored locally, simply because the time required to build it from scratch is too much to wait for every time I want to access the data. The merged copies are built from scratch because it can be done nearly instantaneously. The beer data dictionary has two foreign keys, breweryId from the breweries data dictionary and styleId from the beer styles dictionary. The beer data is merged with both the beer styles dictionary and the breweries data dictionary.

Getting beer data out of the BreweryDB API was much more complicated than previously anticipated. I had planned to use the tidyjson package, but found that there was a bug that had arisen recently that no one had a quick fix for, specifically when attempting to access nested JSON lists and a strange issue with the dplyr package. The beer data straight out of the API is ordered in the following manner:

'{
  "status" : "success",
  "numberOfPages" : 225,
  "data" : [
    {
      "servingTemperatureDisplay" : "",
      "labels" : {
        "medium" : "http://s3.amazonaws.com/",
        "large" : "http://s3.amazonaws.com/",
        "icon" : "http://s3.amazonaws.com/"
      },
      "style" : {
        "id" : 15,
        "category" : {
          "updateDate" : "",
          "id" : 5,
          "description" : "",
          "createDate" : "2012-01-02 11:50:42",
          "name" : "Bock"
        },
        "description" : "",
        "ibuMax" : "27",
        "srmMin" : "14",
        "srmMax" : "22",
        "ibuMin" : "20",
        "ogMax" : "1.072",
        "fgMin" : "1.013",
        "fgMax" : "1.019",
        "createDate" : "2012-01-02 11:50:42",
        "updateDate" : "",
        "abvMax" : "7.2",
        "ogMin" : "1.064",
        "abvMin" : "6.3",
        "name" : "Traditional Bock",
        "categoryId" : 5
      },
      "status" : "verified",
      "srmId" : "",
      "beerVariationId" : "",
      "statusDisplay" : "Verified",
      "foodPairings" : "",
      "breweries":  [{
        "id" : "KlSsWY",
        "description" : "",
        "name" : "Hofbrouwerijke",
        "createDate" : "2012-01-02 11:50:52",
        "mailingListUrl" : "",
        "updateDate" : "",
        "images" : {
          "medium" : "",
          "large" : "",
          "icon" : ""
        },
        "established" : "",
        "isOrganic" : "N",
        "website" : "http://www.thofbrouwerijke.be/",
        "status" : "verified",
        "statusDisplay" : "Verified"
      }],
      "srm" : [],
      "updateDate" : "",
      "servingTemperature" : "",
      "availableId" : 1,
      "beerVariation" : [],
      "abv" : "6",
      "year" : "",
      "name" : "\"My\" Bock",
      "id" : "HXKxpc",
      "originalGravity" : "",
      "styleId" : 15,
      "ibu" : "",
      "glasswareId" : 5,
      "isOrganic" : "N",
      "createDate" : "2012-01-02 11:51:13",
      "available" : {
        "description" : "Available year round as a staple beer.",
        "name" : "Year Round"
      },
      "glass" : {
        "updateDate" : "",
        "id" : 5,
        "description" : "",
        "createDate" : "2012-01-02 11:50:42",
        "name" : "Pint"
      },
      "description" : "Amber, malty and not too heavy, all around favorite even for the drinkers of the yellow fizzy stuff"
    },
    ...
  ],
  "currentPage" : 1
}'

The data frame created directly from the JSON data has breweries defined as a list of lists, key-value pairs, encoded as a string, for each beer item. The key to making the data frame tidy was to extract information from the breweries and add it as proper columns/variables in the beers data drame, and removing extraneous information; as seen above, the breweryDB API returns a lot of data, a lot of which we aren’t interested in. While trying to do this, I quickly ran into issues stemming from the dplyr and the tidyjson packages documented here and here, receiving this error message:

library(tidyjson)
library(tidyverse)

beers %>% 
  gather_array %>% 
  spread_values(name = jstring("name"))
Error in eval(assertion, env) : 
  argument "json.column" is missing, with no default

Downgrading the dplyr package to version 0.5.0 and even downgrading the tidyjson package to version 0.2.1 did not resolve the issue, so I had to devise my own way of accessing the information and making the data frame tidy, using R’s apply functions, also known as group of mapping functions, explained beautifully in this Stack Overflow post. To extract any data located in a list in a column, I used the following code:

beers$breweryId <- lapply(beers$breweries, FUN = function(x) { paste(x$id, collapse = " ") })

turning a list of brewery ids located in the list of breweries into a string of brewery ids separated by a space, for easy separation of a beer id, 1 observation, into several observations of that beer into a beer and its breweries in the main data dictionary later.

The final beer data dictionary has the following variables:

Beers Data Dictionary
variables descriptions
beerId the id of the beer
beerName the name of the beer
beerDescription the official description of the beer
abv the alcohol by volume of the beer (expressed as a percentage
ibu the IBU (international bittering unit) value of the beer, a measure of how bitter a beer is
styleId the style id of the beer
categoryId the category id of the style id
breweryId the id of the brewery that makes the beer

with a beer id and a brewery id acting as primary keys of the beers data frame, meaning that the two together uniquely identify one observation in the data frame.

The brewery data dictionary was assembled in a similar manner to the beers data dictionary, with locations being the list nested in the list of data items in the JSON, and locationId being the list of ids associated with each brewery id. The final brewery data dictionary has the following variables, with a brewery id and a location id as primary keys of the data frame:

Breweries Data Dictionary
variables descriptions
breweryId the id of the brewery
breweryName the name of the brewery
breweryDescription the description of the brewery
locationId the location id associated with a brewery id (a brewery can have several locations

Locations are in a separate data dictionary of their own, partially because the BreweryDB API had the locations as their own dictionaries and because there’s so much information associated with a location id. The variables in the final locations data dictionary are as follows, with locationId being the primary key of the data frame:

Locations Data Dictionary
variables descriptions
locationId the id of a particular location (geophysical location)
locationName the name of a location, usually street name
streetAddress the address and number of a location
locality the city of the location
region the ztate of the region
postalCode the postal code of the location
latitude the latitude coordinates of the location
longitude the longitude coordinates of the location
locationTypeDisplay the kind of location it is: restuarant vs microbrewery for example
isPrimary whether that particular location is the primary location for a particular brewery
countryIsoCode the two character country code of a location
breweryId the brewery id of the brewery associated with this particular location

Finally, I created a styles to categories data dictionary of all the different styles and categories and their mappings, associating styles and style information like the range of alcohol per beer volume content for that particular style, with styleId being the primary key for the data frame. The variables in this dictionary are:

Beer Styles and Categories Data Dictionary
variables descriptions
styleId the style id
categoryId the id of the category that style belonged to
name the name of the style
shortName the name of the style, shortened
description the description of that style
ibuMin the minimum international bitterness value of the style
ibuMax the maximum international bitterness value of the style
abvMin the minimum alcohol per beer volume content of the style
abvMax the maximum alcohol per beer volume content of the style
srmMin the minimum in the typical SRM range for this style
srmMax the maximum in the typical SRM range for this style
ogMin the minimum in the typical original gravity range for this style
ogMax the maximum in the typical original gravity range for this style
fgMin the minimum in the typical final gravity range for this style
fgMax the maximum in the typical final gravity range for this style
categoryName the name of the category the style belongs to

The main foreign keys among the different dictionaries are locationId, breweryId, styleId, and on a lesser scale categoryId when making a data dictionary with both style and category information included.

Data Cleaning

Now that we have all of our data, we might want to take a look at the distribution of the most distinguishable beer characteristics, abv (alcohol per beer volume, expressed as a percentage out of 100) and ibu (international bitterness unit value, which is a measure of how bitter the beer is).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   5.100   6.000   6.545   7.500 308.000   11270

The summary statistics of the abv of all the beers reveal that the maximum abv is 308. Since a percent is out of 100, everything above 100 doesn’t make sense and we can remove all the beers whose abv is above 100 since the credibility of that beer is now questionable. Thankfully, there is only 1 beer whose abv is above 100, and we dispose of that observation.

Let’s take a look at the IBU distributions of all the beers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   22.00   34.00   40.88   55.00 1000.00   41878

As seen in this chart, ibu doesn’t normally go above 120, with the units being parts per million, and this many claiming that the human tongue can’t distinguish past 110 IBUs.

There are 153 beers above 120 IBUs, and googling of the first few beers reveals that these are authentic beers, so no observations will be removed for wrong IBU range, but these observations will be left out of exploratory data analysis visualizations to avoid skewing the scale of data.

The next variable we want to look at and see if cleaning is necessary is the SRM range of the beer. We know that anything significantly bigger than 40 or anything that is negative is a clear error, and we might want to toss that observation out.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    6.00   12.00   17.43   29.00   41.00   60138

Although there are a lot of NA’s, it seems that the ranges of SRM for the beers that have that information is valid, and I won’t mess around with the NA’s for the sake of time in this project.

Since there isn’t anything to clean in breweries; we can’t tell what is a good description or a bad description of a brewery, and when the data dictionaries are merged with each other on the foreign keys of the dictionaries, beers that don’t have a brewery are not included, breweries with no locations are not included, and locations without an associated brewery isn’t included, since when we merge, we are using an inner_join.

So we move on to locations. Since most of the variables in location are strings, we must check the address and latitudes and longitudes to make sure they make sense in the context of the observation. Physical street addresses and postal codes are difficult to verify without using more datasets, and errors in these fields will be obvious in geovisualizations and then hopefully we can single out the errors and fix them. However, we can check if state and country are encoded properly in the dataset.

Assuming that US state data is much more present than foreign countries, we focus on the US for the cleaning. In the data pulled from the BreweryDB API, there are 99 states in the United States according to the dataset, which is erroneous. We expect to see 52 states, accounting for the 50 states in the US, the District Capital, and NA’s. Further exploration reveals that there are 13 locations in the US without a region, which I changed by hand since changing 16 locations by hand is doable. Fixing all other states involved string splitting on the addresses and turning state abbreviations and postal codes to actual state names, so that we have 51 states, including the District capital abbreviated “D.C”. Now that our data has been gone through initial cleaning, we can begin to visualize the distribution of a few variables and produce a few tables.

Exploratory Data Analysis

Keeping the things we learned in cleaning, we will start visualizing a few variables. Let’s visualize the distribution of alcohol per beer volume to get a better idea of the characteristics of the abv.

It seems that the majority of beer abv is between 0 and 20, so let’s visualize observations within that range.

It definitely seems that most beers do not have an abv above 10%, which is makes sense. Now we take a look at the distribution of beer bitterness measured in IBUs (International Bitterness Units), focusing on beers with IBUs below 120, for reasons stated previously, mainly because most beers do not have IBUs greater than 120.

The distribution is definitely right skewed, with the majority of beers preferring a slight bitterness, but not pushing it to 100 or even 120. If we limited the distribution to only beers whose abv is at most 20, the histogram still looks about the same.

Now we want to get an idea of what color beers usually are, and we can view this in a histogram.

It seems that beers are either usually fairly light colored or very dark colored, so either some sort of lager/ale or a strong stout are popular among brewers.

We can add a few statistics about the states with the most breweries in the US, the top styles in the US, and the top 10 cities in the US with the most breweries.

Top 5 States with the Most Breweries
region frequency
California 11386
Michigan 5244
Colorado 5019
Oregon 4126
Pennsylvania 3800
Top 5 Beer Styles in the US
name frequency
American-Style India Pale Ale 8718
American-Style Pale Ale 4863
Imperial or Double India Pale Ale 3607
French & Belgian-Style Saison 2762
American-Style Amber/Red Ale 2629
Top 10 US Cities with the Most Breweries
locality frequency
Portland 1976
San Diego 1863
Denver 1386
Chicago 1252
Indianapolis 955
Seattle 941
Columbus 820
Cincinnati 803
Asheville 756
Tampa 652