There were my previous goals that I hoped to deliver by the end of this project:
I found that the Python wrapper I wanted to use before was rendered useless from RateBeer’s recent updates to the site. In the past few weeks, RateBeer has changed the infrastructure of their website and changed much of how the data on their website is encoded. Since the wrapper has not been updated recently to take into account the changes of RateBeer’s website, I could not use the wrapper without taking a serious digression from this project and fix the Python script. So I have changed the goals of this project.
The focus of my project has shifted from using machine machine learning to gauge beer likability to exploring the trove of data provided by BreweryDB, emphasizing user interaction and data visualization. My vision of this project is to take a user through the story told by the data, what are the top styles made in the US, what characteristics do those styles have, where are those located throughout the country, a state, or a city. That is only an example of what my visualization can do, but I hope to make exploration of the data provided by BreweryDB much more accessible via a user interface through interactive visualizations, something similar to a visual essay done by Russell Goldenberg from The Pudding (maybe as not as cool or as polished, but a publication like that is the dream standard for this project).
But first, I want to provide some background information into beers and the process of brewing beer.
Beer is mainly made out of four ingredients: water, yeast, a grain, such as barley, and hops.
Most of us know what abv and ibu is, since these statistics are often displayed on the beer itself. But what is SRM, and original and final gravity and what does that have to do with beer?
SRM is short for the Standard Reference Method, the color system used by breweries for finished beer and malts.
ranges | examples |
---|---|
1.0 - 2.0 | Pale lager |
2.0 - 3.0 | Pilsener |
3.0 - 4.0 | Blonde Ale |
4.0 - 6.0 | Weissbeer |
6.0 - 8.0 | India Pale Ale |
8.0 - 10.0 | Saison |
10.0 - 13.0 | English bitter |
13.0 - 17.0 | Double IPA |
17.0 - 20.0 | Amber Ale |
20.0 - 24.0 | Brown Ale |
24.0 - 29.0 | Porter |
29.0 -35.0 | Stout |
35.0 - 40.0 | Foreign Stout |
40.0+ | Imperial Stout |
Gravity, in the context of brewing alcohol, is the density of the wort or must compared to water. Original gravity refers to the gravity of the liquid before fermentation, and final gravity is its gravity after fermentation.
To access BreweryDB data on breweries and their beers, as well as their locations, I used the BreweryDB API. This API used an API key to access the information in the API, and I built 4 main data dictionaries about 66K+ beers and the breweries that make them.
In the BreweryDB framework, beers have associated brewery, style, and category data. The BreweryDB website is continually updated with a staff constantly checking authenticity of beers and their breweries. However, some breweries are more visible than others, especially those with websites versus those based in foreign countries, and accordingly there can be missing information about some of the beers and breweries, and that will be discussed in detail later on. The beer styles and categories based off of the Brewers Association Style Guidelines.
A beer can have several breweries that make it, and a brewery can be tied to several locations, such as a main brewery and its microbreweries (check if this is a thing). However, the relationship of location to physical coordinates is a one-to-one, with every unique location given a unique id, denoted locationId. The beer to breweries relationship is one-to-many and the breweries to locations relationship is a one-to-many relationship as well.
BreweryDB has over 66K+ unique beers, with other 1300+ pages of JSON data about beers and their breweries. On my laptop it took over an hour to pull all the data about the beers. I have written code that updates all the dictionaries every time that the script is run, rewriting the dictionaries kept on disk if need be, All four separate data dictionaries are stored locally, simply because the time required to build it from scratch is too much to wait for every time I want to access the data. The merged copies are built from scratch because it can be done nearly instantaneously. The beer data dictionary has two foreign keys, breweryId from the breweries data dictionary and styleId from the beer styles dictionary. The beer data is merged with both the beer styles dictionary and the breweries data dictionary.
Getting beer data out of the BreweryDB API was much more complicated than previously anticipated. I had planned to use the tidyjson package, but found that there was a bug that had arisen recently that no one had a quick fix for, specifically when attempting to access nested JSON lists and a strange issue with the dplyr package. The beer data straight out of the API is ordered in the following manner:
'{
"status" : "success",
"numberOfPages" : 225,
"data" : [
{
"servingTemperatureDisplay" : "",
"labels" : {
"medium" : "http://s3.amazonaws.com/",
"large" : "http://s3.amazonaws.com/",
"icon" : "http://s3.amazonaws.com/"
},
"style" : {
"id" : 15,
"category" : {
"updateDate" : "",
"id" : 5,
"description" : "",
"createDate" : "2012-01-02 11:50:42",
"name" : "Bock"
},
"description" : "",
"ibuMax" : "27",
"srmMin" : "14",
"srmMax" : "22",
"ibuMin" : "20",
"ogMax" : "1.072",
"fgMin" : "1.013",
"fgMax" : "1.019",
"createDate" : "2012-01-02 11:50:42",
"updateDate" : "",
"abvMax" : "7.2",
"ogMin" : "1.064",
"abvMin" : "6.3",
"name" : "Traditional Bock",
"categoryId" : 5
},
"status" : "verified",
"srmId" : "",
"beerVariationId" : "",
"statusDisplay" : "Verified",
"foodPairings" : "",
"breweries": [{
"id" : "KlSsWY",
"description" : "",
"name" : "Hofbrouwerijke",
"createDate" : "2012-01-02 11:50:52",
"mailingListUrl" : "",
"updateDate" : "",
"images" : {
"medium" : "",
"large" : "",
"icon" : ""
},
"established" : "",
"isOrganic" : "N",
"website" : "http://www.thofbrouwerijke.be/",
"status" : "verified",
"statusDisplay" : "Verified"
}],
"srm" : [],
"updateDate" : "",
"servingTemperature" : "",
"availableId" : 1,
"beerVariation" : [],
"abv" : "6",
"year" : "",
"name" : "\"My\" Bock",
"id" : "HXKxpc",
"originalGravity" : "",
"styleId" : 15,
"ibu" : "",
"glasswareId" : 5,
"isOrganic" : "N",
"createDate" : "2012-01-02 11:51:13",
"available" : {
"description" : "Available year round as a staple beer.",
"name" : "Year Round"
},
"glass" : {
"updateDate" : "",
"id" : 5,
"description" : "",
"createDate" : "2012-01-02 11:50:42",
"name" : "Pint"
},
"description" : "Amber, malty and not too heavy, all around favorite even for the drinkers of the yellow fizzy stuff"
},
...
],
"currentPage" : 1
}'
The data frame created directly from the JSON data has breweries defined as a list of lists, key-value pairs, encoded as a string, for each beer item. The key to making the data frame tidy was to extract information from the breweries and add it as proper columns/variables in the beers data drame, and removing extraneous information; as seen above, the breweryDB API returns a lot of data, a lot of which we aren’t interested in. While trying to do this, I quickly ran into issues stemming from the dplyr and the tidyjson packages documented here and here, receiving this error message:
library(tidyjson)
library(tidyverse)
beers %>%
gather_array %>%
spread_values(name = jstring("name"))
Error in eval(assertion, env) :
argument "json.column" is missing, with no default
Downgrading the dplyr package to version 0.5.0 and even downgrading the tidyjson package to version 0.2.1 did not resolve the issue, so I had to devise my own way of accessing the information and making the data frame tidy, using R’s apply functions, also known as group of mapping functions, explained beautifully in this Stack Overflow post. To extract any data located in a list in a column, I used the following code:
beers$breweryId <- lapply(beers$breweries, FUN = function(x) { paste(x$id, collapse = " ") })
turning a list of brewery ids located in the list of breweries into a string of brewery ids separated by a space, for easy separation of a beer id, 1 observation, into several observations of that beer into a beer and its breweries in the main data dictionary later.
The final beer data dictionary has the following variables:
variables | descriptions |
---|---|
beerId | the id of the beer |
beerName | the name of the beer |
beerDescription | the official description of the beer |
abv | the alcohol by volume of the beer (expressed as a percentage |
ibu | the IBU (international bittering unit) value of the beer, a measure of how bitter a beer is |
styleId | the style id of the beer |
categoryId | the category id of the style id |
breweryId | the id of the brewery that makes the beer |
with a beer id and a brewery id acting as primary keys of the beers data frame, meaning that the two together uniquely identify one observation in the data frame.
The brewery data dictionary was assembled in a similar manner to the beers data dictionary, with locations being the list nested in the list of data items in the JSON, and locationId being the list of ids associated with each brewery id. The final brewery data dictionary has the following variables, with a brewery id and a location id as primary keys of the data frame:
variables | descriptions |
---|---|
breweryId | the id of the brewery |
breweryName | the name of the brewery |
breweryDescription | the description of the brewery |
locationId | the location id associated with a brewery id (a brewery can have several locations |
Locations are in a separate data dictionary of their own, partially because the BreweryDB API had the locations as their own dictionaries and because there’s so much information associated with a location id. The variables in the final locations data dictionary are as follows, with locationId being the primary key of the data frame:
variables | descriptions |
---|---|
locationId | the id of a particular location (geophysical location) |
locationName | the name of a location, usually street name |
streetAddress | the address and number of a location |
locality | the city of the location |
region | the ztate of the region |
postalCode | the postal code of the location |
latitude | the latitude coordinates of the location |
longitude | the longitude coordinates of the location |
locationTypeDisplay | the kind of location it is: restuarant vs microbrewery for example |
isPrimary | whether that particular location is the primary location for a particular brewery |
countryIsoCode | the two character country code of a location |
breweryId | the brewery id of the brewery associated with this particular location |
Finally, I created a styles to categories data dictionary of all the different styles and categories and their mappings, associating styles and style information like the range of alcohol per beer volume content for that particular style, with styleId being the primary key for the data frame. The variables in this dictionary are:
variables | descriptions |
---|---|
styleId | the style id |
categoryId | the id of the category that style belonged to |
name | the name of the style |
shortName | the name of the style, shortened |
description | the description of that style |
ibuMin | the minimum international bitterness value of the style |
ibuMax | the maximum international bitterness value of the style |
abvMin | the minimum alcohol per beer volume content of the style |
abvMax | the maximum alcohol per beer volume content of the style |
srmMin | the minimum in the typical SRM range for this style |
srmMax | the maximum in the typical SRM range for this style |
ogMin | the minimum in the typical original gravity range for this style |
ogMax | the maximum in the typical original gravity range for this style |
fgMin | the minimum in the typical final gravity range for this style |
fgMax | the maximum in the typical final gravity range for this style |
categoryName | the name of the category the style belongs to |
The main foreign keys among the different dictionaries are locationId, breweryId, styleId, and on a lesser scale categoryId when making a data dictionary with both style and category information included.
Now that we have all of our data, we might want to take a look at the distribution of the most distinguishable beer characteristics, abv (alcohol per beer volume, expressed as a percentage out of 100) and ibu (international bitterness unit value, which is a measure of how bitter the beer is).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 5.100 6.000 6.545 7.500 308.000 11270
The summary statistics of the abv of all the beers reveal that the maximum abv is 308. Since a percent is out of 100, everything above 100 doesn’t make sense and we can remove all the beers whose abv is above 100 since the credibility of that beer is now questionable. Thankfully, there is only 1 beer whose abv is above 100, and we dispose of that observation.
Let’s take a look at the IBU distributions of all the beers.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 22.00 34.00 40.88 55.00 1000.00 41878
As seen in this chart, ibu doesn’t normally go above 120, with the units being parts per million, and this many claiming that the human tongue can’t distinguish past 110 IBUs.
There are 153 beers above 120 IBUs, and googling of the first few beers reveals that these are authentic beers, so no observations will be removed for wrong IBU range, but these observations will be left out of exploratory data analysis visualizations to avoid skewing the scale of data.
The next variable we want to look at and see if cleaning is necessary is the SRM range of the beer. We know that anything significantly bigger than 40 or anything that is negative is a clear error, and we might want to toss that observation out.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 6.00 12.00 17.43 29.00 41.00 60138
Although there are a lot of NA’s, it seems that the ranges of SRM for the beers that have that information is valid, and I won’t mess around with the NA’s for the sake of time in this project.
Since there isn’t anything to clean in breweries; we can’t tell what is a good description or a bad description of a brewery, and when the data dictionaries are merged with each other on the foreign keys of the dictionaries, beers that don’t have a brewery are not included, breweries with no locations are not included, and locations without an associated brewery isn’t included, since when we merge, we are using an inner_join.
So we move on to locations. Since most of the variables in location are strings, we must check the address and latitudes and longitudes to make sure they make sense in the context of the observation. Physical street addresses and postal codes are difficult to verify without using more datasets, and errors in these fields will be obvious in geovisualizations and then hopefully we can single out the errors and fix them. However, we can check if state and country are encoded properly in the dataset.
Assuming that US state data is much more present than foreign countries, we focus on the US for the cleaning. In the data pulled from the BreweryDB API, there are 99 states in the United States according to the dataset, which is erroneous. We expect to see 52 states, accounting for the 50 states in the US, the District Capital, and NA’s. Further exploration reveals that there are 13 locations in the US without a region, which I changed by hand since changing 16 locations by hand is doable. Fixing all other states involved string splitting on the addresses and turning state abbreviations and postal codes to actual state names, so that we have 51 states, including the District capital abbreviated “D.C”. Now that our data has been gone through initial cleaning, we can begin to visualize the distribution of a few variables and produce a few tables.
Keeping the things we learned in cleaning, we will start visualizing a few variables. Let’s visualize the distribution of alcohol per beer volume to get a better idea of the characteristics of the abv.
It seems that the majority of beer abv is between 0 and 20, so let’s visualize observations within that range.
It definitely seems that most beers do not have an abv above 10%, which is makes sense. Now we take a look at the distribution of beer bitterness measured in IBUs (International Bitterness Units), focusing on beers with IBUs below 120, for reasons stated previously, mainly because most beers do not have IBUs greater than 120.
The distribution is definitely right skewed, with the majority of beers preferring a slight bitterness, but not pushing it to 100 or even 120. If we limited the distribution to only beers whose abv is at most 20, the histogram still looks about the same.
Now we want to get an idea of what color beers usually are, and we can view this in a histogram.
It seems that beers are either usually fairly light colored or very dark colored, so either some sort of lager/ale or a strong stout are popular among brewers.
We can add a few statistics about the states with the most breweries in the US, the top styles in the US, and the top 10 cities in the US with the most breweries.
region | frequency |
---|---|
California | 11386 |
Michigan | 5244 |
Colorado | 5019 |
Oregon | 4126 |
Pennsylvania | 3800 |
name | frequency |
---|---|
American-Style India Pale Ale | 8718 |
American-Style Pale Ale | 4863 |
Imperial or Double India Pale Ale | 3607 |
French & Belgian-Style Saison | 2762 |
American-Style Amber/Red Ale | 2629 |
locality | frequency |
---|---|
Portland | 1976 |
San Diego | 1863 |
Denver | 1386 |
Chicago | 1252 |
Indianapolis | 955 |
Seattle | 941 |
Columbus | 820 |
Cincinnati | 803 |
Asheville | 756 |
Tampa | 652 |