About This Project

Row

Who cares?

North Carolina is arguably the one of the states with the most breweries in the country. Beer is one of the three kinds of liquor you can buy (wine, liquor, beer) and the advertising for it is everywhere. Down a highway or on television, there is almost certainly some kind of advertising for some sort of beer, perhaps one you’ve never heard of. And if you’ve ever seen the SuperBowl, you’ve definitely seen the halftime commercials for beer. If that isn’t enough to convince anyone, Anheuser-Busch InBev SA/NV is the largest brewery in the world and is one of largest and most profitable fast-moving consumer goods companies in the world, with $45.5 billion in revenue last year and a projected revenue of $55 billion this year. Craft beer is growing in popularity, so it’s safe to say quite a few people care about beer and the money it brings in.

Packages Required for this Project

These are the required packages:

  • httr: for using GET requests to access data from the API
  • jsonlite: for getting data out of nest JSON arrays
  • tidyverse: for help cleaning and transforming data into the tidy data frames we can call data dictionaries
  • tidyr: for help with turning JSON strings into tidy data frames
  • flexdashboard: for creating the flexdashboard that this is contained in
  • plotly: for help creating interactive visualizations
  • outliers: for computing the outliers
  • stringr: for string wrapping on some of the visualizations
  • moments: for calculating the skewness and kurtosis
  • knitr: for creating R markdown files
  • DT: for creating data tables
  • highcharter: for help in making tree maps interactive
  • treemap: for making tree maps
  • viridisLite: for creating the color palette for the tree maps

Row

Project Summary and Deliverables

I found that the Python wrapper I wanted to use before was rendered useless from RateBeer’s recent updates to the site. In the past few weeks, RateBeer has changed the infrastructure of their website and changed much of how the data on their website is encoded. Since the wrapper has not been updated recently to take into account the changes of RateBeer’s website, I could not use the wrapper without taking a serious digression from this project and fix the Python script. So I have changed the goals of this project.

The focus of my project has shifted from using machine machine learning to gauge beer likability to exploring the trove of data provided by BreweryDB, emphasizing user interaction and data visualization. My vision of this project is to take a user through the story told by the data, what are the top styles made in the US, what characteristics do those styles have, where are those located throughout the country, a state, or a city. That is only an example of what my visualization can do, but I hoped to make exploration of the data provided by BreweryDB much more accessible via a user interface through interactive visualizations, something similar to a visual essay done by Russell Goldenberg from The Pudding (maybe as not as cool or as polished, but a publication like that is the dream standard for this project). In doing this final project, I believe I have a great starting point, considering the time spent on data wrangling, to do something like that. What I have done is opened up the datasets kept on BreweryDB to gain some initial insights into the craft beer industry by focusing on visualizing and exposing parts of the dataset. So this project was really about exploring my dataset through visualizations and seeing what we can find out about the domain through this dataset, leveraging outside information about US states (state abbreviations), US regions (South, Midwest, West, and Northeast), and country codes to make some pretty neat visualizations.

I quickly ran into issues trying to visualize specific states, finding that I either needed separate shape files for particular states or needed to further transform my data to match locations to states and cities. Since most of my data is crowd-sourced, a vast majority of this project was spent data wrangling and cleaning and manipulating my data into tidy data frames. Trying to visualize breweries by states and cities requires a lot more cleaning on the city side so I decided to focus on craft beer in the United States, looking at beer styles and categories in the United States, the number of breweries in states by years and looking at beer characteristics like alcohol per beer volume (ABV) and international bitterness units (IBU) (which measures bitterness).

Data Importing from BreweryDB

To access BreweryDB data on breweries and their beers, as well as their locations, I used the BreweryDB API. This API used an API key to access the information in the API, and I built 4 main data dictionaries about 66K+ beers and the breweries that make them.

In the BreweryDB framework, beers have associated brewery, style, and category data. The BreweryDB website is continually updated with a staff constantly checking authenticity of beers and their breweries. However, some breweries are more visible than others, especially those with websites versus those based in foreign countries, and accordingly there can be missing information about some of the beers and breweries, and that will be discussed in detail later on. The beer styles and categories based off of the Brewers Association Style Guidelines.

A beer can have several breweries that make it, and a brewery can be tied to several location. However, the relationship of location to physical coordinates is a one-to-one, with every unique location given a unique id, denoted locationId. The beer to breweries relationship is one-to-many and the breweries to locations relationship is a one-to-many relationship as well.

BreweryDB has over 66K+ unique beers, with other 1300+ pages of JSON data about beers and their breweries. On my laptop it took over an hour to pull all the data about the beers. I have written code that updates all the dictionaries every time that the script is run, rewriting the dictionaries kept on disk if need be, All four separate data dictionaries are stored locally, simply because the time required to build it from scratch is too much to wait for every time I want to access the data. The merged copies are built from scratch because it can be done nearly instantaneously. The beer data dictionary has two foreign keys, breweryId from the breweries data dictionary and styleId from the beer styles dictionary. The beer data is merged with both the beer styles dictionary and the breweries data dictionary.

About Beer

Column

How Beer is Made

Beer is mainly made out of four ingredients: water, yeast, a grain, such as barley, and hops.

  1. Malting: The grain is prepared for boiling by being steeped in water and allowed to partially germinate, softening the kernel. This isolates the natural enzymes that will later on break down the starch into sugar. This process is stopped by heating, drying out and cracking the barley, turning it into malted barley.
  2. Mashing: The malted barley is soaked in hot water, allowing the natural enzymes in the grain to turn starch into sugar for the yeast to consume later during fermentation.
  3. Sparging: After the starch has been turned to sugar, the liquid becomes full of sugar and is separated from the grains in a process called lautering and the liquid that is leftover is now called wort.
  4. Boiling and Cooling: The wort is placed in a boil kettle and is boiled to kill any micro-organisms left in the liquid and is also when hops and spices are added over the period of about an hour. Hops provide bitterness in beer, balancing out the sugar. After, the wort is then quickly cooled down, strained, and filtered so that yeast can be added to it without killing it from the heat of the liquid.
  5. Fermentation: The waiting period for the yeast to consume the sugar and turn it into alcohol, which is typically a few weeks.
  6. Carbonation or Aging: After the fermentation period, the beer is still uncarbonated. It can either be artificially carbonated or it can be “bottle conditioned” and allowed to age with the CO2 produced by the yeast.

Column

SRM & Original and Final Gravity

Most of us know what ABV and IBU is, since these statistics are often displayed on the beer itself. But what is SRM, and original and final gravity and what does that have to do with beer?

SRM is short for the Standard Reference Method, the color system used by breweries for finished beer and malts.

SRM IDs and Examples of Beer Styles
ranges examples
1.0 - 2.0 Pale lager
2.0 - 3.0 Pilsener
3.0 - 4.0 Blonde Ale
4.0 - 6.0 Weissbeer
6.0 - 8.0 India Pale Ale
8.0 - 10.0 Saison
10.0 - 13.0 English bitter
13.0 - 17.0 Double IPA
17.0 - 20.0 Amber Ale
20.0 - 24.0 Brown Ale
24.0 - 29.0 Porter
29.0 -35.0 Stout
35.0 - 40.0 Foreign Stout
40.0+ Imperial Stout

Gravity, in the context of brewing alcohol, is the density of the wort or must compared to water. Original gravity refers to the gravity of the liquid before fermentation, and final gravity is its gravity after fermentation.

Importing Project Data

Column

BreweryDB JSON Example

Getting beer data out of the BreweryDB API was much more complicated than previously anticipated. I had planned to use the tidyjson package, but found that there was a bug that had arisen recently that no one had a quick fix for, specifically when attempting to access nested JSON lists and a strange issue with the dplyr package. The beer data straight out of the API is ordered in the following manner:

'{
  "status" : "success",
  "numberOfPages" : 225,
  "data" : [
    {
      "servingTemperatureDisplay" : "",
      "labels" : {
        "medium" : "http://s3.amazonaws.com/",
        "large" : "http://s3.amazonaws.com/",
        "icon" : "http://s3.amazonaws.com/"
      },
      "style" : {
        "id" : 15,
        "category" : {
          "updateDate" : "",
          "id" : 5,
          "description" : "",
          "createDate" : "2012-01-02 11:50:42",
          "name" : "Bock"
        },
        "description" : "",
        "ibuMax" : "27",
        "srmMin" : "14",
        "srmMax" : "22",
        "ibuMin" : "20",
        "ogMax" : "1.072",
        "fgMin" : "1.013",
        "fgMax" : "1.019",
        "createDate" : "2012-01-02 11:50:42",
        "updateDate" : "",
        "abvMax" : "7.2",
        "ogMin" : "1.064",
        "abvMin" : "6.3",
        "name" : "Traditional Bock",
        "categoryId" : 5
      },
      "status" : "verified",
      "srmId" : "",
      "beerVariationId" : "",
      "statusDisplay" : "Verified",
      "foodPairings" : "",
      "breweries":  [{
        "id" : "KlSsWY",
        "description" : "",
        "name" : "Hofbrouwerijke",
        "createDate" : "2012-01-02 11:50:52",
        "mailingListUrl" : "",
        "updateDate" : "",
        "images" : {
          "medium" : "",
          "large" : "",
          "icon" : ""
        },
        "established" : "",
        "isOrganic" : "N",
        "website" : "http://www.thofbrouwerijke.be/",
        "status" : "verified",
        "statusDisplay" : "Verified"
      }],
      "srm" : [],
      "updateDate" : "",
      "servingTemperature" : "",
      "availableId" : 1,
      "beerVariation" : [],
      "abv" : "6",
      "year" : "",
      "name" : "\"My\" Bock",
      "id" : "HXKxpc",
      "originalGravity" : "",
      "styleId" : 15,
      "ibu" : "",
      "glasswareId" : 5,
      "isOrganic" : "N",
      "createDate" : "2012-01-02 11:51:13",
      "available" : {
        "description" : "Available year round as a staple beer.",
        "name" : "Year Round"
      },
      "glass" : {
        "updateDate" : "",
        "id" : 5,
        "description" : "",
        "createDate" : "2012-01-02 11:50:42",
        "name" : "Pint"
      },
      "description" : "Amber, malty and not too heavy, all around favorite even for the drinkers of the yellow fizzy stuff"
    },
    ...
  ],
  "currentPage" : 1
}'

Accessing data from the BreweryDB API

I used a general function that would access the BreweryDB API at whichever endpoint specified, provided that it is an endpoint that the API provides, allowing the user to specify the options to be fed to the endpoint in the form of a list. This function is called by the functions that clean the beer, brewery, location, and category and style JSONS and convert them to tidy data frames. This function is written in a file called BreweryDBRWrapper to distinguish it as the file that contains the function that actually makes calls to the API.

suppressPackageStartupMessages(library(httr))

# the base BreweryDB URL that queries will be appended to
breweryDBBaseURL <- "http://api.brewerydb.com/v2/"

# this function returns the beers endpoint of the API, using parameter
# it needs name, abv, ibu, glasswareId, srmId, availableId, styleId or ids
# without premium to get beers
BreweryDB_endpoint <- function(api_key, endpoint, options = NULL) {
  if (is.null(options)) {
    GET(paste(breweryDBBaseURL, endpoint, sep = ""), query = list(key = api_key))
  } else {
    paramsRequest <- c(options, key = api_key)
    GET(paste(breweryDBBaseURL, endpoint, sep = ""), query = paramsRequest)
  }
}

Wrestling with JSON data

The data frame created directly from the JSON data has breweries defined as a list of lists, key-value pairs, encoded as a string, for each beer item. The key to making the data frame tidy was to extract information from the breweries and add it as proper columns/variables in the beers data drame, and removing extraneous information; as seen above, the breweryDB API returns a lot of data, a lot of which we aren’t interested in. While trying to do this, I quickly ran into issues stemming from the dplyr and the tidyjson packages documented here and here, receiving this error message:

library(tidyjson)
library(tidyverse)

beers %>% 
  gather_array %>% 
  spread_values(name = jstring("name"))

Downgrading the dplyr package to version 0.5.0 and even downgrading the tidyjson package to version 0.2.1 did not resolve the issue, so I had to devise my own way of accessing the information and making the data frame tidy, using R’s apply functions, also known as group of mapping functions, explained beautifully in this Stack Overflow post. To extract any data located in a list in a column, I used the following code:

beers$breweryId <- lapply(beers$breweries, FUN = function(x) { paste(x$id, collapse = " ") })

turning a list of brewery ids located in the list of breweries into a string of brewery ids separated by a space, for easy separation of a beer id, 1 observation, into several observations of that beer into a beer and its breweries in the main data dictionary later

Column

Data Dictionaries

The final beer data dictionary has the following variables:

Beers Data Dictionary
variables descriptions
beerId the id of the beer
beerName the name of the beer
beerDescription the official description of the beer
abv the alcohol by volume of the beer (expressed as a percentage
ibu the IBU (international bittering unit) value of the beer, a measure of how bitter a beer is
styleId the style id of the beer
categoryId the category id of the style id
breweryId the id of the brewery that makes the beer

with a beer id and a brewery id acting as primary keys of the beers data frame, meaning that the two together uniquely identify one observation in the data frame.

The brewery data dictionary was assembled in a similar manner to the beers data dictionary, with locations being the list nested in the list of data items in the JSON, and locationId being the list of ids associated with each brewery id. The final brewery data dictionary has the following variables, with a brewery id and a location id as primary keys of the data frame:

Breweries Data Dictionary
variables descriptions
breweryId the id of the brewery
breweryName the name of the brewery
breweryDescription the description of the brewery
established the year the brewery was established
isOrganic whether or not the brewery is organic
locationId the location id associated with a brewery id (a brewery can have several locations

Locations are in a separate data dictionary of their own, partially because the BreweryDB API had the locations as their own dictionaries and because there’s so much information associated with a location id. The variables in the final locations data dictionary are as follows, with locationId being the primary key of the data frame:

Locations Data Dictionary
variables descriptions
locationId the id of a particular location (geophysical location)
locationName the name of a location, usually street name
streetAddress the address and number of a location
locality the city of the location
region the ztate of the region
postalCode the postal code of the location
latitude the latitude coordinates of the location
longitude the longitude coordinates of the location
locationTypeDisplay the kind of location it is: restuarant vs microbrewery for example
isPrimary whether that particular location is the primary location for a particular brewery
countryIsoCode the two character country code of a location
inPlanning whether the location is in planning or not
isClosed whether the location is closed or not
openToPublic whether the location is open to the public
yearOpened the year the location opened
breweryId the brewery id of the brewery associated with this particular location

Finally, I created a styles to categories data dictionary of all the different styles and categories and their mappings, associating styles and style information like the range of alcohol per beer volume content for that particular style, with styleId being the primary key for the data frame. The variables in this dictionary are:

Beer Styles and Categories Data Dictionary
variables descriptions
styleId the style id
categoryId the id of the category that style belonged to
name the name of the style
shortName the name of the style, shortened
description the description of that style
ibuMin the minimum international bitterness value of the style
ibuMax the maximum international bitterness value of the style
abvMin the minimum alcohol per beer volume content of the style
abvMax the maximum alcohol per beer volume content of the style
srmMin the minimum in the typical SRM range for this style
srmMax the maximum in the typical SRM range for this style
ogMin the minimum in the typical original gravity range for this style
ogMax the maximum in the typical original gravity range for this style
fgMin the minimum in the typical final gravity range for this style
fgMax the maximum in the typical final gravity range for this style
categoryName the name of the category the style belongs to

The main foreign keys among the different dictionaries are locationId, breweryId, styleId, and on a lesser scale categoryId when making a data dictionary with both style and category information included.

The code to produce these data dictionaries is shown in detail in the other pages on importing and cleaning the data from BreweryDB.

The Code Behind the Data Dictionaries

The functions contained in this code chunk is also kept in a separate file than the code done for exploratory data analysis. This is contained in a file named “BreweryDBDataRetriever”, a file that holds all the functions to save specific data frames to files, such as the beer data frame, or the brewery data frame. The functions below are then called before data cleaning is to be done, so that we can load the original data frames from whichever files they were saved to, clean the data frames and prepare them for analysis, and store these frames on disk once more as the clean versions that we then use for visualizations and analysis. I chose to store the data dictionaries, and then load them whenever I wanted to use them because pulling all of the data from the BreweryDB API takes over an hour to do, and waiting over an hour to load a web page is ridiculous.

suppressPackageStartupMessages(library(jsonlite)) # for working with JSON data
suppressPackageStartupMessages(library(tidyverse)) # to transform and clean data
suppressPackageStartupMessages(library(tidyr)) # for help with turning JSONs into tidy frames

source("BreweryDBRWrapper.R")

# in order to make more dynamic, could keep track of how many observations we've seen and 
# do the math to see if we should update (not a priority)

# this function retrieves all the beer data from the API and stores it in the
# path stored in the beersFile variable
retrieveBeerData <- function(APIKey, saveFile) {
  # takes about an hour to fill
  beers <- NULL
  
  beersRequestData <- BreweryDB_endpoint(APIKey, "beers", 
                                         options = list(p = as.character(1), 
                                                        withBreweries = "Y")) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  beerNumPages <- beersRequestData$numberOfPages
  
  for (i in 1:beerNumPages) {
    print(i)
    beersRequestData <- BreweryDB_endpoint(APIKey, "beers", 
                                           options = list(p = as.character(i), 
                                                          withBreweries = "Y")) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredBeerData <- beersRequestData$data
    
    unfilteredBeerData$categoryId <- unfilteredBeerData$style$categoryId
    
    unfilteredBeerData$breweryId <- lapply(unfilteredBeerData$breweries, 
                                           FUN = function(x) { paste(x$id, 
                                                                     collapse = " ") })
    
    headerstoAdd <- setdiff(c("id", "name", "description", "abv", "ibu", 
                              "styleId", "srmId"), 
                            names(unfilteredBeerData))
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        unfilteredBeerData[colName] <- NA
      }
    }
    if (is.null(beers)) {
      beers <- unfilteredBeerData %>%
        select(id, name, description, abv, ibu, styleId, categoryId, breweryId, 
               srmId) %>%
        as_tibble()
    } else {
      beers <- rbind(beers, unfilteredBeerData %>% 
                       select(id, name, description, abv, ibu, styleId, 
                              categoryId, breweryId, srmId) %>%
                       as_tibble())
    }
  }
  rm(beersRequestData, unfilteredBeerData, headerstoAdd)
  write_rds(beers, saveFile)
}

# this function retrives all of the brewery information and stores it in the path
# in the breweriesFile variable
retriveBreweryData <- function(APIKey, saveFile) {
  # takes a few minutes
  breweries <- NULL
  
  breweriesRequestData <- BreweryDB_endpoint(APIKey, "breweries", 
                                             options = list(p = as.character(1))) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  breweryNumPages <- breweriesRequestData$numberOfPages
  
  for (i in 1:breweryNumPages) {
    breweriesRequestData <- BreweryDB_endpoint(APIKey, "breweries", 
                                               options = list(p = as.character(i), 
                                                              withLocations = "Y")) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredBreweriesData <- breweriesRequestData$data
    
    unfilteredBreweriesData$locationId <- lapply(unfilteredBreweriesData$locations, 
                                                 FUN = function(x) { paste(x$id, 
                                                                           collapse = " ") })
    
    # the headers left to add if there is a column that is missing before we add them
    headerstoAdd <- setdiff(c("id", "name", "description", "website", "established", 
                              "isOrganic"), 
                            names(unfilteredBreweriesData))
    
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        # add missing column if it exists
        unfilteredBreweriesData[colName] <- NA
      }
    }
    if (is.null(breweries)) {
      breweries <- unfilteredBreweriesData %>%
        select(id, name, description, website, established, isOrganic, locationId) %>%
        as_tibble()
    } else {
      breweries <- rbind(breweries, unfilteredBreweriesData %>% 
                           select(id, name, description, website, established, 
                                  isOrganic, locationId) %>%
                           as_tibble())
    }
    
  }
  rm(breweriesRequestData, unfilteredBreweriesData, headerstoAdd)
  write_rds(breweries, saveFile) 
}

# this function retrives all of the information about all of the possible locations
# and stores it in the path stored in the locationsFile data
retrieveLocationsData <- function(APIKey, saveFile) {
  locations <- NULL
  
  locationsRequestData <- BreweryDB_endpoint(APIKey, "locations", 
                                             options = list(p = as.character(1))) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  locationNumPages <- locationsRequestData$numberOfPages
  
  # takes about 4 minutes
  for (i in 1:locationNumPages) {
    locationsRequestData <- BreweryDB_endpoint(APIKey, "locations", 
                                               options = list(p = as.character(i))) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredLocationsData <- locationsRequestData$data
    
    # the headers left to add if there is a column that is missing before we add them
    headerstoAdd <- setdiff(c("id", "name", "streetAddress", "locality", "region", 
                              "postalCode", "latitude", "longitude", 
                              "locationTypeDisplay", "isPrimary", "countryIsoCode", 
                              "inPlanning", "isClosed", "openToPublic", "yearOpened", 
                              "website", "breweryId"), 
                            names(unfilteredLocationsData))
    
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        # add missing column if it exists
        unfilteredLocationsData[colName] <- NA
      }
    }
    if (is.null(locations)) {
      locations <- unfilteredLocationsData %>%
        select(id, name, streetAddress, locality, region, postalCode, latitude, 
               longitude, locationTypeDisplay, isPrimary, countryIsoCode, inPlanning,
               isClosed, openToPublic, yearOpened, website, breweryId) %>% 
        as_tibble() 
    } else {
      locations <- rbind(locations, unfilteredLocationsData %>% 
                           select(id, name, streetAddress, locality, region, 
                                  postalCode, latitude, longitude, locationTypeDisplay, 
                                  isPrimary, countryIsoCode, inPlanning, isClosed,
                                  openToPublic, yearOpened, website, breweryId) %>%
                           as_tibble())
    }
  }
  rm(locationsRequestData, unfilteredLocationsData, headerstoAdd)
  write_rds(locations, saveFile) 
}

Cleaning up the Messy Data

Column

Code Behind Cleaning Beer Dictionary

Before we do any cleaning, we must load all of the necessary data dictionaries: beers, breweries, locations, and beer categories and styles. We need an API key to access the BreweryDB API and since pulling data on beers, breweries, and locations all together from the API takes over an hour, we have already stored this data in files. If we wanted to update these data frames, we can change the “updateData” flag to “TRUE” and access the API. The code for this is shown below:

source("BreweryDBDataRetriever.R")

# my BreweryDB API key
breweryDBKey <- "11b192faea1a549172fe2423db077bc5"

#the file in which the beers R data frame is stored
beersFile <- "data/beers.rds"

#the file in which the breweries R data frame is stored
breweriesFile <- "data/breweries.rds"

#the file in which the locations R data frame is stored
locationsFile <- "data/locations.rds"

updateData <- FALSE

# getting beer style data since it is instantaneous
beerStylesRequestData <- BreweryDB_endpoint(breweryDBKey, "styles") %>%
  content(as = "text", encoding = "UTF-8")
beerStyles <- fromJSON(beerStylesRequestData, simplifyDataFrame = TRUE)$data %>%
  select(id, categoryId, name, shortName, description, ibuMin, ibuMax, abvMin, 
         abvMax, srmMin, srmMax, ogMin, ogMax, fgMin, fgMax) %>%
  as_tibble()
rm(beerStylesRequestData)

# getting beer category data since it is instantaneous
beerCategoriesRequestData <- BreweryDB_endpoint(breweryDBKey, "categories") %>%
  content(as = "text", encoding = "UTF-8")
beerCategories <- fromJSON(beerCategoriesRequestData, simplifyDataFrame = TRUE)$data %>%
  select(id, name) %>%
  as_tibble()
beerCategories <- head(beerCategories, -1) # we have a null row at the end
rm(beerCategoriesRequestData)

if (updateData) {
  retrieveBeerData(breweryDBKey, beersFile)
  retriveBreweryData(breweryDBKey, breweriesFile)
  retrieveLocationsData(breweryDBKey, locationsFile)
}

beers <- read_rds(beersFile)
breweries <- read_rds(breweriesFile)
locations <- read_rds(locationsFile)

A quick glance at the data frames reveals columns that are encoded as characters when they should be numeric, such as established, which is the year the brewery was established and should be numeric, is actually a character. We also must change all the “NULL”’s in the dataset to actual NAs so that we can automatically omit them with the R functions at our disposal. This is all done below.

for (attr in names(beers)) {
  beers[ attr == "NULL" ] <- NA
}

for (attr in names(breweries)) {
  breweries[ attr == "NULL" ] <- NA
}

for (attr in names(locations)) {
  locations[ attr == "NULL"] <- NA
}

beers$abv <- as.numeric(beers$abv)
beers$ibu <- as.numeric(beers$ibu)

breweries$established <- as.numeric(breweries$established) # year is numeric
locations$yearOpened <- as.numeric(locations$yearOpened) # year is numeric

beerStyles$ibuMin <- as.numeric(beerStyles$ibuMin)
beerStyles$ibuMax <- as.numeric(beerStyles$ibuMax)
beerStyles$abvMin <- as.numeric(beerStyles$abvMin)
beerStyles$abvMax <- as.numeric(beerStyles$abvMax)
beerStyles$srmMin <- as.numeric(beerStyles$srmMin)
beerStyles$srmMax <- as.numeric(beerStyles$srmMax)
beerStyles$ogMin <- as.numeric(beerStyles$ogMin)
beerStyles$ogMax <- as.numeric(beerStyles$ogMax)
beerStyles$fgMin <- as.numeric(beerStyles$fgMin)
beerStyles$fgMax <- as.numeric(beerStyles$fgMax)

Cleaning up the Beer Table

Now that we have all of our data, we might want to take a look at the distribution of the most distinguishable beer characteristics, ABV (alcohol per beer volume, expressed as a percentage out of 100) and IBU (international bitterness unit value, which is a measure of how bitter the beer is).

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.01    5.10    6.00    6.54    7.50   81.00   11272 

The summary statistics of the ABV of all the beers reveal that the maximum ABV is 308. Since a percent is out of 100, everything above 100 doesn’t make sense and we can remove all the beers whose ABV is above 100 since the credibility of that beer is now questionable. Thankfully, there is only 1 beer whose ABV is above 100, and we dispose of that observation.

disposeOf <- beers %>% filter(abv > 100) %>% select(id)

# there is only 1 beer whose abv is above 100, and we dispose of that observation
beers <- beers %>% filter(id != "EHPIi4")
Let’s take a look at the IBU distributions of all the beers.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   22.00   34.00   40.88   55.00 1000.00   41936 

As seen in this chart, IBU doesn’t normally go above 120, with the units being parts per million, and this many claiming that the human tongue can’t distinguish past 110 IBUs.

There are 153 beers above 120 IBUs, and googling of the first few beers reveals that these are authentic beers, so no observations will be removed for wrong IBU range, but these observations will be left out of exploratory data analysis visualizations to avoid skewing the scale of data.

The next variable we want to look at and see if cleaning is necessary is the SRM range of the beer. We know that anything significantly bigger than 40 or anything that is negative is a clear error, and we might want to toss that observation out.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    6.00   12.00   17.43   29.00   41.00   60237 

Although there are a lot of NAs, it seems that the ranges of SRM for the beers that have that information is valid, and I won’t mess around with the NAs for the sake of time in this project.

Cleaning Brewery Data

There isn’t much to clean in breweries; we can’t tell what is a good description or a bad description of a brewery, and when the data dictionaries are merged with each other on the foreign keys of the dictionaries, beers that don’t have a brewery are not included, breweries with no locations are not included, and locations without an associated brewery isn’t included, since when we merge, we are using an inner_join. The only thing that we can clean is are the years associated with each brewery (established, which is the year the brewery was established), and check whether isOrganic only has “Y”’s and “N”’s.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1040    1996    2010    1987    2013    2017    4475 

There seem to be breweries that can be categorized as extremely old; the United States signed the Declaration of Independence in 1776, and there seems to be 143 breweries older than the United States. A closer examination of these breweries reveal that they are indeed this old, with all of them being European breweries.

We take a look at the isOrganic column in the breweries data frame and we can see that none of the breweries have that column value missing and the column can either be “Y” for “Yes” or “N” for “No”, and no cleaning is required here.
    N     Y 
10422    61 

Column

Cleaning Location Data

So we move on to locations. Since most of the variables in location are strings, we must check the address and latitudes and longitudes to make sure they make sense in the context of the observation. Physical street addresses and postal codes are difficult to verify without using more datasets, and errors in these fields will be obvious in geovisualizations and then hopefully we can single out the errors and fix them. However, we can check if state and country are encoded properly in the dataset.

Assuming that US state data is much more present than foreign countries, we focus on the US for the cleaning. In the data pulled from the BreweryDB API, there are 102 states in the United States according to the dataset, which is erroneous. We expect to see 52 states, accounting for the 50 states in the US, the District Capital, and NAs. Further exploration reveals that there are 12 locations in the US without a region, which I changed by hand since changing 16 locations by hand is doable. Fixing all other states involved looking more in depth into the addresses and turning state abbreviations and postal codes to actual state names, so that we have 54 states, including the District capital abbreviated “D.C”, NAs, and two U.S. territories, Puerto Rico and the Virgin Islands. Now that our data has been gone through initial cleaning, we can begin to visualize the distribution of a few variables and produce a few tables.

The Code behind Cleaning Locations

This is the code behind cleaning the locations data dictionary, complete with hard coded fixing of the miscoded states and the NA states, and the checking of latitude, longitude, and country codes, which are used later in visualizations.

numStates <- locations %>% filter(countryIsoCode == "US") %>% count(region) %>% nrow

# there are 103 states in the United States according to the dataset, which is 
# erroneous. We expect to see 52 "different" states, the 50 states, the District 
# Capital and then NA for observations without a state 

allstates <- locations %>% filter(countryIsoCode == "US") %>% count(region) %>% arrange(desc(n))

locations$region[ locations$region == "53217"] <- "Wisconsin"
locations$region[ locations$region == "Calfornia"] <- "California"
locations$region[ locations$region == "CALIFORNIA"] <- "California"
locations$region[ locations$region == "CO"] <- "Colorado"
locations$region[ locations$region == "CT"] <- "Connecticut"
locations$region[ locations$region == "Deleware"] <- "Delaware"
locations$region[ locations$region == "FL"] <- "Florida"
locations$region[ locations$region == "Flordia"] <- "Florida"
locations$region[ locations$region == "IL"] <- "Illinois"
locations$region[ locations$region == "Illinios"] <- "Illinois"
locations$region[ locations$region == "illinois"] <- "Illinois"
locations$region[ locations$region == "IN"] <- "Indiana"
locations$region[ locations$region == "MA"] <- "Massachusetts"
locations$region[ locations$region == "Massachusettes"] <- "Massachusetts"
locations$region[ locations$region == "MARYLAND"] <- "Maryland"
locations$region[ locations$region == "MI"] <- "Michigan"
locations$region[ locations$region == "MINNESOTA"] <- "Minnesota"
locations$region[ locations$region == "WV"] <- "West Virginia"
locations$region[ locations$region == "WASHINGTON"] <- "Washington"
locations$region[ locations$region == "WA"] <- "Washington"
locations$region[ locations$region == "WA - Washington"] <- "Washington"
locations$region[ locations$region == "TX"] <- "Texas"
locations$region[ locations$region == "TN"] <- "Tennessee"
locations$region[ locations$region == "Tennesssee"] <- "Tennessee"
locations$locality[ locations$region == "Tallahassee"] <- "Tallahassee"
locations$region[ locations$region == "Tallahassee"] <- "Florida"
locations$locality[ locations$region == "Los Angeles"] <- "Los Angeles"
locations$region[ locations$region == "Los Angeles"] <- "California"
locations$postalCode[ locations$region == "Florida 33316"] <- 33316
locations$region[ locations$region == "Florida 33316"] <- "Florida"

locations$postalCode[ locations$region == "N7 8XH"] <- "N7 8XH"
locations$countryIsoCode[ locations$region == "N7 8XH"] <- "GB"

locations$region[ locations$region == "PA"] <- "Pennsylvania"
locations$region[ locations$region == "OHio"] <- "Ohio"
locations$region[ locations$region == "OHIO"] <- "Ohio"
locations$region[ locations$region == "New Yotk"] <- "New York"
locations$region[ locations$region == "NewYork"] <- "New York"
locations$region[ locations$region == "SC"] <- "South Carolina"
locations$region[ locations$region == "SD"] <- "South Dakota"
locations$region[ locations$region == "NC"] <- "North Carolina"
locations$region[ locations$region == "MS."] <- "Mississippi"
locations$region[ locations$region == "MN"] <- "Minnesota"
locations$locality[ locations$region == "Marion"] <- "Marion"
locations$region[ locations$region == "Marion"] <- "Ohio"


locations$countryIsoCode[ locations$region == "British Columbia"] <- "CA"
locations$locality[ locations$region == "Portland" ] <- "Portland"
locations$region[ locations$region == "Portland"] <- "Oregon"

locations$locality[ locations$region == "London"] <- "London"
locations$countryIsoCode[ locations$region == "London"] <- "GB"
locations$countryIsoCode[ locations$region == "England"] <- "GB"
locations$locality[ locations$region == "Shetland Islands"] <- "Shetland Islands"
locations$countryIsoCode[ locations$region == "Shetland Islands"] <- "GB"

locations$locality[ locations$region == "West Sussex"] <- "West Sussex"
locations$countryIsoCode[ locations$region == "West Sussex"] <- "GB"

locations$locality[ locations$region == "Worcestershire"] <- "Worcestershire"
locations$countryIsoCode[ locations$region == "Worcestershire"] <- "GB"

locations$locality[ locations$region == "Nairn"] <- "Nairn"
locations$countryIsoCode[ locations$region == "Nairn"] <- "GB"

locations$locality[ locations$region == "Merseyside"] <- "Merseyside"
locations$countryIsoCode[ locations$region == "Merseyside"] <- "GB"

locations$locality[ locations$region == "Grand Bahama Island"] <- "Grand Bahama Island"
locations$countryIsoCode[ locations$region == "Grand Bahama Island"] <- "BS"

locations %>% filter(countryIsoCode == "US" & is.na(region)) %>% nrow

# there are 16 observations without a state

locations %>% filter(countryIsoCode == "US" & is.na(region))

changeLocationState <- function(locationdataFrame, locationId, regionValue) {
  print(locationdataFrame)
  for (i in 1:nrow(locationdataFrame[,1])) {
    if (locationdataFrame$id[i] == locationId) {
      locationdataFrame$region[i] <- regionValue
      print(nrow(locationdataFrame))
      return(locationdataFrame)
    }
  }
}


# using google to find the states since it's only 16

locations <- changeLocationState(locations, "2sfE3h", "Texas")
locations <- changeLocationState(locations, "6fAJtO", "California")
locations <- changeLocationState(locations, "0QiT0E", "California")
locations <- changeLocationState(locations, "GTwDJI", "California")
locations <- changeLocationState(locations, "5yDXi6", "California")
locations <- changeLocationState(locations, "4nSc9t", "New York")
locations <- changeLocationState(locations, "LMfC4N", "New York")
locations <- changeLocationState(locations, "hbF6tu", "New Jersey")
locations <- changeLocationState(locations, "GM3x67", "North Carolina")
locations <- changeLocationState(locations, "PHV3BL", "District of Columbia")
locations <- changeLocationState(locations, "YKKcj2", "District of Columbia")
locations <- changeLocationState(locations, "1KlIcf", "California")

# remember that washington, dc is not a state
# difficult to see whether addresses are valid or what postal codes are valid, 
# location data cleaning will be a lot more obvious once it is visualized

# have to turn state abbreviations into full state names

# the ranges of the latitude and longitude seem alright, import dates so we can look at things by year?

# we can verify country isocodes, locationTypeDisplay, and check the ranges of latitude and longitude

summary(locations$latitude)
summary(locations$longitude)


# all seem good, going to take a look at country isocodes

countryCodes <- locations %>% count(countryIsoCode) %>% arrange(desc(n))

The Birth of the Data Dictionaries

The code below produces the data dictionaries that I use to visualize things, including the beers, breweries, and location dictionaries and all of the different combinations of merging the different dictionaries.

########################### DATA DICTIONARY ###################################

beerStyles <- arrange(beerStyles, categoryId)

# adds category information to the diferent styles so we can search styles by category
beerCategoriesStyles <- beerStyles %>% inner_join(beerCategories %>% rename(categoryId = id, categoryName = name), 
                                                  by = "categoryId")
# splits observations into one breweryId and locationId per row instead of one id 
# per several location ids
breweries <- breweries %>% 
  mutate(locationId = strsplit(as.character(locationId), " ")) %>% 
  unnest(locationId)

# splits observations into one beer id and brewery id per row
beers <- beers %>% 
  mutate(breweryId = strsplit(as.character(breweryId), " ")) %>%
  unnest(breweryId)

# associates beers with their style information 
beersandStyles <- beers %>% inner_join(beerStyles %>% rename(styleId = id, 
                                                             styleName = name, 
                                                             styleshortName = shortName, 
                                                             stylesDescription = description), 
                                       by = c("styleId", "categoryId"))

# associates breweries with their location(s)
breweriesandLocations <- breweries %>% rename(breweryId = id, breweryName = name, 
                                              breweryDescription = description, 
                                              breweryWebsite = website) %>% 
  inner_join(locations %>% rename(locationId = id, locationName = name, 
                                  locationWebsite = website), 
             by = c("locationId", "breweryId"))

# associates beers with the breweries that brew them and the locations of those
# brewreries
beersBreweriesLocations <- beers %>% rename(beerId = id, beerName = name, 
                                            beerDescription = description) %>% 
  inner_join(breweriesandLocations, by = "breweryId")


write_rds(beers, "data/beersClean.rds")
write_rds(breweries, "data/breweriesClean.rds")
write_rds(locations, "data/locationsClean.rds")
write_rds(breweriesandLocations, "data/breweriesLocations.rds")
write_rds(beersBreweriesLocations, "data/maindictionary.rds")
write_rds(beerCategoriesStyles, "data/categoriesStyles.rds")
################################ END ###########################################

About Beer Categories and Styles

Column

Beer Category and Style Discussion

To the right are three different visualizations of the ranges of ABV (alcohol per beer volume), IBU (international bitterness units), and SRM (Standard Reference Method) IDs, which describe colors from light to dark, smallest to greatest. Some styles and categories allow for a lot of variance in these characteristics, such as the Hybrid/mixed Beer category, but this makes sense because this functions as an “other” category. Other styles, such as North American Lager, don’t allow for much variance in their ABV or IBU. IBU ranges seem to be larger than that of ABVs, and a lot of the styles have have the most variance in the color category than the ABV or IBU range, but it is interesting to visualize the color, ABV, and IBU of beer styles and see the differences among the different styles within categories and differences between the different style categories. Hover over the bars to see which style it is.

Column

ABV Ranges

IBU ranges

SRM Color Ranges

Breweries Across the World

Column

Breweries Across the World

What’s Brewing in the USA

Column

Breweries in the USA

Cities

Column

Beer Styles and Categories Across the US

Cities with the Most Breweries

North Carolina and California, Beertopia?

Column

Brewery Growth over the Years

Column

Top Cities in California

Top Cities in North Carolina

Diving into Breweries

Column

Location Types for Breweries

Number of Breweries Established Each Year

But What About the Beer?!

Column

SRM Ranges

ABV Ranges

IBU Ranges

---
title: "Craft Beer & the U.S.A."
output:
  flexdashboard::flex_dashboard:
    social: menu
    source_code: embed
    theme: journal
    logo: "http://www.brewerydb.com/img/logo_small.png"
    css: "2017_BAN6003_gomezsm_finalproject_style.css"
---

```{r setup, include=FALSE, warning=FALSE}
suppressPackageStartupMessages(library(flexdashboard))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(DT))
suppressPackageStartupMessages(library(outliers)) # for calculating outliers
suppressPackageStartupMessages(library(moments)) # for calculating skewness and kurtosis
suppressPackageStartupMessages(library(stringr)) # string manipulation str_wrap
suppressPackageStartupMessages(library(highcharter)) # for visualization of time series data
suppressPackageStartupMessages(library(treemap)) # for treemap visualization
suppressPackageStartupMessages(library(viridisLite)) # for visualization
suppressPackageStartupMessages(library(tidyverse)) 
beers <- read_rds("data/2017_BAN6003_gomezsm_finalproject_beersClean.rds")
breweries <- read_rds("data/2017_BAN6003_gomezsm_finalproject_breweriesClean.rds")
locations <- read_rds("data/2017_BAN6003_gomezsm_finalproject_locationsClean.rds")
breweriesandLocations <- read_rds("data/2017_BAN6003_gomezsm_finalproject_breweriesLocations.rds")
beersBreweriesLocations <- read_rds("data/2017_BAN6003_gomezsm_finalproject_maindictionary.rds")
beerCategoriesStyles <- read_rds("data/2017_BAN6003_gomezsm_finalproject_categoriesStyles.rds")
stateRegions <- list()
stateRegions[[  "Northeast"]] <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania")
stateRegions[[ "Midwest" ]] <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota","South Dakota")
stateRegions[[ "South" ]] <- c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "DC", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas")
stateRegions[[ "West" ]] <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington")
state <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "DC", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming")
  
abbreviations <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")

stateAbbrv <- data.frame(state, abbreviations)
countryCodes <- read.csv("data/2017_BAN6003_gomezsm_finalproject_countrycodes.csv") %>% select(name, alpha.2, alpha.3)
```


About This Project {data-navmenu="About" data-orientation=rows}
===============================================================================

Row {data-height=250}
-------------------------------------------------------------------------------

### Who cares?

North Carolina is arguably the one of the states with the most breweries in the country. Beer is one of the three kinds of liquor you can buy (wine, liquor, beer) and the advertising for it is everywhere. Down a highway or on television, there is almost certainly some kind of advertising for some sort of beer, perhaps one you've never heard of. And if you've ever seen the SuperBowl, you've definitely seen the halftime commercials for beer. If that isn't enough to convince anyone, Anheuser-Busch InBev SA/NV is the largest brewery in the world and is one of largest and most profitable fast-moving consumer goods companies in the world, with \$45.5 billion in revenue last year and a projected revenue of $55 billion this year. Craft beer is growing in popularity, so it's safe to say quite a few people care about beer and the money it brings in. 

### Packages Required for this Project

These are the required packages:

  - **httr**: for using GET requests to access data from the API
  - **jsonlite**: for getting data out of nest JSON arrays
  - **tidyverse**: for help cleaning and transforming data into the tidy data frames we can call data dictionaries
  - **tidyr**: for help with turning JSON strings into tidy data frames
  - **flexdashboard**: for creating the flexdashboard that this is contained in
  - **plotly**: for help creating interactive visualizations
  - **outliers**: for computing the outliers
  - **stringr**: for string wrapping on some of the visualizations
  - **moments**: for calculating the skewness and kurtosis
  - **knitr**: for creating R markdown files
  - **DT**: for creating data tables
  - **highcharter**: for help in making tree maps interactive
  - **treemap**: for making tree maps
  - **viridisLite**: for creating the color palette for the tree maps

Row
--------------------------------------------------------------------------------

### Project Summary and Deliverables
I found that the [Python wrapper](https://github.com/alilja/ratebeer) I wanted to use before was rendered useless from [RateBeer](https://www.ratebeer.com/)'s recent updates to the site. In the past few weeks, RateBeer has changed the infrastructure of their website and changed much of how the data on their website is encoded. Since the wrapper has not been updated recently to take into account the changes of RateBeer's website, I could not use the wrapper without taking a serious digression from this project and fix the Python script. So I have changed the goals of this project.

The focus of my project has shifted from using machine machine learning to gauge beer likability to exploring the trove of data provided by [BreweryDB](http://www.brewerydb.com/), emphasizing user interaction and data visualization. My vision of this project is to take a user through the story told by the data, what are the top styles made in the US, what characteristics do those styles have, where are those located throughout the country, a state, or a city. That is only an example of what my visualization can do, but I hoped to make exploration of the data provided by BreweryDB much more accessible via a user interface through interactive visualizations, something similar to a [visual essay](https://pudding.cool/2017/04/beer/) done by Russell Goldenberg from [The Pudding](https://pudding.cool/) (maybe as not as cool or as polished, but a publication like that is the dream standard for this project). In doing this final project, I believe I have a great starting point, considering the time spent on data wrangling, to do something like that. What I have done is opened up the datasets kept on BreweryDB to gain some initial insights into the craft beer industry by focusing on visualizing and exposing parts of the dataset. So this project was really about exploring my dataset through visualizations and seeing what we can find out about the domain through this dataset, leveraging outside information about US states (state abbreviations), US regions (South, Midwest, West, and Northeast), and country codes to make some pretty neat visualizations.

I quickly ran into issues trying to visualize specific states, finding that I either needed separate shape files for particular states or needed to further transform my data to match locations to states and cities. Since most of my data is crowd-sourced, a vast majority of this project was spent data wrangling and cleaning and manipulating my data into tidy data frames. Trying to visualize breweries by states and cities requires a lot more cleaning on the city side so I decided to focus on craft beer in the United States, looking at beer styles and categories in the United States, the number of breweries in states by years and looking at beer characteristics like alcohol per beer volume (ABV) and international bitterness units (IBU) (which measures bitterness).


### Data Importing from BreweryDB
To access BreweryDB data on breweries and their beers, as well as their locations, I used the BreweryDB API. This API used an API key to access the information in the API, and I built 4 main data dictionaries about 66K+ beers and the breweries that make them. 

In the BreweryDB framework, beers have associated brewery, style, and category data. The BreweryDB website is continually updated with a staff constantly checking authenticity of beers and their breweries. However, some breweries are more visible than others, especially those with websites versus those based in foreign countries, and accordingly there can be missing information about some of the beers and breweries, and that will be discussed in detail later on. The beer styles and categories based off of the [Brewers Association Style Guidelines](https://www.brewersassociation.org/resources/brewers-association-beer-style-guidelines/).

A beer can have several breweries that make it, and a brewery can be tied to several location. However, the relationship of location to physical coordinates is a one-to-one, with every unique location given a unique id, denoted *locationId*. The beer to breweries relationship is one-to-many and the breweries to locations relationship is a one-to-many relationship as well.
  
BreweryDB has over 66K+ unique beers, with other 1300+ pages of JSON data about beers and their breweries. On my laptop it took over an hour to pull all the data about the beers. I have written code that updates all the dictionaries every time that the script is run, rewriting the dictionaries kept on disk if need be, All four separate data dictionaries are stored locally, simply because the time required to build it from scratch is too much to wait for every time I want to access the data. The merged copies are built from scratch because it can be done nearly instantaneously. The beer data dictionary has two __foreign__ keys, ***breweryId*** from the breweries data dictionary and ***styleId*** from the beer styles dictionary. The beer data is merged with both the beer styles dictionary and the breweries data dictionary.

About Beer {data-navmenu="About"}
===============================================================================

Column
--------------------------------------------------------------------------------

### How Beer is Made



Beer is mainly made out of four ingredients: water, yeast, a grain, such as barley, and hops. 

  1. **Malting**: The grain is prepared for boiling by being steeped in water and allowed to partially germinate, softening the kernel. This isolates the natural enzymes that will later on break down the starch into sugar. This process is stopped by heating, drying out and cracking the barley, turning it into malted barley.
  2. **Mashing**: The malted barley is soaked in hot water, allowing the natural enzymes in the grain to turn starch into sugar for the yeast to consume later during fermentation.
  3. **Sparging**: After the starch has been turned to sugar, the liquid becomes full of sugar and is separated from the grains in a process called lautering and the liquid that is leftover is now called *wort*.
  4. **Boiling and Cooling**: The wort is placed in a boil kettle and is boiled to kill any micro-organisms left in the liquid and is also when hops and spices are added over the period of about an hour. Hops provide bitterness in beer, balancing out the sugar. After, the wort is then quickly cooled down, strained, and filtered so that yeast can be added to it without killing it from the heat of the liquid. 
  5. **Fermentation**: The waiting period for the yeast to consume the sugar and turn it into alcohol, which is typically a few weeks. 
  6. **Carbonation or Aging**: After the fermentation period, the beer is still uncarbonated. It can either be artificially carbonated or it can be "bottle conditioned" and allowed to age with the CO2 produced by the yeast. 
  
Column 
--------------------------------------------------------------------------------

### SRM & Original and Final Gravity

Most of us know what ABV and IBU is, since these statistics are often displayed on the beer itself. But what is SRM, and original and final gravity and what does that have to do with beer?

SRM is short for the [Standard Reference Method](https://en.wikipedia.org/wiki/Standard_Reference_Method), the color system used by breweries for finished beer and malts. 

```{r kable, echo=FALSE, warning=FALSE}
ranges <- c("1.0 - 2.0", "2.0 - 3.0", "3.0 - 4.0", "4.0 - 6.0", "6.0 - 8.0", "8.0 - 10.0", "10.0 - 13.0", "13.0 - 17.0", "17.0 - 20.0", "20.0 - 24.0", "24.0 - 29.0", "29.0 -35.0", "35.0 - 40.0", "40.0+")
examples <- c("Pale lager", "Pilsener", "Blonde Ale", "Weissbeer", "India Pale Ale", "Saison", "English bitter", "Double IPA", "Amber Ale", "Brown Ale", "Porter", "Stout", "Foreign Stout", "Imperial Stout")
srmtable <- data.frame(ranges, examples)
kable(srmtable, digits=3, caption = "SRM IDs and Examples of Beer Styles")
```

Gravity, in the context of brewing alcohol, is the density of the [wort](https://en.wikipedia.org/wiki/Wort) or [must](https://en.wikipedia.org/wiki/Must) compared to water. Original gravity refers to the gravity of the liquid before fermentation, and final gravity is its gravity after fermentation. 

Importing Project Data {data-navmenu="About"}
================================================================================

Column {.tabset .tabset-fade}
--------------------------------------------------------------------------------

### BreweryDB JSON Example

Getting beer data out of the BreweryDB API was much more complicated than previously anticipated. I had planned to use the *tidyjson* package, but found that there was a bug that had arisen recently that no one had a quick fix for, specifically when attempting to access nested JSON lists and a strange issue with the *dplyr* package. The beer data straight out of the API is ordered in the following manner:

```{r, eval=FALSE, echo=TRUE}
'{
  "status" : "success",
  "numberOfPages" : 225,
  "data" : [
    {
      "servingTemperatureDisplay" : "",
      "labels" : {
        "medium" : "http://s3.amazonaws.com/",
        "large" : "http://s3.amazonaws.com/",
        "icon" : "http://s3.amazonaws.com/"
      },
      "style" : {
        "id" : 15,
        "category" : {
          "updateDate" : "",
          "id" : 5,
          "description" : "",
          "createDate" : "2012-01-02 11:50:42",
          "name" : "Bock"
        },
        "description" : "",
        "ibuMax" : "27",
        "srmMin" : "14",
        "srmMax" : "22",
        "ibuMin" : "20",
        "ogMax" : "1.072",
        "fgMin" : "1.013",
        "fgMax" : "1.019",
        "createDate" : "2012-01-02 11:50:42",
        "updateDate" : "",
        "abvMax" : "7.2",
        "ogMin" : "1.064",
        "abvMin" : "6.3",
        "name" : "Traditional Bock",
        "categoryId" : 5
      },
      "status" : "verified",
      "srmId" : "",
      "beerVariationId" : "",
      "statusDisplay" : "Verified",
      "foodPairings" : "",
      "breweries":  [{
        "id" : "KlSsWY",
        "description" : "",
        "name" : "Hofbrouwerijke",
        "createDate" : "2012-01-02 11:50:52",
        "mailingListUrl" : "",
        "updateDate" : "",
        "images" : {
          "medium" : "",
          "large" : "",
          "icon" : ""
        },
        "established" : "",
        "isOrganic" : "N",
        "website" : "http://www.thofbrouwerijke.be/",
        "status" : "verified",
        "statusDisplay" : "Verified"
      }],
      "srm" : [],
      "updateDate" : "",
      "servingTemperature" : "",
      "availableId" : 1,
      "beerVariation" : [],
      "abv" : "6",
      "year" : "",
      "name" : "\"My\" Bock",
      "id" : "HXKxpc",
      "originalGravity" : "",
      "styleId" : 15,
      "ibu" : "",
      "glasswareId" : 5,
      "isOrganic" : "N",
      "createDate" : "2012-01-02 11:51:13",
      "available" : {
        "description" : "Available year round as a staple beer.",
        "name" : "Year Round"
      },
      "glass" : {
        "updateDate" : "",
        "id" : 5,
        "description" : "",
        "createDate" : "2012-01-02 11:50:42",
        "name" : "Pint"
      },
      "description" : "Amber, malty and not too heavy, all around favorite even for the drinkers of the yellow fizzy stuff"
    },
    ...
  ],
  "currentPage" : 1
}'
```

### Accessing data from the BreweryDB API

I used a general function that would access the BreweryDB API at whichever endpoint specified, provided that it is an endpoint that the API provides, allowing the user to specify the options to be fed to the endpoint in the form of a list. This function is called by the functions that clean the beer, brewery, location, and category and style JSONS and convert them to tidy data frames. This function is written in a file called **BreweryDBRWrapper** to distinguish it as the file that contains the function that actually makes calls to the API.

```{r eval=FALSE, echo=TRUE}
suppressPackageStartupMessages(library(httr))

# the base BreweryDB URL that queries will be appended to
breweryDBBaseURL <- "http://api.brewerydb.com/v2/"

# this function returns the beers endpoint of the API, using parameter
# it needs name, abv, ibu, glasswareId, srmId, availableId, styleId or ids
# without premium to get beers
BreweryDB_endpoint <- function(api_key, endpoint, options = NULL) {
  if (is.null(options)) {
    GET(paste(breweryDBBaseURL, endpoint, sep = ""), query = list(key = api_key))
  } else {
    paramsRequest <- c(options, key = api_key)
    GET(paste(breweryDBBaseURL, endpoint, sep = ""), query = paramsRequest)
  }
}

```


### Wrestling with JSON data

The data frame created directly from the JSON data has ***breweries*** defined as a list of lists, key-value pairs, encoded as a string, for each beer item. The key to making the data frame tidy was to extract information from the breweries and add it as proper columns/variables in the beers data drame, and removing extraneous information; as seen above, the breweryDB API returns a lot of data, a lot of which we aren't interested in. While trying to do this, I quickly ran into issues stemming from the *dplyr* and the *tidyjson* packages documented [here](https://github.com/tidyverse/dplyr/blob/master/revdep/problems.md) and [here](https://github.com/MarkEdmondson1234/googleAnalyticsR/issues/88), receiving this error message:


```{r, echo=TRUE, eval=FALSE}
library(tidyjson)
library(tidyverse)

beers %>% 
  gather_array %>% 
  spread_values(name = jstring("name"))
```
```{r, eval=FALSE}
Error in eval(assertion, env) : 
  argument "json.column" is missing, with no default
```



Downgrading the *dplyr* package to version 0.5.0 and even downgrading the *tidyjson* package to version 0.2.1 did not resolve the issue, so I had to devise my own way of accessing the information and making the data frame tidy, using R's apply functions, also known as group of mapping functions, explained beautifully in this [Stack Overflow post](https://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega). To extract any data located in a list in a column, I used the following code:

```{r, eval=FALSE, echo=TRUE}
beers$breweryId <- lapply(beers$breweries, FUN = function(x) { paste(x$id, collapse = " ") })
```

turning a list of brewery ids located in the list of breweries into a string of brewery ids separated by a space, for easy separation of a beer id, 1 observation, into several observations of that beer into a beer and its breweries in the main data dictionary later


Column {.tabset .tabset-fade}
--------------------------------------------------------------------------------

### Data Dictionaries

The final beer data dictionary has the following variables:

```{r, echo=FALSE, warning=FALSE}
variables <- c("beerId", "beerName", "beerDescription", "abv", "ibu", "styleId", "categoryId", "breweryId")
descriptions <- c("the id of the beer", "the name of the beer", "the official description of the beer", "the alcohol by volume of the beer (expressed as a percentage", "the IBU (international bittering unit) value of the beer, a measure of how bitter a beer is", "the style id of the beer", "the category id of the style id", "the id of the brewery that makes the beer")
beerstable <- data.frame(variables, descriptions)
kable(beerstable, caption = "Beers Data Dictionary")
```

with a beer id and a brewery id acting as __primary__ keys of the beers data frame, meaning that the two together uniquely identify one observation in the data frame. 


The brewery data dictionary was assembled in a similar manner to the beers data dictionary, with *locations* being the list nested in the list of data items in the JSON, and *locationId* being the list of ids associated with each brewery id. The final brewery data dictionary has the following variables, with a brewery id and a location id as __primary__ keys of the data frame:

```{r, echo=FALSE, warning=FALSE}
variables <- c("breweryId", "breweryName", "breweryDescription", "established", "isOrganic", "locationId")
descriptions <- c("the id of the brewery", "the name of the brewery", "the description of the brewery", "the year the brewery was established", "whether or not the brewery is organic", "the location id associated with a brewery id (a brewery can have several locations")
breweriestable <- data.frame(variables, descriptions)
kable(breweriestable, caption = "Breweries Data Dictionary")
```
  
Locations are in a separate data dictionary of their own, partially because the BreweryDB API had the locations as their own dictionaries and because there's so much information associated with a location id. The variables in the final locations data dictionary are as follows, with *locationId* being the __primary__ key of the data frame:

```{r, echo=FALSE, warning=FALSE}
variables <- c("locationId", "locationName", "streetAddress", "locality", "region", "postalCode", "latitude", "longitude", "locationTypeDisplay", "isPrimary", "countryIsoCode", "inPlanning", "isClosed", "openToPublic", "yearOpened", "breweryId")
descriptions <- c("the id of a particular location (geophysical location)", "the name of a location, usually street name", "the address and number of a location", "the city of the location", "the ztate of the region", "the postal code of the location", "the latitude coordinates of the location", "the longitude coordinates of the location", "the kind of location it is: restuarant vs microbrewery for example", "whether that particular location is the primary location for a particular brewery", "the two character country code of a location", "whether the location is in planning or not", "whether the location is closed or not", "whether the location is open to the public", "the year the location opened", "the brewery id of the brewery associated with this particular location")
locationstable <- data.frame(variables, descriptions)
kable(locationstable, caption = "Locations Data Dictionary")
```
  
  Finally, I created a styles to categories data dictionary of all the different styles and categories and their mappings, associating styles and style information like the range of alcohol per beer volume content for that particular style, with *styleId* being the __primary__ key for the data frame. The variables in this dictionary are:

```{r, echo=FALSE, warning=FALSE}
variables <- c("styleId", "categoryId", "name", "shortName", "description", "ibuMin", "ibuMax", "abvMin", "abvMax", "srmMin", "srmMax", "ogMin", "ogMax", "fgMin", "fgMax", "categoryName")
descriptions <- c("the style id", "the id of the category that style belonged to", "the name of the style", "the name of the style, shortened", "the description of that style", "the minimum international bitterness value of the style", "the maximum international bitterness value of the style", "the minimum alcohol per beer volume content of the style", "the maximum alcohol per beer volume content of the style", "the minimum in the typical SRM range for this style", "the maximum in the typical SRM range for this style", "the minimum in the typical original gravity range for this style", "the maximum in the typical original gravity range for this style", "the minimum in the typical final gravity range for this style", "the maximum in the typical final gravity range for this style", "the name of the category the style belongs to")
stylestable <- data.frame(variables, descriptions)
kable(stylestable, caption = "Beer Styles and Categories Data Dictionary")
```

The main foreign keys among the different dictionaries are ***locationId***, ***breweryId***, ***styleId***, and on a lesser scale ***categoryId*** when making a data dictionary with both style and category information included. 

The code to produce these data dictionaries is shown in detail in the other pages on importing and cleaning the data from BreweryDB.

### The Code Behind the Data Dictionaries
The functions contained in this code chunk is also kept in a separate file than the code done for exploratory data analysis. This is contained in a file named "BreweryDBDataRetriever", a file that holds all the functions to save specific data frames to files, such as the beer data frame, or the brewery data frame. The functions below are then called before data cleaning is to be done, so that we can load the original data frames from whichever files they were saved to, clean the data frames and prepare them for analysis, and store these frames on disk once more as the clean versions that we then use for visualizations and analysis. I chose to store the data dictionaries, and then load them whenever I wanted to use them because pulling all of the data from the BreweryDB API takes over an hour to do, and waiting over an hour to load a web page is ridiculous.  

```{r echo=TRUE, eval=FALSE}
suppressPackageStartupMessages(library(jsonlite)) # for working with JSON data
suppressPackageStartupMessages(library(tidyverse)) # to transform and clean data
suppressPackageStartupMessages(library(tidyr)) # for help with turning JSONs into tidy frames

source("BreweryDBRWrapper.R")

# in order to make more dynamic, could keep track of how many observations we've seen and 
# do the math to see if we should update (not a priority)

# this function retrieves all the beer data from the API and stores it in the
# path stored in the beersFile variable
retrieveBeerData <- function(APIKey, saveFile) {
  # takes about an hour to fill
  beers <- NULL
  
  beersRequestData <- BreweryDB_endpoint(APIKey, "beers", 
                                         options = list(p = as.character(1), 
                                                        withBreweries = "Y")) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  beerNumPages <- beersRequestData$numberOfPages
  
  for (i in 1:beerNumPages) {
    print(i)
    beersRequestData <- BreweryDB_endpoint(APIKey, "beers", 
                                           options = list(p = as.character(i), 
                                                          withBreweries = "Y")) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredBeerData <- beersRequestData$data
    
    unfilteredBeerData$categoryId <- unfilteredBeerData$style$categoryId
    
    unfilteredBeerData$breweryId <- lapply(unfilteredBeerData$breweries, 
                                           FUN = function(x) { paste(x$id, 
                                                                     collapse = " ") })
    
    headerstoAdd <- setdiff(c("id", "name", "description", "abv", "ibu", 
                              "styleId", "srmId"), 
                            names(unfilteredBeerData))
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        unfilteredBeerData[colName] <- NA
      }
    }
    if (is.null(beers)) {
      beers <- unfilteredBeerData %>%
        select(id, name, description, abv, ibu, styleId, categoryId, breweryId, 
               srmId) %>%
        as_tibble()
    } else {
      beers <- rbind(beers, unfilteredBeerData %>% 
                       select(id, name, description, abv, ibu, styleId, 
                              categoryId, breweryId, srmId) %>%
                       as_tibble())
    }
  }
  rm(beersRequestData, unfilteredBeerData, headerstoAdd)
  write_rds(beers, saveFile)
}

# this function retrives all of the brewery information and stores it in the path
# in the breweriesFile variable
retriveBreweryData <- function(APIKey, saveFile) {
  # takes a few minutes
  breweries <- NULL
  
  breweriesRequestData <- BreweryDB_endpoint(APIKey, "breweries", 
                                             options = list(p = as.character(1))) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  breweryNumPages <- breweriesRequestData$numberOfPages
  
  for (i in 1:breweryNumPages) {
    breweriesRequestData <- BreweryDB_endpoint(APIKey, "breweries", 
                                               options = list(p = as.character(i), 
                                                              withLocations = "Y")) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredBreweriesData <- breweriesRequestData$data
    
    unfilteredBreweriesData$locationId <- lapply(unfilteredBreweriesData$locations, 
                                                 FUN = function(x) { paste(x$id, 
                                                                           collapse = " ") })
    
    # the headers left to add if there is a column that is missing before we add them
    headerstoAdd <- setdiff(c("id", "name", "description", "website", "established", 
                              "isOrganic"), 
                            names(unfilteredBreweriesData))
    
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        # add missing column if it exists
        unfilteredBreweriesData[colName] <- NA
      }
    }
    if (is.null(breweries)) {
      breweries <- unfilteredBreweriesData %>%
        select(id, name, description, website, established, isOrganic, locationId) %>%
        as_tibble()
    } else {
      breweries <- rbind(breweries, unfilteredBreweriesData %>% 
                           select(id, name, description, website, established, 
                                  isOrganic, locationId) %>%
                           as_tibble())
    }
    
  }
  rm(breweriesRequestData, unfilteredBreweriesData, headerstoAdd)
  write_rds(breweries, saveFile) 
}

# this function retrives all of the information about all of the possible locations
# and stores it in the path stored in the locationsFile data
retrieveLocationsData <- function(APIKey, saveFile) {
  locations <- NULL
  
  locationsRequestData <- BreweryDB_endpoint(APIKey, "locations", 
                                             options = list(p = as.character(1))) %>%
    content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
  locationNumPages <- locationsRequestData$numberOfPages
  
  # takes about 4 minutes
  for (i in 1:locationNumPages) {
    locationsRequestData <- BreweryDB_endpoint(APIKey, "locations", 
                                               options = list(p = as.character(i))) %>%
      content(as = "text", encoding = "UTF-8") %>% fromJSON(simplifyDataFrame = TRUE)
    unfilteredLocationsData <- locationsRequestData$data
    
    # the headers left to add if there is a column that is missing before we add them
    headerstoAdd <- setdiff(c("id", "name", "streetAddress", "locality", "region", 
                              "postalCode", "latitude", "longitude", 
                              "locationTypeDisplay", "isPrimary", "countryIsoCode", 
                              "inPlanning", "isClosed", "openToPublic", "yearOpened", 
                              "website", "breweryId"), 
                            names(unfilteredLocationsData))
    
    if (!is_empty(headerstoAdd)) {
      for (colName in headerstoAdd) {
        # add missing column if it exists
        unfilteredLocationsData[colName] <- NA
      }
    }
    if (is.null(locations)) {
      locations <- unfilteredLocationsData %>%
        select(id, name, streetAddress, locality, region, postalCode, latitude, 
               longitude, locationTypeDisplay, isPrimary, countryIsoCode, inPlanning,
               isClosed, openToPublic, yearOpened, website, breweryId) %>% 
        as_tibble() 
    } else {
      locations <- rbind(locations, unfilteredLocationsData %>% 
                           select(id, name, streetAddress, locality, region, 
                                  postalCode, latitude, longitude, locationTypeDisplay, 
                                  isPrimary, countryIsoCode, inPlanning, isClosed,
                                  openToPublic, yearOpened, website, breweryId) %>%
                           as_tibble())
    }
  }
  rm(locationsRequestData, unfilteredLocationsData, headerstoAdd)
  write_rds(locations, saveFile) 
}

```


Cleaning up the Messy Data {data-navmenu="About"}
==============================================================================

Column {.tabset .tabset-fade}
------------------------------------------------------------------------------

### Code Behind Cleaning Beer Dictionary
Before we do any cleaning, we must load all of the necessary data dictionaries: beers, breweries, locations, and beer categories and styles. We need an API key to access the BreweryDB API and since pulling data on beers, breweries, and locations all together from the API takes over an hour, we have already stored this data in files. If we wanted to update these data frames, we can change the "updateData" flag to "TRUE" and access the API. The code for this is shown below:

```{r eval=FALSE, echo=TRUE}
source("BreweryDBDataRetriever.R")

# my BreweryDB API key
breweryDBKey <- "11b192faea1a549172fe2423db077bc5"

#the file in which the beers R data frame is stored
beersFile <- "data/beers.rds"

#the file in which the breweries R data frame is stored
breweriesFile <- "data/breweries.rds"

#the file in which the locations R data frame is stored
locationsFile <- "data/locations.rds"

updateData <- FALSE

# getting beer style data since it is instantaneous
beerStylesRequestData <- BreweryDB_endpoint(breweryDBKey, "styles") %>%
  content(as = "text", encoding = "UTF-8")
beerStyles <- fromJSON(beerStylesRequestData, simplifyDataFrame = TRUE)$data %>%
  select(id, categoryId, name, shortName, description, ibuMin, ibuMax, abvMin, 
         abvMax, srmMin, srmMax, ogMin, ogMax, fgMin, fgMax) %>%
  as_tibble()
rm(beerStylesRequestData)

# getting beer category data since it is instantaneous
beerCategoriesRequestData <- BreweryDB_endpoint(breweryDBKey, "categories") %>%
  content(as = "text", encoding = "UTF-8")
beerCategories <- fromJSON(beerCategoriesRequestData, simplifyDataFrame = TRUE)$data %>%
  select(id, name) %>%
  as_tibble()
beerCategories <- head(beerCategories, -1) # we have a null row at the end
rm(beerCategoriesRequestData)

if (updateData) {
  retrieveBeerData(breweryDBKey, beersFile)
  retriveBreweryData(breweryDBKey, breweriesFile)
  retrieveLocationsData(breweryDBKey, locationsFile)
}

beers <- read_rds(beersFile)
breweries <- read_rds(breweriesFile)
locations <- read_rds(locationsFile)
```

A quick glance at the data frames reveals columns that are encoded as characters when they should be numeric, such as *established*, which is the year the brewery was established and should be numeric, is actually a character. We also must change all the "NULL"'s in the dataset to actual NAs so that we can automatically omit them with the R functions at our disposal. This is all done below.

```{r eval=FALSE, echo=TRUE}
for (attr in names(beers)) {
  beers[ attr == "NULL" ] <- NA
}

for (attr in names(breweries)) {
  breweries[ attr == "NULL" ] <- NA
}

for (attr in names(locations)) {
  locations[ attr == "NULL"] <- NA
}

beers$abv <- as.numeric(beers$abv)
beers$ibu <- as.numeric(beers$ibu)

breweries$established <- as.numeric(breweries$established) # year is numeric
locations$yearOpened <- as.numeric(locations$yearOpened) # year is numeric

beerStyles$ibuMin <- as.numeric(beerStyles$ibuMin)
beerStyles$ibuMax <- as.numeric(beerStyles$ibuMax)
beerStyles$abvMin <- as.numeric(beerStyles$abvMin)
beerStyles$abvMax <- as.numeric(beerStyles$abvMax)
beerStyles$srmMin <- as.numeric(beerStyles$srmMin)
beerStyles$srmMax <- as.numeric(beerStyles$srmMax)
beerStyles$ogMin <- as.numeric(beerStyles$ogMin)
beerStyles$ogMax <- as.numeric(beerStyles$ogMax)
beerStyles$fgMin <- as.numeric(beerStyles$fgMin)
beerStyles$fgMax <- as.numeric(beerStyles$fgMax)
```


### Cleaning up the Beer Table

Now that we have all of our data, we might want to take a look at the distribution of the most distinguishable beer characteristics, ABV (alcohol per beer volume, expressed as a percentage out of 100) and IBU (international bitterness unit value, which is a measure of how bitter the beer is).

```{r, echo=FALSE, warning=FALSE}
summary(beers$abv)
```

The summary statistics of the ABV of all the beers reveal that the maximum ABV is 308. Since a percent
is out of 100, everything above 100 doesn't make sense and we can remove all the beers whose ABV is above 100 since the credibility of that beer is now questionable. Thankfully, there is only 1 beer whose ABV is above 100, and we dispose of that observation.

```{r echo=TRUE, eval=FALSE}
disposeOf <- beers %>% filter(abv > 100) %>% select(id)

# there is only 1 beer whose abv is above 100, and we dispose of that observation
beers <- beers %>% filter(id != "EHPIi4")
```



Let's take a look at the IBU distributions of all the beers.
```{r, echo=FALSE, warning=FALSE}
summary(beers$ibu)
```
As seen in this [chart](https://www.brewersfriend.com/2009/01/24/beer-styles-ibu-chart-graph-bitterness-range/), IBU doesn't normally go above 120, with the units being parts per million, and this [many](https://beerconnoisseur.com/articles/whats-meaning-ibu) claiming that the human tongue can't distinguish past 110 IBUs. 

```{r, echo=FALSE, warning=FALSE}
beersHighIBUs <- beers %>% filter(ibu > 120) %>% nrow
```

There are `r beersHighIBUs` beers above 120 IBUs, and googling of the first few beers reveals that these are authentic beers, so no observations will be removed for wrong IBU range, but these observations will be left out of exploratory data analysis visualizations to avoid skewing the scale of data.

The next variable we want to look at and see if cleaning is necessary is the SRM range of the beer. We know that anything significantly bigger than 40 or anything that is negative is a clear error, and we might want to toss that observation out.

```{r, echo=FALSE, warning=FALSE}
summary(beers$srmId)
```

Although there are a lot of NAs, it seems that the ranges of SRM for the beers that have that information is valid, and I won't mess around with the NAs for the sake of time in this project.


### Cleaning Brewery Data

There isn't much to clean in breweries; we can't tell what is a good description or a bad description of a brewery, and when the data dictionaries are merged with each other on the foreign keys of the dictionaries, beers that don't have a brewery are not included, breweries with no locations are not included, and locations without an associated brewery isn't included, since when we merge, we are using an *inner_join*. The only thing that we can clean is are the years associated with each brewery (*established*, which is the year the brewery was established), and check whether *isOrganic* only has "Y"'s and "N"'s. 

```{r echo=FALSE, warning=FALSE}
summary(breweries$established)
numbeforeUS <- nrow(breweries %>% filter(established < 1776))
```

There seem to be breweries that can be categorized as extremely old; the United States signed the Declaration of Independence in 1776, and there seems to be `r numbeforeUS` breweries older than the United States. A closer examination of these breweries reveal that they are indeed this old, with all of them being European breweries. 

We take a look at the *isOrganic* column in the breweries data frame and we can see that none of the breweries have that column value missing and the column can either be "Y" for "Yes" or "N" for "No", and no cleaning is required here.
```{r, warning=FALSE}
summary(as.factor(breweries$isOrganic))
```



```{r, warning=FALSE}
breweries %>% filter(established < 1776) %>% 
  select(name, description, website, established, isOrganic)  %>% 
  datatable(options = list(pagelength = 5))
```

Column {.tabset .tabset-fade}
----------------------------------------------------------------------------------------
### Cleaning Location Data
So we move on to locations. Since most of the variables in location are strings, we must check the address and latitudes and longitudes to make sure they make sense in the context of the observation. Physical street addresses and postal codes are difficult to verify without using more datasets, and errors in these fields will be obvious in geovisualizations and then hopefully we can single out the errors and fix them. However, we can check if state and country are encoded properly in the dataset. 

```{r echo=FALSE, warning=FALSE}
dirtyLocations <- read_rds("data/2017_BAN6003_gomezsm_finalproject_locations.rds")
numStates <- dirtyLocations %>% filter(countryIsoCode == "US") %>% count(region) %>% nrow

# there are 99 states in the United States according to the dataset, which is erroneous. We expect to see 51 
# "different" states, the 50 states and then NA for observations without a state 

numStatesNA <- dirtyLocations %>% filter(countryIsoCode == "US" & is.na(region)) %>% nrow
```
Assuming that US state data is much more present than foreign countries, we focus on the US for the cleaning. In the data pulled from the BreweryDB API, there are `r numStates` states in the United States according to the dataset, which is erroneous. We expect to see 52 states, accounting for the 50 states in the US, the District Capital, and NAs. Further exploration reveals that there are `r numStatesNA` locations in the US without a region, which I changed by hand since changing 16 locations by hand is doable. Fixing all other states involved looking more in depth into the addresses and turning state abbreviations and postal codes to actual state names, so that we have 54 states, including the District capital abbreviated "D.C", NAs, and two U.S. territories, Puerto Rico and the Virgin Islands. Now that our data has been gone through initial cleaning, we can begin to visualize the distribution of a few variables and produce a few tables.

### The Code behind Cleaning Locations
This is the code behind cleaning the locations data dictionary, complete with hard coded fixing of the miscoded states and the NA states, and the checking of latitude, longitude, and country codes, which are used later in visualizations. 

```{r, echo=TRUE, eval=FALSE}

numStates <- locations %>% filter(countryIsoCode == "US") %>% count(region) %>% nrow

# there are 103 states in the United States according to the dataset, which is 
# erroneous. We expect to see 52 "different" states, the 50 states, the District 
# Capital and then NA for observations without a state 

allstates <- locations %>% filter(countryIsoCode == "US") %>% count(region) %>% arrange(desc(n))

locations$region[ locations$region == "53217"] <- "Wisconsin"
locations$region[ locations$region == "Calfornia"] <- "California"
locations$region[ locations$region == "CALIFORNIA"] <- "California"
locations$region[ locations$region == "CO"] <- "Colorado"
locations$region[ locations$region == "CT"] <- "Connecticut"
locations$region[ locations$region == "Deleware"] <- "Delaware"
locations$region[ locations$region == "FL"] <- "Florida"
locations$region[ locations$region == "Flordia"] <- "Florida"
locations$region[ locations$region == "IL"] <- "Illinois"
locations$region[ locations$region == "Illinios"] <- "Illinois"
locations$region[ locations$region == "illinois"] <- "Illinois"
locations$region[ locations$region == "IN"] <- "Indiana"
locations$region[ locations$region == "MA"] <- "Massachusetts"
locations$region[ locations$region == "Massachusettes"] <- "Massachusetts"
locations$region[ locations$region == "MARYLAND"] <- "Maryland"
locations$region[ locations$region == "MI"] <- "Michigan"
locations$region[ locations$region == "MINNESOTA"] <- "Minnesota"
locations$region[ locations$region == "WV"] <- "West Virginia"
locations$region[ locations$region == "WASHINGTON"] <- "Washington"
locations$region[ locations$region == "WA"] <- "Washington"
locations$region[ locations$region == "WA - Washington"] <- "Washington"
locations$region[ locations$region == "TX"] <- "Texas"
locations$region[ locations$region == "TN"] <- "Tennessee"
locations$region[ locations$region == "Tennesssee"] <- "Tennessee"
locations$locality[ locations$region == "Tallahassee"] <- "Tallahassee"
locations$region[ locations$region == "Tallahassee"] <- "Florida"
locations$locality[ locations$region == "Los Angeles"] <- "Los Angeles"
locations$region[ locations$region == "Los Angeles"] <- "California"
locations$postalCode[ locations$region == "Florida 33316"] <- 33316
locations$region[ locations$region == "Florida 33316"] <- "Florida"

locations$postalCode[ locations$region == "N7 8XH"] <- "N7 8XH"
locations$countryIsoCode[ locations$region == "N7 8XH"] <- "GB"

locations$region[ locations$region == "PA"] <- "Pennsylvania"
locations$region[ locations$region == "OHio"] <- "Ohio"
locations$region[ locations$region == "OHIO"] <- "Ohio"
locations$region[ locations$region == "New Yotk"] <- "New York"
locations$region[ locations$region == "NewYork"] <- "New York"
locations$region[ locations$region == "SC"] <- "South Carolina"
locations$region[ locations$region == "SD"] <- "South Dakota"
locations$region[ locations$region == "NC"] <- "North Carolina"
locations$region[ locations$region == "MS."] <- "Mississippi"
locations$region[ locations$region == "MN"] <- "Minnesota"
locations$locality[ locations$region == "Marion"] <- "Marion"
locations$region[ locations$region == "Marion"] <- "Ohio"


locations$countryIsoCode[ locations$region == "British Columbia"] <- "CA"
locations$locality[ locations$region == "Portland" ] <- "Portland"
locations$region[ locations$region == "Portland"] <- "Oregon"

locations$locality[ locations$region == "London"] <- "London"
locations$countryIsoCode[ locations$region == "London"] <- "GB"
locations$countryIsoCode[ locations$region == "England"] <- "GB"
locations$locality[ locations$region == "Shetland Islands"] <- "Shetland Islands"
locations$countryIsoCode[ locations$region == "Shetland Islands"] <- "GB"

locations$locality[ locations$region == "West Sussex"] <- "West Sussex"
locations$countryIsoCode[ locations$region == "West Sussex"] <- "GB"

locations$locality[ locations$region == "Worcestershire"] <- "Worcestershire"
locations$countryIsoCode[ locations$region == "Worcestershire"] <- "GB"

locations$locality[ locations$region == "Nairn"] <- "Nairn"
locations$countryIsoCode[ locations$region == "Nairn"] <- "GB"

locations$locality[ locations$region == "Merseyside"] <- "Merseyside"
locations$countryIsoCode[ locations$region == "Merseyside"] <- "GB"

locations$locality[ locations$region == "Grand Bahama Island"] <- "Grand Bahama Island"
locations$countryIsoCode[ locations$region == "Grand Bahama Island"] <- "BS"

locations %>% filter(countryIsoCode == "US" & is.na(region)) %>% nrow

# there are 16 observations without a state

locations %>% filter(countryIsoCode == "US" & is.na(region))

changeLocationState <- function(locationdataFrame, locationId, regionValue) {
  print(locationdataFrame)
  for (i in 1:nrow(locationdataFrame[,1])) {
    if (locationdataFrame$id[i] == locationId) {
      locationdataFrame$region[i] <- regionValue
      print(nrow(locationdataFrame))
      return(locationdataFrame)
    }
  }
}


# using google to find the states since it's only 16

locations <- changeLocationState(locations, "2sfE3h", "Texas")
locations <- changeLocationState(locations, "6fAJtO", "California")
locations <- changeLocationState(locations, "0QiT0E", "California")
locations <- changeLocationState(locations, "GTwDJI", "California")
locations <- changeLocationState(locations, "5yDXi6", "California")
locations <- changeLocationState(locations, "4nSc9t", "New York")
locations <- changeLocationState(locations, "LMfC4N", "New York")
locations <- changeLocationState(locations, "hbF6tu", "New Jersey")
locations <- changeLocationState(locations, "GM3x67", "North Carolina")
locations <- changeLocationState(locations, "PHV3BL", "District of Columbia")
locations <- changeLocationState(locations, "YKKcj2", "District of Columbia")
locations <- changeLocationState(locations, "1KlIcf", "California")

# remember that washington, dc is not a state
# difficult to see whether addresses are valid or what postal codes are valid, 
# location data cleaning will be a lot more obvious once it is visualized

# have to turn state abbreviations into full state names

# the ranges of the latitude and longitude seem alright, import dates so we can look at things by year?

# we can verify country isocodes, locationTypeDisplay, and check the ranges of latitude and longitude

summary(locations$latitude)
summary(locations$longitude)


# all seem good, going to take a look at country isocodes

countryCodes <- locations %>% count(countryIsoCode) %>% arrange(desc(n))
```

### The Birth of the Data Dictionaries
The code below produces the data dictionaries that I use to visualize things, including the beers, breweries, and location dictionaries and all of the different combinations of merging the different dictionaries.

```{r echo=TRUE, eval=FALSE}
########################### DATA DICTIONARY ###################################

beerStyles <- arrange(beerStyles, categoryId)

# adds category information to the diferent styles so we can search styles by category
beerCategoriesStyles <- beerStyles %>% inner_join(beerCategories %>% rename(categoryId = id, categoryName = name), 
                                                  by = "categoryId")
# splits observations into one breweryId and locationId per row instead of one id 
# per several location ids
breweries <- breweries %>% 
  mutate(locationId = strsplit(as.character(locationId), " ")) %>% 
  unnest(locationId)

# splits observations into one beer id and brewery id per row
beers <- beers %>% 
  mutate(breweryId = strsplit(as.character(breweryId), " ")) %>%
  unnest(breweryId)

# associates beers with their style information 
beersandStyles <- beers %>% inner_join(beerStyles %>% rename(styleId = id, 
                                                             styleName = name, 
                                                             styleshortName = shortName, 
                                                             stylesDescription = description), 
                                       by = c("styleId", "categoryId"))

# associates breweries with their location(s)
breweriesandLocations <- breweries %>% rename(breweryId = id, breweryName = name, 
                                              breweryDescription = description, 
                                              breweryWebsite = website) %>% 
  inner_join(locations %>% rename(locationId = id, locationName = name, 
                                  locationWebsite = website), 
             by = c("locationId", "breweryId"))

# associates beers with the breweries that brew them and the locations of those
# brewreries
beersBreweriesLocations <- beers %>% rename(beerId = id, beerName = name, 
                                            beerDescription = description) %>% 
  inner_join(breweriesandLocations, by = "breweryId")


write_rds(beers, "data/beersClean.rds")
write_rds(breweries, "data/breweriesClean.rds")
write_rds(locations, "data/locationsClean.rds")
write_rds(breweriesandLocations, "data/breweriesLocations.rds")
write_rds(beersBreweriesLocations, "data/maindictionary.rds")
write_rds(beerCategoriesStyles, "data/categoriesStyles.rds")
################################ END ###########################################
```

About Beer Categories and Styles {data-navmenu="About"} 
=======================================================================

Column {.tabset .tabset-fade data-width=200}
-----------------------------------------------------------------------

### Beer Category and Style Discussion

To the right are three different visualizations of the ranges of ABV (alcohol per beer volume), IBU (international bitterness units), and SRM (Standard Reference Method) IDs, which describe colors from light to dark, smallest to greatest. Some styles and categories allow for a lot of variance in these characteristics, such as the Hybrid/mixed Beer category, but this makes sense because this functions as an "other" category. Other styles, such as North American Lager, don't allow for much variance in their ABV or IBU. IBU ranges seem to be larger than that of ABVs, and a lot of the styles have have the most variance in the color category than the ABV or IBU range, but it is interesting to visualize the color, ABV, and IBU of beer styles and see the differences among the different styles within categories and differences between the different style categories. Hover over the bars to see which style it is. 

Column {.tabset .tabset-fade data-width=750}
-----------------------------------------------------------------------

### ABV Ranges
```{r warning=FALSE}
abvranges <- beerCategoriesStyles %>%
  ggplot(aes(x = id, y = (abvMin + abvMax) / 2)) + 
  geom_linerange(aes(ymin=abvMin, ymax=abvMax, group = name, color = categoryName), na.rm = TRUE) + 
  labs(color = "Category Name", x = "Beer Style Ids", y = "ABV Ranges") + 
  ylim(0, 27) + 
  ggtitle("ABV Ranges by Beer Style and Category") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(abvranges + coord_flip(), tooltip = c("group"))

```

### IBU ranges
```{r warning=FALSE}
iburanges <- beerCategoriesStyles %>% 
  ggplot(aes(x = id, y = ibuMin)) + 
  geom_linerange(aes(ymin=ibuMin, ymax=ibuMax, group = name, 
                     color = categoryName), na.rm = TRUE) + 
  labs(color = "Category Name", x = "Beer Style Ids", y = "IBU Ranges") +
  ggtitle("IBU Ranges by Beer Style and Category") +
  theme(plot.title = element_text(hjust = 0.5)) 

ggplotly(iburanges + coord_flip(), tooltip = c("group"))
```

### SRM Color Ranges
```{r warning=FALSE}
srmranges <- beerCategoriesStyles %>% filter(srmMax <= 50) %>%
  ggplot(aes(x = id, y = srmMin)) + 
  geom_linerange(aes(ymin=srmMin, ymax=srmMax, 
                     group = name, color = categoryName), na.rm = TRUE) + 
  labs(color = "Category Name", x = "Beer Style Ids", y = "SRM Color Ranges") +
  ggtitle("SRM Color Ranges by Beer Style and Category") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(srmranges + coord_flip(), tooltip = c("group"))
```

Breweries Across the World {data-navmenu="Explore" data-orientation=columns}
==============================================================================

Column {data-width=750}
-------------------------------------------------------------------------------

### Breweries Across the World

```{r, fig.height=7, warning=FALSE}
breweriesCountries <- breweriesandLocations %>% count(countryIsoCode) %>% 
  arrange(desc(n)) %>% 
  rename(frequency = n) %>% right_join(countryCodes %>% 
                                         rename(countryIsoCode = alpha.2, 
                                                country3Char = alpha.3,
                                                Country = name), 
                                       by = c("countryIsoCode")) %>%
  select(country3Char, Country, frequency)

breweriesCountries$frequency[is.na(breweriesCountries$frequency)] <- 0
#plotly choropleth map 
# light grey boundaries
l <- list(color = toRGB("grey"), width = 0.5)

# specify map projection/options
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'Mercator')
)

plot_geo(breweriesCountries) %>%
  add_trace(
    z = ~frequency, color = ~frequency, colors = 'Blues',
    text = ~Country, locations = ~country3Char, marker = list(line = l)
  ) %>%
  colorbar(title = 'Number of Breweries') %>% #, tickprefix = '$') %>%
  layout(
    title = 'Number of Breweries by Country',
    geo = g
  )

```

Column {.sidebar}
--------------------------------------------------------------------------------

The chart on the right shows the spread of breweries across the world. United States seems to be the main brewer in the world, with the U.K., Canada, Germany, and Australia also contributed a bit to the dataset, BreweryDB but no where near as much as the United States. Of course, this could be due to this dataset coming from the United States. 
  
Hover over each country to see the number of breweries in that country that is accessible in the United States.
  
Seeing as how United States is the country with the largest number of breweries so far, let's focus on the United States.
  

What's Brewing in the USA {data-navmenu="Explore"}
================================================================================

Column {.tabset .tabset-fade}
--------------------------------------------------------------------------------

### Breweries in the USA
```{r warning=FALSE}
topBreweryStates <- beersBreweriesLocations %>% filter(countryIsoCode == "US") %>% count(region) %>% 
  arrange(desc(n)) %>% rename(frequency = n) %>%
  rename(state = region) %>% 
  left_join(stateAbbrv, by = c("state"))
#plotly choropleth map
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

plot_geo(topBreweryStates, locationmode = 'USA-states') %>%
  add_trace(
    z = ~frequency, text = ~state, locations = ~abbreviations,
    color = ~frequency, colors = 'Purples'
  ) %>%
  colorbar(title = "Number of Breweries") %>%
  layout(
    title = 'Number of Breweries Across the Country
(Hover for breakdown)', geo = g ) ``` ### Cities ```{r warning=FALSE} breweriesandLocations %>% filter(countryIsoCode == "US") %>% distinct(breweryId, .keep_all = TRUE) %>% count(locality, region) %>% arrange(region, desc(n)) %>% plyr::ddply("region", function(x) head(x[order(x$n, decreasing = TRUE) , ], 3)) %>% datatable(options = list(pageLength = 5)) ``` Column {.tabset .tabset-fade data-width=400} --------------------------------------------------------------------------------------------- ### Beer Styles and Categories Across the US ```{r warning=FALSE} topBreweryStates <- beersBreweriesLocations %>% filter(countryIsoCode == "US") %>% count(categoryId, styleId) %>% arrange(desc(n)) %>% rename(frequency = n) %>% inner_join(beerCategoriesStyles %>% rename(styleId = id), by = c("styleId", "categoryId")) %>% select(styleId, categoryId, name, categoryName, frequency) %>% rename(styleName = name) categoriesTreemap <- treemap(topBreweryStates, index = c("categoryName", "styleName"), vSize = "frequency", vColor = "frequency", type = "value", palette = rev(viridis(6))) thm <- hc_theme( colors = c("#1a6ecc", "#434348", "#90ed7d"), chart = list( backgroundColor = "transparent", style = list(fontFamily = "Source Sans Pro") ), xAxis = list( gridLineWidth = 1 ) ) highchart() %>% hc_add_series_treemap(categoriesTreemap, allowDrillToNode = TRUE, layoutAlgorithm = "squarified") %>% hc_add_theme(thm) ``` ### Cities with the Most Breweries ```{r warning=FALSE} # have to do this on unique brewery ids top10cities <- breweriesandLocations %>% filter(countryIsoCode == "US") %>% distinct(breweryId, .keep_all = TRUE) %>% count(locality) %>% arrange(desc(n)) %>% rename(Frequency = n, City = locality) %>% head(10) %>% ggplot(aes(x = City, y = Frequency)) + geom_bar(stat = "identity") + coord_flip() ggplotly(top10cities, tooltip = c("Frequency")) ``` Column {.sidebar} ----------------------------------------------------------------------------------------------- California is clearly the hub for breweries, with Colorado, Oregon, Washington, Michigan, Texas, North Carolina, and most of the Northeast having quite a few breweries as well. Hover over the different countries to see the number of breweries in each country. The data table below the map shows the top 5 cities with the most breweries per state, allowing one to dig into brewery distribution in states a bit. The tree map on the right shows the number of different beers per style in the United States as well as showing the top 10 cities with the most breweries in the United States. We can see the proportion of styles and style categories across the United States, and see that American-style beers are popular as are different style IPAs and French & Belgian style beers. California has three cities in the top 10 with cities in the Midwest also placing in the top 10. Hover over all of the visualizations to see more details. Click on the tree map to descend into categories. Let's focus on California and North Carolina, since Wake Forest University is in North Carolina and California has the most breweries per state in the entire U.S.A. North Carolina and California, Beertopia? {data-navmenu="Explore"} ============================================================================== Column -------------------------------------------------------------------------------- ### Brewery Growth over the Years ```{r warning=FALSE} historybyState <- breweriesandLocations %>% filter(countryIsoCode == "US" & !is.na(established) & established >= 1900 & (region == "California" | region == "North Carolina")) %>% count(region, established) %>% arrange(established, region) %>% rename(State = region, Year = established, Frequency = n) historybyState <- rbind(breweriesandLocations %>% filter(countryIsoCode == "US" & !is.na(established) & established >= 1900 & (region == "California")) %>% count(region, established) %>% arrange(established, region) %>% rename(State = region, Year = established, Frequency = n) %>% mutate(CumFrequency = cumsum(Frequency)), breweriesandLocations %>% filter(countryIsoCode == "US" & !is.na(established) & established >= 1900 & (region == "North Carolina")) %>% count(region, established) %>% arrange(established, region) %>% rename(State = region, Year = established, Frequency = n) %>% mutate(CumFrequency = cumsum(Frequency))) historybyState <- within(historybyState, acc_count <- cumsum(Frequency)) states <- ggplot(historybyState) + geom_line(aes(x = Year, y = CumFrequency, color = State), alpha = 0.4) + geom_line(aes(x = Year, y = Frequency, color = State), alpha = 0.4) + ggtitle("Breweries Founded Each Year") + labs(x = "Year", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplotly(states, tooltip = c("Year", "State", "y")) ``` Column {.tabset .tabset-fade} --------------------------------------------------------------------------------------------- ### Top Cities in California ```{r warning=FALSE} topCities <- breweriesandLocations %>% filter(countryIsoCode == "US" & region == "California") %>% count(locality) %>% arrange(desc(n)) %>% rename(City = locality, Frequency = n) categoriesTreemap <- treemap(topCities, index = c("City"), vSize = "Frequency", vColor = "Frequency", type = "value", palette = rev(viridis(6))) thm <- hc_theme( colors = c("#1a6ecc", "#434348", "#90ed7d"), chart = list( backgroundColor = "transparent", style = list(fontFamily = "Source Sans Pro") ), xAxis = list( gridLineWidth = 1 ) ) highchart() %>% hc_add_series_treemap(categoriesTreemap, allowDrillToNode = TRUE, layoutAlgorithm = "squarified") %>% hc_add_theme(thm) ``` ### Top Cities in North Carolina ```{r warning=FALSE} topCities <- breweriesandLocations %>% filter(countryIsoCode == "US" & region == "North Carolina") %>% count(locality) %>% arrange(desc(n)) %>% rename(City = locality, Frequency = n) categoriesTreemap <- treemap(topCities, index = c("City"), vSize = "Frequency", vColor = "Frequency", type = "value", palette = rev(viridis(6))) thm <- hc_theme( colors = c("#1a6ecc", "#434348", "#90ed7d"), chart = list( backgroundColor = "transparent", style = list(fontFamily = "Source Sans Pro") ), xAxis = list( gridLineWidth = 1 ) ) highchart() %>% hc_add_series_treemap(categoriesTreemap, allowDrillToNode = TRUE, layoutAlgorithm = "squarified") %>% hc_add_theme(thm) ``` Column {.sidebar} ----------------------------------------------------------------------------------------------- Here we have specific data about breweries in California and North Carolina, showing the number of breweries established each year as well as the total number of breweries in the two states over the years. We can also see how much the different cities in each of the states contribute to the overall number of breweries in the state. Charlotte, Raleigh, and Asheville each contribute about the same to the total number of breweries in North Carolina while San Diego makes a much bigger contribution to the total number of breweries in California than other California cities. Brewery growth in the two states follow similar patterns with California having more distinguishable spikes than North Carolina, but both states seem to support brewery growth (positive growth over the last few years). The line graph shows both the number of breweries established each year and the cumulative frequency of breweries for each year, so we can analyze both individual years and total frequency. Since we are in the middle of 2017, the 2017 number of established breweries is smaller than previous years and is the most likely culprit for the negative slope nearing 2017 in the visualization of the number of breweries established each year for both states. Since both states are seeing the number of breweries established each year grow, one can extrapolate and claim that these states are good places to found breweries. Diving into Breweries {data-navmenu="Explore"} ======================================================================= Column {.tabset .tabset-fade} -------------------------------------------------------------------------------- ### Location Types for Breweries ```{r warning=FALSE} # the top kinds of locations in the BreweryDB API dataset, which includes all # of the kinds of locations breweryLocations <- locations %>% inner_join(countryCodes %>% rename(countryIsoCode = alpha.2, countryName = name), by = c("countryIsoCode")) %>% ggplot(aes(x = locationTypeDisplay, fill = countryName)) + geom_bar() + scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) + ggtitle("The Frequency of the Different Kinds of Breweries") + labs(x = "Location Type", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplotly(breweryLocations, tooltip = c("fill", "count")) ``` ### Number of Breweries Established Each Year ```{r warning=FALSE} # the 3 top kinds of locations by states in the US historybyState <- breweriesandLocations %>% filter(countryIsoCode == "US" & !is.na(established) & established >= 1900) %>% count(region, established) %>% arrange(established, region) %>% ggplot(aes(x = established, y = n, color = region)) + geom_line(alpha = 0.4) + ggtitle("The Number of Breweries Founded Each Year per State") + labs(x = "Year", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplotly(historybyState, tooltip = c("region", "n")) ``` Column {.sidebar} ------------------------------------------------------------------------------- These are all the different kinds of locations a brewery can have. We already saw how breweries and their styles and the categories of those styles are distributed in the United States, and now we take a look at the different kinds of brewery locations and the number of breweries established each year per state. We can see that microbreweries are all the rage in the United States and other countries. Hover over the stacked bar chart to see the frequency of location types among the different countries present in the data pulled from the BreweryDB API. Microbreweries are, of course, the top location type, which makes sense because my dataset consists of craft beers, most of which are assuredly made in microbreweries. The frequency of breweries established each year is shown here, with all of the states and territories represented in the United States on a single graph, allowing for comparison between different states and territories by clicking on regions in the legend. Hover over the line graphs for detail about each year. California, Texas, and Michegan immediately pop out as the states with the most breweries established in recent years, with Pennsylvania having quite a few breweries established before and during the '50s. But What About the Beer?! {data-navmenu="Explore"} ======================================================================= Column {.tabset .tabset-fade} -------------------------------------------------------------------------------- ### SRM Ranges ```{r warning=FALSE} srmRanges <- beersBreweriesLocations %>% filter(!is.na(srmId) & countryIsoCode == "US") %>% mutate(USregion = ifelse(region %in% stateRegions[["Northeast"]], "Northeast", ifelse(region %in% stateRegions[["South"]], "South", ifelse(region %in% stateRegions[["West"]], "West", ifelse(region %in% stateRegions[["Midwest"]], "Midwest", region))))) %>% ggplot(aes(x = srmId, fill = USregion)) + geom_histogram(color = "BLACK", binwidth = 1) + labs(x = "SRM ID", y = "Frequency") + ggtitle("The Distribution of the SRM Color Scale") + theme(plot.title = element_text(hjust = 0.5)) ggplotly(srmRanges) ``` ### ABV Ranges ```{r warning=FALSE} ABVranges <- beersBreweriesLocations %>% filter(!is.na(abv) & countryIsoCode == "US") %>% group_by(region) %>% summarise(Average = mean(abv, na.rm = TRUE), Variance = (sd(abv, na.rm = TRUE))^2 + (sd(abv, na.rm = TRUE))^2) ABVranges$state <- factor(ABVranges$region, levels=unique(ABVranges$region)) p <- plot_ly(ABVranges, y = ~state, x = ~Average, text = ~Variance, type = 'scatter', mode = 'markers', marker = list(size = ~Variance, opacity = 0.5)) %>% layout(title = 'Average ABV by State with Double Variance as Size') p ``` ### IBU Ranges ```{r warning=FALSE} IBUranges <- beersBreweriesLocations %>% filter(!is.na(ibu) & countryIsoCode == "US") %>% group_by(region) %>% summarise(Average = mean(ibu, na.rm = TRUE), Variance = sd(abv, na.rm = TRUE)^2 + sd(abv, na.rm = TRUE)^2) IBUranges$state <- factor(IBUranges$region, levels=unique(IBUranges$region)) p <- plot_ly(IBUranges, y = ~state, x = ~Average, text = ~Variance, type = 'scatter', mode = 'markers', marker = list(size = ~Variance, opacity = 0.5)) %>% layout(title = 'Average IBU by State with Variance as Size') p ``` Column {.sidebar} -------------------------------------------------------------------------------- The general SRM color distribution is shown here, sliced by the region a particular brewery belongs to, using the United States regions used in the census [here](https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States#Census_Bureau-designated_regions_and_divisions). We can see that a large percentage of beers made in the United States are dark beers, with extremely light beers being the second range of colors popular, with no particular region specializing in a specific color range of beer. Hover over the bars to see exact breakdowns. Here the general alcohol per beer volume distribution is displayed in the form of a bubble chart by state. The location of the bubbles is determined by the average ABV in each state, and size is represented by the variance (doubled) of the ABV in each state so large bubble means high variance. We can see the distributions of ABV per state and make comparisons that way. The characteristics of this bubble chart are the same as the ABV bubble chart, except now we are looking at the IBU (International Bitterness Units), which is a measure of a beer's bitterness. Once more, the location of a bubble represents the average IBU and its size is the variance (doubled) of IBUs, all separated by state and territories in the United States. Oklahoma immediately pops up in the bubble charts as a state whose ABV and IBU distributions have quite a bit of variance, which indicates that Oklahoma might perhaps have an outlier. Hover over the bubbles to see more detail, with the number on the second row being the variance.