Getting information on Breweries can be a difficult process with data from multiple sources or sources that have limited access to developers. Import.io instantly turns webpages into data with minimal or no set-up involved. This is going to work great for extracting meta information on breweries and I’m specifically extracting data on Connecticut-based breweries.
Go to BeerAdvocate.com and navigate to the search for state, then onto the breweries link.
We need to deducing the URL structure due to the pagination, which is fairly simple by clicking on each of the results links (ie. 1-20, 21-40, 41-60). And here are the necessary links:
http://www.beeradvocate.com/place/list/?start=0&c_id=US&s_id=CT&brewery=Y&sort=name, http://www.beeradvocate.com/place/list/?start=20&c_id=US&s_id=CT&brewery=Y&sort=name, http://www.beeradvocate.com/place/list/?start=40&c_id=US&s_id=CT&brewery=Y&sort=name.
After initially setting up an account on import.io which can be done through linking your github, you can navigate to your my data/mine page and input the previous links into the bulk extractor for your Magic API and run the queries.
The new output page will be an awesome tabular view of all of the extracted link data ready for export in multiple formats (ie. Spreadsheet for CSV)!
Read the data into R:
beer_advocate <- read.csv("beeradvocate.com API 14th Dec 10-37.csv")
beer_head <- head(beer_advocate)
# remove duplicate columns with unimportant info
beer_advocate$bottomdark_links._text <- NULL
beer_advocate$bottomdark_links._source <- NULL
beer_advocate$pageUrl <- NULL
beer_advocate$image <- NULL
beer_advocate$bottomdark_content_numbers._source <- NULL
beer_advocate$bottomlight_number <- NULL
beer_advocate$link._source <- NULL
# change the column names
colnames(beer_advocate) <- c("address", "brewery_website", "beer_advocate_website", "number_of_reviews","beer_avg", "number_of_beers", "place_type", "brewery_rating", "brewery_name", "phone_number")
# get our missing values as NA
beer_advocate[beer_advocate == "-"] = NA
# clean up some of the columns and remove extra data
beer_advocate$address <- as.character(beer_advocate$address) # change into character data type
beer_advocate$address <- gsub(" official website", "", beer_advocate$address)
# extract the phone number from the address column
# http://stackoverflow.com/questions/21007607/extract-phone-number-regex
library(stringr)
beer_advocate$phone_number <- str_extract_all(beer_advocate$address, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
beer_advocate$phone_number[beer_advocate$phone_number == "character(0)"] = NA
# remove the phone number tag from the address field for geocoding
beer_advocate$address <- gsub(" phone: \\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}", "", beer_advocate$address)
# change the factor for average beer rating into a numeric
# http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information
beer_advocate$beer_avg <- as.numeric(levels(beer_advocate$beer_avg))[beer_advocate$beer_avg]
## Warning: NAs introduced by coercion
library(htmlwidgets)
library(leaflet)
library(ggmap)
geo <- geocode(location = beer_advocate$address, output="latlon")
beer_advocate$lon <- geo$lon
beer_advocate$lat <- geo$lat
# create the pop-up info
beer_advocate$pop_up_content <- paste(sep = "<br/>",
"<b><a href='", beer_advocate$beer_advocate_website, "'>", beer_advocate$brewery_name, "</a></b>", beer_advocate$address,
beer_advocate$phone_number, paste("Average beer rating: ", beer_advocate$beer_avg,"/5.0"))
# add colors corresponding to the average beer ratings
pal <- colorNumeric("YlOrRd", beer_advocate$beer_avg, n = 6)
m <- leaflet(data = beer_advocate) %>%
addTiles() %>%
setView(lng =-72.70190, lat=41.75798, zoom = 8) %>%
addCircleMarkers(~lon, ~lat, popup = ~as.character(pop_up_content), color = ~pal(beer_avg)) %>%
addLegend("topright", pal = pal, values = ~beer_avg,
title = "Average Beer Ratings at Brewery",
opacity = 1
)
m