I wanted to do a variety of things for the final project and push myself to use the most I can out of R and use my computer science skills.
North Carolina is arguably the state with the most breweries in the country. Beer is one of the three kinds of liquor you can buy (wine, liquor, beer) and the advertising for it is everywhere. Down a highway or on television, there is almost certainly some kind of advertising for some sort of beer, perhaps one you’ve never heard of. And if you’ve ever seen the SuperBowl, you’ve definitely seen the halftime commercials for beer. If that isn’t enough to convince anyone, Anheuser-Busch InBev SA/NV is the largest brewery in the world and is one of largest and most profitable fast-moving consumer goods companies in the world, with $45.5 billion in revenue last year and a projected revenue of $55 billion this year. Craft beer is growing in popularity, so it’s safe to say quite a few people care about beer and the money it brings in.
For this final project I want to focus on beer and breweries in the United States. There will be three large aspects to this project: data wrangling, data visualization, and prediction, the Big Three of data science, with prediction being the side of machine learning I want to focus on, rather than classification. I want to explore the spread of breweries in the United States and I’d like to see if there is a viable prediction algorithm for beers, given a beer/brewery and its characteristics, can we predict how well-liked the beer will be? This has huge implications for owners of breweries, if ratings can be predicted, then one can aim to create a beer that will be well liked, and if certain locations are more open to certain kinds of beer, then marketing for that specific kind of beer should be increased in those areas. This is just an example of what the possible results of this project can mean in the real world.
I will be using data from two websites, BreweryDB and RateBeer. BreweryDB will be my main source of data, with RateBeer providing ratings of the beers based on community reviews. These two websites are well known and are often used in “beer analytics” and have reputable communities behind them, with revisions being made every day. BreweryDB maintains a staff that checks the authenticity of beers and breweries and the statistics of them, and RateBeer will provide the ratings and the number of reviewers so we can determine whether to trust an average rating when there has only been 2 reviewers for example. Getting data out of these websites is where the data wrangling portion of this project is.
BreweryDB provides just about all the information you could want about a brewery and the different beers it produces. The information I will be using from BreweryDB is:
RateBeer will only be used to associate ratings, which it provides, to the beers found in BreweryDB.
I will be using the BreweryDB API to access its database and I will be accessing it from my R program. RateBeer is much trickier since its API has been down for a number of years now. However, thanks to GitHub and open-sourcing, there is a kind soul who created a Python wrapper to scrape data off of RateBeer and return it in the same manner that an API would. So I will write a script in Python to get the ratings data and call that script from R. I foresee that the main issue will be differences in beer names between the two website and I’m sure that some beers will exist in one dataset but won’t exist in the other, but that’s what data wrangling is all about!
There are three main results I hope to deliver by the end of this project:
Because this project is very involved, I am planning on starting the project this week. My plan of action is as follows: