We begin by loading in the dataset from this package.
data(untappd, package = "izzyuntappd")
# I've also included the dataset as a CSV file and you can read it in by using
# untappd <- read_csv(file = "chester_beer_feb15-june16.csv")
One great feature of RStudio is the ability to view dataframes like untappd in table form:
View(untappd)
We can determine what the mean and median abv values are from this data set and also the standard deviation of the abv values:
summary_abv <- untappd %>% summarize(mean_abv = mean(abv),
median_abv = median(abv),
sd_abv = sd(abv))
summary_abv
## # A tibble: 1 x 3
## mean_abv median_abv sd_abv
## <dbl> <dbl> <dbl>
## 1 6.453058 6.2 1.830056
kable(summary_abv)
| mean_abv | median_abv | sd_abv |
|---|---|---|
| 6.453058 | 6.2 | 1.830056 |
We can also create a plot of this distribution of abv:
abv_plot <- ggplot(aes(x = abv), data = untappd) +
geom_histogram(bins = 20, color = "white")
abv_plot
To make an interactive plot using a ggplot2 graphic, we can use the ggplotly function in the plotly package:
ggplotly(abv_plot)
If we’d like to see the top number of macro_style of beer I’ve tried, sorted:
style_count <- untappd %>% count(macro_style)
datatable(style_count)
The datatable function in the DT package provides a nice interface for searching and sorting datasets.
What is going on here!? Do I actually like my top macro_style as much as these numbers show?
dplyr verbsLet’s focus on only the macro_style corresponding to IPA. We will create a new dataframe called ipas:
ipas <- untappd %>% filter(macro_style == "IPA")
Look through the dataset again by entering View(ipas) into the R console.
Let’s simplify our dataset a bit to view it more easily.
ipas_small <- ipas %>% select(beer_name, style, abv, ibu, rating)
We might be curious to see if ibu has a relationship with rating:
ggplot(data = ipas_small, aes(x = ibu, y = rating))
What type of plot should we make here?
ibu_vs_rating <- ggplot(data = ipas_small, aes(x = ibu, y = rating)) +
geom_point()
ibu_vs_rating
It is often better to view datasets in plots by using multivariate thinking. Another common feature that beer drinkers look for is abv. How does abv relate to ibu and rating for me?
ibu_abv_rating <- ggplot(data = ipas_small, aes(x = ibu, y = rating)) +
geom_point(aes(color = abv, alpha = abv))
ibu_abv_rating
ggplotly(ibu_abv_rating)
There are many different styles of beers in the macro_style of IPA. How could we use what we know already to determine which style of IPA I rated highest, on average?
summary_ipas <- ipas_small %>% group_by(style) %>%
summarize(mean_rating = mean(rating),
median_rating = median(rating),
count = n())
datatable(summary_ipas)
Now let’s go back to the original untappd dataset.
brew_state_counts <- untappd %>% filter(brewery_country == "United States") %>% count(brewery_state)
datatable(brew_state_counts)
We see that there are 25 states listed here.
Now how do we identify the brewery with the smallest maximum rating? Chain together multiple commands to get a final answer.
max_state_ratings <- untappd %>%
filter(brewery_country == "United States") %>%
group_by(brewery_state) %>%
summarize(max_rating = max(rating, na.rm = TRUE),
count = n()) %>%
arrange(max_rating)
datatable(max_state_ratings)
Let’s conclude by showing how we can use the dplyr functions to summarize/manipulate data and then feed that data into ggplot2 functions to plot them.
People like to ask me if I prefer stouts and/or porters better in the winter or in the summer. Let’s use my ratings to address this question.
date column in the untappd dataframe."Porter"s and "Stout"s in the macro_style variablestouts_porters <- untappd %>% filter(grepl("Porter|Stout", macro_style))
dark_by_day <- ggplot(stouts_porters, aes(x = date, y = rating)) +
geom_point(alpha = 0.3)
ggplotly(dark_by_day)
This pretty much addresses our question. Except for a few bad ones in Spring 2016, it doesn’t look like it matters much what time of the year it is. But let’s dig further. Did I like stouts better or porters better over this time frame?
ggplot(stouts_porters, aes(x = date, y = rating)) +
geom_point(aes(color = macro_style))
This is still a little tricky to see. Let’s focus on the median rating for each day for both porters and stouts. First we need to compute the median ratings:
sp_median <- stouts_porters %>% group_by(macro_style, date) %>%
summarize(median_rating = median(rating))
Now we will create a line-graph over the time frame and color by macro_style:
ggplot(sp_median, aes(x = date, y = median_rating, color = macro_style)) +
geom_line() +
scale_color_manual(values = c("goldenrod", "darkblue"))
It does appear that I prefer porters to stouts in the summer months, stouts to porters in the fall, and it is anybody’s guess for the remainder of the year.
Play around with the data more to see which kinds of correlations and things stand out to you!