Sweet, Sweet Coffee

Coffee.

It’s the best.

A few mornings ago, as I was sipping my morning coffee, I checked the new Tidy Tuesday post to see what new dataset was available. This week, it was coffee ratings and the metrics used to determine coffee quality.

I jumped in and started exploring.

First, some cleaning…

#Load libraries

library(tidyverse)
library(tidytuesdayR)

#Pull the coffee data using the tidytuesday package

tuesdata <- tidytuesdayR::tt_load('2020-07-07')
## 
##  Downloading file 1 of 1: `coffee_ratings.csv`
coffee_ratings <- tuesdata$coffee_ratings

#I selected the variables that I wanted in my dataset using filter()

c2 <- coffee_ratings %>% select(total_cup_points, species, country_of_origin,
                                aroma, flavor, aftertaste, 
                                acidity, body, balance, uniformity, clean_cup, 
                                cupper_points, moisture, category_one_defects, 
                                category_two_defects, quakers, 
                                altitude_low_meters, altitude_high_meters, 
                                altitude_mean_meters)

# Change country_of_origin to factor

c2$country_of_origin <- as.factor(c2$country_of_origin)

Where is the best coffee from?

Right away, I wanted to see where the best coffee was from. I was simply going to pull the highest ratings from each country, but that seems a bit unfair (at least initially). I wanted something more descriptive.

Enter GGRidges

# I dropped a one variable (was looking at some different models),
# filtered so that there were no NAs and so number of ratings needed to be > 2,
# and made a variable called instances to get a count, 
# (note, I would not typically make so many DFs, but was working on other models)

c3 <- c2 %>% dplyr::select(-altitude_mean_meters) %>%  drop_na()

c4 <- c3 %>% group_by(country_of_origin) %>%  filter(n() > 2) %>% mutate(instances = n())

#load ggridges and viridis package for color blind friendly palettes 

library(ggridges)
library(viridis)

# And Plot

p2 <- c4 %>% ggplot(aes(x = total_cup_points, y = fct_reorder(country_of_origin, total_cup_points))) + 
  geom_density_ridges(aes(fill=country_of_origin)) +
  scale_fill_viridis(discrete = TRUE, option = "D") +
  labs(title = "Density plot of Coffee Ratings score", subtitle = "By Country of Origin", x = "Total Cup Points", y = NULL) +
  xlim(55, 100) +
  theme_light()+
  guides(fill=FALSE)
p2

Neat

From this plot, it looks like the U.S., Ethopia, and Kenya have some of the highest rated coffee scores.

But, the country with the best coffee, on average, is not the U.S.

Let’s take another look.

This time with an interactive plot using plotly.

#load in library
library(plotly)

#create new DF with mean scores across countries
c_means <- c4 %>% group_by(country_of_origin, instances) %>% summarise(mean_points = mean(total_cup_points))

#Plot using the ggplotly function and change the hovertip text (challenging)
#(note: I made size = instances so you could see how many scores were used in the mean calculation)

p4 <- c_means %>% ggplot(aes(fct_reorder(country_of_origin, mean_points), mean_points, text = paste("With <b>", instances, "</b> cups tasted <b>", country_of_origin, "</b> had an average coffee quality of <b>", mean_points,"</b>"))) + 
  geom_point(aes(color = country_of_origin, size = instances)) +
  scale_color_viridis(discrete = TRUE, option = "D") +
  labs(title = "Mean Cup Points by Country of Origin", x = NULL, y = "Mean Points") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p4 <- ggplotly(p4, tooltip = "text") %>% hide_guides() %>% style(hoverlabel = list(bgcolor = "white"), hoveron = "fill")
p4

The end…

Not quite.

This gives us the mean scores of countries.

Let’s find out which country or region (I see you Hawaii) we should visit to find the best coffee.

#create new DF making certain that each country has a score and get a count
c5 <- c3 %>% group_by(country_of_origin) %>% filter(n() > 0) %>% mutate(instances = n())

#Group by country of origin and instances and summarise for high score per country
c5 <- c5 %>% group_by(country_of_origin, instances) %>% summarise(high_score = max(total_cup_points))

#Plot it with ggplot and plotly
p5 <- c5 %>% ggplot(aes(fct_reorder(country_of_origin, high_score), high_score, 
                        text = paste("The highest rated coffee from <b>", country_of_origin, "</b> received <b>", high_score, "</b> points"))) +
  geom_point(aes(color = country_of_origin, size = instances)) +
  scale_color_viridis(discrete = TRUE, option = "D") +
  labs(title = "Highest Score by Country of Origin", x = NULL, y = "Highest Score") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p5 <- ggplotly(p5, tooltip = "text") %>% hide_guides() %>% style(hoverlabel = list(bgcolor = "white"), hoveron = "fill")
p5

What else???

There’s a lot to get from this data.

What I’ve learned is that I need to travel to Ethopia, Guatemala, and Hawaii for the best coffees in the world.

However, if I want the best chance of getting a good cup of coffee, Ethopia, the U.S. and Kenya are good places to check out.

As an avid coffee drinker, this was exploration was helpful for me to determine the origin of my next whole bean purchase.

As someone who enjoys data exploration, I found customizing the tooltip in plotly::ggplotly to be a fun challenge.