A snapshot of the history of coffee

COFFEE - Mark Pendergrast wrote Uncommon Grounds book, is one of most valuable agricultural commodities of the world. And coffee is the third most consumed beverage, besides water and tea - stated Anshool Deshmukh. Nowadays, over 70 countries worldwide produce this spectacular plant/beverage, some of the biggest names in coffee production are: Brazil, Vietnam, Colombia, Indonesia, Ethiopia - to name a few. They all belong to the “Bean Belt” (located between the Tropic of Cancer and the Tropic of Capricorn.).

Image: Visual Capitalist

So where did coffee come from? According to William Harrison Ukers, it’s commonly accepted that coffee originated from Abissinya (presently known as Ethiopia), and Arabia. Then the coffee plant were transported to different parts of the world starting in the 16th century. The rest, is history.

Why I chose this dataset

As a coffee addict that hardly go a day without consuming at least one cup, I know too little about coffee. That’s the subjective reason why I chose this dataset, to learn a little more about such a powerful beverage.This dataset was web scraped from https://www.coffeereview.com/. It comes with multiple very useful variables including:

name: Name of the coffee
roaster: Name of the roaster
roast: Type of roast (i.e.: Light, Medium-Light, Dark)
loc_country: Country where the roaster is located
origin: Country where the coffee beans come from
100_USD: Price of coffee per 100g in US dollars
rating: Rating of the coffee
review_date: Date of the review
review: Review notes of the coffee

Just by looking at the variables, there are many routes of data exploration to be had, for me, I wanted to explore the top 20 rated coffee beans, where they come from; then take a little deeper dive into those origins. Therefore, I needed to clean things up:

First, I must sort the data by ratings in descending order so I can isolate the top 20 rated beans
Second, from there, I can determine the origins of the top 20 beans, and filter the dataset down to just those origins
Third, I summarize the ratings and prices of the beans for each region to make the data more manageable
Fourth, in my attempt to make a Word Cloud, I remove common/general words in the Review to show which words are most used to describe coffee.

Let’s dig in!

### Load the necessary libraries

library(tidyverse)

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

# Set working directory

setwd("C:/Users/myngu/OneDrive/Montgomery College/Spring 2023/DATA 110/Data Sets")
coffee <- read_csv("simplified_coffee.csv")

## Rows: 1246 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, roaster, roast, loc_country, origin, review_date, review
## dbl (2): 100g_USD, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Looking at the data form a broad point of view

head(coffee)

## # A tibble: 6 × 9
##   name                roaster roast loc_c…¹ origin 100g_…² rating revie…³ review
##   <chr>               <chr>   <chr> <chr>   <chr>    <dbl>  <dbl> <chr>   <chr> 
## 1 Ethiopia Shakiso M… Revel … Medi… United… Ethio…    4.7      92 Novemb… Crisp…
## 2 Ethiopia Suke Quto  Roast … Medi… United… Ethio…    4.19     92 Novemb… Delic…
## 3 Ethiopia Gedeb Hal… Big Cr… Medi… United… Ethio…    4.85     94 Novemb… Deepl…
## 4 Ethiopia Kayon Mou… Red Ro… Light United… Ethio…    5.14     93 Novemb… Delic…
## 5 Ethiopia Gelgelu N… Willou… Medi… United… Ethio…    3.97     93 Novemb… High-…
## 6 Ethiopia Hambela A… Black … Medi… United… Ethio…    5.14     93 Novemb… Very …
## # … with abbreviated variable names ¹loc_country, ²`100g_USD`, ³review_date

Rename 100g_USD to price for ease of pulling data

coffee <- coffee %>% 
  rename(price = "100g_USD")

head(coffee)

## # A tibble: 6 × 9
##   name                  roaster roast loc_c…¹ origin price rating revie…² review
##   <chr>                 <chr>   <chr> <chr>   <chr>  <dbl>  <dbl> <chr>   <chr> 
## 1 Ethiopia Shakiso Mor… Revel … Medi… United… Ethio…  4.7      92 Novemb… Crisp…
## 2 Ethiopia Suke Quto    Roast … Medi… United… Ethio…  4.19     92 Novemb… Delic…
## 3 Ethiopia Gedeb Halo … Big Cr… Medi… United… Ethio…  4.85     94 Novemb… Deepl…
## 4 Ethiopia Kayon Mount… Red Ro… Light United… Ethio…  5.14     93 Novemb… Delic…
## 5 Ethiopia Gelgelu Nat… Willou… Medi… United… Ethio…  3.97     93 Novemb… High-…
## 6 Ethiopia Hambela Ala… Black … Medi… United… Ethio…  5.14     93 Novemb… Very …
## # … with abbreviated variable names ¹loc_country, ²review_date

You can see that 100g_USD is now price

Coffee data summary

summary(coffee)

##      name             roaster             roast           loc_country       
##  Length:1246        Length:1246        Length:1246        Length:1246       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     origin              price            rating      review_date       
##  Length:1246        Min.   :  0.17   Min.   :84.00   Length:1246       
##  Class :character   1st Qu.:  5.26   1st Qu.:93.00   Class :character  
##  Mode  :character   Median :  6.17   Median :93.00   Mode  :character  
##                     Mean   : 10.48   Mean   :93.31                     
##                     3rd Qu.:  9.60   3rd Qu.:94.00                     
##                     Max.   :132.28   Max.   :97.00                     
##     review         
##  Length:1246       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Notice under “price”, how the max is $132.28 compares to the average 10.48! What a huge difference. Let’s see if we can find out what that particlar coffee brand is after creating a few different graphs!

Let’s look at the top 20 rated coffee brands

First, we must isolate the top 20 rated coffee beans

# Reorder the coffee dataset base on rating
coffee <- coffee[order(-coffee$rating),]

# Isolate the top 20 highest rated brands
top20 <- head(coffee, 20)

Make a bar graph to display the top 20 rated coffee beans, where they’re from and their ratings

graph1 <- ggplot(top20, aes(x=rating, y=name, fill=origin)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_fill_brewer(palette="BrBG") +
  ggtitle("Top 20 Coffees by Rating") +
  xlab("Rating") +
  ylab("Coffee Name") +
  theme(axis.text.y = element_text(size = 8),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12)) +
  theme(legend.position = "right") +
  geom_text(aes(label = origin), size = 3, hjust = -0.5, vjust = 0.5) +
  coord_cartesian(xlim = c(90, 100)) + #The coord_cartesian() function is used to set the x-axis limits, and the argument xlim = c(90, 100) sets the starting point of the x-axis to 90, and the end point to 100. 
  scale_x_continuous(breaks = seq(90, 100, by = 1)) #The breaks = seq(90, 100, by = 1) means that the x axis will start at 90 and count to 100 by 1.
graph1

So the top 20 coffee brands come from just 5 countries: Colombia, Ethiopia, Hawai’i, Kenya, and Panama. I want to look at the average ratings of these 5 regions and where they stand.

Filter out the data to show Colombia, Ethiopia, Hawai’i, Kenya, and Panama

chepk <- coffee %>%
  filter(origin == "Hawai'I" | origin == "Colombia" | origin == "Ethiopia" | origin == "Kenya" | origin == "Panama")

Average coffee rating of the five regions Colombia, Ethiopia, Hawai’i, Kenya, and Panama

coffee_by_origin <- chepk %>%
  group_by(origin) %>%
  summarize(average_rating = mean(rating, na.rm = TRUE)) %>%
  arrange(desc(average_rating))

Graph 2: Average ratings in descending order

graph2 <- ggplot(coffee_by_origin, aes(x = reorder(origin, average_rating), y = average_rating, fill = origin)) +
  geom_bar(stat = "identity", width = 0.5) +
  ggtitle("Top Origins by Rating") +
  xlab("Origin") +
  ylab("Average Rating") +
  theme_minimal() +
  scale_fill_brewer(palette="BrBG") +
  theme(axis.text.y = element_text(size = 8),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_text(size=10),
        axis.title.y = element_text(size=10))  +
  coord_flip() +
  theme(axis.text.x = element_text(angle = 0, hjust = 1), legend.position = "none") +
  geom_text(aes(label = round(average_rating, 1)), size = 2, hjust = -0.5, vjust = 0.5)
graph2

As the graph shows, the average rating of the coffee beans from these 5 origins are neck and neck, showing differences in decimal points. However, Panama is the leader with the average rating of 94.5 points. Impressive! So is there any correlation between ratings and prices? My assumption is, yes. The higher the rating, the more costly the beans. But let’s check it out.

Average coffee prices of the five regions Colombia, Ethiopia, Hawai’i, Kenya, and Panama

coffee_by_price <- chepk %>%
  group_by(origin) %>%
  summarize(average_price = mean(price, na.rm = TRUE)) %>%
  arrange(desc(average_price))

Graph 3: Colombia, Ethiopia, Hawai’i, Kenya, and Panama coffee prices

graph3 <- ggplot(coffee_by_price, aes(x = origin, y = average_price, fill = origin)) +
  geom_bar(stat = "identity", width = 0.5) +
  ggtitle("Top Origins by Price") +
  xlab("Origin") +
  ylab("Average Price") +
  theme_minimal() +
  scale_fill_brewer(palette="BrBG") +
  theme(axis.text.y = element_text(size = 8),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10))  +
  theme(axis.text.x = element_text(angle = 0, hjust = 1), legend.position = "none") +
  coord_flip() +
  geom_text(aes(label = round(average_price, 1)), size = 2, hjust = -0.5, vjust = 0.5)
graph3

Contrary to my initial assumption, besides Panama, coffee beans from Kenya, Hawai’i, Ethiopia, and Colombia, though highly rated, are not overly priced. To see the relations between ratings and prices, I put the 2 graphs next to each other.

Side by side comparison

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.2.3

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:lubridate':
## 
##     stamp

plot_grid(graph2, graph3)

As presented, even though coffee from these 5 countries have very similar ratings, their average prices are different, with Panama being significantly more expensive! This is telling me that if I buy coffee from Kenya, Hawai’i, Ethiopia, and Colombia, I will get the most bang for my buck!

However, I do want to dig a little deeper and learn more about the coffee beans from these countries - which can be achieved by creating an interactive chart. Remember, I still have to find out which coffee bean costs $132.28 per 100g and from where!

Highcharter graph with detailed tooltips

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.2.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)

cols <- brewer.pal(5, "BrBG")

highchart() %>%
  hc_add_series(chepk, type = "scatter",
                hcaes(x = price, y = rating, group = origin)) %>%
  hc_colors(cols) %>%
  hc_title(text = "Coffee beans from") %>%
  hc_yAxis(title = list(text = "Rating")) %>%
  hc_xAxis(title = list(text = "Price per 100g")) %>%
  hc_chart(style = list(fontFamily = "Georgia")) %>%
  hc_legend(align = "center", 
            verticalAlign = "top") %>%
  hc_add_theme(hc_theme_darkunica()) %>%
  hc_tooltip(shared = TRUE,
                        pointFormat = "Coffee Name: {point.name}<br> Roast: {point.roast}<br> Price (per 100g): {point.price}<br> Rating: {point.rating}<br> Review: {point.review} Roaster: {point.roaster}<br> Roasted in: {point.loc_country}")

Unsurprisingly, Panama takes home the most expensive coffee bean: Mama Cata Mokkita with the high rating of 97 points! When I look at the graph more closely, I noticed that most of the expensive coffee beans come from Panama, and most of them have “Geisha” in their name. How intriguing!

After some research, I learned that coffee beans from Panama, specifically the Geisha species, are priced quite high due to:

Specific cultivation conditions (ideally 1,450 to 1,700 meters above sea level - that’s about 4,757 to 5,577 feet)
Their ower photosynthesis efficiency (low yield)
Must be hand-picked at the right time
The beans must be dried for 8 days (Toby, www.coffeewithconscience.org)

A little silly fact about Geisha coffee bean, its correct spelling should be GESHA (for the Gesha forest in Ethiopia where the coffee originates from); unfortunately, it was misunderstood to Geisha - which means Japanese Entertainer.

But enough about Geisha, let’s talk Mama Cata Mokkita - a non-Geisha variety that took the crown for most expensive. Mama Cata Mokkita is a one-of-a-kind Mokka type variety grown only at Mama Cata estate. It was discovered by Jose David Garrido (owner of Mama Cata estate) when he was inspecting his farm for unique plants that could potentially produce an even more amazing cup of coffee - when he found one plant that had smaller than usual cherries. After a process of reviewing, he determined that coffee from this plant was special and decided to use the seed to plant more. In the first year of planting, the plants did not take well - which promted Jose David Garrido to move the planting location to a drier area in higher altitude with soil similar to that of the varietal’s origin - Yemen. In addition to the difficult growing condition, low yielding (producing at half the rate of Geisha’s), Mama Cata Mokkita beans must be dried slowly over the course of a month inside a greenhouse. These are some of the reasons why they are that expensive. (https://paradiseroasters.com/products/panama-mokkita-natural-mama-cata-estate?variant=39667399884905)

Take away

I learned more about coffee with this project than ever, yet I feel like I only just scratched a tiny bit at the surface. I am taken aback by just how many varieties of coffee beans there are, the different requirements for coffee cultivation, the climate, soil, elevation… The world of coffee is just as complex as wine; and the reviews/descriptions of coffee beans read like those of wine as well.

There is one thing that surprises me, Vietnam did not make the list, despite being one of the top coffee exporters and consumers in the world. I think that it’s probably because this dataset focuses on quality more than quantity. Not to say that Vietnamese coffee is bad, but perhaps there hasn’t been a varietal that makes the cut yet. After all, the lowest rating for a coffee bean in this dataset is 84. Maybe one day! Until then, I will continue to display my biased love for Vietnamese coffee.

Works Cited

Deshmukh, Anshool. “Which Country Produced the Most Coffee in 2020?” World Economic Forum, 5 Oct. 2021, https://www.weforum.org/agenda/2021/10/which-country-produced-the-most-coffee-in-2020/.

“Panama Mokkita Natural -Mama Cata Estate.” Paradise Coffee Roasters, https://paradiseroasters.com/products/panama-mokkita-natural-mama-cata-estate?variant=39667399884905.

Pendergrast, Mark. Uncommon Grounds: The History of Coffee and How It Transformed Our World. Basic Books, 2019.

Toby. “Why Are Panama Geisha Coffee Beans so Expensive?” Coffee With Conscience, 26 Mar. 2023, https://www.coffeewithconscience.org/why-are-panama-geisha-coffee-beans-so-expensive/.

Ukers, William H. All about Coffee. Tea and Coffee Trade Journal Co., 1922.

BONUS: Making a Word Cloud from the reviews

Load the necessary libraries

library(RColorBrewer)
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.2.3

library(wordcloud2)

## Warning: package 'wordcloud2' was built under R version 4.2.3

library(tm)

## Warning: package 'tm' was built under R version 4.2.3

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC)

Calculate Corpus

coffeecloud <-Corpus(VectorSource(chepk$review))

Data Cleaning and Wrangling

coffeecloud <-coffeecloud %>%
  tm_map(tolower) %>%
  tm_map(removeWords,c("mouthfeel","cup","and","note","that","with","will","also","aroma", "finish", "structure", "structur", "the")) %>%
  tm_map(removePunctuation) %>%
  tm_map(stemDocument)

## Warning in tm_map.SimpleCorpus(., tolower): transformation drops documents

## Warning in tm_map.SimpleCorpus(., removeWords, c("mouthfeel", "cup", "and", :
## transformation drops documents

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops documents

Make a cloud with words

wordcloud(words = coffeecloud, min.freq = 20,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(11, "BrBG"))

Coffee Coffee Coffee

My Nguyen

2023-04-19