COFFEE - Mark Pendergrast wrote Uncommon Grounds book, is one of most valuable agricultural commodities of the world. And coffee is the third most consumed beverage, besides water and tea - stated Anshool Deshmukh. Nowadays, over 70 countries worldwide produce this spectacular plant/beverage, some of the biggest names in coffee production are: Brazil, Vietnam, Colombia, Indonesia, Ethiopia - to name a few. They all belong to the “Bean Belt” (located between the Tropic of Cancer and the Tropic of Capricorn.).
Image:
Visual Capitalist
So where did coffee come from? According to William Harrison Ukers, it’s commonly accepted that coffee originated from Abissinya (presently known as Ethiopia), and Arabia. Then the coffee plant were transported to different parts of the world starting in the 16th century. The rest, is history.
As a coffee addict that hardly go a day without consuming at least one cup, I know too little about coffee. That’s the subjective reason why I chose this dataset, to learn a little more about such a powerful beverage.This dataset was web scraped from https://www.coffeereview.com/. It comes with multiple very useful variables including:
Just by looking at the variables, there are many routes of data exploration to be had, for me, I wanted to explore the top 20 rated coffee beans, where they come from; then take a little deeper dive into those origins. Therefore, I needed to clean things up:
### Load the necessary libraries
library(tidyverse)
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Set working directory
setwd("C:/Users/myngu/OneDrive/Montgomery College/Spring 2023/DATA 110/Data Sets")
coffee <- read_csv("simplified_coffee.csv")
## Rows: 1246 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, roaster, roast, loc_country, origin, review_date, review
## dbl (2): 100g_USD, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(coffee)
## # A tibble: 6 × 9
## name roaster roast loc_c…¹ origin 100g_…² rating revie…³ review
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Ethiopia Shakiso M… Revel … Medi… United… Ethio… 4.7 92 Novemb… Crisp…
## 2 Ethiopia Suke Quto Roast … Medi… United… Ethio… 4.19 92 Novemb… Delic…
## 3 Ethiopia Gedeb Hal… Big Cr… Medi… United… Ethio… 4.85 94 Novemb… Deepl…
## 4 Ethiopia Kayon Mou… Red Ro… Light United… Ethio… 5.14 93 Novemb… Delic…
## 5 Ethiopia Gelgelu N… Willou… Medi… United… Ethio… 3.97 93 Novemb… High-…
## 6 Ethiopia Hambela A… Black … Medi… United… Ethio… 5.14 93 Novemb… Very …
## # … with abbreviated variable names ¹loc_country, ²`100g_USD`, ³review_date
coffee <- coffee %>%
rename(price = "100g_USD")
head(coffee)
## # A tibble: 6 × 9
## name roaster roast loc_c…¹ origin price rating revie…² review
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Ethiopia Shakiso Mor… Revel … Medi… United… Ethio… 4.7 92 Novemb… Crisp…
## 2 Ethiopia Suke Quto Roast … Medi… United… Ethio… 4.19 92 Novemb… Delic…
## 3 Ethiopia Gedeb Halo … Big Cr… Medi… United… Ethio… 4.85 94 Novemb… Deepl…
## 4 Ethiopia Kayon Mount… Red Ro… Light United… Ethio… 5.14 93 Novemb… Delic…
## 5 Ethiopia Gelgelu Nat… Willou… Medi… United… Ethio… 3.97 93 Novemb… High-…
## 6 Ethiopia Hambela Ala… Black … Medi… United… Ethio… 5.14 93 Novemb… Very …
## # … with abbreviated variable names ¹loc_country, ²review_date
You can see that 100g_USD is now price
summary(coffee)
## name roaster roast loc_country
## Length:1246 Length:1246 Length:1246 Length:1246
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## origin price rating review_date
## Length:1246 Min. : 0.17 Min. :84.00 Length:1246
## Class :character 1st Qu.: 5.26 1st Qu.:93.00 Class :character
## Mode :character Median : 6.17 Median :93.00 Mode :character
## Mean : 10.48 Mean :93.31
## 3rd Qu.: 9.60 3rd Qu.:94.00
## Max. :132.28 Max. :97.00
## review
## Length:1246
## Class :character
## Mode :character
##
##
##
Notice under “price”, how the max is $132.28 compares to the average 10.48! What a huge difference. Let’s see if we can find out what that particlar coffee brand is after creating a few different graphs!
# Reorder the coffee dataset base on rating
coffee <- coffee[order(-coffee$rating),]
# Isolate the top 20 highest rated brands
top20 <- head(coffee, 20)
graph1 <- ggplot(top20, aes(x=rating, y=name, fill=origin)) +
geom_bar(stat = "identity", width = 0.5) +
scale_fill_brewer(palette="BrBG") +
ggtitle("Top 20 Coffees by Rating") +
xlab("Rating") +
ylab("Coffee Name") +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12)) +
theme(legend.position = "right") +
geom_text(aes(label = origin), size = 3, hjust = -0.5, vjust = 0.5) +
coord_cartesian(xlim = c(90, 100)) + #The coord_cartesian() function is used to set the x-axis limits, and the argument xlim = c(90, 100) sets the starting point of the x-axis to 90, and the end point to 100.
scale_x_continuous(breaks = seq(90, 100, by = 1)) #The breaks = seq(90, 100, by = 1) means that the x axis will start at 90 and count to 100 by 1.
graph1
So the top 20 coffee brands come from just 5 countries: Colombia, Ethiopia, Hawai’i, Kenya, and Panama. I want to look at the average ratings of these 5 regions and where they stand.
chepk <- coffee %>%
filter(origin == "Hawai'I" | origin == "Colombia" | origin == "Ethiopia" | origin == "Kenya" | origin == "Panama")
coffee_by_origin <- chepk %>%
group_by(origin) %>%
summarize(average_rating = mean(rating, na.rm = TRUE)) %>%
arrange(desc(average_rating))
graph2 <- ggplot(coffee_by_origin, aes(x = reorder(origin, average_rating), y = average_rating, fill = origin)) +
geom_bar(stat = "identity", width = 0.5) +
ggtitle("Top Origins by Rating") +
xlab("Origin") +
ylab("Average Rating") +
theme_minimal() +
scale_fill_brewer(palette="BrBG") +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10)) +
coord_flip() +
theme(axis.text.x = element_text(angle = 0, hjust = 1), legend.position = "none") +
geom_text(aes(label = round(average_rating, 1)), size = 2, hjust = -0.5, vjust = 0.5)
graph2
As the graph shows, the average rating of the coffee beans from these 5 origins are neck and neck, showing differences in decimal points. However, Panama is the leader with the average rating of 94.5 points. Impressive! So is there any correlation between ratings and prices? My assumption is, yes. The higher the rating, the more costly the beans. But let’s check it out.
coffee_by_price <- chepk %>%
group_by(origin) %>%
summarize(average_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(average_price))
graph3 <- ggplot(coffee_by_price, aes(x = origin, y = average_price, fill = origin)) +
geom_bar(stat = "identity", width = 0.5) +
ggtitle("Top Origins by Price") +
xlab("Origin") +
ylab("Average Price") +
theme_minimal() +
scale_fill_brewer(palette="BrBG") +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 10),
axis.title.y = element_text(size = 10)) +
theme(axis.text.x = element_text(angle = 0, hjust = 1), legend.position = "none") +
coord_flip() +
geom_text(aes(label = round(average_price, 1)), size = 2, hjust = -0.5, vjust = 0.5)
graph3
Contrary to my initial assumption, besides Panama, coffee beans from Kenya, Hawai’i, Ethiopia, and Colombia, though highly rated, are not overly priced. To see the relations between ratings and prices, I put the 2 graphs next to each other.
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.2.3
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
##
## stamp
plot_grid(graph2, graph3)
As presented, even though coffee from these 5 countries have very similar ratings, their average prices are different, with Panama being significantly more expensive! This is telling me that if I buy coffee from Kenya, Hawai’i, Ethiopia, and Colombia, I will get the most bang for my buck!
However, I do want to dig a little deeper and learn more about the coffee beans from these countries - which can be achieved by creating an interactive chart. Remember, I still have to find out which coffee bean costs $132.28 per 100g and from where!
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
cols <- brewer.pal(5, "BrBG")
highchart() %>%
hc_add_series(chepk, type = "scatter",
hcaes(x = price, y = rating, group = origin)) %>%
hc_colors(cols) %>%
hc_title(text = "Coffee beans from") %>%
hc_yAxis(title = list(text = "Rating")) %>%
hc_xAxis(title = list(text = "Price per 100g")) %>%
hc_chart(style = list(fontFamily = "Georgia")) %>%
hc_legend(align = "center",
verticalAlign = "top") %>%
hc_add_theme(hc_theme_darkunica()) %>%
hc_tooltip(shared = TRUE,
pointFormat = "Coffee Name: {point.name}<br> Roast: {point.roast}<br> Price (per 100g): {point.price}<br> Rating: {point.rating}<br> Review: {point.review} Roaster: {point.roaster}<br> Roasted in: {point.loc_country}")
Unsurprisingly, Panama takes home the most expensive coffee bean: Mama Cata Mokkita with the high rating of 97 points! When I look at the graph more closely, I noticed that most of the expensive coffee beans come from Panama, and most of them have “Geisha” in their name. How intriguing!
After some research, I learned that coffee beans from Panama, specifically the Geisha species, are priced quite high due to:
A little silly fact about Geisha coffee bean, its correct spelling should be GESHA (for the Gesha forest in Ethiopia where the coffee originates from); unfortunately, it was misunderstood to Geisha - which means Japanese Entertainer.
But enough about Geisha, let’s talk Mama Cata Mokkita - a non-Geisha variety that took the crown for most expensive. Mama Cata Mokkita is a one-of-a-kind Mokka type variety grown only at Mama Cata estate. It was discovered by Jose David Garrido (owner of Mama Cata estate) when he was inspecting his farm for unique plants that could potentially produce an even more amazing cup of coffee - when he found one plant that had smaller than usual cherries. After a process of reviewing, he determined that coffee from this plant was special and decided to use the seed to plant more. In the first year of planting, the plants did not take well - which promted Jose David Garrido to move the planting location to a drier area in higher altitude with soil similar to that of the varietal’s origin - Yemen. In addition to the difficult growing condition, low yielding (producing at half the rate of Geisha’s), Mama Cata Mokkita beans must be dried slowly over the course of a month inside a greenhouse. These are some of the reasons why they are that expensive. (https://paradiseroasters.com/products/panama-mokkita-natural-mama-cata-estate?variant=39667399884905)
I learned more about coffee with this project than ever, yet I feel like I only just scratched a tiny bit at the surface. I am taken aback by just how many varieties of coffee beans there are, the different requirements for coffee cultivation, the climate, soil, elevation… The world of coffee is just as complex as wine; and the reviews/descriptions of coffee beans read like those of wine as well.
There is one thing that surprises me, Vietnam did not make the list, despite being one of the top coffee exporters and consumers in the world. I think that it’s probably because this dataset focuses on quality more than quantity. Not to say that Vietnamese coffee is bad, but perhaps there hasn’t been a varietal that makes the cut yet. After all, the lowest rating for a coffee bean in this dataset is 84. Maybe one day! Until then, I will continue to display my biased love for Vietnamese coffee.
Deshmukh, Anshool. “Which Country Produced the Most Coffee in 2020?” World Economic Forum, 5 Oct. 2021, https://www.weforum.org/agenda/2021/10/which-country-produced-the-most-coffee-in-2020/.
“Panama Mokkita Natural -Mama Cata Estate.” Paradise Coffee Roasters, https://paradiseroasters.com/products/panama-mokkita-natural-mama-cata-estate?variant=39667399884905.
Pendergrast, Mark. Uncommon Grounds: The History of Coffee and How It Transformed Our World. Basic Books, 2019.
Toby. “Why Are Panama Geisha Coffee Beans so Expensive?” Coffee With Conscience, 26 Mar. 2023, https://www.coffeewithconscience.org/why-are-panama-geisha-coffee-beans-so-expensive/.
Ukers, William H. All about Coffee. Tea and Coffee Trade Journal Co., 1922.
library(RColorBrewer)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.2.3
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.2.3
library(tm)
## Warning: package 'tm' was built under R version 4.2.3
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
coffeecloud <-Corpus(VectorSource(chepk$review))
coffeecloud <-coffeecloud %>%
tm_map(tolower) %>%
tm_map(removeWords,c("mouthfeel","cup","and","note","that","with","will","also","aroma", "finish", "structure", "structur", "the")) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument)
## Warning in tm_map.SimpleCorpus(., tolower): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, c("mouthfeel", "cup", "and", :
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops documents
wordcloud(words = coffeecloud, min.freq = 20,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(11, "BrBG"))