The report is also available on RPubs.
Introduction
In this report I am going to analyze the Coffee Ratings dataset which was released on the 28th week of 2020 on Tidy Tuesday’ GitHub page. The data set contains information about both Arabica and Robusta beans across many countries and professionally rated on a 0-100 scale. Many sorts of scoring are available for each producer such as based on the acidity, sweetness, fragrance and the balance of the coffee which can provide interesting insights o the differences between coffee producing countries. In this analysis I will focus on country level analysis of the coffee ratings and the different dimensions of assessment of the coffee quality as well as its relationship with the growing altitude. With the help of descriptive analytics and data visualization I will try to find answer for the following question: where should I travel if I want to drink the best coffee in the world?
The dataset
We can import the coffee ratings data set using the tidytuesdayR package. The dataset was converted to a data.table object in order to make the subsequent data cleaning and feature engineering more efficient.
# Clear environment
rm(list=ls())
# Import libraries for data management
library(tidytuesdayR)
library(data.table)
library(ggplot2)
library(ggrepel)
library(sf)
library(tidyverse)
library(kableExtra)
# Import dataset
df <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
# Convert to data table
df <- data.table(df)Let’s take a quick look at the imported dataset.
str(df)## Classes 'data.table' and 'data.frame': 1339 obs. of 43 variables:
## $ total_cup_points : num 90.6 89.9 89.8 89 88.8 ...
## $ species : chr "Arabica" "Arabica" "Arabica" "Arabica" ...
## $ owner : chr "metad plc" "metad plc" "grounds for health admin" "yidnekachew dabessa" ...
## $ country_of_origin : chr "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" ...
## $ farm_name : chr "metad plc" "metad plc" "san marcos barrancas \"san cristobal cuch" "yidnekachew dabessa coffee plantation" ...
## $ lot_number : chr NA NA NA NA ...
## $ mill : chr "metad plc" "metad plc" NA "wolensu" ...
## $ ico_number : chr "2014/2015" "2014/2015" NA NA ...
## $ company : chr "metad agricultural developmet plc" "metad agricultural developmet plc" NA "yidnekachew debessa coffee plantation" ...
## $ altitude : chr "1950-2200" "1950-2200" "1600 - 1800 m" "1800-2200" ...
## $ region : chr "guji-hambela" "guji-hambela" NA "oromia" ...
## $ producer : chr "METAD PLC" "METAD PLC" NA "Yidnekachew Dabessa Coffee Plantation" ...
## $ number_of_bags : num 300 300 5 320 300 100 100 300 300 50 ...
## $ bag_weight : chr "60 kg" "60 kg" "1" "60 kg" ...
## $ in_country_partner : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ harvest_year : chr "2014" "2014" NA "2014" ...
## $ grading_date : chr "April 4th, 2015" "April 4th, 2015" "May 31st, 2010" "March 26th, 2015" ...
## $ owner_1 : chr "metad plc" "metad plc" "Grounds for Health Admin" "Yidnekachew Dabessa" ...
## $ variety : chr NA "Other" "Bourbon" NA ...
## $ processing_method : chr "Washed / Wet" "Washed / Wet" NA "Natural / Dry" ...
## $ aroma : num 8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
## $ flavor : num 8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
## $ aftertaste : num 8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
## $ acidity : num 8.75 8.58 8.42 8.42 8.5 8.5 8.5 8.42 8.42 8.5 ...
## $ body : num 8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
## $ balance : num 8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
## $ uniformity : num 10 10 10 10 10 10 10 10 9.33 10 ...
## $ clean_cup : num 10 10 10 10 10 10 10 10 10 10 ...
## $ sweetness : num 10 10 10 10 10 10 10 9.33 9.33 10 ...
## $ cupper_points : num 8.75 8.58 9.25 8.67 8.58 8.33 8.5 9 8.67 8.5 ...
## $ moisture : num 0.12 0.12 0 0.11 0.12 0.11 0.11 0.03 0.03 0.1 ...
## $ category_one_defects : num 0 0 0 0 0 0 0 0 0 0 ...
## $ quakers : num 0 0 0 0 0 0 0 0 0 0 ...
## $ color : chr "Green" "Green" NA "Green" ...
## $ category_two_defects : num 0 1 0 2 2 1 0 0 0 4 ...
## $ expiration : chr "April 3rd, 2016" "April 3rd, 2016" "May 31st, 2011" "March 25th, 2016" ...
## $ certification_body : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ certification_address: chr "309fcf77415a3661ae83e027f7e5f05dad786e44" "309fcf77415a3661ae83e027f7e5f05dad786e44" "36d0d00a3724338ba7937c52a378d085f2172daa" "309fcf77415a3661ae83e027f7e5f05dad786e44" ...
## $ certification_contact: chr "19fef5a731de2db57d16da10287413f5f99bc2dd" "19fef5a731de2db57d16da10287413f5f99bc2dd" "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660" "19fef5a731de2db57d16da10287413f5f99bc2dd" ...
## $ unit_of_measurement : chr "m" "m" "m" "m" ...
## $ altitude_low_meters : num 1950 1950 1600 1800 1950 ...
## $ altitude_high_meters : num 2200 2200 1800 2200 2200 NA NA 1700 1700 1850 ...
## $ altitude_mean_meters : num 2075 2075 1700 2000 2075 ...
## - attr(*, ".internal.selfref")=<externalptr>
As the dataset was collected with web scraping there are obviously lots of fields that are out of our interest. Let’s get rid of them first.
# Drop unnecessary fields
drops <- c("lot_number", "ico_number", "certification_body", "certification_address", "certification_contact", "altitude", "mill", "farm_name", "company_name", "bag_weight", "altitude_low_meters", "altitude_high_meters", "in_country_partner", "unit_of_measurement", "owner_1", "number_of_bags", "category_one_defects", "quakers", "expiration", "category_two_defects", "harvest_year", "company", "producer")
df <- df[, (drops) := NULL]
rm(drops)Identifiers and different certificate numbers can be dropped. As the measure of the altitude we are going to use the altitude_mean_meters which is the mean of the higher and lower altitude limits and already converted to meters. We are not going to drill down to producer or farm level so these fields can be also removed. Also the number of coffee bags used for the testing is not relevant for us so we will remove it.
Technical preparation
In order to unify the outlook of our graps I created a custom theme in advance that we are going to add to each graph.
source("theme_viki.R")Also for the explanatory data analysis custom graph functions were created in advance.
# Histogram
histograms <- function( data, x_var , x_lab, bin ){
# n = nrow(df)
ggplot( data , aes(x = x_var)) +
geom_histogram( aes(y = (..count..)/sum(..count..)), binwidth = bin, fill="#238b45", color = 'gray50', alpha = 0.8, na.rm = T) +
# stat_function(fun = function(x) dnorm(x, mean = mean(x_var, na.rm = T), sd = sd(x_var, na.rm = T)) * n * bin, color = "darkred", size = 1, na.rm = T) +
labs(y = 'Percent',x = x_lab ) +
scale_y_continuous(labels = scales::percent_format(1)) +
theme_viki()
}
# Boxplot
boxplots <- function( data, x_var , y_var, x_lab , y_lab ){
ggplot(data, aes(x = factor(x_var), y = y_var)) +
geom_boxplot( fill="#238b45", alpha = 0.8) +
stat_boxplot(geom = "errorbar", width = 0.8, size = 0.3, na.rm=T) +
labs(y = y_lab,x = x_lab ) +
theme_viki()
}Data cleaning
Now we can start to explore and clean our dataset. First let’s take a look at the numeric variables.
# Import library
library(modelsummary)
# Summary table
datasummary_skim(df, 'numeric') | Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
|---|---|---|---|---|---|---|---|---|
| total_cup_points | 180 | 0 | 82.1 | 3.5 | 0.0 | 82.5 | 90.6 | |
| aroma | 33 | 0 | 7.6 | 0.4 | 0.0 | 7.6 | 8.8 | |
| flavor | 35 | 0 | 7.5 | 0.4 | 0.0 | 7.6 | 8.8 | |
| aftertaste | 35 | 0 | 7.4 | 0.4 | 0.0 | 7.4 | 8.7 | |
| acidity | 31 | 0 | 7.5 | 0.4 | 0.0 | 7.6 | 8.8 | |
| body | 33 | 0 | 7.5 | 0.4 | 0.0 | 7.5 | 8.6 | |
| balance | 33 | 0 | 7.5 | 0.4 | 0.0 | 7.5 | 8.8 | |
| uniformity | 10 | 0 | 9.8 | 0.6 | 0.0 | 10.0 | 10.0 | |
| clean_cup | 11 | 0 | 9.8 | 0.8 | 0.0 | 10.0 | 10.0 | |
| sweetness | 17 | 0 | 9.9 | 0.6 | 0.0 | 10.0 | 10.0 | |
| cupper_points | 42 | 0 | 7.5 | 0.5 | 0.0 | 7.5 | 10.0 | |
| moisture | 23 | 0 | 0.1 | 0.0 | 0.0 | 0.1 | 0.3 | |
| altitude_mean_meters | 212 | 17 | 1775.0 | 8668.6 | 1.0 | 1310.6 | 190164.0 |
Our main field of interest is the total_cup_points which is the total rating of the coffee and is measured in a 0 to 100 scale. Other 11 individual coffee quality grade measures are also available which are the following: aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean_cup, sweetness, cupper_points, moisture. There are no missings and the range seems fine, so no additional cleaning step is needed for these fields. With the altitude, on the other hand, it seems that there are some data issues. First of all it has extreme values above 190 thousand meters which is higher than the peak of the Mount Everest (8848 meters). Even if we remove these observations from the edge of the distribution, there seems to remain some observations with higher than 6000 meters of growing altitude. If we check some external statistics about coffee producing, it seems that growing altitude above 3000 meters is not likely, hence we I will exclude observations with mean altitude above 3000 meters from the analysis.
The next step is to take a look at the categorical fields.
# Summary table
datasummary_skim(df, 'categorical')| N | % | ||
|---|---|---|---|
| species | Arabica | 1311 | 97.9 |
| Robusta | 28 | 2.1 | |
| country_of_origin | Brazil | 132 | 9.9 |
| Burundi | 2 | 0.1 | |
| China | 16 | 1.2 | |
| Colombia | 183 | 13.7 | |
| Costa Rica | 51 | 3.8 | |
| Cote d?Ivoire | 1 | 0.1 | |
| Ecuador | 3 | 0.2 | |
| El Salvador | 21 | 1.6 | |
| Ethiopia | 44 | 3.3 | |
| Guatemala | 181 | 13.5 | |
| Haiti | 6 | 0.4 | |
| Honduras | 53 | 4.0 | |
| India | 14 | 1.0 | |
| Indonesia | 20 | 1.5 | |
| Japan | 1 | 0.1 | |
| Kenya | 25 | 1.9 | |
| Laos | 3 | 0.2 | |
| Malawi | 11 | 0.8 | |
| Mauritius | 1 | 0.1 | |
| Mexico | 236 | 17.6 | |
| Myanmar | 8 | 0.6 | |
| Nicaragua | 26 | 1.9 | |
| Panama | 4 | 0.3 | |
| Papua New Guinea | 1 | 0.1 | |
| Peru | 10 | 0.7 | |
| Philippines | 5 | 0.4 | |
| Rwanda | 1 | 0.1 | |
| Taiwan | 75 | 5.6 | |
| Tanzania, United Republic Of | 40 | 3.0 | |
| Thailand | 32 | 2.4 | |
| Uganda | 36 | 2.7 | |
| United States | 10 | 0.7 | |
| United States (Hawaii) | 73 | 5.5 | |
| United States (Puerto Rico) | 4 | 0.3 | |
| Vietnam | 8 | 0.6 | |
| Zambia | 1 | 0.1 | |
| NA | 1 | 0.1 | |
| variety | Arusha | 6 | 0.4 |
| Blue Mountain | 2 | 0.1 | |
| Bourbon | 226 | 16.9 | |
| Catimor | 20 | 1.5 | |
| Catuai | 74 | 5.5 | |
| Caturra | 256 | 19.1 | |
| Ethiopian Heirlooms | 1 | 0.1 | |
| Ethiopian Yirgacheffe | 2 | 0.1 | |
| Gesha | 12 | 0.9 | |
| Hawaiian Kona | 44 | 3.3 | |
| Java | 2 | 0.1 | |
| Mandheling | 3 | 0.2 | |
| Marigojipe | 1 | 0.1 | |
| Moka Peaberry | 1 | 0.1 | |
| Mundo Novo | 33 | 2.5 | |
| Other | 110 | 8.2 | |
| Pacamara | 8 | 0.6 | |
| Pacas | 13 | 1.0 | |
| Pache Comun | 1 | 0.1 | |
| Peaberry | 5 | 0.4 | |
| Ruiru 11 | 2 | 0.1 | |
| SL14 | 17 | 1.3 | |
| SL28 | 15 | 1.1 | |
| SL34 | 8 | 0.6 | |
| Sulawesi | 1 | 0.1 | |
| Sumatra | 3 | 0.2 | |
| Sumatra Lintong | 1 | 0.1 | |
| Typica | 211 | 15.8 | |
| Yellow Bourbon | 35 | 2.6 | |
| NA | 226 | 16.9 | |
| processing_method | Natural / Dry | 258 | 19.3 |
| Other | 26 | 1.9 | |
| Pulped natural / honey | 14 | 1.0 | |
| Semi-washed / Semi-pulped | 56 | 4.2 | |
| Washed / Wet | 815 | 60.9 | |
| NA | 170 | 12.7 | |
| color | Blue-Green | 85 | 6.3 |
| Bluish-Green | 114 | 8.5 | |
| Green | 870 | 65.0 | |
| None | 52 | 3.9 | |
| NA | 218 | 16.3 |
97.9% of the coffee sample tested was Arabica. Regarding the country of origin, it is important to convert them to standard names that we can later use for the geocoding. I will also introduce a new categorization for the continent. There is one record with missing country which I will remove from the dataset. In case of the variety of the beans, there are a lot of small categories with only a few observations which will be grouped together to a separate ‘Other’ category. Finally, the grading date is in date format. For the later visualizations I will extract the year using the lubridate package.
# Import library
library(lubridate)
# Create unique ID
df$ID <- seq_along(df[[1]])
# Correct country names
df[, country_of_origin := ifelse(country_of_origin == 'Cote d?Ivoire', "Ivory Coast", country_of_origin)]
df[, country_of_origin := ifelse(country_of_origin == 'Tanzania, United Republic Of', "Tanzania", country_of_origin)]
df[, country_of_origin := ifelse(country_of_origin == 'United States (Puerto Rico)', "Puerto Rico", country_of_origin)]
df[, country_of_origin := ifelse(country_of_origin == 'United States (Hawaii)', "Hawaii", country_of_origin)]
df[, country_of_origin := ifelse(country_of_origin == 'United States', "USA", country_of_origin)]
# Add continent
africa <- c("Ethiopia","Guatemala","Uganda","Kenya","Tanzania","Burundi","Rwanda","Malawi","Zambia","Mauritius","Ivory Coast")
north_america <- c("USA","Hawaii","Puerto Rico")
south_and_central_america <- c("Costa Rica","Mexico","Brazil","Honduras","Colombia","Panama","El Salvador","Nicaragua","Ecuador","Haiti","Peru")
asia <- c("Indonesia","China","Taiwan","Thailand","Papua New Guinea","Japan","Vietnam","Philippines","Laos","Myanmar","India")
df[, continent_of_origin := fcase(
country_of_origin %in% africa, 'Africa',
country_of_origin %in% north_america, 'North America',
country_of_origin %in% south_and_central_america, 'South and Central America',
country_of_origin %in% asia, 'Asia')]
# Put small varierty categories (with below 10 observations) to 'Other'
variety_group <- df[!is.na(variety), .(count = .N), by = .(variety)][count < 10]
df[, variety_grouped := ifelse(variety %in% variety_group$variety, 'Other', variety)]
# Drop observations with missing data
df <- df[!is.na(country_of_origin)] # 1 obs.
df <- df[total_cup_points != 0] # 1 obs.
# Extract year from grading date
df[, grading_date_year := year(mdy(grading_date))]
# Exclude observations with 3000+ meters altitude
df <- df[altitude_mean_meters < 3000]
# Metric list
rating_metrics <- c("aroma", "flavor", "aftertaste","acidity", "body", "balance", "uniformity", "clean_cup", "sweetness", "cupper_points", "moisture")
rm(variety_group, africa, north_america, south_and_central_america, asia)Explanatory data analysis
Before we start to create analyze the coffee ratings, let’s take a look at the distributions of our variables of interest.
# Histograms
histograms(df, df$total_cup_points, "Total cup pionts", 1) +
annotate("text", x = 61, y = 0.015, label = "Guatemala")histograms(df, df$altitude_mean_meters, "Mean altitude (meters)", 100)In case of the total coffee ratings, there are 7 observations on the left edge of the distributions which have below 70 score. Regarding the altitude there seems to be some coffee farms where the growing altitude is below 100 meters above the sea level. These vast majority of these farms are located in Brazil so we cannot exclude that there are farms at such low altitude, hence I kept these observations in the dataset.
Let’s also compare the distributions of the different coffee rating metrics. For this, I will use the ggridges package which enables us to create nice ridgeline plots.
# Import library
library(ggridges)
# Create long dataset
df_long <- melt(df[, .SD, .SDcols = c("ID", rating_metrics)], id.vars = "ID",
measure.vars = rating_metrics)
ggplot( df_long, aes(y = variable, x = value, fill = variable, color = "gray10") ) +
geom_density_ridges(scale = 1.5, alpha = 0.8, show.legend = F) +
labs(title = 'Distributions of coffee rating metrics') +
scale_fill_viridis_d() +
scale_color_viridis_d() +
theme_viki() +
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank())rm(df_long)The distributions of the coffee quality metrics can be separated into three groups: those centered around 7.5, those with very high ratings around 10 and the moisture of the coffee which has a distribution with the lowest values.
If we check the correlation heatmap we can see that as expected all the rating metrics have strong positive correlation with the total rating except for the moisture. According to this, it is likely that higher moisture content results in lower coffee quality.
# Import library
library(GGally)
# Correlations
ggcorr(df[, .SD, .SDcols = c( "total_cup_points", rating_metrics)], hjust = .85, size = 3,layout.exp=2)We can also take a look at the boxplots of the total rating by the categorical fields.
# Boxplots
boxplots(df, df$species, df$total_cup_points, "Species", "Total cup points")boxplots(df, df$country_of_origin, df$total_cup_points, "Country of origin", "Total cup points") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))boxplots(df, df$variety_grouped, df$total_cup_points, "Variety", "Total cup points") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))boxplots(df, df$color, df$total_cup_points, "Color", "Total cup points")As we saw earlier, the vast majority of the sample contains Arabica species hence we cannot really draw conclusions from comparing their boxplots. Regarding the split by countries farms that are located in Guatemala, Nicaragua, Mexico and Honduras have coffee ratings below 70. On the other hand, the one with the highest rating is also located in Guatemala.
Now that we are finished with the primary exploration of our coffee dataset we can start to draw further insights from our data.
Dataviz
Country of origination
First, let’s take a look at the main coffee producer countries by the number of ratings. Latin American countries hold the highest number of ratings where Mexico is leading with 250+ ratings. We can see that just a very few countries produce Robusta species which includes Uganda, India, Ecuador and the USA. Also there are 9 countries that have only one coffee rating (Ivory Coast, Mauritius, Zambia, Rwanda, Japan, Papua New Guinea).
df_country <- df[, .(count = .N) , by = .(country_of_origin, species)][order(-count)]
df_country_rank <- df[, .(count_tot = .N) , by = .(country_of_origin)][order(-count_tot)][, rank := as.numeric(row.names(.SD))]
df_country <- left_join(df_country, df_country_rank, by="country_of_origin")
ggplot( df_country , aes(x = reorder(country_of_origin, -rank), y = count, fill = species) ) +
geom_bar(stat = "identity" ) +
labs(title = 'Number of ratings by country and species', y = "Number of observations") +
scale_fill_manual("legend", values = c( "#238b45", "#2c7fb8")) +
coord_flip() +
theme_viki() +
theme(
axis.title.y = element_blank())rm(df_country)Arabica - Robusta ratio
Let’s check the proportion of ratings of Robusta and Arabica species in the four countries that produce Robusta.
robusta_countries <- df[species == "Robusta", .(count = .N), by =.(country_of_origin)][order(-count)]
df_country <- df[country_of_origin %in% c(robusta_countries$country_of_origin), .(count = .N) , by = .(country_of_origin, species)][, count_robusta := ifelse(species == "Robusta", count, 0)]
robusta_countries <- df_country[, count_country := sum(count), by = country_of_origin][, prop_robusta := count_robusta / count_country][order(country_of_origin)]
ggplot( df_country , aes(x = reorder(country_of_origin, -prop_robusta), y = count, fill = species) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = 'Number of ratings by country and species') +
scale_fill_manual("legend", values = c( "#238b45", "#2c7fb8")) +
scale_y_continuous(labels = scales::percent) +
theme_viki() +
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank())rm(robusta_countries, df_country)India has ratings for only Robusta species, while in the USA their proportion is below 15%.
Country ratings
Now we can turn to the total ratings and rank the examined countries based on the distribution of their ratings. The next plot shows the top 10 countries with the highest number of ratings and highlights the owner company of the highest rated coffee.
# Create filter for countries with the most rating
df[, count_country_ratings := sum(ifelse(!is.na(total_cup_points), 1, 0) ), by = country_of_origin][, .(country_of_origin, count_country_ratings, total_cup_points)]## country_of_origin count_country_ratings total_cup_points
## 1: Ethiopia 30 90.58
## 2: Ethiopia 30 89.92
## 3: Guatemala 151 89.75
## 4: Ethiopia 30 89.00
## 5: Ethiopia 30 88.83
## ---
## 1086: Uganda 34 80.50
## 1087: India 9 80.17
## 1088: India 9 80.17
## 1089: Ecuador 2 78.08
## 1090: USA 9 77.17
top_10_count_ratings <- df[, .(count = .N), by = .(country_of_origin, count_country_ratings)][order(-count_country_ratings)][, head(.SD, 15)]
ggplot( df[country_of_origin %in% top_10_count_ratings$country_of_origin] ,
aes(y = reorder(country_of_origin, total_cup_points), x = total_cup_points, label = owner, fill = stat(x)) ) +
geom_density_ridges_gradient(scale= 0.9, show.legend = F, alpha = 0.5, point_alpha = 0.5, jittered_points = TRUE) +
labs(title = 'Top 10 countries based on the number of coffee ratings') +
scale_fill_viridis_c(alpha = 0.6) +
theme_viki() +
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank(),
legend.position="none") +
geom_label_repel(
data = subset(df, total_cup_points > 89),
force = 10,
xlim = c(85, NA)) It seems that Ethiopia has the rating distribution closest to the highest scores. The 3 biggest producers, Mexico, Guatemala and Brazil are also among the top 10.
Highest ratings by year
Let’s check that which county achieved the highest coffee rating in each year.
df_yearly_max_rating <- df[!is.na(grading_date_year), .(max_total_cup_points = max(total_cup_points)), by = .(country_of_origin, grading_date_year)][order(grading_date_year, -max_total_cup_points)]
df_yearly_max_rating <- df_yearly_max_rating[, first_by_year := rank( -max_total_cup_points), by = grading_date_year][first_by_year == 1]
ggplot( df_yearly_max_rating , aes(y = max_total_cup_points,
x = grading_date_year,
fill = country_of_origin,
color = country_of_origin,
label=paste0(grading_date_year,"\n",country_of_origin,"\n", max_total_cup_points))) +
geom_point() +
geom_text(position = position_dodge(width = 0.9),
hjust =-0.2,
size =3.5) +
ylim(c(80,95)) +
expand_limits(x= c(2009.5, 2018.5 )) +
scale_x_continuous(breaks = c(2010:2018)) +
labs(title = 'Countries with the highest ratings in each year', color = NULL, fill = NULL, label = NULL) +
ylab("Highest coffee rating") +
theme_viki() +
theme(
axis.title.x = element_blank(),
legend.position="",
panel.grid.minor = element_blank()
)It seems that there is a lot of variation in the leading countries.
Coffee growing altitude by countries
We can also investigate the relationship between the coffee ratings and the growing altitude. For this first let’s plot the distribution of the growing altitude by each producing countries.
ggplot( df, aes(x=reorder(country_of_origin, -altitude_mean_meters), y=altitude_mean_meters, fill=country_of_origin)) +
geom_boxplot() +
geom_jitter(color="grey", alpha=0.3, size=0.9) +
scale_fill_viridis_d() +
xlab("") +
ylab("Growing altitude (meters)") +
labs(title = 'Countries ranked by growing altitude', color = NULL, fill = NULL, label = NULL) +
coord_flip() +
theme_viki() +
theme(
legend.position="none") Note that altitude information was missing for 17% of the observations which puts a limitation to our findings. Among the countries with the highest growing altitude we can find Ethiopia, Papua New Guinea, Guatemala and Kenya implied by the external coffee statistics.
Altitude and total cup points
The next scatter plot shows the relationship between the growing altitude and the coffee ratings separately for each continent.
library(gganimate)
ggplot( df , aes(x = altitude_mean_meters,
y = total_cup_points,
color = continent_of_origin)) +
geom_point( alpha = 0.8, size = 1) +
geom_smooth(method = "lm", se = F) +
transition_states(continent_of_origin, transition_length = 10, state_length = 10) +
scale_color_viridis_d(begin = 0, end = 0.9)+
ylab("Coffee rating") +
xlab("Coffee growing altitude (meters)") +
labs(title = 'Relationship between altitude and coffee ratings',
subtitle = '{closest_state}') +
theme_viki() +
theme(
legend.position="none")It seems that farms located at higher altitudes produce higher rated coffee. This positive pattern holds for each continent.
Country ranking by different coffee rating measures
As the final score is calculated by the 11 individual rating categories we can expect that some countries perform better in one rating dimension while others achieve higher ratings in another categories. The next animation shows the countries ranked by their mean ratings in each rating category together with their score ranges.
library(stringr)
# Create long dataset
df_long <- melt(df[, .SD, .SDcols = c("ID", "country_of_origin", rating_metrics)], id.vars = c("ID", "country_of_origin"),
measure.vars = rating_metrics)
# Create aggregates - Country level
county_aggregates <- df_long[, .(max_value = max(value, na.rm = T),
min_value = min(value, na.rm = T),
mean_value = mean(value, na.rm = T),
rating_number = .N), by = .(country_of_origin, variable)][order(variable, -mean_value, - rating_number, country_of_origin)][, rank := as.numeric(row.names(.SD)), by = variable]
# Create aggregates - Total
total_aggregates <- df_long[, .(max_value_total = max(value, na.rm = T), min_value_total = min(value, na.rm = T), mean_value_total = mean(value, na.rm = T)), by = .(variable)]
# Join datasets
county_aggregates <- left_join(x = county_aggregates, y = total_aggregates, by =c("variable"))
p <- ggplot(county_aggregates, aes(y= -rank)) +
transition_states(variable, transition_length = 10, state_length = 10) +
geom_text(aes(x = -3, label = country_of_origin), hjust = -0.01, size = 6) +
geom_point(aes(x = mean_value, y = -rank, color=country_of_origin), alpha = 0.8 , size = 3) +
geom_segment(aes(x = 0, y = -rank, xend = 10, yend = -rank),alpha = 0.4, size = 0.5) +
geom_segment(aes(x = min_value, y = -rank, xend = max_value, yend = -rank, color=country_of_origin),alpha = 0.7, size = 1.5) +
geom_vline(aes(xintercept = mean_value_total), color = "red", size = 1) +
scale_x_continuous(breaks = c(0:10)) +
scale_y_continuous(seq(from = 0, to = 36, by = 1)) +
scale_color_viridis_d() +
xlab("Rating")+
expand_limits(x= c(-3, 10)) +
labs(
title = 'Countries ranked by different coffee rating metrics',
subtitle = '{str_to_title(closest_state)}') +
theme_viki() +
theme(
axis.title.y = element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_text(size=18),
axis.title.x = element_text(size=18),
legend.position="none",
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
plot.title = element_text(size=22),
plot.subtitle = element_text(size=14)
)
animate(p, nframes = 100, fps=5, height = 1000, width =800)anim_save("p.gif")We can see that for example with the aroma and sweetness ratings Tanzania is the leader, while in case of the flavor dimension Papua New Guinea ranks first.
Average ratings
The next graph colors the coffee producer countries by their average rating.
library(rworldmap)
library(RColorBrewer)
df_country_species <- df[, .(mean_rating = mean(total_cup_points)), by = .(country_of_origin, species)]
# map_world$filter <- ifelse(map_world$region %in% df_country_species$country_of_origin, map_world$region, NA)
# Create a map-shaped window
mapDevice('x11')
# Join dataset
spdf <- joinCountryData2Map(df_country_species, joinCode="NAME", nameJoinColumn="country_of_origin")## 38 codes from your data successfully matched countries in the map
## 1 codes from your data failed to match with a country code in the map
## 208 codes from the map weren't represented in your data
mapCountryData(spdf,
nameColumnToPlot="mean_rating",
catMethod="fixedWidth",
mapTitle = "Mean coffee ratings",
missingCountryCol = 'grey')Mexico is surely the leader on the American continent while in Africa Ethiopia has the highest average coffee ranking. As we saw earlier Ethiopia is the leader not just in its continent but it produces the highest rated coffee among the producers.
If we go back to our initial question that where should we travel if we would like to drink the world’s best coffee then the answer is definitely Ethiopia. The next chart shows the top 3 regions in Ethiopia with the highest average coffee ratings using a downloaded custom map of the country.
# Filter top regions
df_aggr <- df[country_of_origin == "Ethiopia", .(max_total = max(total_cup_points)), by = .(region)][order(-max_total)]
# Remote file information
u_remote <- "https://biogeo.ucdavis.edu/"
p_remote <- "data/gadm3.6/Rsf/"
f_name <- "gadm36_ETH_3_sf.rds"
# Local file location to save to
ethiopia_rds <- file.path(tempdir(), "gadm36_ETH_3_sf.rds")
if (toupper(Sys.info()["sysname"]) == "WINDOWS") {
download.file(
url = paste0(u_remote, p_remote, f_name),
destfile = ethiopia_rds,
method = "wininet",
mode = "wb"
)
} else {
download.file(
url = paste0(u_remote, p_remote, f_name),
destfile = ethiopia_rds,
method = "auto"
)
}
# GADM distributes native R files in .rds format that we then import.
ethiopia_sf <- readRDS(ethiopia_rds)
# Using geom_sf_text() we can add labels to the map.
ethiopia_regions_sf <-
ethiopia_sf %>%
mutate(NAME_1 = gsub("SNNP", "Oromia",
NAME_1)) %>%
group_by(NAME_1) %>%
summarise() %>%
ungroup() %>%
st_as_sf()
# First layer, country outline
ethiopia_simple_sf <-
ethiopia_sf %>%
group_by(NAME_0) %>%
summarise() %>%
ungroup() %>%
st_as_sf()
# Second layer
ethiopia_zones_sf <-
ethiopia_sf %>%
filter(
NAME_2 %in% c(
"Guji",
"Keffa",
"Oromia"
)
)
# Final map
final_map <- ggplot() +
geom_sf(data = ethiopia_simple_sf,
col = NA,
fill = "#D5C1AB") + # use a coffee-ish colour background for country
geom_sf(data = ethiopia_zones_sf, # add zones and fill by name
aes(fill = NAME_2),
colour = NA)
final_map <-
final_map +
labs(
x = "Longitude",
y = "Latitude",
title = "Ethiopian top coffee regions",
caption = "Data from GADM, https://gadm.org",
fill = "Zone",
linetype = "Region"
) + theme_bw() +
theme(panel.grid.major = element_line(colour = "transparent"),
axis.title.y = element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_blank(),
axis.title.x = element_blank())
final_mapSummary
We have seen that there is a lot of heterogeneity in the rating of the coffee products between producer countries . Ethiopia was ranked first not just in the overall ratings but it was the leader in the aftertaste rating category as well. The dominant coffee variety is Arabica only India is specialized in the cultivation of Robusta species. External studies implied that we can expect variation in the coffee quality by the growing altitude. When we examined the relationship between the growing altitude and the positive association was clearly visible even wen we examined the relationship by continents. Finally, if one would like to drink a good cup of coffee my recommendation is the Ethiopian variant.