Data Analysis Final Project - World’s Coffee Rating

Zhuo Ding

2025-11-27

Issue Description

Coffee is a critical export crop for many developing nations, and its market value depends strongly on its overall rating. The total cup score varies based on geographic, environmental, and post-harvest factors such as altitude, species/variety, and processing method. Understanding which factors strongly influence quality can support sustainable farming and global competitiveness.

Questions

  1. Which factors most strongly predict total cup points (overall cupping score)?
  2. Do coffees from certain countries consistently score higher than others?

Data Source

Dataset: TidyTuesday 2020-07-07 Coffee Ratings

Source: Coffee Quality Institute (CQI) and the Specialty Coffee Association

URL: “https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-07/coffee_ratings.csv

Imported using standard readr::read_csv() in R

Documentation

The dataset is documented in the TidyTuesday repository with full descriptions of variables such as altitude_mean_meters, aroma, flavor, processing_method, species, and total_cup_points. A data dictionary is provided in the repository.

Description of the Data

Use the tools in R such as str() and summary() to describe the original dataset you imported.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
coffee <- readr::read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-07/coffee_ratings.csv"
)
## Rows: 1339 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
## dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(coffee)
## spc_tbl_ [1,339 × 43] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ total_cup_points     : num [1:1339] 90.6 89.9 89.8 89 88.8 ...
##  $ species              : chr [1:1339] "Arabica" "Arabica" "Arabica" "Arabica" ...
##  $ owner                : chr [1:1339] "metad plc" "metad plc" "grounds for health admin" "yidnekachew dabessa" ...
##  $ country_of_origin    : chr [1:1339] "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" ...
##  $ farm_name            : chr [1:1339] "metad plc" "metad plc" "san marcos barrancas \"san cristobal cuch" "yidnekachew dabessa coffee plantation" ...
##  $ lot_number           : chr [1:1339] NA NA NA NA ...
##  $ mill                 : chr [1:1339] "metad plc" "metad plc" NA "wolensu" ...
##  $ ico_number           : chr [1:1339] "2014/2015" "2014/2015" NA NA ...
##  $ company              : chr [1:1339] "metad agricultural developmet plc" "metad agricultural developmet plc" NA "yidnekachew debessa coffee plantation" ...
##  $ altitude             : chr [1:1339] "1950-2200" "1950-2200" "1600 - 1800 m" "1800-2200" ...
##  $ region               : chr [1:1339] "guji-hambela" "guji-hambela" NA "oromia" ...
##  $ producer             : chr [1:1339] "METAD PLC" "METAD PLC" NA "Yidnekachew Dabessa Coffee Plantation" ...
##  $ number_of_bags       : num [1:1339] 300 300 5 320 300 100 100 300 300 50 ...
##  $ bag_weight           : chr [1:1339] "60 kg" "60 kg" "1" "60 kg" ...
##  $ in_country_partner   : chr [1:1339] "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
##  $ harvest_year         : chr [1:1339] "2014" "2014" NA "2014" ...
##  $ grading_date         : chr [1:1339] "April 4th, 2015" "April 4th, 2015" "May 31st, 2010" "March 26th, 2015" ...
##  $ owner_1              : chr [1:1339] "metad plc" "metad plc" "Grounds for Health Admin" "Yidnekachew Dabessa" ...
##  $ variety              : chr [1:1339] NA "Other" "Bourbon" NA ...
##  $ processing_method    : chr [1:1339] "Washed / Wet" "Washed / Wet" NA "Natural / Dry" ...
##  $ aroma                : num [1:1339] 8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
##  $ flavor               : num [1:1339] 8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
##  $ aftertaste           : num [1:1339] 8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
##  $ acidity              : num [1:1339] 8.75 8.58 8.42 8.42 8.5 8.5 8.5 8.42 8.42 8.5 ...
##  $ body                 : num [1:1339] 8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
##  $ balance              : num [1:1339] 8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
##  $ uniformity           : num [1:1339] 10 10 10 10 10 10 10 10 9.33 10 ...
##  $ clean_cup            : num [1:1339] 10 10 10 10 10 10 10 10 10 10 ...
##  $ sweetness            : num [1:1339] 10 10 10 10 10 10 10 9.33 9.33 10 ...
##  $ cupper_points        : num [1:1339] 8.75 8.58 9.25 8.67 8.58 8.33 8.5 9 8.67 8.5 ...
##  $ moisture             : num [1:1339] 0.12 0.12 0 0.11 0.12 0.11 0.11 0.03 0.03 0.1 ...
##  $ category_one_defects : num [1:1339] 0 0 0 0 0 0 0 0 0 0 ...
##  $ quakers              : num [1:1339] 0 0 0 0 0 0 0 0 0 0 ...
##  $ color                : chr [1:1339] "Green" "Green" NA "Green" ...
##  $ category_two_defects : num [1:1339] 0 1 0 2 2 1 0 0 0 4 ...
##  $ expiration           : chr [1:1339] "April 3rd, 2016" "April 3rd, 2016" "May 31st, 2011" "March 25th, 2016" ...
##  $ certification_body   : chr [1:1339] "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
##  $ certification_address: chr [1:1339] "309fcf77415a3661ae83e027f7e5f05dad786e44" "309fcf77415a3661ae83e027f7e5f05dad786e44" "36d0d00a3724338ba7937c52a378d085f2172daa" "309fcf77415a3661ae83e027f7e5f05dad786e44" ...
##  $ certification_contact: chr [1:1339] "19fef5a731de2db57d16da10287413f5f99bc2dd" "19fef5a731de2db57d16da10287413f5f99bc2dd" "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660" "19fef5a731de2db57d16da10287413f5f99bc2dd" ...
##  $ unit_of_measurement  : chr [1:1339] "m" "m" "m" "m" ...
##  $ altitude_low_meters  : num [1:1339] 1950 1950 1600 1800 1950 ...
##  $ altitude_high_meters : num [1:1339] 2200 2200 1800 2200 2200 NA NA 1700 1700 1850 ...
##  $ altitude_mean_meters : num [1:1339] 2075 2075 1700 2000 2075 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   total_cup_points = col_double(),
##   ..   species = col_character(),
##   ..   owner = col_character(),
##   ..   country_of_origin = col_character(),
##   ..   farm_name = col_character(),
##   ..   lot_number = col_character(),
##   ..   mill = col_character(),
##   ..   ico_number = col_character(),
##   ..   company = col_character(),
##   ..   altitude = col_character(),
##   ..   region = col_character(),
##   ..   producer = col_character(),
##   ..   number_of_bags = col_double(),
##   ..   bag_weight = col_character(),
##   ..   in_country_partner = col_character(),
##   ..   harvest_year = col_character(),
##   ..   grading_date = col_character(),
##   ..   owner_1 = col_character(),
##   ..   variety = col_character(),
##   ..   processing_method = col_character(),
##   ..   aroma = col_double(),
##   ..   flavor = col_double(),
##   ..   aftertaste = col_double(),
##   ..   acidity = col_double(),
##   ..   body = col_double(),
##   ..   balance = col_double(),
##   ..   uniformity = col_double(),
##   ..   clean_cup = col_double(),
##   ..   sweetness = col_double(),
##   ..   cupper_points = col_double(),
##   ..   moisture = col_double(),
##   ..   category_one_defects = col_double(),
##   ..   quakers = col_double(),
##   ..   color = col_character(),
##   ..   category_two_defects = col_double(),
##   ..   expiration = col_character(),
##   ..   certification_body = col_character(),
##   ..   certification_address = col_character(),
##   ..   certification_contact = col_character(),
##   ..   unit_of_measurement = col_character(),
##   ..   altitude_low_meters = col_double(),
##   ..   altitude_high_meters = col_double(),
##   ..   altitude_mean_meters = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(coffee)
##  total_cup_points   species             owner           country_of_origin 
##  Min.   : 0.00    Length:1339        Length:1339        Length:1339       
##  1st Qu.:81.08    Class :character   Class :character   Class :character  
##  Median :82.50    Mode  :character   Mode  :character   Mode  :character  
##  Mean   :82.09                                                            
##  3rd Qu.:83.67                                                            
##  Max.   :90.58                                                            
##                                                                           
##   farm_name          lot_number            mill            ico_number       
##  Length:1339        Length:1339        Length:1339        Length:1339       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    company            altitude            region            producer        
##  Length:1339        Length:1339        Length:1339        Length:1339       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  number_of_bags    bag_weight        in_country_partner harvest_year      
##  Min.   :   0.0   Length:1339        Length:1339        Length:1339       
##  1st Qu.:  14.0   Class :character   Class :character   Class :character  
##  Median : 175.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 154.2                                                           
##  3rd Qu.: 275.0                                                           
##  Max.   :1062.0                                                           
##                                                                           
##  grading_date         owner_1            variety          processing_method 
##  Length:1339        Length:1339        Length:1339        Length:1339       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      aroma           flavor       aftertaste       acidity           body      
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:7.420   1st Qu.:7.33   1st Qu.:7.250   1st Qu.:7.330   1st Qu.:7.330  
##  Median :7.580   Median :7.58   Median :7.420   Median :7.580   Median :7.500  
##  Mean   :7.567   Mean   :7.52   Mean   :7.401   Mean   :7.536   Mean   :7.517  
##  3rd Qu.:7.750   3rd Qu.:7.75   3rd Qu.:7.580   3rd Qu.:7.750   3rd Qu.:7.670  
##  Max.   :8.750   Max.   :8.83   Max.   :8.670   Max.   :8.750   Max.   :8.580  
##                                                                                
##     balance        uniformity       clean_cup        sweetness     
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:7.330   1st Qu.:10.000   1st Qu.:10.000   1st Qu.:10.000  
##  Median :7.500   Median :10.000   Median :10.000   Median :10.000  
##  Mean   :7.518   Mean   : 9.835   Mean   : 9.835   Mean   : 9.857  
##  3rd Qu.:7.750   3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:10.000  
##  Max.   :8.750   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                    
##  cupper_points       moisture       category_one_defects    quakers       
##  Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000      Min.   : 0.0000  
##  1st Qu.: 7.250   1st Qu.:0.09000   1st Qu.: 0.0000      1st Qu.: 0.0000  
##  Median : 7.500   Median :0.11000   Median : 0.0000      Median : 0.0000  
##  Mean   : 7.503   Mean   :0.08838   Mean   : 0.4795      Mean   : 0.1734  
##  3rd Qu.: 7.750   3rd Qu.:0.12000   3rd Qu.: 0.0000      3rd Qu.: 0.0000  
##  Max.   :10.000   Max.   :0.28000   Max.   :63.0000      Max.   :11.0000  
##                                                          NA's   :1        
##     color           category_two_defects  expiration        certification_body
##  Length:1339        Min.   : 0.000       Length:1339        Length:1339       
##  Class :character   1st Qu.: 0.000       Class :character   Class :character  
##  Mode  :character   Median : 2.000       Mode  :character   Mode  :character  
##                     Mean   : 3.556                                            
##                     3rd Qu.: 4.000                                            
##                     Max.   :55.000                                            
##                                                                               
##  certification_address certification_contact unit_of_measurement
##  Length:1339           Length:1339           Length:1339        
##  Class :character      Class :character      Class :character   
##  Mode  :character      Mode  :character      Mode  :character   
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  altitude_low_meters altitude_high_meters altitude_mean_meters
##  Min.   :     1      Min.   :     1       Min.   :     1      
##  1st Qu.:  1100      1st Qu.:  1100       1st Qu.:  1100      
##  Median :  1311      Median :  1350       Median :  1311      
##  Mean   :  1751      Mean   :  1799       Mean   :  1775      
##  3rd Qu.:  1600      3rd Qu.:  1650       3rd Qu.:  1600      
##  Max.   :190164      Max.   :190164       Max.   :190164      
##  NA's   :230         NA's   :230          NA's   :230

The dataset contains 1,339 rows and 43 columns describing coffee samples evaluated worldwide. Key variables include: - total_cup_points - country_of_origin - variety - processing_method - altitude_mean_meters - aroma, flavor, acidity - moisture

Cleaning and Preparation

coffee_clean <- coffee %>%
select(country_of_origin, species,
altitude_mean_meters, processing_method,
aroma, flavor, aftertaste, acidity, body, balance,
total_cup_points) %>%
drop_na() %>%
mutate(avg_sensory = (aroma + flavor + aftertaste + acidity + body + balance) / 6)

glimpse(coffee_clean)
## Rows: 1,013
## Columns: 12
## $ country_of_origin    <chr> "Ethiopia", "Ethiopia", "Ethiopia", "Ethiopia", "…
## $ species              <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Arab…
## $ altitude_mean_meters <dbl> 2075.0, 2075.0, 2000.0, 2075.0, 1822.5, 1905.0, 1…
## $ processing_method    <chr> "Washed / Wet", "Washed / Wet", "Natural / Dry", …
## $ aroma                <dbl> 8.67, 8.75, 8.17, 8.25, 8.08, 8.17, 8.25, 8.08, 8…
## $ flavor               <dbl> 8.83, 8.67, 8.58, 8.50, 8.58, 8.67, 8.42, 8.67, 8…
## $ aftertaste           <dbl> 8.67, 8.50, 8.42, 8.25, 8.50, 8.25, 8.17, 8.33, 8…
## $ acidity              <dbl> 8.75, 8.58, 8.42, 8.50, 8.50, 8.50, 8.33, 8.42, 8…
## $ body                 <dbl> 8.50, 8.42, 8.50, 8.42, 7.67, 7.75, 8.08, 8.00, 8…
## $ balance              <dbl> 8.42, 8.42, 8.25, 8.33, 8.42, 8.17, 8.17, 8.08, 8…
## $ total_cup_points     <dbl> 90.58, 89.92, 89.00, 88.83, 88.25, 88.08, 87.92, …
## $ avg_sensory          <dbl> 8.640000, 8.556667, 8.390000, 8.375000, 8.291667,…

Cleaning steps:

Final Results

Q1 — Which factors most strongly predict cup quality?

A. Average Sensory Score vs Total Cup Points:

ggplot(coffee_clean, aes(x = avg_sensory, y = total_cup_points)) +
geom_point(alpha = 0.6) +
geom_smooth() +
labs(title = "Average Sensory Quality and Coffee Rating",
x = "Average Sensory Score",
y = "Total Cup Points")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Scatterplot shows positive correlation. The higher average sensory score, the higher-rated coffee.

B. Species vs Total Cup Points:

ggplot(coffee_clean, aes(x = species, y = total_cup_points, fill = species)) +
geom_boxplot() 

labs(title = "Coffee Rating by Coffee Species")
## <ggplot2::labels> List of 1
##  $ title: chr "Coffee Rating by Coffee Species"

Arabica generally earns higher scores than Robusta.The box encloses the middle 50% of the data.

C. Altitude vs Total Cup Points

coffee_alt <- coffee_clean %>%
  filter(
    altitude_mean_meters > 400,
    altitude_mean_meters < 3000
  )



ggplot(coffee_alt, aes(x = altitude_mean_meters, y = total_cup_points, color = species)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", color = "blue")+

labs(title = "Altitude and Coffee Quality",
x = "Altitude (m)",
y = "Total Cup Points")
## `geom_smooth()` using formula = 'y ~ x'

Coffees grown at higher altitudes — particularly Arabica varieties — tend to receive higher quality scores. In contrast, Robusta coffees cluster at lower altitudes and exhibit lower average scores.

D.Processing Method Influence

coffee_clean %>%
ggplot(aes(x = processing_method, y = total_cup_points)) +
geom_col() + coord_flip()+
labs(title = "Scores by Processing Method",
x = "Processing Method",
y = "Total Cup Points")

Wet/Washed methods yield more consistent high scores.

Q2.Do coffees from certain countries consistently score higher than others?

coffee_clean %>%
  group_by(country_of_origin) %>%
  summarize(mean_score = mean(total_cup_points),
            n = n()) %>% # counts sampler size per country by create a new column n and counts how many rows (coffee samples) exist for each country
  filter(n >= 5) %>%  # keep countries with enough data (less sampler size could make the results misleading)
  slice_max(mean_score, n = 15) %>%
  ggplot(aes(x = reorder(country_of_origin, mean_score), 
             y = mean_score)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 15 Countries by Mean Total Cup Points (Cleaned Data)",
       x = "Country of Origin",
       y = "Mean Cup Score") +
  theme_minimal() +
  theme(legend.position = "none")

• Bar chart of top 15 countries by rating Using the cleaned coffee dataset, we computed the mean cupping score for each country and selected the top 15 countries with at least 5 valid evaluations. The bar chart shows that Ethiopia ranks highest in mean coffee quality, followed closely by the United States, Kenya, Uganda, and Colombia. These results highlight the dominance of East African and Latin American coffee origins in the specialty market. Countries such as Guatemala, Costa Rica, and El Salvador are also consistently well regarded, aligning with global expert perceptions of high-altitude Arabica production regions. The United States appears in the top rankings, not as a major coffee producer, but the dataset includes specialty coffee grown in Hawaii, which is known for premium, high-scoring Arabica beans.

Conclusion

Conclusion

Altitude, sensory quality, species, and processing method strongly relate to the coffee rating scores.

High sensory quality Arabica beans processed by Wet/Washed methods perform best.

Certain regions (Latin America & East Africa) consistently lead in rating.

Future Work could be done:

Incorporate climate variables

Predict scores using regression models

Examine sustainability metrics