Coffee Production and Quality Analysis

Author

Leah Marshall

Introduction

For this project, I am analyzing a dataset on coffee production and quality from the CORGIS dataset repository. The dataset includes information on coffee-producing countries, altitude, production, and coffee quality scores across several categories such as aroma, flavor, and aftertaste.

Variables included:
- country: The country where coffee is produced (categorical)
- year: Year of production (quantitative)
- altitude: Average altitude of coffee farms in meters (quantitative)
- aroma, flavor, aftertaste: Sensory quality scores (quantitative)
- production: Amount of coffee produced in metric tons (quantitative)

The purpose of this analysis is to explore relationships between coffee production, altitude, and quality scores, and to visualize differences in sensory scores among the top five coffee-producing countries. Additionally, a multiple linear regression is performed to examine how altitude and production together predict aroma scores.

Dataset Source: Coffee Dataset - CORGIS

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.1

Warning: package 'tibble' was built under R version 4.5.1

Warning: package 'purrr' was built under R version 4.5.1

Warning: package 'stringr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load dataset

coffee <- read_csv("https://corgis-edu.github.io/corgis/datasets/csv/coffee/coffee.csv")

Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Simplify dataset: select relevant variables

coffee_clean <- coffee %>%
select(country = Location.Country,
year = Year,
altitude = Location.Altitude.Average,
aroma = Data.Scores.Aroma,
flavor = Data.Scores.Flavor,
aftertaste = Data.Scores.Aftertaste,
production = `Data.Production.Number of bags`) %>%
filter(!is.na(altitude), !is.na(aroma), !is.na(flavor), !is.na(aftertaste), !is.na(production))

# Pivot scores to long format for visualization

coffee_long <- coffee_clean %>%
pivot_longer(cols = c(aroma, flavor, aftertaste),
names_to = "score_type",
values_to = "score")

# Linear regression: aroma ~ altitude + production

lm_aroma <- lm(aroma ~ altitude + production, data = coffee_clean)
summary(lm_aroma)


Call:
lm(formula = aroma ~ altitude + production, data = coffee_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5763 -0.1557  0.0070  0.1798  1.1735 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.570e+00  1.998e-02 378.836   <2e-16 ***
altitude    -7.169e-07  1.375e-06  -0.521    0.602    
production   2.649e-05  1.006e-04   0.263    0.792    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3971 on 986 degrees of freedom
Multiple R-squared:  0.000353,  Adjusted R-squared:  -0.001675 
F-statistic: 0.1741 on 2 and 986 DF,  p-value: 0.8403

# Filter for the top 5 countries by production

top5_countries <- coffee_clean %>%
group_by(country) %>%
summarise(total_production = sum(production, na.rm = TRUE)) %>%
arrange(desc(total_production)) %>%
slice_head(n = 5) %>%
pull(country)

coffee_top5 <- coffee_long %>%
filter(country %in% top5_countries)

# Create boxplot for only the top 5 countries

colors <- grDevices::rainbow(length(top5_countries))

ggplot(coffee_top5, aes(x = score_type, y = score, fill = country)) +
geom_boxplot() +
scale_fill_manual(values = colors) +
theme_light(base_size = 12) +
labs(
title = "Coffee Scores by Top 5 Producing Countries",
x = "Score Type",
y = "Score",
fill = "Country",
caption = "Source: CORGIS Coffee Dataset"
)

Essay

How I cleaned the data set

To prepare the coffee data set for analysis, I first selected only the relevant variables using dplyr::select(). I included country, year, altitude, aroma, flavor, aftertaste, and production. Column names were simplified for clarity. I then filtered out any rows with missing values using filter(!is.na(…)) to ensure accurate analysis. Finally, I pivoted the sensory quality scores to long format using tidyr::pivot_longer(), which allowed me to compare aroma, flavor, and aftertaste scores in a single visualization.
What the visualization represents and interesting patterns

The boxplot I created shows the distribution of coffee quality scores for the top five producing countries. Each country is represented by a different color, allowing easy comparison across score types. An interesting observation is that the countries producing the most coffee do not always have the highest quality scores. For example, one country had the highest production but lower aroma and aftertaste scores compared to other top producers, highlighting that higher production does not guarantee better sensory quality.
Additional analyses I considered but could not include

I had hoped to include a regression plot to visualize how altitude or production affects coffee quality scores, as well as explore interactions between multiple sensory scores and production levels. However, time constraints and data set limitations prevented me from fully implementing these analyses. Despite this, the cleaned data set and visualization provide a clear overview of coffee production and quality among the top producing countries.