Prepared by:
Name Matric Number
Aulia Fadlan 24088143
Muhammad Amir Shafiq bin Zulkipli U2004711
Muhammad Syafiq bin Salim 17107373
Nurul Iman Farhanah binti Haizul Azmi 17202781
Siti Aishah Binti Johan Iskandar 24235814


Introduction

Meridian, a prominent retail giant known for its diverse product portfolio, is launching a strategic initiative to integrate cafés into its flagship stores. This “café-within-a-store” model aims to increase customer dwell time and basket size. The critical challenge is supply chain strategy. Meridian lacks institutional knowledge in coffee sourcing. They must make a strategic decision on whether to source coffee beans from established suppliers or invest in direct trade from farmers.

To minimize risk and maximize brand reputation, Meridian must understand what drives coffee quality. Investing in a farm with high altitude but poor processing methods could be a financial disaster. The Chief Information Officer (CIO), in collaboration with the Head of Procurement require a data-driven framework to identify high-potential coffee sources and predict quality before signing long-term contracts.

The team will first validate the relationships between sensory quality measures such as aroma, flavor, and aftertaste and the final quality score. Once these links are confirmed, the team will analyze how bean metadata (e.g., processing method, species) and farm metadata (e.g., altitude, origin) influence these quality markers.

Finally, the team will develop two models: a classification model to categorize coffee as “Premium” or “Standard” for pricing guides, and a regression model to predict overall quality using only objective farm attributes. This allows Meridian to pre-qualify suppliers based on objective farm attributes, ensuring quality consistency before committing to a single roast.

Objectives

  1. To determine how quality measures (Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Cup Cleanliness, Sweetness, Moisture & Defects) statistically correlate with the final coffee quality.
  2. To analyse how Bean Metadata (processing method, species & color) and Farm Metadata (country of origin & altitude) impact the quality measures.
  3. To build and evaluate classification models to categorize coffee lots into two tiers: “Premium” and “Standard”.
  4. To develop and evaluate regression models to predict final coffee quality using only objective Bean and Farm metadata.

Data Understanding

Import Data

df <- read.csv("coffee_arabica_robusta_dataset.csv", na.strings = c("", "NA"))
glimpse(df)
## Rows: 1,339
## Columns: 44
## $ X                     <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ Owner                 <chr> "metad plc", "metad plc", "grounds for health ad…
## $ Country.of.Origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ Farm.Name             <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ Lot.Number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
## $ ICO.Number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
## $ Company               <chr> "metad agricultural developmet plc", "metad agri…
## $ Altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ Region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
## $ Producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
## $ Number.of.Bags        <int> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ Bag.Weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ In.Country.Partner    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Harvest.Year          <chr> "2014", "2014", NA, "2014", "2014", "2013", "201…
## $ Grading.Date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ Owner.1               <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ Variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
## $ Processing.Method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
## $ Aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ Flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ Aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ Acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ Body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ Balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ Uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Clean.Cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Cupper.Points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ Total.Cup.Points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ Moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ Category.One.Defects  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Quakers               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
## $ Category.Two.Defects  <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ Expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ Certification.Body    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Certification.Address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ Certification.Contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …

The Coffee Quality Institute (CQI) dataset is obtained from Kaggle. It contains 1,339 individual coffee samples (rows) and 44 features (columns) describing the coffee’s origin, processing and sensory quality.

The data type int is referred to integer/discrete count, chr is text/categorical data and dbl is continuous numerical values. The dataset contains 5 int, 24 chr and 15 dbl columns.

Key Features of The Dataset
Features Description Category
Aroma Evaluating the olfactory experience of the coffee Quality Measures
Flavor Evaluating the taste profile of coffee Quality Measures
Aftertaste Evaluaitng the flavor that remains on the palate after the coffee is swallowed Quality Measures
Acidity Evaluating the brightness and crispness of the coffee Quality Measures
Body Evaluating the tactile mouthfeel or weight of the liquid coffee Quality Measures
Balance Evaluating how well all the sensory components work together Quality Measures
Uniformity Evaluating how consistent the flavor is across multiple sample cups Quality Measures
Clean,Cup Evaluating the transparency of the flavor and absence of negative taints Quality Measures
Sweetness Evaluating the presence of pleasing, natural sweetness in the brew Quality Measures
Moisture Indicating the water content percentage of the green coffee beans Quality Measures
Category.One.Defects Number of count of primary defects (major imperfections) found in the coffee sample Quality Measures
Category.Two.Defects Number of count of secondary defects (minor imperfections) found in the coffee sample Quality Measures
Processing.Method Describing how the bean was removed from the fruit Bean Metadata
Color Physical color of the raw green beans Bean Metadata
Species Botanical variety of the coffee plant Bean Metadata
Country.of.Origin Nation where the coffee was grown and harvested Farm Metadata
Altitude_mean_meters Mean elevation at which the coffee plants were cultivated measured in meters above sea level Farm Metadata
Total.Cup.Points The primary indicator of the coffee’s overall quality Dependent Variable

Load Packages

# Dynamic report generation and table formatting
library(knitr)

# Advanced styling for tables
library(kableExtra)

# Data manipulation and wrangling
library(dplyr)

# Data visualization and plotting
library(ggplot2)

# Unified framework for modeling and machine learning
library(tidymodels)

# Collection of data science packages
library(tidyverse)

# Reshaping data structures
library(reshape2)

# Implementation of Random Forest algorithm
library(randomForest)

# Classification and Regression Training framework
library(caret)

# Fast and memory-efficient implementation of Random Forest
library(ranger)

# Variable Importance Plots
library(vip)

Data Cleaning

Data cleaning is the process of identifying and correcting errors, inconsistencies and inaccuracies in raw data. In the context of our dataset, this involves several critical steps which are selecting relevant features, identifying missing values, removing duplicate entries that could bias the model, handling outliers and other issues that may affect the accuracy and reliability of the data.

Drop irrelevant features

Features such as Farm.Name, Mill, Company and Region were dropped due to high variability and redundancy while Lot.Number and Owner were removed as they act as unique identifiers.

keep_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity", "Body", "Balance", "Uniformity", "Clean.Cup", "Sweetness", "Species", "Country.of.Origin", "Processing.Method", "Color", "altitude_mean_meters", "Moisture", "Category.One.Defects", "Category.Two.Defects", "Total.Cup.Points"
)

df2a <- df %>%
  select(all_of(keep_cols))

Check duplicate records

Now, any duplicate records found in the dataset will be dropped.

duplicate_record <- duplicated(df2a)

cat("Number of duplicate rows:", sum(duplicate_record), "\n")
## Number of duplicate rows: 0
df2b <- df2a %>% distinct()

cat("Number of rows after removing duplicate:",nrow(df2b))
## Number of rows after removing duplicate: 1339

Check number of rows and columns

The dataset contains 1339 records and 18 features.

num_rows <- nrow(df2b)
cat("Number of rows:", num_rows, "\n")
## Number of rows: 1339
num_cols <- ncol(df2b)
cat("Number of columns:", num_cols, "\n")
## Number of columns: 18

Check missing values

For this part, missing data from each column will be counted.

total_missing <- sum(is.na(df2b))
cat("Total Missing Values:", total_missing, "\n\n")
## Total Missing Values: 619
missing_summary <- colSums(is.na(df2b))
print(missing_summary[missing_summary > 0])
##    Country.of.Origin    Processing.Method                Color 
##                    1                  170                  218 
## altitude_mean_meters 
##                  230

Data Visualization

Data visualization serves as a critical preliminary step in our analysis, allowing for the qualitative assessment of the dataset. Through the use of histograms and boxplots, we investigate the statistical properties of the attributes.

Missing Value Visualization

Missing value will be visualized from each column to identify variables that contain incomplete data.

missing_count <- colSums(is.na(df2b))

missing_count <- missing_count[missing_count > 0]

missing_count <- sort(missing_count, decreasing = TRUE)

par(mar = c(10, 4, 4, 2) + 0.1)

barplot(missing_count,
        main = "Missing Values per Variable",
        ylab = "Number of Missing Values",
        las = 2)

Based on the visualization of missing values, significant gaps were observed in several key variables. Specifically, altitude_mean_meters, Color and Processing.Method exhibit a notable number of missing entries while Country.of.Origin have negligible missing values.

In Country.of.Origin, one missing value was manually identified as Colombia by cross-referencing the owner; racafe & cia s.c.a. The missing value then will be replaced with Colombia.

Also, we will replace None with NA in Color.

df2b <- df2b %>%
  mutate(
    Country.of.Origin = if_else(is.na(Country.of.Origin), "Colombia", Country.of.Origin),
    Color = na_if(Color, "None")
  )

Then, a working copy for visualization is created.

df_vis <- df2b

Distribution of Quality Measures Features

For this part, we will visualize the distribution of all Quality Measures features to understand the data pattern and spread.

quality_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity", 
                  "Body", "Balance", "Uniformity", "Clean.Cup", 
                  "Sweetness", "Moisture", 
                  "Category.One.Defects", "Category.Two.Defects","Total.Cup.Points")

for(col_name in quality_cols){
  
  p <- ggplot(df_vis, aes(x = .data[[col_name]])) +
    geom_density(fill = "#69b3a2", color = "black", alpha = 0.7) +
    labs(title = paste("Density Distribution of", col_name),
         x = col_name,
         y = "Density") +
    theme_minimal()
  
  print(p)
}

Aroma

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Flavor

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Aftertaste

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Acidity

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Body

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Balance

Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.

Uniformity

Note: Heavily left-skewed with a massive peak at 10.

Clean.Cup

Note: Shows a ceiling effect where most samples score a perfect 10.

Sweetness

Note: Similar to Clean Cup, almost all samples receive full marks.

Moisture

Note: The distribution is centered around 10-12%. Some samples are completely dry based on the moisture value.

Category.One.Defects

Note: Most samples have zero Category 1 defects.

Category.Two.Defects

Note: Category 2 defects are more common and show a right-skewed distribution.

Total.Cup.Points

Note: The target variable has a distribution centered around 82 points.

Distribution of Bean and Farm Metadata Features

For this part, we analyzed the physical and geographical metadata to understand the diversity of the samples.

meta_cols <- c("Species", "Country.of.Origin", "Processing.Method", 
               "Color", "altitude_mean_meters")

for(col_name in meta_cols){
  
  if(is.numeric(df_vis[[col_name]])){
    
    p <- ggplot(df_vis, aes(x = .data[[col_name]])) +
      geom_density(fill = "#40798C", color = "black", alpha = 0.7, adjust = 2) +
      labs(title = paste("Distribution of", col_name),
           x = col_name, y = "Density") +
      theme_minimal()
      
  } else {
    
    plot_data <- df_vis %>%
      filter(!is.na(.data[[col_name]])) %>% 
      count(.data[[col_name]], sort = TRUE) %>%
      slice_max(n, n = 10) 

    if(nrow(plot_data) < 10) {
      plot_title <- paste("Distribution of", col_name)
    } else {
      plot_title <- paste("Top 10 Categories:", col_name)
    }
    
    p <- ggplot(plot_data, aes(x = reorder(.data[[col_name]], n), y = n)) +
      geom_col(fill = "#40798C", color = "black", alpha = 0.7) +
      coord_flip() + 
      labs(title = plot_title,
           x = "", y = "Count") +
      theme_minimal()
  }
  
  print(p)
}

Species

Note: The dataset is dominated by Arabica beans which is expected for specialty coffee datasets.

Country.of.Origin

Note: Latin American countries such as Mexico, Colombia and Guatemala appear most frequently.

Processing.Method

Note: Washed / Wet is the most common processing method followed by Natural / Dry.

Color

Note: Green is the predominant bean color though many values might be missing.

altitude_mean_meters

Note: Most farms are between 1000m and 1800m. Altitude higher than 5000m are likely outliers.

Outlier Detection Using Boxplot

From the visualized data, we will use boxplots to detect the presence of outliers in the suspected numerical variables.

numeric_cols <- df_vis %>%
  select(altitude_mean_meters, Moisture, Category.One.Defects,
         Category.Two.Defects)

for(col_name in names(numeric_cols)){
  
  cat('\n### ', col_name, '\n')
  
  boxplot(numeric_cols[[col_name]],
          main = paste("Boxplot of", col_name),
          horizontal = TRUE,       
          col = "orange",          
          border = "brown",        
          xlab = col_name)
  
  cat('\n')
}

altitude_mean_meters

Moisture

Category.One.Defects

Category.Two.Defects

We utilized boxplots to identify extreme values across numeric variables. Upon inspection, altitude_mean_meters contained values exceeding 5,000 meters which were identified as data entry errors and replaced with NA which will be handled during the imputation steps.

We will filter Total.Cup.Points to exclude values of 0 as these represent data entry errors or disqualified samples that do not reflect valid coffee grading scores.

df3a <- df2b %>%
  mutate(
    altitude_mean_meters = if_else(altitude_mean_meters > 5000, NA_real_, altitude_mean_meters)
  ) %>%
  filter(Total.Cup.Points > 0)

In contrast, outliers in other variables such as Moisture and Defects were retained. We assume that outliers as informative and representing genuine variations in quality or processing rather than errors.

Data Preprocessing

Data processing is the systematic transformation of raw data into a model-ready format which encompasses partitioning the dataset into training and testing subsets. To prevent data leakage, missing values will be imputed to ensure completeness.

Target Variable Creation

To prepare the data for binary classification, the continuous Total.Cup.Points variable were transformed into a categorical target. A median-based split was applied where samples scoring 82.5 or higher were labeled as Premium while those falling below this threshold were designated as Standard.

df3a <- df3a %>%
  mutate(Quality_Class = if_else(Total.Cup.Points >= 82.5, "Premium", "Standard"))
table(df3a$Quality_Class)
## 
##  Premium Standard 
##      683      655

Convert Features

Here, categorical features were converted into factors which enables R’s machine learning algorithms to correctly interpret categories.

df3b <- df3a %>%
  mutate_if(is.character, as.factor)

glimpse(df3b)
## Rows: 1,338
## Columns: 19
## $ Aroma                <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, 8…
## $ Flavor               <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8…
## $ Aftertaste           <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, 8…
## $ Acidity              <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, 8…
## $ Body                 <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, 8…
## $ Balance              <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, 8…
## $ Uniformity           <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, …
## $ Clean.Cup            <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ Sweetness            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, …
## $ Species              <fct> Arabica, Arabica, Arabica, Arabica, Arabica, Arab…
## $ Country.of.Origin    <fct> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", …
## $ Processing.Method    <fct> Washed / Wet, Washed / Wet, NA, Natural / Dry, Wa…
## $ Color                <fct> Green, Green, NA, Green, Green, Bluish-Green, Blu…
## $ altitude_mean_meters <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, 1…
## $ Moisture             <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, 0…
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, 0…
## $ Total.Cup.Points     <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75, …
## $ Quality_Class        <fct> Premium, Premium, Premium, Premium, Premium, Prem…

Train and Test Splitting

A 80-20 split is aimed to allocate the majority of the dataset to the training phase, ensuring the model has sufficient examples to learn robust patterns while reserving a statistically significant portion for testing to provide an unbiased evaluation of its performance on unseen data.

set.seed(123)

data_split <- initial_split(df3b, prop = 0.8)

train_data <- training(data_split)
test_data  <- testing(data_split)

cat("Training Set:", nrow(train_data), "rows\n")
## Training Set: 1070 rows
cat("Testing Set: ", nrow(test_data), "rows\n")
## Testing Set:  268 rows

Handling Missing Values

As discussed before, we will opt for imputation to handle missing values. All imputation steps will be performed on training set before the same methods were applied on test set to prevent data leakage.

  • altitude_mean_meters: Impute missing values using the mean altitude stratified by Country.of.Origin
  • Country.of.Origin: To reduce high cardinality, we will group rare countries into Others.
  • Color & Processing.Method: Missing values will be imputed using the Mode.
  • All numerical features will be normalized here as well.
train_imputation <- recipe(Total.Cup.Points ~ ., data = train_data) %>%
  
  # 1. Impute mode for categorical
  step_impute_mode(Processing.Method, Color) %>%
  
  # 2. Keep top 15 countries and group the rest into Others
  step_other(Country.of.Origin, threshold = 0.02, other = "Others") %>%
  
  # 3. Impute altitude by country-level mean
  step_impute_linear(altitude_mean_meters, impute_with = imp_vars(Country.of.Origin)) %>%

  # 4. Normalize all numerical features
  step_normalize(all_numeric_predictors()) 

# Train the recipe
prep_recipe <- prep(train_imputation, training = train_data)

# Get the final processed dataframes
train_processed <- bake(prep_recipe, new_data = NULL)
test_processed  <- bake(prep_recipe, new_data = test_data)

# Check for missing values
cat("Total NAs remaining:", sum(is.na(train_processed)), "\n")
## Total NAs remaining: 0

Exploratory Data Analysis (EDA)

At this stage, we explore the processed dataset to understand relationships between features and identify which variables are most relevant for explaining and predicting coffee quality, measured by Total.Cup.Points.

Below, we merge the train_processed and test_processed sets to analyze correlations and feature relationships across the entire dataset. This step is safe from data leakage because this dataset will not be utilized for classification and regression task.

eda_data <- bind_rows(train_processed, test_processed)

Correlation Analysis

For this part, we examine linear relationships between numerical features using Pearson correlation.

The analysis focuses on two key aspects:

  1. Predictive Power - Strength of correlation between each feature and Total.Cup.Points.
  2. Multicollinearity - Strong correlations between features themselves, indicating redundant information.

A correlation heatmap is used to visualize these relationships.

num_cols <- eda_data %>%
select(Aroma, Flavor, Aftertaste, Acidity, Body, Balance,
Uniformity, Clean.Cup, Sweetness,
altitude_mean_meters, Moisture,
Category.One.Defects, Category.Two.Defects,
Total.Cup.Points)

corr_matrix <- cor(num_cols, use = "complete.obs")

corr_long <- melt(corr_matrix)

ggplot(corr_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0,
limits = c(-1, 1),
breaks = seq(-1, 1, 0.5)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Pearson Correlation Heatmap",
x = "", y = "", fill = "Correlation")

From the correlation matrix, we observed that Defects and Moisture shows negative correlation with the target variable, correctly indicating that as defects and moisture of bean increase, the quality decreases.

Meanwhile, sensory attributes such as Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup and Sweetness, show high positive correlation with Total.Cup.Points indicates that these features possess high predictive power making them suitable for the classification model.

However, the heatmap reveals inter-correlation among the sensory variables themselves which confirms that the features must be excluded to prevent data leakage and multicollinearity issues when performing regression which aligns with our regression objective.

Relationship between Numerical features and Total.Cup.Points

We examine the relationship between selected numerical features and coffee quality using scatter plots with trend lines.

num_vars <- c("Aroma", "Flavor", "Aftertaste", "Acidity", 
              "Body", "Balance", "Uniformity", "Clean.Cup", 
              "Sweetness","altitude_mean_meters", "Moisture", 
              "Category.One.Defects", "Category.Two.Defects")

for(var in num_vars){
  
  p <- ggplot(eda_data, aes(x = .data[[var]], y = Total.Cup.Points)) +
    geom_point(alpha = 0.4, color = "#2C3E50") +
    geom_smooth(method = "lm", color = "#E74C3C", se = FALSE) +
    labs(title = paste("Total.Cup.Points vs", var),
         x = var, y = "Total.Cup.Points") +
    theme_minimal()

  print(p)
}

Aroma

Observation: Strong positive correlation as aroma quality improves, the total score increases significantly.

Flavor

Observation: Strong positive correlation between Flavor and target variable.

Aftertaste

Observation: Positive correlation as better aftertaste ratings contribute to a higher overall score.

Acidity

Observation: Positive correlation as desirable acidity levels boost the total cup points.

Body

Observation: Positive correlation as well-structured body contributes to higher quality scores.

Balance

Observation: Positive correlation as balance is key to a high-scoring coffee.

Uniformity

Observation: Positive correlation as lack of uniformity drastically penalizes the total score.

Clean.Cup

Observation: Positive correlation as defects in cup cleanliness lower the score immediately.

Sweetness

Observation: Positive correlation as perceptible sweetness is essential for specialty grade coffee.

altitude_mean_meters

Observation: Higher altitudes generally correlate with higher quality scores.

Moisture

Observation: Ideal moisture levels typically range between 10-12%. Deviations often result in lower scores.

Category.One.Defects

Observation: A small increase in count in defect will cause a drop in quality.

Category.Two.Defects

Observation: Category.Two.Defects negatively impact quality.

Relationship between Categorical features and Total.Cup.Points

For categorical variables, we use boxplots to compare the distribution of Total.Cup.Points across categories.

cat_vars <- c("Species", "Color", "Processing.Method", "Country.of.Origin")

for(var in cat_vars){
  
  # reorder() sorts the categories by median score
  p <- ggplot(eda_data, aes(x = reorder(.data[[var]], Total.Cup.Points, FUN = median), 
                            y = Total.Cup.Points)) +
    geom_boxplot(fill = "#40798C", alpha = 0.7) +
    coord_flip() +
    labs(title = paste("Total.Cup.Points by", var),
         x = var, y = "Total.Cup.Points") +
    theme_minimal()
  
  print(p)
}

Species

Observation: Arabica beans generally achieve significantly higher scores than Robusta.

Color

Observation: Beans with Blue-Green or Green coloration appear to have higher median scores.

Processing.Method

Observation: Washed coffees often show tighter consistency than Natural processed beans.

Country.of.Origin

Observation: Origins such as Ethiopia and Kenya tend to score higher while Mexico and Nicaragua shows lower average quality.

Classification

For classification, we focus on building and evaluating two classification models to predict Quality_Class by fitting Logistic Regression and Random Forest using Quality Measures and Bean & Farm Metadata features.

# Remove the regression target 
if (!exists("class_train")) {
  class_train <- train_processed %>% select(-Total.Cup.Points)
}
if (!exists("class_test")) {
  class_test <- test_processed %>% select(-Total.Cup.Points)
}

# Safety check: ensure target exists and is a factor with consistent ordering
stopifnot("Quality_Class" %in% names(class_train))
stopifnot("Quality_Class" %in% names(class_test))

class_train <- class_train %>%
  mutate(Quality_Class = forcats::fct_relevel(as.factor(Quality_Class), "Standard", "Premium"))

class_test <- class_test %>%
  mutate(Quality_Class = forcats::fct_relevel(as.factor(Quality_Class), "Standard", "Premium"))

glimpse(class_train)
## Rows: 1,070
## Columns: 18
## $ Aroma                <dbl> 1.60729395, 0.04811594, -0.20135254, 0.32876798, …
## $ Flavor               <dbl> 0.88369161, -0.05840406, 0.65530478, 0.42691795, …
## $ Aftertaste           <dbl> 1.18625969, 0.48939464, 1.18625969, 0.48939464, 0…
## $ Acidity              <dbl> 0.6468548, 0.6468548, 1.1756293, -0.1307547, 1.17…
## $ Body                 <dbl> 0.48322290, 0.18311736, 1.31684940, 0.48322290, 0…
## $ Balance              <dbl> 0.15999396, 0.15999396, 0.64488070, -0.29637002, …
## $ Uniformity           <dbl> -1.0340780, 0.3239935, 0.3239935, 0.3239935, 0.32…
## $ Clean.Cup            <dbl> 0.2234559, 0.2234559, 0.2234559, 0.2234559, 0.223…
## $ Sweetness            <dbl> -0.9561335, 0.2372647, 0.2372647, 0.2372647, 0.23…
## $ Species              <fct> Arabica, Arabica, Arabica, Arabica, Arabica, Arab…
## $ Country.of.Origin    <fct> "United States (Hawaii)", "Tanzania, United Repub…
## $ Processing.Method    <fct> Washed / Wet, Washed / Wet, Washed / Wet, Washed …
## $ Color                <fct> Green, Green, Green, Bluish-Green, Green, Green, …
## $ altitude_mean_meters <dbl> -1.78499744, 1.06614121, 1.06414349, 1.01423740, …
## $ Moisture             <dbl> -1.6712894, 0.6527981, 2.3430435, 0.6527981, 0.44…
## $ Category.One.Defects <dbl> -0.1775027, -0.1775027, -0.1775027, -0.1775027, -…
## $ Category.Two.Defects <dbl> -0.66905265, -0.48026406, -0.29147547, -0.6690526…
## $ Quality_Class        <fct> Premium, Premium, Premium, Premium, Premium, Stan…

Classification Preprocessing

We analyzed the class distribution to determine the baseline accuracy to ensure that the model’s performance exceeds that of simply predicting the majority class.

## 
##  Standard   Premium 
## 0.4897196 0.5102804
## 
## Standard  Premium 
## 0.488806 0.511194

(1) Encoding Categorical Features

Here, categorical variables will be converted into dummy variable which is needed for logistic regression. Also, zero-variance predictor will be removed and step_novel() will be utilized so unseen categories in test data will not break the recipe.

class_recipe <- recipe(Quality_Class ~ ., data = class_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())

(2) Evaluation Helper

We evaluate using:

  1. Confusion matrix

  2. Accuracy

  3. Recall

  4. F1

  5. ROC-AUC.

We treat Premium as the positive class. ROC-AUC is useful because it evaluates ranking quality across thresholds.

eval_model <- function(fit_obj, test_data, model_name) {

  prob_pred  <- predict(fit_obj, test_data, type = "prob")
  class_pred <- predict(fit_obj, test_data, type = "class")
  
  pred_out <- bind_cols(
  tibble(
    truth = factor(test_data$Quality_Class, levels = c("Standard", "Premium"))
    ),
    rename(class_pred, pred_class = .pred_class),
    prob_pred
  )
  
  conf_mat <- yardstick::conf_mat(pred_out, truth = truth, estimate = pred_class)
  
  metrics_tbl <- tibble(
    model    = model_name,
    accuracy = yardstick::accuracy_vec(truth = pred_out$truth, estimate = pred_out$pred_class),
    recall   = yardstick::recall_vec(truth = pred_out$truth, estimate = pred_out$pred_class, event_level = "second"),
    f1       = yardstick::f_meas_vec(truth = pred_out$truth, estimate = pred_out$pred_class, event_level = "second"),
    roc_auc  = yardstick::roc_auc_vec(truth = pred_out$truth, estimate = pred_out$.pred_Premium, event_level = "second")
  )
  
  list(metrics = metrics_tbl, conf_mat = conf_mat, pred = pred_out)
}

Logistic Regression Model

We use Logistic Regression as a strong baseline. It is easy to interpret and gives strong performance in our run.

lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")

lr_wf <- workflow() %>%
add_recipe(class_recipe) %>%
add_model(lr_spec)

lr_fit <- fit(lr_wf, data = class_train)
lr_res <- eval_model(lr_fit, class_test, "Logistic Regression")

lr_res$conf_mat
##           Truth
## Prediction Standard Premium
##   Standard      126       7
##   Premium         5     130
lr_res$metrics
## # A tibble: 1 × 5
##   model               accuracy recall    f1 roc_auc
##   <chr>                  <dbl>  <dbl> <dbl>   <dbl>
## 1 Logistic Regression    0.955  0.949 0.956   0.986

Random Forest Model

Random Forest averages many trees. It often performs well on tabular data.

rf_spec <- rand_forest(trees = 800) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")

rf_wf <- workflow() %>%
add_recipe(class_recipe) %>%
add_model(rf_spec)

rf_fit <- fit(rf_wf, data = class_train)
rf_res <- eval_model(rf_fit, class_test, "Random Forest")

rf_res$conf_mat
##           Truth
## Prediction Standard Premium
##   Standard      118       8
##   Premium        13     129
rf_res$metrics
## # A tibble: 1 × 5
##   model         accuracy recall    f1 roc_auc
##   <chr>            <dbl>  <dbl> <dbl>   <dbl>
## 1 Random Forest    0.922  0.942 0.925   0.980

Evaluation of Classification Models

We compare models side-by-side to choose the strongest overall performer.

class_metrics <- bind_rows(lr_res$metrics, rf_res$metrics)
class_metrics
## # A tibble: 2 × 5
##   model               accuracy recall    f1 roc_auc
##   <chr>                  <dbl>  <dbl> <dbl>   <dbl>
## 1 Logistic Regression    0.955  0.949 0.956   0.986
## 2 Random Forest          0.922  0.942 0.925   0.980

Logistic Regression achieved the strongest overall performance with accuracy = 0.955, recall = 0.949, F1 = 0.956, and ROC-AUC = 0.986, indicating very strong separation between Premium and Standard and consistently correct predictions.

Random Forest also performed well with accuracy = 0.922, recall = 0.942, F1 = 0.925, and ROC-AUC = 0.980, but it produced more total errors than Logistic Regression which reduced its accuracy and F1.

Overall, we select Logistic Regression as the best model because it provides the best balance of high correctness (accuracy/F1) and strong class separation (ROC-AUC).

Confusion Matrix

plot_cm_heatmap <- function(res_obj, title_txt) {
  
  cm_tbl <- as.data.frame(res_obj$conf_mat$table) %>%
    rename(truth = Truth, pred_class = Prediction, n = Freq)
  
  ggplot(cm_tbl, aes(x = truth, y = pred_class, fill = n)) +
    geom_tile(color = "white") +
    geom_text(aes(label = n)) +
    scale_fill_gradient(low = "white", high = "orange") +
    labs(title = title_txt, x = "Actual", y = "Predicted", fill = "Count") +
    theme_minimal()
}
plot_cm_heatmap(lr_res, "Confusion Matrix (Logistic Regression)")
plot_cm_heatmap(rf_res, "Confusion Matrix (Random Forest)")

Logistic Regression

Random Forest

Logistic Regression proved superior with only 12 total errors, demonstrating balanced and high accuracy. In contrast, Random Forest had nearly double the misclassifications (23), suffering specifically from higher false positives (15 Standard beans incorrectly labeled as Premium), which explains why Random Forest has lower accuracy and F1 compared to Logistic Regression.

Predicted Probability Distribution

plot_prob_dist <- function(res_obj, title_txt) {
ggplot(res_obj$pred, aes(x = .pred_Premium)) +
geom_histogram(bins = 30) +
facet_wrap(~ truth, ncol = 1) +
labs(title = title_txt, x = "Predicted P(Premium)", y = "Count")
}

plot_prob_dist(lr_res, "Predicted Probability Distribution (Logistic Regression)")
plot_prob_dist(rf_res, "Predicted Probability Distribution (Random Forest)")

Logistic Regression

Random Forest

Logistic Regression demonstrates sharp class separation, clustering most Standard cases near 0 and Premium cases near 1 with very few borderline predictions. This high confidence distribution directly aligns with the model’s strong accuracy and ROC-AUC metrics.

Random Forest shows broader probability distributions with more overlap in the 0.3–0.8 range, indicating less consistent confidence compared to Logistic Regression. This increased uncertainty on borderline cases correlates directly with its higher misclassification rate and slightly lower accuracy/F1 scores.

Feature Importance

  • Logistic Regression

We use the absolute size of coefficients, then group dummy columns back to their original feature name so the output stays readable.

lr_fit_parsnip <- extract_fit_parsnip(lr_fit)$fit

orig_pred <- setdiff(names(class_train), "Quality_Class")

get_base_feature <- function(term) {
  if (term %in% orig_pred) return(term)
  hits <- orig_pred[startsWith(term, paste0(orig_pred, "_"))]
  if (length(hits) == 0) return(term)
  hits[which.max(nchar(hits))]
}

lr_coef_tbl <- broom::tidy(lr_fit_parsnip) %>%
filter(term != "(Intercept)") %>%
mutate(
  abs_coef = abs(estimate),
  base_feature = purrr::map_chr(term, get_base_feature)
) %>%
group_by(base_feature) %>%
  summarise(importance = sum(abs_coef), .groups = "drop") %>%
  arrange(desc(importance))

top_n <- 15

lr_coef_tbl %>%
  slice_max(order_by = importance, n = top_n) %>%
  ggplot(aes(x = reorder(base_feature, importance), y = importance)) +
  geom_col() +
  coord_flip() +
labs(
  title = paste0("Top ", top_n, " Feature Importance (Logistic Regression)"),
  x = "Feature",
  y = "Importance (sum |coef|)"
)

Based on absolute coefficient values, Logistic Regression identifies Country.of.Origin as the dominant predictor, followed by Clean.Cup, Sweetness, and Species. This indicates that the model distinguishes quality by balancing a bean’s geographical origin with specific, high-impact sensory attributes.
- Random Forest

We use permutation importance from ranger, then group dummy columns back to their original feature name.

rf_fit_parsnip <- extract_fit_parsnip(rf_fit)$fit
rf_imp <- ranger::importance(rf_fit_parsnip)

rf_imp_tbl <- tibble(
  feature = names(rf_imp),
  importance = as.numeric(rf_imp),
  base_feature = purrr::map_chr(names(rf_imp), get_base_feature)
) %>%
group_by(base_feature) %>%
summarise(importance = sum(importance), .groups = "drop") %>%
arrange(desc(importance))

rf_imp_tbl %>%
  slice_max(order_by = importance, n = top_n) %>%
  ggplot(aes(x = reorder(base_feature, importance), y = importance)) +
  geom_col() +
  coord_flip() +
labs(
  title = paste0("Top ", top_n, " Feature Importance (Random Forest)"),
  x = "Feature",
  y = "Importance"
)

Using permutation importance, Random Forest overwhelmingly prioritizes sensory attributes, identifying Flavor, Aftertaste, and Balanceas the dominant drivers of quality. Unlike Logistic Regression, it treats contextual metadata like Country.of.Origin and physical metrics like Altitude as secondary signals, relying primarily on the bean’s direct taste profile to make predictions.

ROC Curve Plot

roc_lr <- yardstick::roc_curve(lr_res$pred, truth, .pred_Premium, event_level = "second") %>%
  mutate(model = "Logistic Regression")

roc_rf <- yardstick::roc_curve(rf_res$pred, truth, .pred_Premium, event_level = "second") %>%
  mutate(model = "Random Forest")

bind_rows(roc_lr, roc_rf) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_path() +
  geom_abline(lty = 2) +
  labs(
    title = "ROC Curves (Premium as Positive Class)",
    x = "False Positive Rate",
    y = "True Positive Rate"
  )

Logistic Regression dominates the ROC plot with the highest AUC, demonstrating superior class separation. Random Forest follows closely but trails slightly, indicating it requires a marginally higher false positive rate to achieve comparable sensitivity.

Regression

In this part, two Regression Models, Multiple Linear Regression (MLR) and Random Forest are designed to predict the target variable Total.Cup.Points which is continuous (numerical) value.

Multiple Linear Regression Model

The first regression model designed is Multiple Linear Regression.

sensory <- c("Aroma", "Flavor", "Aftertaste", "Acidity", 
              "Body", "Balance", "Uniformity", "Clean.Cup", 
              "Sweetness")

# Remove the classification target and sensory attributes
reg_train <- train_processed %>% 
  select(-Quality_Class,-all_of(sensory))

reg_test <- test_processed %>% 
  select(-Quality_Class,-all_of(sensory))

# Fit model
mlr_model <- lm(Total.Cup.Points ~ ., data = reg_train)
# Predict
mlr_pred <- predict(mlr_model, newdata = reg_test)


A predicted vs actual plot helps visualize the model’s accuracy.

# Plot predicted vs actual
plot(reg_test$Total.Cup.Points, mlr_pred,
     main = "Multiple Linear Regression: Predicted vs Actual",
     xlab = "Actual Total Cup Points",
     ylab = "Predicted Total Cup Points",
     pch = 19, col = "brown")
abline(0, 1, col = "black", lwd = 2)

The scatter plots show a diffuse cloud reflecting a relatively low . While the models capture the central trend, they struggle to predict the variance of very high or low-scoring coffees.

Regression Coefficient

Regression coefficient plot shows how each predictor affects Total.Cup.Points. Positive values increase the score, while negative values decrease it, holding other variables constant.

coef_df <- tidy(mlr_model) %>%
  filter(term != "(Intercept)")

ggplot(coef_df, aes(x = reorder(term, estimate), y = estimate)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Regression Coefficients (MLR)",
    x = "Predictors",
    y = "Coefficient Estimate"
  ) +
  theme_minimal()

  • Origins like Ethiopia, Kenya, Uganda and high altitude farms are strong positive drivers of coffee quality.
  • Defects, Robusta species, and origins such as Nicaragua are strong negative drivers.

Random Forest Model

The second regression model designed is Random Forest.

rf_model <- randomForest(
  Total.Cup.Points ~ .,
  data = reg_train,
  ntree = 500,
  importance = TRUE
)
rf_model
## 
## Call:
##  randomForest(formula = Total.Cup.Points ~ ., data = reg_train,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 5.932104
##                     % Var explained: 20.62
# Predict
rf_pred <- predict(rf_model, newdata = reg_test)


A predicted vs actual plot helps visualize the model’s accuracy.

# Plot predicted vs actual
plot(reg_test$Total.Cup.Points, rf_pred,
     main = "Random Forest: Predicted vs Actual",
     xlab = "Actual Total Cup Points",
     ylab = "Predicted Total Cup Points",
     pch = 19, col = "brown")
abline(0, 1, col = "black", lwd = 2)

Similar with MLR, the scatter plots show a diffuse cloud reflecting a relatively low . While the models capture the central trend, they struggle to predict the variance of very high or low-scoring coffees.

Feature Importance

Feature importance shows which variables the random forest relies on most when making predictions, helping us identify the key drivers of coffee quality.

vip(rf_model, num_features = 10, aesthetics = list(fill = "steelblue")) +
  ggtitle("Random Forest Feature Importance") +
  theme_minimal()

The Random Forest model highlights Altitude and Origin as the strongest predictors of coffee quality, with Defects and Moisture also contributing. Species and Color have minimal impact in this dataset.

Evaluation of Regression Models

Both models were evaluated using RMSE and MAE to measure prediction errors while were used to assess how well the predictors explain the variation in Total.Cup.Points.

# Evaluate
MLR_RMSE <- RMSE(mlr_pred, reg_test$Total.Cup.Points)
MLR_R2   <- R2(mlr_pred, reg_test$Total.Cup.Points)
MLR_MAE   <- MAE(mlr_pred, reg_test$Total.Cup.Points)
RF_RMSE <- RMSE(rf_pred, reg_test$Total.Cup.Points)
RF_R2   <- R2(rf_pred, reg_test$Total.Cup.Points)
RF_MAE   <- MAE(rf_pred, reg_test$Total.Cup.Points)


comparison <- data.frame(
Model = c("Multiple Linear Regression", "Random Forest"),
RMSE = c(MLR_RMSE, RF_RMSE),
R2 = c(MLR_R2, RF_R2),
MAE = c(MLR_MAE, RF_MAE)
)
comparison
##                        Model     RMSE       R2      MAE
## 1 Multiple Linear Regression 2.187667 0.232373 1.584461
## 2              Random Forest 2.129165 0.271545 1.528944
  • Based on the evaluation results, the Random Forest model performs better than Multiple Linear Regression.

  • The RMSE (Root Mean Squared Error) and Mean Absolute Error (MAE) for Random Forest is slightly lower, meaning its predictions are closer to the actual coffee scores.

  • The value is also higher, showing that Random Forest explains more of the variation in Total.Cup.Points compared to the linear model.

Conclusion

From evaluation of the Classification models, Logistic Regression consistently outperformed Random Forest with accuracy of 0.955, recall of 0.949, F1 of 0.956, and ROC-AUC of 0.986. On top of that, there is a divergence in feature importance between the two models. Logistic Regression identifies Country.of.Origin as the dominant predictor, while Random Forest indicates that Flavor, Aftertaste, and Balance are the definitive sensory markers of quality. This divergence translates that Logistic Regression looks for broader, linear signals. It identifies that certain countries have a higher “baseline” of quality due to national infrastructure, climate, and sorting standards.

Both of the Regression models show that Country of Origin is the strongest predictor of coffee quality. Higher altitudes and specific origins like Ethiopia and Kenya are statistically safer investments for quality. However, the R² values for both models are low, indicating coffee quality is complex and cannot be fully predicted by geography alone.

By combining the findings found, Meridian may aim to source their coffee beans from Origins like Ethiopia, Kenya and Uganda in areas of high altitudes as the initial step. Then, they may make the final decision based on the flavour, aftertaste, balance, aroma and acidity of the coffee beans filtered.