Introduction
- Objectives
Data Understanding
Data Cleaning
Data Visualization
Data Preprocessing
Exploratory Data Analysis (EDA)
Classification
Regression
Conclusion

Prepared by:

Name	Matric Number
Aulia Fadlan	24088143
Muhammad Amir Shafiq bin Zulkipli	U2004711
Muhammad Syafiq bin Salim	17107373
Nurul Iman Farhanah binti Haizul Azmi	17202781
Siti Aishah Binti Johan Iskandar	24235814

Introduction

Meridian, a prominent retail giant known for its diverse product portfolio, is launching a strategic initiative to integrate cafés into its flagship stores. This “café-within-a-store” model aims to increase customer dwell time and basket size. The critical challenge is supply chain strategy. Meridian lacks institutional knowledge in coffee sourcing. They must make a strategic decision on whether to source coffee beans from established suppliers or invest in direct trade from farmers.

To minimize risk and maximize brand reputation, Meridian must understand what drives coffee quality. Investing in a farm with high altitude but poor processing methods could be a financial disaster. The Chief Information Officer (CIO), in collaboration with the Head of Procurement require a data-driven framework to identify high-potential coffee sources and predict quality before signing long-term contracts.

The team will first validate the relationships between sensory quality measures such as aroma, flavor, and aftertaste and the final quality score. Once these links are confirmed, the team will analyze how bean metadata (e.g., processing method, species) and farm metadata (e.g., altitude, origin) influence these quality markers.

Finally, the team will develop two models: a classification model to categorize coffee as “Premium” or “Standard” for pricing guides, and a regression model to predict overall quality using only objective farm attributes. This allows Meridian to pre-qualify suppliers based on objective farm attributes, ensuring quality consistency before committing to a single roast.

Objectives

To determine how quality measures (Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Cup Cleanliness, Sweetness, Moisture & Defects) statistically correlate with the final coffee quality.
To analyse how Bean Metadata (processing method, species & color) and Farm Metadata (country of origin & altitude) impact the quality measures.
To build and evaluate classification models to categorize coffee lots into two tiers: “Premium” and “Standard”.
To develop and evaluate regression models to predict final coffee quality using only objective Bean and Farm metadata.

Data Understanding

Import Data

df <- read.csv("coffee_arabica_robusta_dataset.csv", na.strings = c("", "NA"))
glimpse(df)

## Rows: 1,339
## Columns: 44
## $ X                     <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ Owner                 <chr> "metad plc", "metad plc", "grounds for health ad…
## $ Country.of.Origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ Farm.Name             <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ Lot.Number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
## $ ICO.Number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
## $ Company               <chr> "metad agricultural developmet plc", "metad agri…
## $ Altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ Region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
## $ Producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
## $ Number.of.Bags        <int> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ Bag.Weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ In.Country.Partner    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Harvest.Year          <chr> "2014", "2014", NA, "2014", "2014", "2013", "201…
## $ Grading.Date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ Owner.1               <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ Variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
## $ Processing.Method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
## $ Aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ Flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ Aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ Acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ Body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ Balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ Uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Clean.Cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Cupper.Points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ Total.Cup.Points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ Moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ Category.One.Defects  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Quakers               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
## $ Category.Two.Defects  <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ Expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ Certification.Body    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Certification.Address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ Certification.Contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …

The Coffee Quality Institute (CQI) dataset is obtained from Kaggle. It contains 1,339 individual coffee samples (rows) and 44 features (columns) describing the coffee’s origin, processing and sensory quality.

The data type int is referred to integer/discrete count, chr is text/categorical data and dbl is continuous numerical values. The dataset contains 5 int, 24 chr and 15 dbl columns.

Key Features of The Dataset
Features	Description	Category
Aroma	Evaluating the olfactory experience of the coffee	Quality Measures
Flavor	Evaluating the taste profile of coffee	Quality Measures
Aftertaste	Evaluaitng the flavor that remains on the palate after the coffee is swallowed	Quality Measures
Acidity	Evaluating the brightness and crispness of the coffee	Quality Measures
Body	Evaluating the tactile mouthfeel or weight of the liquid coffee	Quality Measures
Balance	Evaluating how well all the sensory components work together	Quality Measures
Uniformity	Evaluating how consistent the flavor is across multiple sample cups	Quality Measures
Clean,Cup	Evaluating the transparency of the flavor and absence of negative taints	Quality Measures
Sweetness	Evaluating the presence of pleasing, natural sweetness in the brew	Quality Measures
Moisture	Indicating the water content percentage of the green coffee beans	Quality Measures
Category.One.Defects	Number of count of primary defects (major imperfections) found in the coffee sample	Quality Measures
Category.Two.Defects	Number of count of secondary defects (minor imperfections) found in the coffee sample	Quality Measures
Processing.Method	Describing how the bean was removed from the fruit	Bean Metadata
Color	Physical color of the raw green beans	Bean Metadata
Species	Botanical variety of the coffee plant	Bean Metadata
Country.of.Origin	Nation where the coffee was grown and harvested	Farm Metadata
Altitude_mean_meters	Mean elevation at which the coffee plants were cultivated measured in meters above sea level	Farm Metadata
Total.Cup.Points	The primary indicator of the coffee’s overall quality	Dependent Variable

Load Packages

# Dynamic report generation and table formatting
library(knitr)

# Advanced styling for tables
library(kableExtra)

# Data manipulation and wrangling
library(dplyr)

# Data visualization and plotting
library(ggplot2)

# Unified framework for modeling and machine learning
library(tidymodels)

# Collection of data science packages
library(tidyverse)

# Reshaping data structures
library(reshape2)

# Implementation of Random Forest algorithm
library(randomForest)

# Classification and Regression Training framework
library(caret)

# Fast and memory-efficient implementation of Random Forest
library(ranger)

# Variable Importance Plots
library(vip)

Data Cleaning

Data cleaning is the process of identifying and correcting errors, inconsistencies and inaccuracies in raw data. In the context of our dataset, this involves several critical steps which are selecting relevant features, identifying missing values, removing duplicate entries that could bias the model, handling outliers and other issues that may affect the accuracy and reliability of the data.

Drop irrelevant features

Features such as Farm.Name, Mill, Company and Region were dropped due to high variability and redundancy while Lot.Number and Owner were removed as they act as unique identifiers.

keep_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity", "Body", "Balance", "Uniformity", "Clean.Cup", "Sweetness", "Species", "Country.of.Origin", "Processing.Method", "Color", "altitude_mean_meters", "Moisture", "Category.One.Defects", "Category.Two.Defects", "Total.Cup.Points"
)

df2a <- df %>%
  select(all_of(keep_cols))

Check duplicate records

Now, any duplicate records found in the dataset will be dropped.

duplicate_record <- duplicated(df2a)

cat("Number of duplicate rows:", sum(duplicate_record), "\n")

## Number of duplicate rows: 0

df2b <- df2a %>% distinct()

cat("Number of rows after removing duplicate:",nrow(df2b))

## Number of rows after removing duplicate: 1339

Check number of rows and columns

The dataset contains 1339 records and 18 features.

num_rows <- nrow(df2b)
cat("Number of rows:", num_rows, "\n")

## Number of rows: 1339

num_cols <- ncol(df2b)
cat("Number of columns:", num_cols, "\n")

## Number of columns: 18

Check missing values

For this part, missing data from each column will be counted.

total_missing <- sum(is.na(df2b))
cat("Total Missing Values:", total_missing, "\n\n")

## Total Missing Values: 619

missing_summary <- colSums(is.na(df2b))
print(missing_summary[missing_summary > 0])

##    Country.of.Origin    Processing.Method                Color 
##                    1                  170                  218 
## altitude_mean_meters 
##                  230

Data Visualization

Data visualization serves as a critical preliminary step in our analysis, allowing for the qualitative assessment of the dataset. Through the use of histograms and boxplots, we investigate the statistical properties of the attributes.

Missing Value Visualization

Missing value will be visualized from each column to identify variables that contain incomplete data.

missing_count <- colSums(is.na(df2b))

missing_count <- missing_count[missing_count > 0]

missing_count <- sort(missing_count, decreasing = TRUE)

par(mar = c(10, 4, 4, 2) + 0.1)

barplot(missing_count,
        main = "Missing Values per Variable",
        ylab = "Number of Missing Values",
        las = 2)

Based on the visualization of missing values, significant gaps were observed in several key variables. Specifically, altitude_mean_meters, Color and Processing.Method exhibit a notable number of missing entries while Country.of.Origin have negligible missing values.

In Country.of.Origin, one missing value was manually identified as Colombia by cross-referencing the owner; racafe & cia s.c.a. The missing value then will be replaced with Colombia.

Also, we will replace None with NA in Color.

df2b <- df2b %>%
  mutate(
    Country.of.Origin = if_else(is.na(Country.of.Origin), "Colombia", Country.of.Origin),
    Color = na_if(Color, "None")
  )

Then, a working copy for visualization is created.

df_vis <- df2b

Distribution of Quality Measures Features

For this part, we will visualize the distribution of all Quality Measures features to understand the data pattern and spread.

quality_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity", 
                  "Body", "Balance", "Uniformity", "Clean.Cup", 
                  "Sweetness", "Moisture", 
                  "Category.One.Defects", "Category.Two.Defects","Total.Cup.Points")

for(col_name in quality_cols){
  
  p <- ggplot(df_vis, aes(x = .data[[col_name]])) +
    geom_density(fill = "#69b3a2", color = "black", alpha = 0.7) +
    labs(title = paste("Density Distribution of", col_name),
         x = col_name,
         y = "Density") +
    theme_minimal()
  
  print(p)
}