| Name | Matric Number |
|---|---|
| Aulia Fadlan | 24088143 |
| Muhammad Amir Shafiq bin Zulkipli | U2004711 |
| Muhammad Syafiq bin Salim | 17107373 |
| Nurul Iman Farhanah binti Haizul Azmi | 17202781 |
| Siti Aishah Binti Johan Iskandar | 24235814 |
Meridian, a prominent retail giant known for its diverse product portfolio, is launching a strategic initiative to integrate cafés into its flagship stores. This “café-within-a-store” model aims to increase customer dwell time and basket size. The critical challenge is supply chain strategy. Meridian lacks institutional knowledge in coffee sourcing. They must make a strategic decision on whether to source coffee beans from established suppliers or invest in direct trade from farmers.
To minimize risk and maximize brand reputation, Meridian must understand what drives coffee quality. Investing in a farm with high altitude but poor processing methods could be a financial disaster. The Chief Information Officer (CIO), in collaboration with the Head of Procurement require a data-driven framework to identify high-potential coffee sources and predict quality before signing long-term contracts.
The team will first validate the relationships between sensory quality measures such as aroma, flavor, and aftertaste and the final quality score. Once these links are confirmed, the team will analyze how bean metadata (e.g., processing method, species) and farm metadata (e.g., altitude, origin) influence these quality markers.
Finally, the team will develop two models: a classification model to categorize coffee as “Premium” or “Standard” for pricing guides, and a regression model to predict overall quality using only objective farm attributes. This allows Meridian to pre-qualify suppliers based on objective farm attributes, ensuring quality consistency before committing to a single roast.
df <- read.csv("coffee_arabica_robusta_dataset.csv", na.strings = c("", "NA"))
glimpse(df)
## Rows: 1,339
## Columns: 44
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Species <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ Owner <chr> "metad plc", "metad plc", "grounds for health ad…
## $ Country.of.Origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ Farm.Name <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ Lot.Number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Mill <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
## $ ICO.Number <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
## $ Company <chr> "metad agricultural developmet plc", "metad agri…
## $ Altitude <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ Region <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
## $ Producer <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
## $ Number.of.Bags <int> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ Bag.Weight <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ In.Country.Partner <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Harvest.Year <chr> "2014", "2014", NA, "2014", "2014", "2013", "201…
## $ Grading.Date <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ Owner.1 <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ Variety <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
## $ Processing.Method <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
## $ Aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ Flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ Aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ Acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ Body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ Balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Clean.Cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Cupper.Points <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ Total.Cup.Points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ Moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Quakers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Color <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ Expiration <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ Certification.Body <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Certification.Address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ Certification.Contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …
The Coffee Quality Institute (CQI) dataset is obtained from Kaggle. It contains 1,339 individual coffee samples (rows) and 44 features (columns) describing the coffee’s origin, processing and sensory quality.
The data type int is referred to integer/discrete count, chr is text/categorical data and dbl is continuous numerical values. The dataset contains 5 int, 24 chr and 15 dbl columns.
| Features | Description | Category |
|---|---|---|
| Aroma | Evaluating the olfactory experience of the coffee | Quality Measures |
| Flavor | Evaluating the taste profile of coffee | Quality Measures |
| Aftertaste | Evaluaitng the flavor that remains on the palate after the coffee is swallowed | Quality Measures |
| Acidity | Evaluating the brightness and crispness of the coffee | Quality Measures |
| Body | Evaluating the tactile mouthfeel or weight of the liquid coffee | Quality Measures |
| Balance | Evaluating how well all the sensory components work together | Quality Measures |
| Uniformity | Evaluating how consistent the flavor is across multiple sample cups | Quality Measures |
| Clean,Cup | Evaluating the transparency of the flavor and absence of negative taints | Quality Measures |
| Sweetness | Evaluating the presence of pleasing, natural sweetness in the brew | Quality Measures |
| Moisture | Indicating the water content percentage of the green coffee beans | Quality Measures |
| Category.One.Defects | Number of count of primary defects (major imperfections) found in the coffee sample | Quality Measures |
| Category.Two.Defects | Number of count of secondary defects (minor imperfections) found in the coffee sample | Quality Measures |
| Processing.Method | Describing how the bean was removed from the fruit | Bean Metadata |
| Color | Physical color of the raw green beans | Bean Metadata |
| Species | Botanical variety of the coffee plant | Bean Metadata |
| Country.of.Origin | Nation where the coffee was grown and harvested | Farm Metadata |
| Altitude_mean_meters | Mean elevation at which the coffee plants were cultivated measured in meters above sea level | Farm Metadata |
| Total.Cup.Points | The primary indicator of the coffee’s overall quality | Dependent Variable |
# Dynamic report generation and table formatting
library(knitr)
# Advanced styling for tables
library(kableExtra)
# Data manipulation and wrangling
library(dplyr)
# Data visualization and plotting
library(ggplot2)
# Unified framework for modeling and machine learning
library(tidymodels)
# Collection of data science packages
library(tidyverse)
# Reshaping data structures
library(reshape2)
# Implementation of Random Forest algorithm
library(randomForest)
# Classification and Regression Training framework
library(caret)
# Fast and memory-efficient implementation of Random Forest
library(ranger)
# Variable Importance Plots
library(vip)
Data cleaning is the process of identifying and correcting errors, inconsistencies and inaccuracies in raw data. In the context of our dataset, this involves several critical steps which are selecting relevant features, identifying missing values, removing duplicate entries that could bias the model, handling outliers and other issues that may affect the accuracy and reliability of the data.
Features such as Farm.Name, Mill,
Company and Region were dropped due to high
variability and redundancy while Lot.Number and
Owner were removed as they act as unique identifiers.
keep_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity", "Body", "Balance", "Uniformity", "Clean.Cup", "Sweetness", "Species", "Country.of.Origin", "Processing.Method", "Color", "altitude_mean_meters", "Moisture", "Category.One.Defects", "Category.Two.Defects", "Total.Cup.Points"
)
df2a <- df %>%
select(all_of(keep_cols))
Now, any duplicate records found in the dataset will be dropped.
duplicate_record <- duplicated(df2a)
cat("Number of duplicate rows:", sum(duplicate_record), "\n")
## Number of duplicate rows: 0
df2b <- df2a %>% distinct()
cat("Number of rows after removing duplicate:",nrow(df2b))
## Number of rows after removing duplicate: 1339
The dataset contains 1339 records and 18 features.
num_rows <- nrow(df2b)
cat("Number of rows:", num_rows, "\n")
## Number of rows: 1339
num_cols <- ncol(df2b)
cat("Number of columns:", num_cols, "\n")
## Number of columns: 18
For this part, missing data from each column will be counted.
total_missing <- sum(is.na(df2b))
cat("Total Missing Values:", total_missing, "\n\n")
## Total Missing Values: 619
missing_summary <- colSums(is.na(df2b))
print(missing_summary[missing_summary > 0])
## Country.of.Origin Processing.Method Color
## 1 170 218
## altitude_mean_meters
## 230
Data visualization serves as a critical preliminary step in our analysis, allowing for the qualitative assessment of the dataset. Through the use of histograms and boxplots, we investigate the statistical properties of the attributes.
Missing value will be visualized from each column to identify variables that contain incomplete data.
missing_count <- colSums(is.na(df2b))
missing_count <- missing_count[missing_count > 0]
missing_count <- sort(missing_count, decreasing = TRUE)
par(mar = c(10, 4, 4, 2) + 0.1)
barplot(missing_count,
main = "Missing Values per Variable",
ylab = "Number of Missing Values",
las = 2)
Based on the visualization of missing values, significant gaps were
observed in several key variables. Specifically,
altitude_mean_meters, Color and
Processing.Method exhibit a notable number of missing
entries while Country.of.Origin have negligible missing
values.
In Country.of.Origin, one missing value was manually
identified as Colombia by cross-referencing the owner;
racafe & cia s.c.a. The missing value then will be
replaced with Colombia.
Also, we will replace None with NA in
Color.
df2b <- df2b %>%
mutate(
Country.of.Origin = if_else(is.na(Country.of.Origin), "Colombia", Country.of.Origin),
Color = na_if(Color, "None")
)
Then, a working copy for visualization is created.
df_vis <- df2b
For this part, we will visualize the distribution of all Quality Measures features to understand the data pattern and spread.
quality_cols <- c("Aroma", "Flavor", "Aftertaste", "Acidity",
"Body", "Balance", "Uniformity", "Clean.Cup",
"Sweetness", "Moisture",
"Category.One.Defects", "Category.Two.Defects","Total.Cup.Points")
for(col_name in quality_cols){
p <- ggplot(df_vis, aes(x = .data[[col_name]])) +
geom_density(fill = "#69b3a2", color = "black", alpha = 0.7) +
labs(title = paste("Density Distribution of", col_name),
x = col_name,
y = "Density") +
theme_minimal()
print(p)
}
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Shows a normal distribution centered around 7.5 with approximate range of 6.5 to 8.5.
Note: Heavily left-skewed with a massive peak at 10.
Note: Shows a ceiling effect where most samples score a perfect 10.
Note: Similar to Clean Cup, almost all samples receive full marks.
Note: The distribution is centered around 10-12%. Some samples are completely dry based on the moisture value.
Note: Most samples have zero Category 1 defects.
Note: Category 2 defects are more common and show a right-skewed distribution.
Note: The target variable has a distribution centered around 82 points.
For this part, we analyzed the physical and geographical metadata to understand the diversity of the samples.
meta_cols <- c("Species", "Country.of.Origin", "Processing.Method",
"Color", "altitude_mean_meters")
for(col_name in meta_cols){
if(is.numeric(df_vis[[col_name]])){
p <- ggplot(df_vis, aes(x = .data[[col_name]])) +
geom_density(fill = "#40798C", color = "black", alpha = 0.7, adjust = 2) +
labs(title = paste("Distribution of", col_name),
x = col_name, y = "Density") +
theme_minimal()
} else {
plot_data <- df_vis %>%
filter(!is.na(.data[[col_name]])) %>%
count(.data[[col_name]], sort = TRUE) %>%
slice_max(n, n = 10)
if(nrow(plot_data) < 10) {
plot_title <- paste("Distribution of", col_name)
} else {
plot_title <- paste("Top 10 Categories:", col_name)
}
p <- ggplot(plot_data, aes(x = reorder(.data[[col_name]], n), y = n)) +
geom_col(fill = "#40798C", color = "black", alpha = 0.7) +
coord_flip() +
labs(title = plot_title,
x = "", y = "Count") +
theme_minimal()
}
print(p)
}
Note: The dataset is dominated by Arabica beans which is expected for specialty coffee datasets.
Note: Latin American countries such as Mexico, Colombia and Guatemala appear most frequently.
Note: Washed / Wet is the most common processing method followed by Natural / Dry.
Note: Green is the predominant bean color though many values might be missing.
Note: Most farms are between 1000m and 1800m. Altitude higher than 5000m are likely outliers.
From the visualized data, we will use boxplots to detect the presence of outliers in the suspected numerical variables.
numeric_cols <- df_vis %>%
select(altitude_mean_meters, Moisture, Category.One.Defects,
Category.Two.Defects)
for(col_name in names(numeric_cols)){
cat('\n### ', col_name, '\n')
boxplot(numeric_cols[[col_name]],
main = paste("Boxplot of", col_name),
horizontal = TRUE,
col = "orange",
border = "brown",
xlab = col_name)
cat('\n')
}
We utilized boxplots to identify extreme values across numeric
variables. Upon inspection, altitude_mean_meters contained
values exceeding 5,000 meters which were identified as data entry errors
and replaced with NA which will be handled during the imputation
steps.
We will filter Total.Cup.Points to exclude values of 0
as these represent data entry errors or disqualified samples that do not
reflect valid coffee grading scores.
df3a <- df2b %>%
mutate(
altitude_mean_meters = if_else(altitude_mean_meters > 5000, NA_real_, altitude_mean_meters)
) %>%
filter(Total.Cup.Points > 0)
In contrast, outliers in other variables such as
Moisture and Defects were retained. We assume
that outliers as informative and representing genuine variations in
quality or processing rather than errors.
Data processing is the systematic transformation of raw data into a model-ready format which encompasses partitioning the dataset into training and testing subsets. To prevent data leakage, missing values will be imputed to ensure completeness.
To prepare the data for binary classification, the continuous
Total.Cup.Points variable were transformed into a
categorical target. A median-based split was applied where samples
scoring 82.5 or higher were labeled as Premium while
those falling below this threshold were designated as
Standard.
df3a <- df3a %>%
mutate(Quality_Class = if_else(Total.Cup.Points >= 82.5, "Premium", "Standard"))
table(df3a$Quality_Class)
##
## Premium Standard
## 683 655
Here, categorical features were converted into factors which enables R’s machine learning algorithms to correctly interpret categories.
df3b <- df3a %>%
mutate_if(is.character, as.factor)
glimpse(df3b)
## Rows: 1,338
## Columns: 19
## $ Aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, 8…
## $ Flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8…
## $ Aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, 8…
## $ Acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, 8…
## $ Body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, 8…
## $ Balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, 8…
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, …
## $ Clean.Cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, …
## $ Species <fct> Arabica, Arabica, Arabica, Arabica, Arabica, Arab…
## $ Country.of.Origin <fct> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", …
## $ Processing.Method <fct> Washed / Wet, Washed / Wet, NA, Natural / Dry, Wa…
## $ Color <fct> Green, Green, NA, Green, Green, Bluish-Green, Blu…
## $ altitude_mean_meters <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, 1…
## $ Moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, 0…
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, 0…
## $ Total.Cup.Points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75, …
## $ Quality_Class <fct> Premium, Premium, Premium, Premium, Premium, Prem…
A 80-20 split is aimed to allocate the majority of the dataset to the training phase, ensuring the model has sufficient examples to learn robust patterns while reserving a statistically significant portion for testing to provide an unbiased evaluation of its performance on unseen data.
set.seed(123)
data_split <- initial_split(df3b, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)
cat("Training Set:", nrow(train_data), "rows\n")
## Training Set: 1070 rows
cat("Testing Set: ", nrow(test_data), "rows\n")
## Testing Set: 268 rows
As discussed before, we will opt for imputation to handle missing values. All imputation steps will be performed on training set before the same methods were applied on test set to prevent data leakage.
altitude_mean_meters: Impute missing values using the
mean altitude stratified by Country.of.OriginCountry.of.Origin: To reduce high cardinality, we will
group rare countries into Others.Color & Processing.Method: Missing
values will be imputed using the Mode.train_imputation <- recipe(Total.Cup.Points ~ ., data = train_data) %>%
# 1. Impute mode for categorical
step_impute_mode(Processing.Method, Color) %>%
# 2. Keep top 15 countries and group the rest into Others
step_other(Country.of.Origin, threshold = 0.02, other = "Others") %>%
# 3. Impute altitude by country-level mean
step_impute_linear(altitude_mean_meters, impute_with = imp_vars(Country.of.Origin)) %>%
# 4. Normalize all numerical features
step_normalize(all_numeric_predictors())
# Train the recipe
prep_recipe <- prep(train_imputation, training = train_data)
# Get the final processed dataframes
train_processed <- bake(prep_recipe, new_data = NULL)
test_processed <- bake(prep_recipe, new_data = test_data)
# Check for missing values
cat("Total NAs remaining:", sum(is.na(train_processed)), "\n")
## Total NAs remaining: 0
At this stage, we explore the processed dataset to understand relationships between features and identify which variables are most relevant for explaining and predicting coffee quality, measured by Total.Cup.Points.
Below, we merge the train_processed and test_processed sets to analyze correlations and feature relationships across the entire dataset. This step is safe from data leakage because this dataset will not be utilized for classification and regression task.
eda_data <- bind_rows(train_processed, test_processed)
For this part, we examine linear relationships between numerical
features using Pearson correlation.
The analysis focuses on two key aspects:
Total.Cup.Points.A correlation heatmap is used to visualize these relationships.
num_cols <- eda_data %>%
select(Aroma, Flavor, Aftertaste, Acidity, Body, Balance,
Uniformity, Clean.Cup, Sweetness,
altitude_mean_meters, Moisture,
Category.One.Defects, Category.Two.Defects,
Total.Cup.Points)
corr_matrix <- cor(num_cols, use = "complete.obs")
corr_long <- melt(corr_matrix)
ggplot(corr_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0,
limits = c(-1, 1),
breaks = seq(-1, 1, 0.5)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Pearson Correlation Heatmap",
x = "", y = "", fill = "Correlation")
From the correlation matrix, we observed that
Defects and
Moisture shows negative correlation with the target
variable, correctly indicating that as defects and moisture of bean
increase, the quality decreases.
Meanwhile, sensory attributes such as Aroma,
Flavor, Aftertaste, Acidity,
Body, Balance, Uniformity,
Clean.Cup and Sweetness, show high positive
correlation with Total.Cup.Points indicates that these
features possess high predictive power making them suitable for the
classification model.
However, the heatmap reveals inter-correlation among the sensory variables themselves which confirms that the features must be excluded to prevent data leakage and multicollinearity issues when performing regression which aligns with our regression objective.
We examine the relationship between selected numerical features and coffee quality using scatter plots with trend lines.
num_vars <- c("Aroma", "Flavor", "Aftertaste", "Acidity",
"Body", "Balance", "Uniformity", "Clean.Cup",
"Sweetness","altitude_mean_meters", "Moisture",
"Category.One.Defects", "Category.Two.Defects")
for(var in num_vars){
p <- ggplot(eda_data, aes(x = .data[[var]], y = Total.Cup.Points)) +
geom_point(alpha = 0.4, color = "#2C3E50") +
geom_smooth(method = "lm", color = "#E74C3C", se = FALSE) +
labs(title = paste("Total.Cup.Points vs", var),
x = var, y = "Total.Cup.Points") +
theme_minimal()
print(p)
}
Observation: Strong positive correlation as aroma quality improves, the total score increases significantly.
Observation: Strong positive correlation between
Flavor and target variable.
Observation: Positive correlation as better aftertaste ratings contribute to a higher overall score.
Observation: Positive correlation as desirable acidity levels boost the total cup points.
Observation: Positive correlation as well-structured body contributes to higher quality scores.
Observation: Positive correlation as balance is key to a high-scoring coffee.
Observation: Positive correlation as lack of uniformity drastically penalizes the total score.
Observation: Positive correlation as defects in cup cleanliness lower the score immediately.
Observation: Positive correlation as perceptible sweetness is essential for specialty grade coffee.
Observation: Higher altitudes generally correlate with higher quality scores.
Observation: Ideal moisture levels typically range between 10-12%. Deviations often result in lower scores.
Observation: A small increase in count in defect will cause a drop in quality.
Observation: Category.Two.Defects
negatively impact quality.
For categorical variables, we use boxplots to compare the
distribution of Total.Cup.Points across categories.
cat_vars <- c("Species", "Color", "Processing.Method", "Country.of.Origin")
for(var in cat_vars){
# reorder() sorts the categories by median score
p <- ggplot(eda_data, aes(x = reorder(.data[[var]], Total.Cup.Points, FUN = median),
y = Total.Cup.Points)) +
geom_boxplot(fill = "#40798C", alpha = 0.7) +
coord_flip() +
labs(title = paste("Total.Cup.Points by", var),
x = var, y = "Total.Cup.Points") +
theme_minimal()
print(p)
}
Observation: Arabica beans generally achieve significantly higher scores than Robusta.
Observation: Beans with Blue-Green or Green coloration appear to have higher median scores.
Observation: Washed coffees often show tighter consistency than Natural processed beans.
Observation: Origins such as Ethiopia and Kenya tend to score higher while Mexico and Nicaragua shows lower average quality.
For classification, we focus on building and evaluating two classification models to predict Quality_Class by fitting Logistic Regression and Random Forest using Quality Measures and Bean & Farm Metadata features.
# Remove the regression target
if (!exists("class_train")) {
class_train <- train_processed %>% select(-Total.Cup.Points)
}
if (!exists("class_test")) {
class_test <- test_processed %>% select(-Total.Cup.Points)
}
# Safety check: ensure target exists and is a factor with consistent ordering
stopifnot("Quality_Class" %in% names(class_train))
stopifnot("Quality_Class" %in% names(class_test))
class_train <- class_train %>%
mutate(Quality_Class = forcats::fct_relevel(as.factor(Quality_Class), "Standard", "Premium"))
class_test <- class_test %>%
mutate(Quality_Class = forcats::fct_relevel(as.factor(Quality_Class), "Standard", "Premium"))
glimpse(class_train)
## Rows: 1,070
## Columns: 18
## $ Aroma <dbl> 1.60729395, 0.04811594, -0.20135254, 0.32876798, …
## $ Flavor <dbl> 0.88369161, -0.05840406, 0.65530478, 0.42691795, …
## $ Aftertaste <dbl> 1.18625969, 0.48939464, 1.18625969, 0.48939464, 0…
## $ Acidity <dbl> 0.6468548, 0.6468548, 1.1756293, -0.1307547, 1.17…
## $ Body <dbl> 0.48322290, 0.18311736, 1.31684940, 0.48322290, 0…
## $ Balance <dbl> 0.15999396, 0.15999396, 0.64488070, -0.29637002, …
## $ Uniformity <dbl> -1.0340780, 0.3239935, 0.3239935, 0.3239935, 0.32…
## $ Clean.Cup <dbl> 0.2234559, 0.2234559, 0.2234559, 0.2234559, 0.223…
## $ Sweetness <dbl> -0.9561335, 0.2372647, 0.2372647, 0.2372647, 0.23…
## $ Species <fct> Arabica, Arabica, Arabica, Arabica, Arabica, Arab…
## $ Country.of.Origin <fct> "United States (Hawaii)", "Tanzania, United Repub…
## $ Processing.Method <fct> Washed / Wet, Washed / Wet, Washed / Wet, Washed …
## $ Color <fct> Green, Green, Green, Bluish-Green, Green, Green, …
## $ altitude_mean_meters <dbl> -1.78499744, 1.06614121, 1.06414349, 1.01423740, …
## $ Moisture <dbl> -1.6712894, 0.6527981, 2.3430435, 0.6527981, 0.44…
## $ Category.One.Defects <dbl> -0.1775027, -0.1775027, -0.1775027, -0.1775027, -…
## $ Category.Two.Defects <dbl> -0.66905265, -0.48026406, -0.29147547, -0.6690526…
## $ Quality_Class <fct> Premium, Premium, Premium, Premium, Premium, Stan…
We analyzed the class distribution to determine the baseline accuracy to ensure that the model’s performance exceeds that of simply predicting the majority class.
##
## Standard Premium
## 0.4897196 0.5102804
##
## Standard Premium
## 0.488806 0.511194
(1) Encoding Categorical Features
Here, categorical variables will be converted into dummy variable which is needed for logistic regression. Also, zero-variance predictor will be removed and step_novel() will be utilized so unseen categories in test data will not break the recipe.
class_recipe <- recipe(Quality_Class ~ ., data = class_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
(2) Evaluation Helper
We evaluate using:
Confusion matrix
Accuracy
Recall
F1
ROC-AUC.
We treat Premium as the positive class. ROC-AUC is useful because it evaluates ranking quality across thresholds.
eval_model <- function(fit_obj, test_data, model_name) {
prob_pred <- predict(fit_obj, test_data, type = "prob")
class_pred <- predict(fit_obj, test_data, type = "class")
pred_out <- bind_cols(
tibble(
truth = factor(test_data$Quality_Class, levels = c("Standard", "Premium"))
),
rename(class_pred, pred_class = .pred_class),
prob_pred
)
conf_mat <- yardstick::conf_mat(pred_out, truth = truth, estimate = pred_class)
metrics_tbl <- tibble(
model = model_name,
accuracy = yardstick::accuracy_vec(truth = pred_out$truth, estimate = pred_out$pred_class),
recall = yardstick::recall_vec(truth = pred_out$truth, estimate = pred_out$pred_class, event_level = "second"),
f1 = yardstick::f_meas_vec(truth = pred_out$truth, estimate = pred_out$pred_class, event_level = "second"),
roc_auc = yardstick::roc_auc_vec(truth = pred_out$truth, estimate = pred_out$.pred_Premium, event_level = "second")
)
list(metrics = metrics_tbl, conf_mat = conf_mat, pred = pred_out)
}
We use Logistic Regression as a strong baseline. It is easy to interpret and gives strong performance in our run.
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
lr_wf <- workflow() %>%
add_recipe(class_recipe) %>%
add_model(lr_spec)
lr_fit <- fit(lr_wf, data = class_train)
lr_res <- eval_model(lr_fit, class_test, "Logistic Regression")
lr_res$conf_mat
## Truth
## Prediction Standard Premium
## Standard 126 7
## Premium 5 130
lr_res$metrics
## # A tibble: 1 × 5
## model accuracy recall f1 roc_auc
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic Regression 0.955 0.949 0.956 0.986
Random Forest averages many trees. It often performs well on tabular data.
rf_spec <- rand_forest(trees = 800) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
rf_wf <- workflow() %>%
add_recipe(class_recipe) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, data = class_train)
rf_res <- eval_model(rf_fit, class_test, "Random Forest")
rf_res$conf_mat
## Truth
## Prediction Standard Premium
## Standard 118 8
## Premium 13 129
rf_res$metrics
## # A tibble: 1 × 5
## model accuracy recall f1 roc_auc
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Random Forest 0.922 0.942 0.925 0.980
We compare models side-by-side to choose the strongest overall performer.
class_metrics <- bind_rows(lr_res$metrics, rf_res$metrics)
class_metrics
## # A tibble: 2 × 5
## model accuracy recall f1 roc_auc
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic Regression 0.955 0.949 0.956 0.986
## 2 Random Forest 0.922 0.942 0.925 0.980
Logistic Regression achieved the strongest overall performance with accuracy = 0.955, recall = 0.949, F1 = 0.956, and ROC-AUC = 0.986, indicating very strong separation between Premium and Standard and consistently correct predictions.
Random Forest also performed well with accuracy = 0.922, recall = 0.942, F1 = 0.925, and ROC-AUC = 0.980, but it produced more total errors than Logistic Regression which reduced its accuracy and F1.
Overall, we select Logistic Regression as the best model because it provides the best balance of high correctness (accuracy/F1) and strong class separation (ROC-AUC).
plot_cm_heatmap <- function(res_obj, title_txt) {
cm_tbl <- as.data.frame(res_obj$conf_mat$table) %>%
rename(truth = Truth, pred_class = Prediction, n = Freq)
ggplot(cm_tbl, aes(x = truth, y = pred_class, fill = n)) +
geom_tile(color = "white") +
geom_text(aes(label = n)) +
scale_fill_gradient(low = "white", high = "orange") +
labs(title = title_txt, x = "Actual", y = "Predicted", fill = "Count") +
theme_minimal()
}
plot_cm_heatmap(lr_res, "Confusion Matrix (Logistic Regression)")
plot_cm_heatmap(rf_res, "Confusion Matrix (Random Forest)")
Logistic Regression proved superior with only 12 total errors, demonstrating balanced and high accuracy. In contrast, Random Forest had nearly double the misclassifications (23), suffering specifically from higher false positives (15 Standard beans incorrectly labeled as Premium), which explains why Random Forest has lower accuracy and F1 compared to Logistic Regression.
plot_prob_dist <- function(res_obj, title_txt) {
ggplot(res_obj$pred, aes(x = .pred_Premium)) +
geom_histogram(bins = 30) +
facet_wrap(~ truth, ncol = 1) +
labs(title = title_txt, x = "Predicted P(Premium)", y = "Count")
}
plot_prob_dist(lr_res, "Predicted Probability Distribution (Logistic Regression)")
plot_prob_dist(rf_res, "Predicted Probability Distribution (Random Forest)")
Logistic Regression demonstrates sharp class separation, clustering most Standard cases near 0 and Premium cases near 1 with very few borderline predictions. This high confidence distribution directly aligns with the model’s strong accuracy and ROC-AUC metrics.
Random Forest shows broader probability distributions with more overlap in the 0.3–0.8 range, indicating less consistent confidence compared to Logistic Regression. This increased uncertainty on borderline cases correlates directly with its higher misclassification rate and slightly lower accuracy/F1 scores.
We use the absolute size of coefficients, then group dummy columns back to their original feature name so the output stays readable.
lr_fit_parsnip <- extract_fit_parsnip(lr_fit)$fit
orig_pred <- setdiff(names(class_train), "Quality_Class")
get_base_feature <- function(term) {
if (term %in% orig_pred) return(term)
hits <- orig_pred[startsWith(term, paste0(orig_pred, "_"))]
if (length(hits) == 0) return(term)
hits[which.max(nchar(hits))]
}
lr_coef_tbl <- broom::tidy(lr_fit_parsnip) %>%
filter(term != "(Intercept)") %>%
mutate(
abs_coef = abs(estimate),
base_feature = purrr::map_chr(term, get_base_feature)
) %>%
group_by(base_feature) %>%
summarise(importance = sum(abs_coef), .groups = "drop") %>%
arrange(desc(importance))
top_n <- 15
lr_coef_tbl %>%
slice_max(order_by = importance, n = top_n) %>%
ggplot(aes(x = reorder(base_feature, importance), y = importance)) +
geom_col() +
coord_flip() +
labs(
title = paste0("Top ", top_n, " Feature Importance (Logistic Regression)"),
x = "Feature",
y = "Importance (sum |coef|)"
)
Based on absolute coefficient values, Logistic Regression identifies
Country.of.Origin as the dominant predictor, followed by
Clean.Cup, Sweetness, and
Species. This indicates that the model distinguishes
quality by balancing a bean’s geographical origin with specific,
high-impact sensory attributes.
- Random Forest
We use permutation importance from ranger, then group dummy columns back to their original feature name.
rf_fit_parsnip <- extract_fit_parsnip(rf_fit)$fit
rf_imp <- ranger::importance(rf_fit_parsnip)
rf_imp_tbl <- tibble(
feature = names(rf_imp),
importance = as.numeric(rf_imp),
base_feature = purrr::map_chr(names(rf_imp), get_base_feature)
) %>%
group_by(base_feature) %>%
summarise(importance = sum(importance), .groups = "drop") %>%
arrange(desc(importance))
rf_imp_tbl %>%
slice_max(order_by = importance, n = top_n) %>%
ggplot(aes(x = reorder(base_feature, importance), y = importance)) +
geom_col() +
coord_flip() +
labs(
title = paste0("Top ", top_n, " Feature Importance (Random Forest)"),
x = "Feature",
y = "Importance"
)
Using permutation importance, Random Forest overwhelmingly
prioritizes sensory attributes, identifying Flavor,
Aftertaste, and Balanceas the dominant drivers
of quality. Unlike Logistic Regression, it treats contextual metadata
like Country.of.Origin and physical metrics like
Altitude as secondary signals, relying primarily on the
bean’s direct taste profile to make predictions.
roc_lr <- yardstick::roc_curve(lr_res$pred, truth, .pred_Premium, event_level = "second") %>%
mutate(model = "Logistic Regression")
roc_rf <- yardstick::roc_curve(rf_res$pred, truth, .pred_Premium, event_level = "second") %>%
mutate(model = "Random Forest")
bind_rows(roc_lr, roc_rf) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
geom_path() +
geom_abline(lty = 2) +
labs(
title = "ROC Curves (Premium as Positive Class)",
x = "False Positive Rate",
y = "True Positive Rate"
)
Logistic Regression dominates the ROC plot with the highest AUC, demonstrating superior class separation. Random Forest follows closely but trails slightly, indicating it requires a marginally higher false positive rate to achieve comparable sensitivity.
In this part, two Regression Models, Multiple Linear
Regression (MLR) and Random Forest are
designed to predict the target variable Total.Cup.Points
which is continuous (numerical) value.
The first regression model designed is Multiple Linear Regression.
sensory <- c("Aroma", "Flavor", "Aftertaste", "Acidity",
"Body", "Balance", "Uniformity", "Clean.Cup",
"Sweetness")
# Remove the classification target and sensory attributes
reg_train <- train_processed %>%
select(-Quality_Class,-all_of(sensory))
reg_test <- test_processed %>%
select(-Quality_Class,-all_of(sensory))
# Fit model
mlr_model <- lm(Total.Cup.Points ~ ., data = reg_train)
# Predict
mlr_pred <- predict(mlr_model, newdata = reg_test)
A predicted vs actual plot helps visualize the model’s
accuracy.
# Plot predicted vs actual
plot(reg_test$Total.Cup.Points, mlr_pred,
main = "Multiple Linear Regression: Predicted vs Actual",
xlab = "Actual Total Cup Points",
ylab = "Predicted Total Cup Points",
pch = 19, col = "brown")
abline(0, 1, col = "black", lwd = 2)
The scatter plots show a diffuse cloud reflecting a relatively low R². While the models capture the central trend, they struggle to predict the variance of very high or low-scoring coffees.
Regression coefficient plot shows how each predictor affects
Total.Cup.Points. Positive values increase the score, while
negative values decrease it, holding other variables constant.
coef_df <- tidy(mlr_model) %>%
filter(term != "(Intercept)")
ggplot(coef_df, aes(x = reorder(term, estimate), y = estimate)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Regression Coefficients (MLR)",
x = "Predictors",
y = "Coefficient Estimate"
) +
theme_minimal()
The second regression model designed is Random Forest.
rf_model <- randomForest(
Total.Cup.Points ~ .,
data = reg_train,
ntree = 500,
importance = TRUE
)
rf_model
##
## Call:
## randomForest(formula = Total.Cup.Points ~ ., data = reg_train, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 5.932104
## % Var explained: 20.62
# Predict
rf_pred <- predict(rf_model, newdata = reg_test)
A predicted vs actual plot helps visualize the model’s
accuracy.
# Plot predicted vs actual
plot(reg_test$Total.Cup.Points, rf_pred,
main = "Random Forest: Predicted vs Actual",
xlab = "Actual Total Cup Points",
ylab = "Predicted Total Cup Points",
pch = 19, col = "brown")
abline(0, 1, col = "black", lwd = 2)
Similar with MLR, the scatter plots show a diffuse cloud reflecting a relatively low R². While the models capture the central trend, they struggle to predict the variance of very high or low-scoring coffees.
Feature importance shows which variables the random forest relies on most when making predictions, helping us identify the key drivers of coffee quality.
vip(rf_model, num_features = 10, aesthetics = list(fill = "steelblue")) +
ggtitle("Random Forest Feature Importance") +
theme_minimal()
The Random Forest model highlights Altitude and
Origin as the strongest predictors of coffee
quality, with Defects and Moisture
also contributing. Species and Color have
minimal impact in this dataset.
Both models were evaluated using RMSE and
MAE to measure prediction errors while
R² were used to assess how well the predictors explain
the variation in Total.Cup.Points.
# Evaluate
MLR_RMSE <- RMSE(mlr_pred, reg_test$Total.Cup.Points)
MLR_R2 <- R2(mlr_pred, reg_test$Total.Cup.Points)
MLR_MAE <- MAE(mlr_pred, reg_test$Total.Cup.Points)
RF_RMSE <- RMSE(rf_pred, reg_test$Total.Cup.Points)
RF_R2 <- R2(rf_pred, reg_test$Total.Cup.Points)
RF_MAE <- MAE(rf_pred, reg_test$Total.Cup.Points)
comparison <- data.frame(
Model = c("Multiple Linear Regression", "Random Forest"),
RMSE = c(MLR_RMSE, RF_RMSE),
R2 = c(MLR_R2, RF_R2),
MAE = c(MLR_MAE, RF_MAE)
)
comparison
## Model RMSE R2 MAE
## 1 Multiple Linear Regression 2.187667 0.232373 1.584461
## 2 Random Forest 2.129165 0.271545 1.528944
Based on the evaluation results, the Random Forest model performs better than Multiple Linear Regression.
The RMSE (Root Mean Squared Error) and Mean Absolute Error (MAE) for Random Forest is slightly lower, meaning its predictions are closer to the actual coffee scores.
The R² value is also higher,
showing that Random Forest explains more of the
variation in Total.Cup.Points compared to the
linear model.
From evaluation of the Classification models, Logistic Regression consistently outperformed Random Forest with accuracy of 0.955, recall of 0.949, F1 of 0.956, and ROC-AUC of 0.986. On top of that, there is a divergence in feature importance between the two models. Logistic Regression identifies Country.of.Origin as the dominant predictor, while Random Forest indicates that Flavor, Aftertaste, and Balance are the definitive sensory markers of quality. This divergence translates that Logistic Regression looks for broader, linear signals. It identifies that certain countries have a higher “baseline” of quality due to national infrastructure, climate, and sorting standards.
Both of the Regression models show that Country of Origin is the strongest predictor of coffee quality. Higher altitudes and specific origins like Ethiopia and Kenya are statistically safer investments for quality. However, the R² values for both models are low, indicating coffee quality is complex and cannot be fully predicted by geography alone.
By combining the findings found, Meridian may aim to source their coffee beans from Origins like Ethiopia, Kenya and Uganda in areas of high altitudes as the initial step. Then, they may make the final decision based on the flavour, aftertaste, balance, aroma and acidity of the coffee beans filtered.