The dataset in use was acquired from Kaggle and is based on a data provided by Coffee Quality Institute (CQI). CQI maintains a web database that serves as a resource for coffee professionals and enthusiasts who are interested in learning about coffee quality and sustainability. The database includes a range of information on coffee production, processing, and sensory evaluation. It also contains data on coffee genetics, soil types, and other factors that can affect coffee quality. The selected dataset represents specifically arabica type and covered the period May 2023.
Description of the columns:
- X: Record id.
- ID: Record id.
- Country of Origin: Refers to country name of coffee origin
-
Farm Name: The name of the coffee farm where the coffee was grown.
- Lot Number: A unique identifier for a specific batch of coffee.
-
Mill: The processing facility where the coffee was milled or processed.
- ICO Number: Refers to the ICO (International Coffee Organization)
number, which identify the coffee’s certification or region.
-
Company: The organization or entity associated with the coffee batch.
- Altitude: The elevation range where the coffee was grown,
measured in meters above sea level.
- Region: The geographical area
or location of the coffee farm.
- Producer: The name of the
individual or entity responsible for producing the coffee.
- Number
of bags: The total count of coffee bags in this lot.
- Bag weight:
The weight of each bag of coffee (kg).
- In-Country Partner: The
organization or partner facilitating coffee trade or logistics within
the producing country.
- Harvest Year: The year during which the
coffee was harvested.
- Grading Date: The date when the coffee
quality grading took place.
- Owner: The entity that owns this
specific coffee batch or lot.
- Variety: The specific type or
varietal of coffee plant used to grow the beans.
- Status: The
current status of the coffee batch, such as whether it is “Completed” or
still in process.
- Processing Method: The method used to process
the coffee beans after harvesting.
- Aroma: Refers to the scent or
fragrance of the coffee.
- Flavor: The flavor of coffee is evaluated
based on the taste, including any sweetness, bitterness, acidity, and
other flavor notes.
- Aftertaste: Refers to the lingering taste
that remains in the mouth after swallowing the coffee.
- Acidity:
Acidity in coffee refers to the brightness or liveliness of the
taste.
- Body: The body of coffee refers to the thickness or
viscosity of the coffee in the mouth.
- Balance: Balance refers to
how well the different flavor components of the coffee work
together.
- Uniformity: Uniformity refers to the consistency of the
coffee from cup to cup.
- Clean Cup: A clean cup refers to a coffee
that is free of any off-flavors or defects, such as sourness, mustiness,
or staleness.
- Sweetness: It can be described as caramel-like,
fruity, or floral, and is a desirable quality in coffee.
- Overall:
The overall score assigned to the coffee based on its quality
evaluation.
- Defects: The number of defects found in the coffee
beans. A defect can include physical or sensory issues. A value of “0”
indicates no defects.
- Total Cup Points: is literally the total of
10 features given above. There were some notebooks trying to predict the
total cup points given these features.
- Moisture Percentage: The
moisture content of the coffee beans, measured as a percentage.
-
Category One Defects: Category One defects are primary defects that can
be perceived through visual inspection of the coffee beans. These
defects include Black beans, sour beans, insect-damaged beans,
fungus-damaged beans, etc.
- Quakers: The number of “quaker” beans
in the coffee. Quakers are underdeveloped or defective beans that fail
to roast properly. A value of “0” indicates no quakers.
- Color: The
color of the coffee beans, which reflects the processing and storage
condition.
- Category Two Defects: Category Two defects are
secondary defects that are more subtle and can only be detected through
tasting. These defects include Over-fermentation, staleness, rancidness,
chemical taste, etc.
- Expiration: The expiration date of the coffee
batch.
- Certification Body: The name of the organization that
certified the coffee batch, confirming its quality or compliance with
specific standards.
- Certification Address: The physical address of
the certification body.
- Certification Contact: The contact person
and details at the certification body.
The objective of this project is to apply a dimensionality reduction technique to a selected dataset. Principal Component Analysis (PCA) has been chosen as the initial method due to its effectiveness in handling linear relationships within the data (for instance, coffee quality). PCA provides a clear approach for analyzing high-dimensional datasets, preserves the most significant information, and useful for visualization in more than two dimensions.
library(ggplot2)
library(corrplot)
library(factoextra)
library(labdsv)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)
library(tidyr)
library(dplyr)
library(caret)
library(gridExtra)
setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL")
arabica <- read.csv("df_arabica_clean.csv", header = TRUE, dec = ".")
summary(arabica)
## X ID Country.of.Origin Farm.Name
## Min. : 0.0 Min. : 0.0 Length:207 Length:207
## 1st Qu.: 51.5 1st Qu.: 51.5 Class :character Class :character
## Median :103.0 Median :103.0 Mode :character Mode :character
## Mean :103.0 Mean :103.0
## 3rd Qu.:154.5 3rd Qu.:154.5
## Max. :206.0 Max. :206.0
## Lot.Number Mill ICO.Number Company
## Length:207 Length:207 Length:207 Length:207
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Altitude Region Producer Number.of.Bags
## Length:207 Length:207 Length:207 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 1.0
## Mode :character Mode :character Mode :character Median : 14.0
## Mean : 155.4
## 3rd Qu.: 275.0
## Max. :2240.0
## Bag.Weight In.Country.Partner Harvest.Year Grading.Date
## Length:207 Length:207 Length:207 Length:207
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Owner Variety Status Processing.Method
## Length:207 Length:207 Length:207 Length:207
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Aroma Flavor Aftertaste Acidity Body
## Min. :6.500 Min. :6.750 Min. :6.67 Min. :6.830 Min. :6.830
## 1st Qu.:7.580 1st Qu.:7.580 1st Qu.:7.42 1st Qu.:7.500 1st Qu.:7.500
## Median :7.670 Median :7.750 Median :7.58 Median :7.670 Median :7.670
## Mean :7.721 Mean :7.745 Mean :7.60 Mean :7.690 Mean :7.641
## 3rd Qu.:7.920 3rd Qu.:7.920 3rd Qu.:7.75 3rd Qu.:7.875 3rd Qu.:7.750
## Max. :8.580 Max. :8.500 Max. :8.42 Max. :8.580 Max. :8.250
## Balance Uniformity Clean.Cup Sweetness Overall
## Min. :6.670 Min. : 8.67 Min. :10 Min. :10 Min. :6.670
## 1st Qu.:7.500 1st Qu.:10.00 1st Qu.:10 1st Qu.:10 1st Qu.:7.500
## Median :7.670 Median :10.00 Median :10 Median :10 Median :7.670
## Mean :7.644 Mean : 9.99 Mean :10 Mean :10 Mean :7.677
## 3rd Qu.:7.790 3rd Qu.:10.00 3rd Qu.:10 3rd Qu.:10 3rd Qu.:7.920
## Max. :8.420 Max. :10.00 Max. :10 Max. :10 Max. :8.580
## Defects Total.Cup.Points Moisture.Percentage Category.One.Defects
## Min. :0 Min. :78.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:0 1st Qu.:82.58 1st Qu.:10.10 1st Qu.:0.0000
## Median :0 Median :83.75 Median :10.80 Median :0.0000
## Mean :0 Mean :83.71 Mean :10.74 Mean :0.1353
## 3rd Qu.:0 3rd Qu.:84.83 3rd Qu.:11.50 3rd Qu.:0.0000
## Max. :0 Max. :89.33 Max. :13.50 Max. :5.0000
## Quakers Color Category.Two.Defects Expiration
## Min. : 0.0000 Length:207 Min. : 0.000 Length:207
## 1st Qu.: 0.0000 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.0000 Mode :character Median : 1.000 Mode :character
## Mean : 0.6908 Mean : 2.251
## 3rd Qu.: 1.0000 3rd Qu.: 3.000
## Max. :12.0000 Max. :16.000
## Certification.Body Certification.Address Certification.Contact
## Length:207 Length:207 Length:207
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
str(arabica)
## 'data.frame': 207 obs. of 41 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ ID : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Country.of.Origin : chr "Colombia" "Taiwan" "Laos" "Costa Rica" ...
## $ Farm.Name : chr "Finca El Paraiso" "Royal Bean Geisha Estate" "OKLAO coffee farms" "La Cumbre" ...
## $ Lot.Number : chr "CQU2022015" "The 2022 Pacific Rim Coffee Summit,T037" "The 2022 Pacific Rim Coffee Summit,LA01" "CQU2022017" ...
## $ Mill : chr "Finca El Paraiso" "Royal Bean Geisha Estate" "oklao coffee processing plant" "La Montana Tarrazu MIll" ...
## $ ICO.Number : chr "" "" "" "" ...
## $ Company : chr "Coffee Quality Union" "Taiwan Coffee Laboratory" "Taiwan Coffee Laboratory" "Coffee Quality Union" ...
## $ Altitude : chr "1700-1930" "1200" "1300" "1900" ...
## $ Region : chr "Piendamo,Cauca" "Chiayi" "Laos Borofen Plateau" "Los Santos,Tarrazu" ...
## $ Producer : chr "Diego Samuel Bermudez" "曾福森" "WU TAO CHI" "Santa Maria de Dota" ...
## $ Number.of.Bags : int 1 1 19 1 2 5 1 1 1 320 ...
## $ Bag.Weight : chr "35 kg" "80 kg" "25 kg" "22 kg" ...
## $ In.Country.Partner : chr "Japan Coffee Exchange" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Japan Coffee Exchange" ...
## $ Harvest.Year : chr "2021 / 2022" "2021 / 2022" "2021 / 2022" "2022" ...
## $ Grading.Date : chr "September 21st, 2022" "November 15th, 2022" "November 15th, 2022" "September 21st, 2022" ...
## $ Owner : chr "Coffee Quality Union" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Coffee Quality Union" ...
## $ Variety : chr "Castillo" "Gesha" "Java" "Gesha" ...
## $ Status : chr "Completed" "Completed" "Completed" "Completed" ...
## $ Processing.Method : chr "Double Anaerobic Washed" "Washed / Wet" "Semi Washed" "Washed / Wet" ...
## $ Aroma : num 8.58 8.5 8.33 8.08 8.33 8.33 8.33 8.25 8.08 8.08 ...
## $ Flavor : num 8.5 8.5 8.42 8.17 8.33 8.33 8.17 8.25 8.08 8.17 ...
## $ Aftertaste : num 8.42 7.92 8.08 8.17 8.08 8.25 8.08 8.17 8.25 8.08 ...
## $ Acidity : num 8.58 8 8.17 8.25 8.25 7.83 8 8 8.08 8.17 ...
## $ Body : num 8.25 7.92 7.92 8.17 7.92 7.83 7.83 7.92 7.92 8 ...
## $ Balance : num 8.42 8.25 8.17 8.08 7.92 8.17 8.25 8.08 8 8 ...
## $ Uniformity : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Clean.Cup : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Sweetness : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Overall : num 8.58 8.5 8.33 8.25 8.25 8.25 8.25 8.08 8.25 8 ...
## $ Defects : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total.Cup.Points : num 89.3 87.6 87.4 87.2 87.1 ...
## $ Moisture.Percentage : num 11.8 10.5 10.4 11.8 11.6 10.7 9.1 10 10.8 11 ...
## $ Category.One.Defects : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Quakers : int 0 0 0 0 2 0 0 0 0 0 ...
## $ Color : chr "green" "blue-green" "yellowish" "green" ...
## $ Category.Two.Defects : int 3 0 2 0 2 2 0 1 0 0 ...
## $ Expiration : chr "September 21st, 2023" "November 15th, 2023" "November 15th, 2023" "September 21st, 2023" ...
## $ Certification.Body : chr "Japan Coffee Exchange" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Japan Coffee Exchange" ...
## $ Certification.Address: chr "〒413-0002 静岡県熱海市伊豆山1173−58 1173-58 Izusan, Atami, Shizuoka, 413-0002 JAPAN" "QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd., Xindian Dist. New Taipei City, Taiwan" "QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd., Xindian Dist. New Taipei City, Taiwan" "〒413-0002 静岡県熱海市伊豆山1173−58 1173-58 Izusan, Atami, Shizuoka, 413-0002 JAPAN" ...
## $ Certification.Contact: chr "松澤 宏樹 Koju Matsuzawa - +81(0)9085642901" "Lin, Jen-An Neil 林仁安 - 886-289116612" "Lin, Jen-An Neil 林仁安 - 886-289116612" "松澤 宏樹 Koju Matsuzawa - +81(0)9085642901" ...
Dataset consists of 207 observations and 41 variables.
Convert
the appropriate columns to a numeric data type.
arabica$Bag.Weight <- as.numeric(str_remove(arabica$Bag.Weight, "kg"))
arabica$Category.One.Defects <- as.numeric(arabica$Category.One.Defects)
arabica$Category.Two.Defects <- as.numeric(arabica$Category.Two.Defects)
arabica$Quakers <- as.numeric(arabica$Quakers)
arabica$Moisture.Percentage <- as.numeric(arabica$Moisture.Percentage)
arabica$Number.of.Bags <- as.numeric(arabica$Number.of.Bags)
str(arabica)
Replace empty values with “NA” and identify columns containing missing values.
arabica[arabica == ""] <- NA
missing_positions <- which(is.na(arabica), arr.ind = TRUE)
print(missing_positions)
Determine the number of missing values for each column.
print(arabica[,4]) # <- Farm name
print(sum(is.na(arabica[,4])))
print(arabica[,5]) # <- Lot.Number
print(sum(is.na(arabica[,5])))
print(arabica[,6]) # <- Mill
print(sum(is.na(arabica[,6])))
print(arabica[,7]) # <- ICO.number
print(sum(is.na(arabica[,7])))
print(arabica[,9]) # <- altitude
print(sum(is.na(arabica[,9])))
print(arabica[,10]) # <- region
print(sum(is.na(arabica[,10])))
print(arabica[,11]) # <- producer
print(sum(is.na(arabica[,11])))
print(arabica[,18]) # <- variety
print(sum(is.na(arabica[,18])))
print(arabica[,20]) # <- processing method
print(sum(is.na(arabica[,20])))
Since all columns store character data representing general
information about the coffee, replace missing values in columns with
fewer than 10 “NA” entries using the mode function. Fill in the column
with 132 missing values with the placeholder value “Unknown”.
Verify
that the dataset no longer contains missing values.
get_mode <- function(x) {
unique_x <- unique(x)
unique_x[which.max(tabulate(match(x, unique_x)))]
}
arabica$Farm.Name[is.na(arabica$Farm.Name)] <- get_mode(arabica$Farm.Name[!is.na(arabica$Farm.Name)])
arabica$Lot.Number[is.na(arabica$Lot.Number)] <- get_mode(arabica$Lot.Number[!is.na(arabica$Lot.Number)])
arabica$Mill[is.na(arabica$Mill)] <- get_mode(arabica$Mill[!is.na(arabica$Mill)])
arabica$ICO.Number[is.na(arabica$ICO.Number)] <- "Unknown"
arabica$Altitude[is.na(arabica$Altitude)] <- get_mode(arabica$Altitude[!is.na(arabica$Altitude)])
arabica$Region[is.na(arabica$Region)] <- get_mode(arabica$Region[!is.na(arabica$Region)])
arabica$Producer[is.na(arabica$Producer)] <- get_mode(arabica$Producer[!is.na(arabica$Producer)])
arabica$Variety[is.na(arabica$Variety)] <- get_mode(arabica$Variety[!is.na(arabica$Variety)])
arabica$Processing.Method[is.na(arabica$Processing.Method)] <- get_mode(arabica$Processing.Method[!is.na(arabica$Processing.Method)])
print(sum(is.na(arabica)))
## [1] 0
summary(arabica)
## X ID Country.of.Origin Farm.Name
## Min. : 0.0 Min. : 0.0 Length:207 Length:207
## 1st Qu.: 51.5 1st Qu.: 51.5 Class :character Class :character
## Median :103.0 Median :103.0 Mode :character Mode :character
## Mean :103.0 Mean :103.0
## 3rd Qu.:154.5 3rd Qu.:154.5
## Max. :206.0 Max. :206.0
## Lot.Number Mill ICO.Number Company
## Length:207 Length:207 Length:207 Length:207
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Altitude Region Producer Number.of.Bags
## Length:207 Length:207 Length:207 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 1.0
## Mode :character Mode :character Mode :character Median : 14.0
## Mean : 155.4
## 3rd Qu.: 275.0
## Max. :2240.0
## Bag.Weight In.Country.Partner Harvest.Year Grading.Date
## Min. : 1.0 Length:207 Length:207 Length:207
## 1st Qu.: 15.0 Class :character Class :character Class :character
## Median : 30.0 Mode :character Mode :character Mode :character
## Mean : 227.8
## 3rd Qu.: 60.0
## Max. :19200.0
## Owner Variety Status Processing.Method
## Length:207 Length:207 Length:207 Length:207
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Aroma Flavor Aftertaste Acidity Body
## Min. :6.500 Min. :6.750 Min. :6.67 Min. :6.830 Min. :6.830
## 1st Qu.:7.580 1st Qu.:7.580 1st Qu.:7.42 1st Qu.:7.500 1st Qu.:7.500
## Median :7.670 Median :7.750 Median :7.58 Median :7.670 Median :7.670
## Mean :7.721 Mean :7.745 Mean :7.60 Mean :7.690 Mean :7.641
## 3rd Qu.:7.920 3rd Qu.:7.920 3rd Qu.:7.75 3rd Qu.:7.875 3rd Qu.:7.750
## Max. :8.580 Max. :8.500 Max. :8.42 Max. :8.580 Max. :8.250
## Balance Uniformity Clean.Cup Sweetness Overall
## Min. :6.670 Min. : 8.67 Min. :10 Min. :10 Min. :6.670
## 1st Qu.:7.500 1st Qu.:10.00 1st Qu.:10 1st Qu.:10 1st Qu.:7.500
## Median :7.670 Median :10.00 Median :10 Median :10 Median :7.670
## Mean :7.644 Mean : 9.99 Mean :10 Mean :10 Mean :7.677
## 3rd Qu.:7.790 3rd Qu.:10.00 3rd Qu.:10 3rd Qu.:10 3rd Qu.:7.920
## Max. :8.420 Max. :10.00 Max. :10 Max. :10 Max. :8.580
## Defects Total.Cup.Points Moisture.Percentage Category.One.Defects
## Min. :0 Min. :78.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:0 1st Qu.:82.58 1st Qu.:10.10 1st Qu.:0.0000
## Median :0 Median :83.75 Median :10.80 Median :0.0000
## Mean :0 Mean :83.71 Mean :10.74 Mean :0.1353
## 3rd Qu.:0 3rd Qu.:84.83 3rd Qu.:11.50 3rd Qu.:0.0000
## Max. :0 Max. :89.33 Max. :13.50 Max. :5.0000
## Quakers Color Category.Two.Defects Expiration
## Min. : 0.0000 Length:207 Min. : 0.000 Length:207
## 1st Qu.: 0.0000 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.0000 Mode :character Median : 1.000 Mode :character
## Mean : 0.6908 Mean : 2.251
## 3rd Qu.: 1.0000 3rd Qu.: 3.000
## Max. :12.0000 Max. :16.000
## Certification.Body Certification.Address Certification.Contact
## Length:207 Length:207 Length:207
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Remove the columns Cleaned.cup, Sweetness, and Defects, as they
contain constant values. These columns would otherwise introduce issues
in matrix computations and correlation analyses.
Exclude the ID
columns, as they solely contain record identifiers and don’t contribute
to the analysis.
Exclude the Overall and Total.cup.points columns
since those columns could introduce redundancy, as they don’t provide
new information, but summarize existing features.
These preparations
were made to compute the covariance matrix to see if there is any
relationship between variables.
numeric_arabica <- arabica %>%
select(-Defects, -Sweetness, -Clean.Cup, -X, -ID, - Total.Cup.Points, - Overall) %>%
select_if(is.numeric)
arabica_matrix <- as.matrix(numeric_arabica)
head(arabica_matrix)
## Number.of.Bags Bag.Weight Aroma Flavor Aftertaste Acidity Body Balance
## [1,] 1 35 8.58 8.50 8.42 8.58 8.25 8.42
## [2,] 1 80 8.50 8.50 7.92 8.00 7.92 8.25
## [3,] 19 25 8.33 8.42 8.08 8.17 7.92 8.17
## [4,] 1 22 8.08 8.17 8.17 8.25 8.17 8.08
## [5,] 2 24 8.33 8.33 8.08 8.25 7.92 7.92
## [6,] 5 30 8.33 8.33 8.25 7.83 7.83 8.17
## Uniformity Moisture.Percentage Category.One.Defects Quakers
## [1,] 10 11.8 0 0
## [2,] 10 10.5 0 0
## [3,] 10 10.4 0 0
## [4,] 10 11.8 0 0
## [5,] 10 11.6 0 2
## [6,] 10 10.7 0 0
## Category.Two.Defects
## [1,] 3
## [2,] 0
## [3,] 2
## [4,] 0
## [5,] 2
## [6,] 2
cor_matrix <- cor(arabica_matrix, method = "pearson")
print(cor_matrix)
## Number.of.Bags Bag.Weight Aroma Flavor
## Number.of.Bags 1.00000000 0.068794936 -0.227414231 -0.26469950
## Bag.Weight 0.06879494 1.000000000 -0.016899458 -0.05667924
## Aroma -0.22741423 -0.016899458 1.000000000 0.82277851
## Flavor -0.26469950 -0.056679241 0.822778505 1.00000000
## Aftertaste -0.30469701 0.009330594 0.793397466 0.87681144
## Acidity -0.20441537 0.149742631 0.712919976 0.81093386
## Body -0.04003724 0.135317762 0.633100500 0.73985741
## Balance -0.25901609 0.025045088 0.745648029 0.85178584
## Uniformity 0.05937032 0.008722334 -0.028063067 -0.03976682
## Moisture.Percentage 0.10275334 -0.067663518 -0.002418472 -0.05090234
## Category.One.Defects 0.04367764 -0.022594818 -0.057859622 -0.08129878
## Quakers 0.08227137 0.017300591 -0.342885600 -0.31005432
## Category.Two.Defects 0.23740750 0.125250267 -0.254719142 -0.33034651
## Aftertaste Acidity Body Balance
## Number.of.Bags -0.304697011 -0.20441537 -0.040037237 -0.25901609
## Bag.Weight 0.009330594 0.14974263 0.135317762 0.02504509
## Aroma 0.793397466 0.71291998 0.633100500 0.74564803
## Flavor 0.876811442 0.81093386 0.739857407 0.85178584
## Aftertaste 1.000000000 0.81443868 0.738674332 0.86195074
## Acidity 0.814438682 1.00000000 0.765184556 0.80523566
## Body 0.738674332 0.76518456 1.000000000 0.81609773
## Balance 0.861950737 0.80523566 0.816097729 1.00000000
## Uniformity -0.023925511 -0.06256401 -0.043903922 -0.08906386
## Moisture.Percentage -0.051115362 -0.01675124 0.009770954 -0.07265985
## Category.One.Defects -0.107965594 -0.11020378 0.048256481 0.03955156
## Quakers -0.303247393 -0.20948342 -0.257464654 -0.31550219
## Category.Two.Defects -0.330731327 -0.26227881 -0.208783636 -0.31773347
## Uniformity Moisture.Percentage Category.One.Defects
## Number.of.Bags 0.059370319 0.102753338 0.04367764
## Bag.Weight 0.008722334 -0.067663518 -0.02259482
## Aroma -0.028063067 -0.002418472 -0.05785962
## Flavor -0.039766820 -0.050902336 -0.08129878
## Aftertaste -0.023925511 -0.051115362 -0.10796559
## Acidity -0.062564014 -0.016751242 -0.11020378
## Body -0.043903922 0.009770954 0.04825648
## Balance -0.089063859 -0.072659846 0.03955156
## Uniformity 1.000000000 0.035239870 0.02147091
## Moisture.Percentage 0.035239870 1.000000000 0.09275477
## Category.One.Defects 0.021470908 0.092754773 1.00000000
## Quakers 0.038486395 0.170142394 0.01777426
## Category.Two.Defects 0.071713700 0.130671152 0.16665467
## Quakers Category.Two.Defects
## Number.of.Bags 0.08227137 0.2374075
## Bag.Weight 0.01730059 0.1252503
## Aroma -0.34288560 -0.2547191
## Flavor -0.31005432 -0.3303465
## Aftertaste -0.30324739 -0.3307313
## Acidity -0.20948342 -0.2622788
## Body -0.25746465 -0.2087836
## Balance -0.31550219 -0.3177335
## Uniformity 0.03848639 0.0717137
## Moisture.Percentage 0.17014239 0.1306712
## Category.One.Defects 0.01777426 0.1666547
## Quakers 1.00000000 0.4624226
## Category.Two.Defects 0.46242261 1.0000000
corrplot::corrplot(cor_matrix, method = "circle")
dim(arabica_matrix)
## [1] 207 13
The plot reveals the strong correlation between the following
variables: Aroma, Flavor, Aftertaste, Acidity, Body and Balance.
Positive correlations arise from shared contributions to coffee quality
metrics or shared underlying properties. For example, Aroma and Flavor
are positively correlated because coffee with a strong aroma tend to
have better flavor profiles, reflecting an underlying quality
factor.High acidity in coffee is often associated with higher overall
quality, also contributing to a better overall score. Presented positive
correlations reflect interdependencies between features that
collectively affect coffee quality. Negative correlations result from
detrimental factors that reduce overall scores or specific sensory
features.Another example of negative correlation is Quakers and Aroma.
An increase in quakers,unripe beans, usually reduces the aroma
quality.
The central idea of principal component analysis is to reduce the
dimensionality of a dataset in which there are a large number of
interrelated variables, while retaining as much as possible of the
variation present in the dataset. This reduction is achieved by
transforming to a new set of variables, the principal components, which
are uncorrelated, and which are ordered so that the first few retain
most of the variation present in all of the original variables.
Computation of the principal components reduces to the solution of an
eigenvalue-eigenvector problem for a positive-semidefinite symmetric
matrix.
We start with the dataset normalization before further analysis.
preproc1 <- preProcess(arabica_matrix, method=c("center", "scale"))
arabica_matrix.s <- predict(preproc1, arabica_matrix)
Eigenvalues on the basis of covariance.
arabica_matrix.cov<-cov(arabica_matrix.s)
arabica_matrix.eigen<-eigen(arabica_matrix.cov)
arabica_matrix.eigen$values
## [1] 5.2954382 1.4499720 1.1218288 1.0163110 0.9977321 0.9293184 0.7826027
## [8] 0.5161073 0.3148871 0.1894170 0.1621808 0.1158438 0.1083607
head(arabica_matrix.eigen$vectors)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.13475997 -0.31090811 -0.18956363 -0.40369484 -0.098789782 0.642954507
## [2,] 0.01356599 -0.26979254 -0.75396989 -0.06521995 0.075363222 -0.105882170
## [3,] 0.37328140 -0.05976710 0.08004008 0.01906562 -0.056000736 0.016821517
## [4,] 0.40618268 -0.04226558 0.07782905 0.06118791 -0.045297141 -0.005737637
## [5,] 0.40610479 -0.04359876 0.02529748 0.09242236 -0.060234348 -0.041380794
## [6,] 0.38432100 -0.15967199 -0.10808402 0.11762529 -0.008073615 0.026282950
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.414745410 -0.14833927 -0.20776575 0.06386324 -0.1435768 0.02985631
## [2,] -0.532860668 -0.01278676 -0.17211389 -0.08241683 -0.1023183 -0.07684239
## [3,] 0.051047830 0.34316751 -0.65614410 -0.28001510 0.4044517 0.22871166
## [4,] 0.115350971 0.03814272 -0.15274605 0.04151546 -0.1939058 -0.85882683
## [5,] 0.039420651 0.02020285 -0.04275869 0.01195993 -0.5837365 0.29502260
## [6,] -0.003339653 -0.12678718 0.06927037 0.80567971 0.3326143 0.13701125
## [,13]
## [1,] 0.01878774
## [2,] 0.02279404
## [3,] 0.04130781
## [4,] 0.07965296
## [5,] -0.62204139
## [6,] 0.03346227
Now we need to select the number of components. Firstly, use the Kaiser’s rule or Kaiser-Guttman criterion. It is a widely used method to evaluate the maximum number of linear combinations to extract from the dataset. According to that rule only those principal components are retained, whose variances exceed 1. The idea behind the Kaiser-Guttman criterion is that any principal with variance less than 1 contains less information than one of the original variables and so is not worth retaining (Jolliffe, 2002). Based on the Kaiser’s rule description, we can retain 4 components, since the eigenvalues are greater than 1 for them. Check it again:
kaiser_rule <- which(arabica_matrix.eigen$values > 1)
print(kaiser_rule)
## [1] 1 2 3 4
xxx<-arabica_matrix.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
summary(xxx.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3012 1.2041 1.05916 1.00812 0.99887 0.96401 0.8846
## Proportion of Variance 0.4073 0.1115 0.08629 0.07818 0.07675 0.07149 0.0602
## Cumulative Proportion 0.4073 0.5189 0.60517 0.68335 0.76010 0.83158 0.8918
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.7184 0.56115 0.43522 0.40272 0.34036 0.32918
## Proportion of Variance 0.0397 0.02422 0.01457 0.01248 0.00891 0.00834
## Cumulative Proportion 0.9315 0.95571 0.97028 0.98275 0.99166 1.00000
print(get_eigenvalue(xxx.pca1))
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.2954382 40.7341398 40.73414
## Dim.2 1.4499720 11.1536307 51.88777
## Dim.3 1.1218288 8.6294527 60.51722
## Dim.4 1.0163110 7.8177768 68.33500
## Dim.5 0.9977321 7.6748626 76.00986
## Dim.6 0.9293184 7.1486029 83.15847
## Dim.7 0.7826027 6.0200211 89.17849
## Dim.8 0.5161073 3.9700561 93.14854
## Dim.9 0.3148871 2.4222087 95.57075
## Dim.10 0.1894170 1.4570541 97.02781
## Dim.11 0.1621808 1.2475446 98.27535
## Dim.12 0.1158438 0.8911058 99.16646
## Dim.13 0.1083607 0.8335442 100.00000
Use Scree plot to visualize the eigenvalues of the components in the ascending order.
fviz_eig(xxx.pca1, choice='eigenvalue') +
geom_hline(yintercept = 1, linetype = "dashed", color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the Eigenvalue and Scree plot result, it’s more obvious
now, that there are several approaches how many components we can
select. First, is to select based on the Kaiser-Guttman criterion, and
the number of components is 4 (Jolliffe, 2002). Second, we can notice
that 3 components explain 60 percent of the variances, capturing quite
significant variance. We can see a “knee” or “elbow” at Dim 3. This
indicates a diminishing return on explained variance after Dim.3. Once
the “elbow” was defined in the graph, is then taken to be the number of
components to be retained (Jolliffe, 2002). Third, we can see that the
cumulative variance explained by the first 5 components is 76.01%, which
is a significant proportion of the total variance. Dimension 5 has an
eigenvalue of 0.9977, which is very close to 1. Including it might be
justified as it contributes meaningful variance (7.67%).
fviz_eig(xxx.pca1)
The cumulative proportion of explained variance shows that 3
components account for approximately 60% of the total variance. To
explain approximately 68% of the variance, 4 components are required,
while selecting 5 components increases the cumulative explained variance
to 76%.
a<-summary(xxx.pca1)
plot(a$importance[3,],type="p")
From the plot above, we also can see that to explain 70-80% of
variance, we need to select from 4 to 5 components. Let’s use another
technique to justify selection of the components.
Using Singular Value Decomposition (SVD) as an alternative to PCA for verifying the number of dimensions could be useful approach. SVD provides a similar decomposition of the dataset but does so without relying on the covariance matrix.
svd_arabica <- svd(scale(arabica_matrix))
singular_values <- svd_arabica$d
explained_variance <- singular_values^2 / sum(singular_values^2)
cumulative_variance <- cumsum(explained_variance)
print(cumulative_variance)
## [1] 0.4073414 0.5188777 0.6051722 0.6833500 0.7600986 0.8315847 0.8917849
## [8] 0.9314854 0.9557075 0.9702781 0.9827535 0.9916646 1.0000000
plot(svd_arabica$d)
The plot displays the singular values from the SVD decomposition.
Each point on the plot represents a singular value, which reflects the
amount of variance captured by the corresponding dimension. The first
singular value is significantly larger than the others, indicating that
the first dimension captures the majority of the variance in the data.
After a certain point (index 4 or 5), the singular values become small
and relatively uniform, indicating that these dimensions contribute
little additional information or may represent noise. We can verify the
result drawing plot for cumulative variance.
plot(1:length(cumulative_variance), cumulative_variance,
xlab = "Dimensions", ylab = "Cumulative Variance")
To explain 70–80% of the variance, we need to select between 4 and
5 components.
Based on the outcomes of the PCA and SVD techniques, we need to determine the optimal number of components to retain. The analysis suggests that the optimal range is between 4 and 5 components. Using the “elbow method,” selecting 3 components could be a choice for simplicity and better interpretation. These 3 components likely focus on the primary coffee quality metrics, as they show strong positive correlations. However, the goal is to maximize the explained variance, and selecting 4 or 5 components could be more suitable. Including 4 or 5 components would allow us to account for additional factors such as defect-related attributes, batch size, or processing methods, which contribute to the remaining variance.
Use 5 principal components.
cos2 <- apply(xxx.pca1$x[, 1:5]^2, 1, sum) / apply(xxx.pca1$x^2, 1, sum) # quality of representation of a data point on the selected principal component
pairs(
as.data.frame(xxx.pca1$x[, 1:5]),
main = "Pairwise scatterplot of 5 principal components",
pch = 21,
bg = "lightblue"
)
plot_ly(
x = ~xxx.pca1$rotation[, 1],
y = ~xxx.pca1$rotation[, 2],
z = ~xxx.pca1$rotation[, 3],
type = 'scatter3d',
mode = 'markers+text',
text = rownames(xxx.pca1$rotation),
marker = list(size = 5, color = 'steelblue')
) %>%
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
)
)
plot_ly(
x = ~xxx.pca1$rotation[, 3],
y = ~xxx.pca1$rotation[, 4],
z = ~xxx.pca1$rotation[, 5],
type = 'scatter3d',
mode = 'markers+text',
text = rownames(xxx.pca1$rotation),
marker = list(size = 5, color = 'steelblue')
) %>%
layout(
scene = list(
xaxis = list(title = "PC3"),
yaxis = list(title = "PC4"),
zaxis = list(title = "PC5")
)
)
The variable plots displayed above illustrate relationships
between variables as well as the “quality” of all factors. Positively
correlated variables are positioned close to each other, while
negatively correlated variables appear on opposite sides of the plot.
The “quality” of a variable is reflected by its distance from the
center, with variables further from the center being better represented
by the selected principal components.
The axes correspond to the first three principal components (PC1, PC2, and PC3) in the first plot and the next three principal components (PC3, PC4, and PC5) in the second plot. From these plots, we can observe that Flavor, Aroma, Aftertaste, and Balance are strongly positively correlated and collectively contribute to coffee quality metrics. Body and Acidity can also be added to these coffee quality-related attributes. Variables such as Bag.Weight and Number.of.Bags are positioned farther from the quality-related variables, indicating they contribute to variance unrelated to sensory attributes. Similarly, Category.One.Defects, Category.Two.Defects, and Quakers are positioned separately from the sensory variables, representing variance related to defects.
loading_scores_PC_1<-xxx.pca1$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_1_ranked, 1]
## Flavor Aftertaste Balance
## -0.40618268 -0.40610479 -0.40378593
## Acidity Aroma Body
## -0.38432100 -0.37328140 -0.36145936
## Category.Two.Defects Quakers Number.of.Bags
## 0.18518358 0.18213234 0.13475997
## Category.One.Defects Moisture.Percentage Uniformity
## 0.03529932 0.03468225 0.03314528
## Bag.Weight
## -0.01356599
As we can see, variables were sorted based on their absolute contributions to PC1. The following contributions are the strongest one: Flavor, Aftertaste, Balance, Acidity, Aroma and Body.
loading_scores_PC_2<-xxx.pca1$rotation[,2]
fac_scores_PC_2<-abs(loading_scores_PC_2)
fac_scores_PC_2_ranked<-names(sort(fac_scores_PC_2, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_2_ranked, 2]
## Category.Two.Defects Quakers Moisture.Percentage
## 0.54469393 0.41697167 0.39232940
## Number.of.Bags Category.One.Defects Bag.Weight
## 0.31090811 0.29174840 0.26979254
## Body Acidity Uniformity
## 0.24647458 0.15967199 0.13972045
## Balance Aroma Aftertaste
## 0.08902692 0.05976710 0.04359876
## Flavor
## 0.04226558
As we can observe from the PC2, the following contributions are the strongest one: Category.Two.Defects, Quakers, Moisture.Percentage, Number.of.Bags.
loading_scores_PC_3<-xxx.pca1$rotation[, 3]
fac_scores_PC_3<-abs(loading_scores_PC_3)
fac_scores_PC_3_ranked<-names(sort(fac_scores_PC_3, decreasing = TRUE))
xxx.pca1$rotation[fac_scores_PC_3_ranked, 3]
## Bag.Weight Moisture.Percentage Category.One.Defects
## -0.75396989 0.48968287 0.33814757
## Number.of.Bags Acidity Aroma
## -0.18956363 -0.10808402 0.08004008
## Flavor Body Category.Two.Defects
## 0.07782905 -0.07713777 -0.06420221
## Quakers Uniformity Balance
## 0.05815442 0.04897627 0.02869291
## Aftertaste
## 0.02529748
Regarding the PC3, we can observe Bag.Weight, Moisture.Percentage and Category.One.Defects as a strong contributions.
loading_scores_PC_4<-xxx.pca1$rotation[,4]
fac_scores_PC_4<-abs(loading_scores_PC_4)
fac_scores_PC_4_ranked<-names(sort(fac_scores_PC_4, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_4_ranked, 4]
## Category.One.Defects Quakers Number.of.Bags
## 0.62096334 -0.54980742 0.40369484
## Uniformity Category.Two.Defects Moisture.Percentage
## 0.23675392 -0.15245563 -0.14024339
## Body Acidity Aftertaste
## 0.13325069 -0.11762529 -0.09242236
## Bag.Weight Flavor Balance
## 0.06521995 -0.06118791 0.04109770
## Aroma
## -0.01906562
Then, Category.One.Defects, Quakers and Number.of.Bags consider as a strong contribution of PC4.
loading_scores_PC_5<-xxx.pca1$rotation[,5]
fac_scores_PC_5<-abs(loading_scores_PC_5)
fac_scores_PC_5_ranked<-names(sort(fac_scores_PC_5, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_5_ranked, 5]
## Uniformity Category.One.Defects Number.of.Bags
## -0.900541967 0.382927989 -0.098789782
## Balance Bag.Weight Moisture.Percentage
## 0.080924636 0.075363222 -0.069840558
## Category.Two.Defects Aftertaste Aroma
## 0.065870129 -0.060234348 -0.056000736
## Flavor Body Quakers
## -0.045297141 0.045093452 0.016468689
## Acidity
## -0.008073615
The last one from the PC5, Uniformity and Category.One.Defects are the strongest one.
ind<-get_pca_ind(xxx.pca1)
print(ind)
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
head(ind$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## 1 -6.988786 1.90681089 0.5418321 -0.7429148 -0.3003789 -0.1230459 -0.21930247
## 2 -4.736186 0.09064613 0.3422025 -0.2200747 -0.2674048 0.2088002 -0.32004123
## 3 -4.627493 0.49812488 0.1328734 -0.3990757 -0.2460498 0.3227001 -0.55679435
## 4 -4.532551 0.75217958 0.4842278 -0.3498330 -0.2750309 -0.3626697 0.45170261
## 5 -3.981433 1.29690187 0.6000946 -1.2705898 -0.3537232 0.1127918 -0.08487344
## 6 -4.103541 0.28445111 0.4213595 -0.3902483 -0.2863240 0.3474746 -0.35012146
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
## 1 0.55793387 -0.0851971 -0.4410661 0.1579104 0.409909268 -0.15133224
## 2 0.28606869 0.9787804 0.6382671 0.1077150 -0.568085644 -1.03297036
## 3 0.46502832 0.4391967 -0.1528116 -0.1244917 -0.294493862 -0.42851387
## 4 -0.42643838 -0.6857634 -0.1448655 0.1697774 0.222792780 0.37702350
## 5 -0.07999881 0.7872899 -0.3773512 0.3627248 -0.265483594 0.25515197
## 6 0.73202859 0.6878320 0.7416791 -1.0291200 0.007186071 -0.08439423
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) +
theme(axis.title.x = element_blank(), plot.title = element_blank())
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) +
theme(axis.title.x = element_blank(), plot.title = element_blank())
c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) +
theme(axis.title.x = element_blank(), plot.title = element_blank())
d<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90) +
theme(axis.title.x = element_blank(), plot.title = element_blank())
e<-fviz_contrib(xxx.pca1, "var", axes=5, xtickslab.rt=90) +
theme(axis.title.x = element_blank(), plot.title = element_blank())
grid.arrange(a, b, c,d,e, ncol = 5)
Plots represent the same variables contributions as we describe
above.
In Hierarchical clustering method, each object is assigned to its own
cluster; then the algorithm proceeds iteratively, at each stage joining
the two most similar clusters, continuing until there is just a single
cluster. At each stage, distances between clusters are recomputed by a
dissimilarity formula according to the particular clustering method that
is in use. It produces a set of nested clusters organized as a
hierarchical three and may be visualized as a dendrogram.
To validate whether the identified groups or patterns are consistent, let’s use hierarchical clustering to cross-check the results with PCA.
arabica.df <- as.data.frame(lapply(numeric_arabica,scale))
distance.m <- dist(t(arabica.df))
hc <- hclust(distance.m,method = "complete")
plot(hc, cex = 0.6, hang = 1)
rect.hclust(hc,k =5, border = "blue")
distance.m <- dist((arabica.df))
hc<- hclust(distance.m,method = "complete")
sub_grp<-cutree(hc, k=5)
plot_ly(
x = ~xxx.pca1$x[, 1],
y = ~xxx.pca1$x[, 2],
z = ~xxx.pca1$x[, 3],
color = ~as.factor(sub_grp),
colors = c("blue", "steelblue", "gray"),
marker = list(size = 5)
) %>%
add_markers() %>%
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
),
title = "3D Cluster plot"
)
The result show that there are no big differences between
clusters generated from hierarchical tree and PCA. According to the
dendrogram and plot results,five clusters are identified, grouped by
different coffee attributes. The first cluster groups coffee
sensory-related variables, which are positively correlated with coffee
quality and are highly interdependent.The second cluster consider the
coffee bag weight, indicating it is not strongly associated with other
variables. The third cluster includes variables related to the number of
bags and defects,that negatively impact coffee quality. The fourth and
the fifth clusters contains uniformity and defect, that are not strongly
associated with other variables.
We can check the optimal number of clusters and apply Silhouette and Elbow methods.
fviz_nbclust(arabica.df, FUN = hcut, method = "silhouette")
fviz_nbclust(arabica.df, FUN = hcut, method = "wss")
As we can see, based on the Elbow Method, the optimal number of
clusters appears to be between 4 and 5, as the reduction in total
within-cluster sum of squares begins to soften after 4 clusters.
Supporting this, the Silhouette Method suggests that the optimal number
of clusters is 6, as it maximizes the average silhouette width. We can
notice that this aligns with the selection of 5 components from PCA,
which explains 76% of the variance, supporting the possibility of
retaining more clusters to capture additional structure in the data.
The first four components have eigenvalues greater than 1, indicating that they explain more variance than an average variable (PC1: 5.30, PC2: 1.45, PC3: 1.12, PC4: 1.02). For this project, we included PC5 as well, since its eigenvalue (0.99) is only slightly below 1. Combined, the first five PCs explain approximately 76% of the total variance, allowing us to reduce the dimensionality to five components. Then hierarchical clustering was applied to the same dataset to assess whether its results aligned with those from PCA. The findings showed that both methods produced similar results, aligning with the characteristics of the coffee quality dataset. Hierarchical clustering identified five clusters, grouped by different coffee attributes.Furthermore, the Elbow and Silhouette methods supported our decision by suggesting that the optimal number of clusters lies between 4 and 6.