Dimensionality Reduction on Coffee Quality Data (Coffee Quality Institute, May - 2023)

Dataset

The dataset in use was acquired from Kaggle and is based on a data provided by Coffee Quality Institute (CQI). CQI maintains a web database that serves as a resource for coffee professionals and enthusiasts who are interested in learning about coffee quality and sustainability. The database includes a range of information on coffee production, processing, and sensory evaluation. It also contains data on coffee genetics, soil types, and other factors that can affect coffee quality. The selected dataset represents specifically arabica type and covered the period May 2023.

Description of the columns:
- X: Record id.
- ID: Record id.
- Country of Origin: Refers to country name of coffee origin
- Farm Name: The name of the coffee farm where the coffee was grown.
- Lot Number: A unique identifier for a specific batch of coffee.
- Mill: The processing facility where the coffee was milled or processed.
- ICO Number: Refers to the ICO (International Coffee Organization) number, which identify the coffee’s certification or region.
- Company: The organization or entity associated with the coffee batch.
- Altitude: The elevation range where the coffee was grown, measured in meters above sea level.
- Region: The geographical area or location of the coffee farm.
- Producer: The name of the individual or entity responsible for producing the coffee.
- Number of bags: The total count of coffee bags in this lot.
- Bag weight: The weight of each bag of coffee (kg).
- In-Country Partner: The organization or partner facilitating coffee trade or logistics within the producing country.
- Harvest Year: The year during which the coffee was harvested.
- Grading Date: The date when the coffee quality grading took place.
- Owner: The entity that owns this specific coffee batch or lot.
- Variety: The specific type or varietal of coffee plant used to grow the beans.
- Status: The current status of the coffee batch, such as whether it is “Completed” or still in process.
- Processing Method: The method used to process the coffee beans after harvesting.
- Aroma: Refers to the scent or fragrance of the coffee.
- Flavor: The flavor of coffee is evaluated based on the taste, including any sweetness, bitterness, acidity, and other flavor notes.
- Aftertaste: Refers to the lingering taste that remains in the mouth after swallowing the coffee.
- Acidity: Acidity in coffee refers to the brightness or liveliness of the taste.
- Body: The body of coffee refers to the thickness or viscosity of the coffee in the mouth.
- Balance: Balance refers to how well the different flavor components of the coffee work together.
- Uniformity: Uniformity refers to the consistency of the coffee from cup to cup.
- Clean Cup: A clean cup refers to a coffee that is free of any off-flavors or defects, such as sourness, mustiness, or staleness.
- Sweetness: It can be described as caramel-like, fruity, or floral, and is a desirable quality in coffee.
- Overall: The overall score assigned to the coffee based on its quality evaluation.
- Defects: The number of defects found in the coffee beans. A defect can include physical or sensory issues. A value of “0” indicates no defects.
- Total Cup Points: is literally the total of 10 features given above. There were some notebooks trying to predict the total cup points given these features.
- Moisture Percentage: The moisture content of the coffee beans, measured as a percentage.
- Category One Defects: Category One defects are primary defects that can be perceived through visual inspection of the coffee beans. These defects include Black beans, sour beans, insect-damaged beans, fungus-damaged beans, etc.
- Quakers: The number of “quaker” beans in the coffee. Quakers are underdeveloped or defective beans that fail to roast properly. A value of “0” indicates no quakers.
- Color: The color of the coffee beans, which reflects the processing and storage condition.
- Category Two Defects: Category Two defects are secondary defects that are more subtle and can only be detected through tasting. These defects include Over-fermentation, staleness, rancidness, chemical taste, etc.
- Expiration: The expiration date of the coffee batch.
- Certification Body: The name of the organization that certified the coffee batch, confirming its quality or compliance with specific standards.
- Certification Address: The physical address of the certification body.
- Certification Contact: The contact person and details at the certification body.

Introduction

The objective of this project is to apply a dimensionality reduction technique to a selected dataset. Principal Component Analysis (PCA) has been chosen as the initial method due to its effectiveness in handling linear relationships within the data (for instance, coffee quality). PCA provides a clear approach for analyzing high-dimensional datasets, preserves the most significant information, and useful for visualization in more than two dimensions.

Libraries

library(ggplot2)
library(corrplot)
library(factoextra)
library(labdsv)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)
library(tidyr)
library(dplyr)
library(caret)
library(gridExtra)

Data and data cleaning

setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL")
arabica <- read.csv("df_arabica_clean.csv", header = TRUE, dec = ".")
summary(arabica)

##        X               ID        Country.of.Origin   Farm.Name        
##  Min.   :  0.0   Min.   :  0.0   Length:207         Length:207        
##  1st Qu.: 51.5   1st Qu.: 51.5   Class :character   Class :character  
##  Median :103.0   Median :103.0   Mode  :character   Mode  :character  
##  Mean   :103.0   Mean   :103.0                                        
##  3rd Qu.:154.5   3rd Qu.:154.5                                        
##  Max.   :206.0   Max.   :206.0                                        
##   Lot.Number            Mill            ICO.Number          Company         
##  Length:207         Length:207         Length:207         Length:207        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Altitude            Region            Producer         Number.of.Bags  
##  Length:207         Length:207         Length:207         Min.   :   1.0  
##  Class :character   Class :character   Class :character   1st Qu.:   1.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :  14.0  
##                                                           Mean   : 155.4  
##                                                           3rd Qu.: 275.0  
##                                                           Max.   :2240.0  
##   Bag.Weight        In.Country.Partner Harvest.Year       Grading.Date      
##  Length:207         Length:207         Length:207         Length:207        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Owner             Variety             Status          Processing.Method 
##  Length:207         Length:207         Length:207         Length:207        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Aroma           Flavor        Aftertaste      Acidity           Body      
##  Min.   :6.500   Min.   :6.750   Min.   :6.67   Min.   :6.830   Min.   :6.830  
##  1st Qu.:7.580   1st Qu.:7.580   1st Qu.:7.42   1st Qu.:7.500   1st Qu.:7.500  
##  Median :7.670   Median :7.750   Median :7.58   Median :7.670   Median :7.670  
##  Mean   :7.721   Mean   :7.745   Mean   :7.60   Mean   :7.690   Mean   :7.641  
##  3rd Qu.:7.920   3rd Qu.:7.920   3rd Qu.:7.75   3rd Qu.:7.875   3rd Qu.:7.750  
##  Max.   :8.580   Max.   :8.500   Max.   :8.42   Max.   :8.580   Max.   :8.250  
##     Balance        Uniformity      Clean.Cup    Sweetness     Overall     
##  Min.   :6.670   Min.   : 8.67   Min.   :10   Min.   :10   Min.   :6.670  
##  1st Qu.:7.500   1st Qu.:10.00   1st Qu.:10   1st Qu.:10   1st Qu.:7.500  
##  Median :7.670   Median :10.00   Median :10   Median :10   Median :7.670  
##  Mean   :7.644   Mean   : 9.99   Mean   :10   Mean   :10   Mean   :7.677  
##  3rd Qu.:7.790   3rd Qu.:10.00   3rd Qu.:10   3rd Qu.:10   3rd Qu.:7.920  
##  Max.   :8.420   Max.   :10.00   Max.   :10   Max.   :10   Max.   :8.580  
##     Defects  Total.Cup.Points Moisture.Percentage Category.One.Defects
##  Min.   :0   Min.   :78.00    Min.   : 0.00       Min.   :0.0000      
##  1st Qu.:0   1st Qu.:82.58    1st Qu.:10.10       1st Qu.:0.0000      
##  Median :0   Median :83.75    Median :10.80       Median :0.0000      
##  Mean   :0   Mean   :83.71    Mean   :10.74       Mean   :0.1353      
##  3rd Qu.:0   3rd Qu.:84.83    3rd Qu.:11.50       3rd Qu.:0.0000      
##  Max.   :0   Max.   :89.33    Max.   :13.50       Max.   :5.0000      
##     Quakers           Color           Category.Two.Defects  Expiration       
##  Min.   : 0.0000   Length:207         Min.   : 0.000       Length:207        
##  1st Qu.: 0.0000   Class :character   1st Qu.: 0.000       Class :character  
##  Median : 0.0000   Mode  :character   Median : 1.000       Mode  :character  
##  Mean   : 0.6908                      Mean   : 2.251                         
##  3rd Qu.: 1.0000                      3rd Qu.: 3.000                         
##  Max.   :12.0000                      Max.   :16.000                         
##  Certification.Body Certification.Address Certification.Contact
##  Length:207         Length:207            Length:207           
##  Class :character   Class :character      Class :character     
##  Mode  :character   Mode  :character      Mode  :character     
##                                                                
##                                                                
##

str(arabica)

## 'data.frame':    207 obs. of  41 variables:
##  $ X                    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ ID                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Country.of.Origin    : chr  "Colombia" "Taiwan" "Laos" "Costa Rica" ...
##  $ Farm.Name            : chr  "Finca El Paraiso" "Royal Bean Geisha Estate" "OKLAO coffee farms" "La Cumbre" ...
##  $ Lot.Number           : chr  "CQU2022015" "The 2022 Pacific Rim Coffee Summit,T037" "The 2022 Pacific Rim Coffee Summit,LA01" "CQU2022017" ...
##  $ Mill                 : chr  "Finca El Paraiso" "Royal Bean Geisha Estate" "oklao coffee processing plant" "La Montana Tarrazu MIll" ...
##  $ ICO.Number           : chr  "" "" "" "" ...
##  $ Company              : chr  "Coffee Quality Union" "Taiwan Coffee Laboratory" "Taiwan Coffee Laboratory" "Coffee Quality Union" ...
##  $ Altitude             : chr  "1700-1930" "1200" "1300" "1900" ...
##  $ Region               : chr  "Piendamo,Cauca" "Chiayi" "Laos Borofen Plateau" "Los Santos,Tarrazu" ...
##  $ Producer             : chr  "Diego Samuel Bermudez" "曾福森" "WU TAO CHI" "Santa Maria de Dota" ...
##  $ Number.of.Bags       : int  1 1 19 1 2 5 1 1 1 320 ...
##  $ Bag.Weight           : chr  "35 kg" "80 kg" "25 kg" "22 kg" ...
##  $ In.Country.Partner   : chr  "Japan Coffee Exchange" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Japan Coffee Exchange" ...
##  $ Harvest.Year         : chr  "2021 / 2022" "2021 / 2022" "2021 / 2022" "2022" ...
##  $ Grading.Date         : chr  "September 21st, 2022" "November 15th, 2022" "November 15th, 2022" "September 21st, 2022" ...
##  $ Owner                : chr  "Coffee Quality Union" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Coffee Quality Union" ...
##  $ Variety              : chr  "Castillo" "Gesha" "Java" "Gesha" ...
##  $ Status               : chr  "Completed" "Completed" "Completed" "Completed" ...
##  $ Processing.Method    : chr  "Double Anaerobic Washed" "Washed / Wet" "Semi Washed" "Washed / Wet" ...
##  $ Aroma                : num  8.58 8.5 8.33 8.08 8.33 8.33 8.33 8.25 8.08 8.08 ...
##  $ Flavor               : num  8.5 8.5 8.42 8.17 8.33 8.33 8.17 8.25 8.08 8.17 ...
##  $ Aftertaste           : num  8.42 7.92 8.08 8.17 8.08 8.25 8.08 8.17 8.25 8.08 ...
##  $ Acidity              : num  8.58 8 8.17 8.25 8.25 7.83 8 8 8.08 8.17 ...
##  $ Body                 : num  8.25 7.92 7.92 8.17 7.92 7.83 7.83 7.92 7.92 8 ...
##  $ Balance              : num  8.42 8.25 8.17 8.08 7.92 8.17 8.25 8.08 8 8 ...
##  $ Uniformity           : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Clean.Cup            : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Sweetness            : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Overall              : num  8.58 8.5 8.33 8.25 8.25 8.25 8.25 8.08 8.25 8 ...
##  $ Defects              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total.Cup.Points     : num  89.3 87.6 87.4 87.2 87.1 ...
##  $ Moisture.Percentage  : num  11.8 10.5 10.4 11.8 11.6 10.7 9.1 10 10.8 11 ...
##  $ Category.One.Defects : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Quakers              : int  0 0 0 0 2 0 0 0 0 0 ...
##  $ Color                : chr  "green" "blue-green" "yellowish" "green" ...
##  $ Category.Two.Defects : int  3 0 2 0 2 2 0 1 0 0 ...
##  $ Expiration           : chr  "September 21st, 2023" "November 15th, 2023" "November 15th, 2023" "September 21st, 2023" ...
##  $ Certification.Body   : chr  "Japan Coffee Exchange" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Taiwan Coffee Laboratory 台灣咖啡研究室" "Japan Coffee Exchange" ...
##  $ Certification.Address: chr  "〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Atami, Shizuoka, 413-0002 JAPAN" "QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd., Xindian Dist. New Taipei City, Taiwan" "QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd., Xindian Dist. New Taipei City, Taiwan" "〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Atami, Shizuoka, 413-0002 JAPAN" ...
##  $ Certification.Contact: chr  "松澤　宏樹　Koju Matsuzawa - +81(0)9085642901" "Lin, Jen-An Neil 林仁安 - 886-289116612" "Lin, Jen-An Neil 林仁安 - 886-289116612" "松澤　宏樹　Koju Matsuzawa - +81(0)9085642901" ...

Dataset consists of 207 observations and 41 variables.
Convert the appropriate columns to a numeric data type.

arabica$Bag.Weight <- as.numeric(str_remove(arabica$Bag.Weight, "kg"))
arabica$Category.One.Defects <- as.numeric(arabica$Category.One.Defects)
arabica$Category.Two.Defects <- as.numeric(arabica$Category.Two.Defects)
arabica$Quakers <- as.numeric(arabica$Quakers)
arabica$Moisture.Percentage <- as.numeric(arabica$Moisture.Percentage)
arabica$Number.of.Bags <- as.numeric(arabica$Number.of.Bags)
str(arabica)

Replace empty values with “NA” and identify columns containing missing values.

arabica[arabica == ""] <- NA
missing_positions <- which(is.na(arabica), arr.ind = TRUE)
print(missing_positions)

Determine the number of missing values for each column.

print(arabica[,4]) # <- Farm name 
print(sum(is.na(arabica[,4])))
print(arabica[,5]) # <- Lot.Number
print(sum(is.na(arabica[,5])))
print(arabica[,6]) # <- Mill
print(sum(is.na(arabica[,6])))
print(arabica[,7]) # <- ICO.number
print(sum(is.na(arabica[,7])))
print(arabica[,9]) # <- altitude
print(sum(is.na(arabica[,9])))
print(arabica[,10]) # <- region
print(sum(is.na(arabica[,10])))
print(arabica[,11]) # <- producer
print(sum(is.na(arabica[,11])))
print(arabica[,18]) # <- variety
print(sum(is.na(arabica[,18])))
print(arabica[,20]) # <- processing method
print(sum(is.na(arabica[,20])))

Since all columns store character data representing general information about the coffee, replace missing values in columns with fewer than 10 “NA” entries using the mode function. Fill in the column with 132 missing values with the placeholder value “Unknown”.
Verify that the dataset no longer contains missing values.

get_mode <- function(x) {
  unique_x <- unique(x)
  unique_x[which.max(tabulate(match(x, unique_x)))]
}

arabica$Farm.Name[is.na(arabica$Farm.Name)] <- get_mode(arabica$Farm.Name[!is.na(arabica$Farm.Name)])
arabica$Lot.Number[is.na(arabica$Lot.Number)] <- get_mode(arabica$Lot.Number[!is.na(arabica$Lot.Number)])
arabica$Mill[is.na(arabica$Mill)] <- get_mode(arabica$Mill[!is.na(arabica$Mill)])
arabica$ICO.Number[is.na(arabica$ICO.Number)] <- "Unknown"
arabica$Altitude[is.na(arabica$Altitude)] <- get_mode(arabica$Altitude[!is.na(arabica$Altitude)]) 
arabica$Region[is.na(arabica$Region)] <- get_mode(arabica$Region[!is.na(arabica$Region)])
arabica$Producer[is.na(arabica$Producer)] <- get_mode(arabica$Producer[!is.na(arabica$Producer)])
arabica$Variety[is.na(arabica$Variety)] <- get_mode(arabica$Variety[!is.na(arabica$Variety)])
arabica$Processing.Method[is.na(arabica$Processing.Method)] <- get_mode(arabica$Processing.Method[!is.na(arabica$Processing.Method)])
print(sum(is.na(arabica)))

## [1] 0

summary(arabica)

##        X               ID        Country.of.Origin   Farm.Name        
##  Min.   :  0.0   Min.   :  0.0   Length:207         Length:207        
##  1st Qu.: 51.5   1st Qu.: 51.5   Class :character   Class :character  
##  Median :103.0   Median :103.0   Mode  :character   Mode  :character  
##  Mean   :103.0   Mean   :103.0                                        
##  3rd Qu.:154.5   3rd Qu.:154.5                                        
##  Max.   :206.0   Max.   :206.0                                        
##   Lot.Number            Mill            ICO.Number          Company         
##  Length:207         Length:207         Length:207         Length:207        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Altitude            Region            Producer         Number.of.Bags  
##  Length:207         Length:207         Length:207         Min.   :   1.0  
##  Class :character   Class :character   Class :character   1st Qu.:   1.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :  14.0  
##                                                           Mean   : 155.4  
##                                                           3rd Qu.: 275.0  
##                                                           Max.   :2240.0  
##    Bag.Weight      In.Country.Partner Harvest.Year       Grading.Date      
##  Min.   :    1.0   Length:207         Length:207         Length:207        
##  1st Qu.:   15.0   Class :character   Class :character   Class :character  
##  Median :   30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  227.8                                                           
##  3rd Qu.:   60.0                                                           
##  Max.   :19200.0                                                           
##     Owner             Variety             Status          Processing.Method 
##  Length:207         Length:207         Length:207         Length:207        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Aroma           Flavor        Aftertaste      Acidity           Body      
##  Min.   :6.500   Min.   :6.750   Min.   :6.67   Min.   :6.830   Min.   :6.830  
##  1st Qu.:7.580   1st Qu.:7.580   1st Qu.:7.42   1st Qu.:7.500   1st Qu.:7.500  
##  Median :7.670   Median :7.750   Median :7.58   Median :7.670   Median :7.670  
##  Mean   :7.721   Mean   :7.745   Mean   :7.60   Mean   :7.690   Mean   :7.641  
##  3rd Qu.:7.920   3rd Qu.:7.920   3rd Qu.:7.75   3rd Qu.:7.875   3rd Qu.:7.750  
##  Max.   :8.580   Max.   :8.500   Max.   :8.42   Max.   :8.580   Max.   :8.250  
##     Balance        Uniformity      Clean.Cup    Sweetness     Overall     
##  Min.   :6.670   Min.   : 8.67   Min.   :10   Min.   :10   Min.   :6.670  
##  1st Qu.:7.500   1st Qu.:10.00   1st Qu.:10   1st Qu.:10   1st Qu.:7.500  
##  Median :7.670   Median :10.00   Median :10   Median :10   Median :7.670  
##  Mean   :7.644   Mean   : 9.99   Mean   :10   Mean   :10   Mean   :7.677  
##  3rd Qu.:7.790   3rd Qu.:10.00   3rd Qu.:10   3rd Qu.:10   3rd Qu.:7.920  
##  Max.   :8.420   Max.   :10.00   Max.   :10   Max.   :10   Max.   :8.580  
##     Defects  Total.Cup.Points Moisture.Percentage Category.One.Defects
##  Min.   :0   Min.   :78.00    Min.   : 0.00       Min.   :0.0000      
##  1st Qu.:0   1st Qu.:82.58    1st Qu.:10.10       1st Qu.:0.0000      
##  Median :0   Median :83.75    Median :10.80       Median :0.0000      
##  Mean   :0   Mean   :83.71    Mean   :10.74       Mean   :0.1353      
##  3rd Qu.:0   3rd Qu.:84.83    3rd Qu.:11.50       3rd Qu.:0.0000      
##  Max.   :0   Max.   :89.33    Max.   :13.50       Max.   :5.0000      
##     Quakers           Color           Category.Two.Defects  Expiration       
##  Min.   : 0.0000   Length:207         Min.   : 0.000       Length:207        
##  1st Qu.: 0.0000   Class :character   1st Qu.: 0.000       Class :character  
##  Median : 0.0000   Mode  :character   Median : 1.000       Mode  :character  
##  Mean   : 0.6908                      Mean   : 2.251                         
##  3rd Qu.: 1.0000                      3rd Qu.: 3.000                         
##  Max.   :12.0000                      Max.   :16.000                         
##  Certification.Body Certification.Address Certification.Contact
##  Length:207         Length:207            Length:207           
##  Class :character   Class :character      Class :character     
##  Mode  :character   Mode  :character      Mode  :character     
##                                                                
##                                                                
##

Remove the columns Cleaned.cup, Sweetness, and Defects, as they contain constant values. These columns would otherwise introduce issues in matrix computations and correlation analyses.
Exclude the ID columns, as they solely contain record identifiers and don’t contribute to the analysis.
Exclude the Overall and Total.cup.points columns since those columns could introduce redundancy, as they don’t provide new information, but summarize existing features.
These preparations were made to compute the covariance matrix to see if there is any relationship between variables.

numeric_arabica <- arabica %>%
  select(-Defects, -Sweetness, -Clean.Cup, -X, -ID, - Total.Cup.Points, - Overall) %>%
  select_if(is.numeric)

arabica_matrix <- as.matrix(numeric_arabica)
head(arabica_matrix)

##      Number.of.Bags Bag.Weight Aroma Flavor Aftertaste Acidity Body Balance
## [1,]              1         35  8.58   8.50       8.42    8.58 8.25    8.42
## [2,]              1         80  8.50   8.50       7.92    8.00 7.92    8.25
## [3,]             19         25  8.33   8.42       8.08    8.17 7.92    8.17
## [4,]              1         22  8.08   8.17       8.17    8.25 8.17    8.08
## [5,]              2         24  8.33   8.33       8.08    8.25 7.92    7.92
## [6,]              5         30  8.33   8.33       8.25    7.83 7.83    8.17
##      Uniformity Moisture.Percentage Category.One.Defects Quakers
## [1,]         10                11.8                    0       0
## [2,]         10                10.5                    0       0
## [3,]         10                10.4                    0       0
## [4,]         10                11.8                    0       0
## [5,]         10                11.6                    0       2
## [6,]         10                10.7                    0       0
##      Category.Two.Defects
## [1,]                    3
## [2,]                    0
## [3,]                    2
## [4,]                    0
## [5,]                    2
## [6,]                    2

cor_matrix <- cor(arabica_matrix, method = "pearson")
print(cor_matrix)

##                      Number.of.Bags   Bag.Weight        Aroma      Flavor
## Number.of.Bags           1.00000000  0.068794936 -0.227414231 -0.26469950
## Bag.Weight               0.06879494  1.000000000 -0.016899458 -0.05667924
## Aroma                   -0.22741423 -0.016899458  1.000000000  0.82277851
## Flavor                  -0.26469950 -0.056679241  0.822778505  1.00000000
## Aftertaste              -0.30469701  0.009330594  0.793397466  0.87681144
## Acidity                 -0.20441537  0.149742631  0.712919976  0.81093386
## Body                    -0.04003724  0.135317762  0.633100500  0.73985741
## Balance                 -0.25901609  0.025045088  0.745648029  0.85178584
## Uniformity               0.05937032  0.008722334 -0.028063067 -0.03976682
## Moisture.Percentage      0.10275334 -0.067663518 -0.002418472 -0.05090234
## Category.One.Defects     0.04367764 -0.022594818 -0.057859622 -0.08129878
## Quakers                  0.08227137  0.017300591 -0.342885600 -0.31005432
## Category.Two.Defects     0.23740750  0.125250267 -0.254719142 -0.33034651
##                        Aftertaste     Acidity         Body     Balance
## Number.of.Bags       -0.304697011 -0.20441537 -0.040037237 -0.25901609
## Bag.Weight            0.009330594  0.14974263  0.135317762  0.02504509
## Aroma                 0.793397466  0.71291998  0.633100500  0.74564803
## Flavor                0.876811442  0.81093386  0.739857407  0.85178584
## Aftertaste            1.000000000  0.81443868  0.738674332  0.86195074
## Acidity               0.814438682  1.00000000  0.765184556  0.80523566
## Body                  0.738674332  0.76518456  1.000000000  0.81609773
## Balance               0.861950737  0.80523566  0.816097729  1.00000000
## Uniformity           -0.023925511 -0.06256401 -0.043903922 -0.08906386
## Moisture.Percentage  -0.051115362 -0.01675124  0.009770954 -0.07265985
## Category.One.Defects -0.107965594 -0.11020378  0.048256481  0.03955156
## Quakers              -0.303247393 -0.20948342 -0.257464654 -0.31550219
## Category.Two.Defects -0.330731327 -0.26227881 -0.208783636 -0.31773347
##                        Uniformity Moisture.Percentage Category.One.Defects
## Number.of.Bags        0.059370319         0.102753338           0.04367764
## Bag.Weight            0.008722334        -0.067663518          -0.02259482
## Aroma                -0.028063067        -0.002418472          -0.05785962
## Flavor               -0.039766820        -0.050902336          -0.08129878
## Aftertaste           -0.023925511        -0.051115362          -0.10796559
## Acidity              -0.062564014        -0.016751242          -0.11020378
## Body                 -0.043903922         0.009770954           0.04825648
## Balance              -0.089063859        -0.072659846           0.03955156
## Uniformity            1.000000000         0.035239870           0.02147091
## Moisture.Percentage   0.035239870         1.000000000           0.09275477
## Category.One.Defects  0.021470908         0.092754773           1.00000000
## Quakers               0.038486395         0.170142394           0.01777426
## Category.Two.Defects  0.071713700         0.130671152           0.16665467
##                          Quakers Category.Two.Defects
## Number.of.Bags        0.08227137            0.2374075
## Bag.Weight            0.01730059            0.1252503
## Aroma                -0.34288560           -0.2547191
## Flavor               -0.31005432           -0.3303465
## Aftertaste           -0.30324739           -0.3307313
## Acidity              -0.20948342           -0.2622788
## Body                 -0.25746465           -0.2087836
## Balance              -0.31550219           -0.3177335
## Uniformity            0.03848639            0.0717137
## Moisture.Percentage   0.17014239            0.1306712
## Category.One.Defects  0.01777426            0.1666547
## Quakers               1.00000000            0.4624226
## Category.Two.Defects  0.46242261            1.0000000

corrplot::corrplot(cor_matrix, method = "circle")

dim(arabica_matrix)

## [1] 207  13

The plot reveals the strong correlation between the following variables: Aroma, Flavor, Aftertaste, Acidity, Body and Balance. Positive correlations arise from shared contributions to coffee quality metrics or shared underlying properties. For example, Aroma and Flavor are positively correlated because coffee with a strong aroma tend to have better flavor profiles, reflecting an underlying quality factor.High acidity in coffee is often associated with higher overall quality, also contributing to a better overall score. Presented positive correlations reflect interdependencies between features that collectively affect coffee quality. Negative correlations result from detrimental factors that reduce overall scores or specific sensory features.Another example of negative correlation is Quakers and Aroma. An increase in quakers,unripe beans, usually reduces the aroma quality.

Principal Component Analysis (PCA)

The central idea of principal component analysis is to reduce the dimensionality of a dataset in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Computation of the principal components reduces to the solution of an eigenvalue-eigenvector problem for a positive-semidefinite symmetric matrix.

We start with the dataset normalization before further analysis.

preproc1 <- preProcess(arabica_matrix, method=c("center", "scale"))
arabica_matrix.s <- predict(preproc1, arabica_matrix)

Eigenvalues on the basis of covariance.

arabica_matrix.cov<-cov(arabica_matrix.s)
arabica_matrix.eigen<-eigen(arabica_matrix.cov)
arabica_matrix.eigen$values

##  [1] 5.2954382 1.4499720 1.1218288 1.0163110 0.9977321 0.9293184 0.7826027
##  [8] 0.5161073 0.3148871 0.1894170 0.1621808 0.1158438 0.1083607

head(arabica_matrix.eigen$vectors)

##             [,1]        [,2]        [,3]        [,4]         [,5]         [,6]
## [1,] -0.13475997 -0.31090811 -0.18956363 -0.40369484 -0.098789782  0.642954507
## [2,]  0.01356599 -0.26979254 -0.75396989 -0.06521995  0.075363222 -0.105882170
## [3,]  0.37328140 -0.05976710  0.08004008  0.01906562 -0.056000736  0.016821517
## [4,]  0.40618268 -0.04226558  0.07782905  0.06118791 -0.045297141 -0.005737637
## [5,]  0.40610479 -0.04359876  0.02529748  0.09242236 -0.060234348 -0.041380794
## [6,]  0.38432100 -0.15967199 -0.10808402  0.11762529 -0.008073615  0.026282950
##              [,7]        [,8]        [,9]       [,10]      [,11]       [,12]
## [1,]  0.414745410 -0.14833927 -0.20776575  0.06386324 -0.1435768  0.02985631
## [2,] -0.532860668 -0.01278676 -0.17211389 -0.08241683 -0.1023183 -0.07684239
## [3,]  0.051047830  0.34316751 -0.65614410 -0.28001510  0.4044517  0.22871166
## [4,]  0.115350971  0.03814272 -0.15274605  0.04151546 -0.1939058 -0.85882683
## [5,]  0.039420651  0.02020285 -0.04275869  0.01195993 -0.5837365  0.29502260
## [6,] -0.003339653 -0.12678718  0.06927037  0.80567971  0.3326143  0.13701125
##            [,13]
## [1,]  0.01878774
## [2,]  0.02279404
## [3,]  0.04130781
## [4,]  0.07965296
## [5,] -0.62204139
## [6,]  0.03346227

Now we need to select the number of components. Firstly, use the Kaiser’s rule or Kaiser-Guttman criterion. It is a widely used method to evaluate the maximum number of linear combinations to extract from the dataset. According to that rule only those principal components are retained, whose variances exceed 1. The idea behind the Kaiser-Guttman criterion is that any principal with variance less than 1 contains less information than one of the original variables and so is not worth retaining (Jolliffe, 2002). Based on the Kaiser’s rule description, we can retain 4 components, since the eigenvalues are greater than 1 for them. Check it again:

kaiser_rule <- which(arabica_matrix.eigen$values > 1)
print(kaiser_rule)

## [1] 1 2 3 4

xxx<-arabica_matrix.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
summary(xxx.pca1)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.3012 1.2041 1.05916 1.00812 0.99887 0.96401 0.8846
## Proportion of Variance 0.4073 0.1115 0.08629 0.07818 0.07675 0.07149 0.0602
## Cumulative Proportion  0.4073 0.5189 0.60517 0.68335 0.76010 0.83158 0.8918
##                           PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.7184 0.56115 0.43522 0.40272 0.34036 0.32918
## Proportion of Variance 0.0397 0.02422 0.01457 0.01248 0.00891 0.00834
## Cumulative Proportion  0.9315 0.95571 0.97028 0.98275 0.99166 1.00000

print(get_eigenvalue(xxx.pca1))

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   5.2954382       40.7341398                    40.73414
## Dim.2   1.4499720       11.1536307                    51.88777
## Dim.3   1.1218288        8.6294527                    60.51722
## Dim.4   1.0163110        7.8177768                    68.33500
## Dim.5   0.9977321        7.6748626                    76.00986
## Dim.6   0.9293184        7.1486029                    83.15847
## Dim.7   0.7826027        6.0200211                    89.17849
## Dim.8   0.5161073        3.9700561                    93.14854
## Dim.9   0.3148871        2.4222087                    95.57075
## Dim.10  0.1894170        1.4570541                    97.02781
## Dim.11  0.1621808        1.2475446                    98.27535
## Dim.12  0.1158438        0.8911058                    99.16646
## Dim.13  0.1083607        0.8335442                   100.00000

Use Scree plot to visualize the eigenvalues of the components in the ascending order.

fviz_eig(xxx.pca1, choice='eigenvalue') + 
  geom_hline(yintercept = 1, linetype = "dashed", color = "red", size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the Eigenvalue and Scree plot result, it’s more obvious now, that there are several approaches how many components we can select. First, is to select based on the Kaiser-Guttman criterion, and the number of components is 4 (Jolliffe, 2002). Second, we can notice that 3 components explain 60 percent of the variances, capturing quite significant variance. We can see a “knee” or “elbow” at Dim 3. This indicates a diminishing return on explained variance after Dim.3. Once the “elbow” was defined in the graph, is then taken to be the number of components to be retained (Jolliffe, 2002). Third, we can see that the cumulative variance explained by the first 5 components is 76.01%, which is a significant proportion of the total variance. Dimension 5 has an eigenvalue of 0.9977, which is very close to 1. Including it might be justified as it contributes meaningful variance (7.67%).

fviz_eig(xxx.pca1)

The cumulative proportion of explained variance shows that 3 components account for approximately 60% of the total variance. To explain approximately 68% of the variance, 4 components are required, while selecting 5 components increases the cumulative explained variance to 76%.

a<-summary(xxx.pca1)
plot(a$importance[3,],type="p")

From the plot above, we also can see that to explain 70-80% of variance, we need to select from 4 to 5 components. Let’s use another technique to justify selection of the components.

Singular Value Decomposition (SVD)

Using Singular Value Decomposition (SVD) as an alternative to PCA for verifying the number of dimensions could be useful approach. SVD provides a similar decomposition of the dataset but does so without relying on the covariance matrix.

svd_arabica <- svd(scale(arabica_matrix))
singular_values <- svd_arabica$d
explained_variance <- singular_values^2 / sum(singular_values^2)
cumulative_variance <- cumsum(explained_variance)
print(cumulative_variance)

##  [1] 0.4073414 0.5188777 0.6051722 0.6833500 0.7600986 0.8315847 0.8917849
##  [8] 0.9314854 0.9557075 0.9702781 0.9827535 0.9916646 1.0000000

plot(svd_arabica$d)

The plot displays the singular values from the SVD decomposition. Each point on the plot represents a singular value, which reflects the amount of variance captured by the corresponding dimension. The first singular value is significantly larger than the others, indicating that the first dimension captures the majority of the variance in the data. After a certain point (index 4 or 5), the singular values become small and relatively uniform, indicating that these dimensions contribute little additional information or may represent noise. We can verify the result drawing plot for cumulative variance.

plot(1:length(cumulative_variance), cumulative_variance,
     xlab = "Dimensions", ylab = "Cumulative Variance")

To explain 70–80% of the variance, we need to select between 4 and 5 components.

Based on the outcomes of the PCA and SVD techniques, we need to determine the optimal number of components to retain. The analysis suggests that the optimal range is between 4 and 5 components. Using the “elbow method,” selecting 3 components could be a choice for simplicity and better interpretation. These 3 components likely focus on the primary coffee quality metrics, as they show strong positive correlations. However, the goal is to maximize the explained variance, and selecting 4 or 5 components could be more suitable. Including 4 or 5 components would allow us to account for additional factors such as defect-related attributes, batch size, or processing methods, which contribute to the remaining variance.

Use 5 principal components.

cos2 <- apply(xxx.pca1$x[, 1:5]^2, 1, sum) / apply(xxx.pca1$x^2, 1, sum) # quality of representation of a data point on the selected principal component 


pairs(
  as.data.frame(xxx.pca1$x[, 1:5]),
  main = "Pairwise scatterplot of 5 principal components",
  pch = 21, 
  bg = "lightblue"
)

plot_ly(
  x = ~xxx.pca1$rotation[, 1], 
  y = ~xxx.pca1$rotation[, 2],
  z = ~xxx.pca1$rotation[, 3], 
  type = 'scatter3d',
  mode = 'markers+text',
  text = rownames(xxx.pca1$rotation), 
  marker = list(size = 5, color = 'steelblue')
) %>%
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    )
  )

plot_ly(
  x = ~xxx.pca1$rotation[, 3], 
  y = ~xxx.pca1$rotation[, 4],
  z = ~xxx.pca1$rotation[, 5], 
  type = 'scatter3d',
  mode = 'markers+text',
  text = rownames(xxx.pca1$rotation), 
  marker = list(size = 5, color = 'steelblue')
) %>%
  layout(
    scene = list(
      xaxis = list(title = "PC3"),
      yaxis = list(title = "PC4"),
      zaxis = list(title = "PC5")
    )
  )

The variable plots displayed above illustrate relationships between variables as well as the “quality” of all factors. Positively correlated variables are positioned close to each other, while negatively correlated variables appear on opposite sides of the plot. The “quality” of a variable is reflected by its distance from the center, with variables further from the center being better represented by the selected principal components.

The axes correspond to the first three principal components (PC1, PC2, and PC3) in the first plot and the next three principal components (PC3, PC4, and PC5) in the second plot. From these plots, we can observe that Flavor, Aroma, Aftertaste, and Balance are strongly positively correlated and collectively contribute to coffee quality metrics. Body and Acidity can also be added to these coffee quality-related attributes. Variables such as Bag.Weight and Number.of.Bags are positioned farther from the quality-related variables, indicating they contribute to variance unrelated to sensory attributes. Similarly, Category.One.Defects, Category.Two.Defects, and Quakers are positioned separately from the sensory variables, representing variance related to defects.

loading_scores_PC_1<-xxx.pca1$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_1_ranked, 1]

##               Flavor           Aftertaste              Balance 
##          -0.40618268          -0.40610479          -0.40378593 
##              Acidity                Aroma                 Body 
##          -0.38432100          -0.37328140          -0.36145936 
## Category.Two.Defects              Quakers       Number.of.Bags 
##           0.18518358           0.18213234           0.13475997 
## Category.One.Defects  Moisture.Percentage           Uniformity 
##           0.03529932           0.03468225           0.03314528 
##           Bag.Weight 
##          -0.01356599

As we can see, variables were sorted based on their absolute contributions to PC1. The following contributions are the strongest one: Flavor, Aftertaste, Balance, Acidity, Aroma and Body.

loading_scores_PC_2<-xxx.pca1$rotation[,2]
fac_scores_PC_2<-abs(loading_scores_PC_2)
fac_scores_PC_2_ranked<-names(sort(fac_scores_PC_2, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_2_ranked, 2]

## Category.Two.Defects              Quakers  Moisture.Percentage 
##           0.54469393           0.41697167           0.39232940 
##       Number.of.Bags Category.One.Defects           Bag.Weight 
##           0.31090811           0.29174840           0.26979254 
##                 Body              Acidity           Uniformity 
##           0.24647458           0.15967199           0.13972045 
##              Balance                Aroma           Aftertaste 
##           0.08902692           0.05976710           0.04359876 
##               Flavor 
##           0.04226558

As we can observe from the PC2, the following contributions are the strongest one: Category.Two.Defects, Quakers, Moisture.Percentage, Number.of.Bags.

loading_scores_PC_3<-xxx.pca1$rotation[, 3]
fac_scores_PC_3<-abs(loading_scores_PC_3)
fac_scores_PC_3_ranked<-names(sort(fac_scores_PC_3, decreasing = TRUE))
xxx.pca1$rotation[fac_scores_PC_3_ranked, 3]

##           Bag.Weight  Moisture.Percentage Category.One.Defects 
##          -0.75396989           0.48968287           0.33814757 
##       Number.of.Bags              Acidity                Aroma 
##          -0.18956363          -0.10808402           0.08004008 
##               Flavor                 Body Category.Two.Defects 
##           0.07782905          -0.07713777          -0.06420221 
##              Quakers           Uniformity              Balance 
##           0.05815442           0.04897627           0.02869291 
##           Aftertaste 
##           0.02529748

Regarding the PC3, we can observe Bag.Weight, Moisture.Percentage and Category.One.Defects as a strong contributions.

loading_scores_PC_4<-xxx.pca1$rotation[,4]
fac_scores_PC_4<-abs(loading_scores_PC_4)
fac_scores_PC_4_ranked<-names(sort(fac_scores_PC_4, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_4_ranked, 4]

## Category.One.Defects              Quakers       Number.of.Bags 
##           0.62096334          -0.54980742           0.40369484 
##           Uniformity Category.Two.Defects  Moisture.Percentage 
##           0.23675392          -0.15245563          -0.14024339 
##                 Body              Acidity           Aftertaste 
##           0.13325069          -0.11762529          -0.09242236 
##           Bag.Weight               Flavor              Balance 
##           0.06521995          -0.06118791           0.04109770 
##                Aroma 
##          -0.01906562

Then, Category.One.Defects, Quakers and Number.of.Bags consider as a strong contribution of PC4.

loading_scores_PC_5<-xxx.pca1$rotation[,5]
fac_scores_PC_5<-abs(loading_scores_PC_5)
fac_scores_PC_5_ranked<-names(sort(fac_scores_PC_5, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_5_ranked, 5]

##           Uniformity Category.One.Defects       Number.of.Bags 
##         -0.900541967          0.382927989         -0.098789782 
##              Balance           Bag.Weight  Moisture.Percentage 
##          0.080924636          0.075363222         -0.069840558 
## Category.Two.Defects           Aftertaste                Aroma 
##          0.065870129         -0.060234348         -0.056000736 
##               Flavor                 Body              Quakers 
##         -0.045297141          0.045093452          0.016468689 
##              Acidity 
##         -0.008073615

The last one from the PC5, Uniformity and Category.One.Defects are the strongest one.

ind<-get_pca_ind(xxx.pca1) 
print(ind)

## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

head(ind$coord)

##       Dim.1      Dim.2     Dim.3      Dim.4      Dim.5      Dim.6       Dim.7
## 1 -6.988786 1.90681089 0.5418321 -0.7429148 -0.3003789 -0.1230459 -0.21930247
## 2 -4.736186 0.09064613 0.3422025 -0.2200747 -0.2674048  0.2088002 -0.32004123
## 3 -4.627493 0.49812488 0.1328734 -0.3990757 -0.2460498  0.3227001 -0.55679435
## 4 -4.532551 0.75217958 0.4842278 -0.3498330 -0.2750309 -0.3626697  0.45170261
## 5 -3.981433 1.29690187 0.6000946 -1.2705898 -0.3537232  0.1127918 -0.08487344
## 6 -4.103541 0.28445111 0.4213595 -0.3902483 -0.2863240  0.3474746 -0.35012146
##         Dim.8      Dim.9     Dim.10     Dim.11       Dim.12      Dim.13
## 1  0.55793387 -0.0851971 -0.4410661  0.1579104  0.409909268 -0.15133224
## 2  0.28606869  0.9787804  0.6382671  0.1077150 -0.568085644 -1.03297036
## 3  0.46502832  0.4391967 -0.1528116 -0.1244917 -0.294493862 -0.42851387
## 4 -0.42643838 -0.6857634 -0.1448655  0.1697774  0.222792780  0.37702350
## 5 -0.07999881  0.7872899 -0.3773512  0.3627248 -0.265483594  0.25515197
## 6  0.73202859  0.6878320  0.7416791 -1.0291200  0.007186071 -0.08439423

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) +
  theme(axis.title.x = element_blank(), plot.title = element_blank())
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) +
  theme(axis.title.x = element_blank(), plot.title = element_blank())
c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) +
  theme(axis.title.x = element_blank(), plot.title = element_blank())
d<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90) +
  theme(axis.title.x = element_blank(), plot.title = element_blank())
e<-fviz_contrib(xxx.pca1, "var", axes=5, xtickslab.rt=90) +
  theme(axis.title.x = element_blank(), plot.title = element_blank())
grid.arrange(a, b, c,d,e, ncol = 5)

Plots represent the same variables contributions as we describe above.

Hierarchical clustering

In Hierarchical clustering method, each object is assigned to its own cluster; then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage, distances between clusters are recomputed by a dissimilarity formula according to the particular clustering method that is in use. It produces a set of nested clusters organized as a hierarchical three and may be visualized as a dendrogram.

To validate whether the identified groups or patterns are consistent, let’s use hierarchical clustering to cross-check the results with PCA.

arabica.df <- as.data.frame(lapply(numeric_arabica,scale))
distance.m <- dist(t(arabica.df))
hc <- hclust(distance.m,method = "complete")
plot(hc, cex = 0.6, hang = 1)
rect.hclust(hc,k =5, border = "blue")

distance.m <- dist((arabica.df))
hc<- hclust(distance.m,method = "complete")
sub_grp<-cutree(hc, k=5)

plot_ly(
  x = ~xxx.pca1$x[, 1], 
  y = ~xxx.pca1$x[, 2], 
  z = ~xxx.pca1$x[, 3], 
  color = ~as.factor(sub_grp),
  colors = c("blue", "steelblue", "gray"),
  marker = list(size = 5)
) %>%
  add_markers() %>%
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    ),
    title = "3D Cluster plot"
  )

The result show that there are no big differences between clusters generated from hierarchical tree and PCA. According to the dendrogram and plot results,five clusters are identified, grouped by different coffee attributes. The first cluster groups coffee sensory-related variables, which are positively correlated with coffee quality and are highly interdependent.The second cluster consider the coffee bag weight, indicating it is not strongly associated with other variables. The third cluster includes variables related to the number of bags and defects,that negatively impact coffee quality. The fourth and the fifth clusters contains uniformity and defect, that are not strongly associated with other variables.

We can check the optimal number of clusters and apply Silhouette and Elbow methods.

fviz_nbclust(arabica.df, FUN = hcut, method = "silhouette")

fviz_nbclust(arabica.df, FUN = hcut, method = "wss")

As we can see, based on the Elbow Method, the optimal number of clusters appears to be between 4 and 5, as the reduction in total within-cluster sum of squares begins to soften after 4 clusters. Supporting this, the Silhouette Method suggests that the optimal number of clusters is 6, as it maximizes the average silhouette width. We can notice that this aligns with the selection of 5 components from PCA, which explains 76% of the variance, supporting the possibility of retaining more clusters to capture additional structure in the data.

Summary

The first four components have eigenvalues greater than 1, indicating that they explain more variance than an average variable (PC1: 5.30, PC2: 1.45, PC3: 1.12, PC4: 1.02). For this project, we included PC5 as well, since its eigenvalue (0.99) is only slightly below 1. Combined, the first five PCs explain approximately 76% of the total variance, allowing us to reduce the dimensionality to five components. Then hierarchical clustering was applied to the same dataset to assess whether its results aligned with those from PCA. The findings showed that both methods produced similar results, aligning with the characteristics of the coffee quality dataset. Hierarchical clustering identified five clusters, grouped by different coffee attributes.Furthermore, the Elbow and Silhouette methods supported our decision by suggesting that the optimal number of clusters lies between 4 and 6.

Resources

Unsupervised Lerning course materials by Jacek Lewkowicz.University of Warsaw, Faculty of Economic Science.
I.T.Jolliffe.Principal Component Analysis, Second Edition.Springer. http://cda.psych.uiuc.edu/statistical_learning_course/ .
https://bookdown.org/rdpeng/exdata/dimension-reduction.html#notes-and-further-resources-2.
https://rpubs.com/aaronsc32/singular-value-decomposition-r.

Project 2

Yuliya Martyniuk