knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(factoextra)
library(caret)
library(FactoMineR)
library(tidyverse)
library(plotly)
library(tibble)
library(knitr)
library(scutr)

Introduction

The price of olive oil, among other factors, depends on its origin. Verifying the source of oil can be done precisely with expensive chemical methodologies (e.g., HPLC, GCMS etc). A number of studies have shown that applying Infrared analysis (both Near and Mid) is a faster and cheaper approach than traditional chemistry. It is a quick and often “no-sample-preparation-required” technique. When combined with chemometrics, the time and cost of the analysis can be radically reduced.

To decipher the information hidden in the observations (scans), we need to apply chemometrics with an adequate algorithm. The scans are highly correlated, so a simple regression (OLS) technique would not work. Partial least square (PLS) does not suffer from that drawback of OLS.
PLS has been used for over 20 years both as a tool for regression (PLSR) and classification (PLSDA). As OLS, it is a linear regression method, however it uses Principal Components (rather than variables as uploaded) for that purpose.

Data analysis

The data is related to an experiment described in a linked article 1. The paper investigates whether Fourier transform infrared spectroscopy (FTIR), in combination with multivariate analysis, can distinguish extra virgin olive oils from different producing countries.

Data preparation

Data Download

We download a zipped folder to our working directory, then extract two files (readme.txt and FTIR_Spectra_olive_oils.csv)

## [1] 572 121
url_oil <- "http://csr.quadram.ac.uk/wp-content/uploads/FTIRSpectraOliveOils.zip" # url to our data
download.file(url_oil,"FTIRSpectraOliveOils.zip")  # we download a zipped folder to our working directory
oil <- as.data.frame(t(read.csv(unzip("FTIRSpectraOliveOils.zip","FTIR_Spectra_olive_oils.csv")))) # we read the csv file and transpose it
readme <- read_file(unzip("FTIRSpectraOliveOils.zip","readme.txt")) # we can also read the text file contained in the zipped folder
dim(oil)

Let’s first look at the data; for practical reasons, here we will print out just few first rows / columns:

oil%>% select(1:5) %>% slice_head(n=5)
##   Sample.Number.      X1...2      X1...3      X2...4      X2...5
## 1    Group Code:           1           1           1           1
## 2    Wavenumbers      Greece      Greece      Greece      Greece
## 3        798.892 0.127523009 0.126498181 0.130411785 0.130022227
## 4       800.8215 0.127949615 0.127130974 0.130675401 0.130406662
## 5        802.751 0.129282219 0.128510777  0.13201661 0.132018029

Data wrangling

We need to:

  • transpose the data
  • use the first row as the colnames
  • then remove the first row
  • remove the row names, as in the original file those were duplicated sample numbers (each sample was scanned twice - see downloaded readme.txt)
  • change the colname of the country of origin
  • all the wavenumbers need to be converted to numeric
oil <- as.data.frame(t(oil))
colnames(oil) <- oil[1,] # changing the column names
oil2 <- oil[-1,]# removing the first row
oil2 <- oil2[,-1]# removing the first column
oil2 <- oil2 %>% rename(Origin=Wavenumbers) # renaming a columns
rownames(oil2) <- NULL # removing row names
oil2[,2:ncol(oil2)] <- apply(oil2[,2:ncol(oil2)],2,as.numeric) # changing the class of ATR data
oil2[,-1] <- (oil2[,-1]-min(oil2[,-1]))/(max(oil2[,-1])-min(oil2[,-1])) # scaling min-max
oil2[,1] <- as.factor(oil2[,1]) # setting Origin as factor
oil2 %>% select(1:5) %>% slice_head(n=5)
##   Origin    798.892   800.8215    802.751   804.6805
## 1 Greece 0.08094809 0.08121735 0.08205844 0.08325256
## 2 Greece 0.08030126 0.08070066 0.08157153 0.08272601
## 3 Greece 0.08277138 0.08293776 0.08378428 0.08492507
## 4 Greece 0.08252550 0.08276814 0.08378517 0.08504071
## 5 Greece 0.08162910 0.08174749 0.08252553 0.08378128

Now our data frame contains relevant information: country of origin (first column) stored as a factor, and spectral values (numeric) with column names as relevant wavenumbers.

Finally, we need to check if there are any missing values in the dataset.

if(length(oil2[is.na(oil2)])==0) {
  print("There were no NAs")
  } else{"Please remove NAs!"}
## [1] "There were no NAs"

There were no missing values in the dataset. We are ready to proceed with our analysis.

Exploratory Data Analysis

Data summary

We will print a fraction of the data glimpse and data summary:

oil2 %>% slice(1:5) %>% select(1:5) %>% glimpse()
## Rows: 5
## Columns: 5
## $ Origin     <fct> Greece, Greece, Greece, Greece, Greece
## $ `798.892`  <dbl> 0.08094809, 0.08030126, 0.08277138, 0.08252550, 0.08162910
## $ `800.8215` <dbl> 0.08121735, 0.08070066, 0.08293776, 0.08276814, 0.08174749
## $ `802.751`  <dbl> 0.08205844, 0.08157153, 0.08378428, 0.08378517, 0.08252553
## $ `804.6805` <dbl> 0.08325256, 0.08272601, 0.08492507, 0.08504071, 0.08378128
oil2 %>% select(1:5) %>% summary()
##       Origin      798.892           800.8215          802.751       
##  Greece  :20   Min.   :0.07949   Min.   :0.07974   Min.   :0.08060  
##  Italy   :34   1st Qu.:0.08043   1st Qu.:0.08073   1st Qu.:0.08160  
##  Portugal:16   Median :0.08098   Median :0.08124   Median :0.08207  
##  Spain   :50   Mean   :0.08162   Mean   :0.08190   Mean   :0.08275  
##                3rd Qu.:0.08158   3rd Qu.:0.08184   3rd Qu.:0.08266  
##                Max.   :0.09183   Max.   :0.09221   Max.   :0.09307  
##     804.6805      
##  Min.   :0.08171  
##  1st Qu.:0.08286  
##  Median :0.08329  
##  Mean   :0.08395  
##  3rd Qu.:0.08383  
##  Max.   :0.09422

Data visualization

Let’s visualize the dataset. To do so, we need to:

  • add an identifier to each scan (record)
  • change the format of the file to a long one
oil2 %>% mutate(Scan=1:nrow(oil2)) %>% 
  pivot_longer(cols=-c(Origin,Scan)) %>% mutate(name=as.numeric(name)) %>% 
  ggplot(aes(name,value, group=Scan, color=Origin))+geom_line()+
  theme_gray()+ggtitle("ATR of virgin olive oils",subtitle="After initial transformations")+xlab(expression(paste("Wavenumber, ", cm^-1)))+ylab("Absorbance")+scale_x_reverse()

Data balance

Let’s check the number of identifying factors and their balance.

oil2 %>% group_by(Origin) %>% summarize(Count=n()) %>% mutate('Fraction, %'=Count/sum(Count)*100) %>% mutate_at(3,funs(round(.,1))) %>% print()
## # A tibble: 4 × 3
##   Origin   Count `Fraction, %`
##   <fct>    <int>         <dbl>
## 1 Greece      20          16.7
## 2 Italy       34          28.3
## 3 Portugal    16          13.3
## 4 Spain       50          41.7

The data is not balanced: the number of samples originated from Spain is over 3x higher than the Portuguese samples. We will balance the data with scutr package options when applying PLS algorithm.

Principal Component Analysis

We will check how well the PCA separates Origin groups.

pca.model <- FactoMineR::PCA(as.matrix(oil2[,-1]),  scale = TRUE, graph = FALSE)
factoextra::fviz_screeplot(pca.model, addlabels = TRUE, ylim = c(0, 50))

The scree plot tells us that that first 8 components can explain over 96% of variance in the dataset.

factoextra::fviz_pca_ind(pca.model,
             label = "none", # hide individual labels
             habillage = oil2$Origin, # color by groups
             #palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses
             )

The scree plot of individuals shows first two components are responsible for 65.8% of the variance.

Partial Least Squares Discriminant Analysis

Data preparation

We need to mix data rows to make the data analysis more realistic.

set.seed(123); oil3 <- oil2[order(runif(nrow(oil2))),]

We split the data into training and testing sets in ratio 7/3, respectively, according to the Origin labels:

set.seed(123)
indexOil<-createDataPartition(oil3$Origin, p=0.7, list=FALSE)
trainData <- oil3[indexOil,]
testData <- oil3[-indexOil,]

Let’s check what is the proportion of Origin in the original, the training and the testing sets:

data.frame(
Original_set=oil3 %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
  mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round(),
Train_set=trainData %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
  mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round(),
Test_set=testData %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
  mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round()
           ) %>% add_column(Origin=sort(unique(oil3$Origin)),.before=1) %>% kable(caption="Proportion of different countries of origin") %>% kableExtra::kable_styling()
Proportion of different countries of origin
Origin Original_set Train_set Test_set
Greece 17 16 17
Italy 28 28 29
Portugal 13 14 11
Spain 42 41 43

We will balance the training set by SMOTE (Synthetic Minority Oversampling Technique). In the scutr package description we read: “In a dataframe with n classes and m rows, the resulting dataframe will have m/n rows per class.”

table(trainData$Origin)
## 
##   Greece    Italy Portugal    Spain 
##       14       24       12       35
train_balanced <- SCUT(data=trainData,cls_col="Origin",oversample = oversample_smote)

train_balanced$Origin %>% table()
## .
##   Greece    Italy Portugal    Spain 
##       21       21       21       21

Each of the four countries is now represented equally in the training set.

PLS DA

To have more robust results, we will apply cross-validation via the trControl function. Also, pre-processing (centering and scaling) will be done by the algorithm (see preProcess).

set.seed(123)
control1 <- caret::trainControl(method = "repeatedcv", repeats = 10,number=3,classProbs = TRUE)
# We train the PLS algorithm on the training set
model_asReceived <- caret::train(Origin~.,
                  data=train_balanced,
                  method = "pls",
                  trControl=control1,
                  preProcess =c("center","scale"),
                  tuneLength = 20, 
                  summaryFunction = multiClassSummary)
confusionMatrix(predict(model_asReceived,testData),as.factor(testData$Origin))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Greece Italy Portugal Spain
##   Greece        6     0        0     0
##   Italy         0    10        0     0
##   Portugal      0     0        4     0
##   Spain         0     0        0    15
## 
## Overall Statistics
##                                    
##                Accuracy : 1        
##                  95% CI : (0.9, 1) 
##     No Information Rate : 0.4286   
##     P-Value [Acc > NIR] : 1.321e-13
##                                    
##                   Kappa : 1        
##                                    
##  Mcnemar's Test P-Value : NA       
## 
## Statistics by Class:
## 
##                      Class: Greece Class: Italy Class: Portugal Class: Spain
## Sensitivity                 1.0000       1.0000          1.0000       1.0000
## Specificity                 1.0000       1.0000          1.0000       1.0000
## Pos Pred Value              1.0000       1.0000          1.0000       1.0000
## Neg Pred Value              1.0000       1.0000          1.0000       1.0000
## Prevalence                  0.1714       0.2857          0.1143       0.4286
## Detection Rate              0.1714       0.2857          0.1143       0.4286
## Detection Prevalence        0.1714       0.2857          0.1143       0.4286
## Balanced Accuracy           1.0000       1.0000          1.0000       1.0000

The accuracy results for the test set is 100%.

Let’s visualize tuning. Instead the plot(model) function, we will customize the plot.

model_tune<- data.frame(cbind(1:length(model_asReceived$results$Accuracy),model_asReceived$results$Accuracy)) %>% rename(Components=1,Accuracy=2) 

model_tune %>% ggplot(aes(Components,Accuracy))+geom_line(color="steelblue")+geom_point(aes(Components,Accuracy), color="steelblue", size=2, alpha=0.5)+geom_point(aes(y=max(Accuracy),x=Components[which.max(Accuracy)]), color="red", size=3)+ggtitle("Cross-validation tuning", subtitle = "Best accuracy marked with red")+scale_x_continuous(breaks=seq(1,25,2))

The best results were for 15 latent variables.

Spectral ranges importance

Now we can visualize how much various parts of the spectrum contributed to the separation. Please mind that is an overall contribution of variables, and for a particular class it might vary.

var.imp <- varImp(model_asReceived)$importance %>% as.data.frame() %>% bind_cols(Wavelength=as.numeric(colnames(oil2)[-1])) %>%   mutate(Importance=rowSums(.[1:4])) %>% bind_cols(Spectrum=as.numeric(t(oil2[1,-1]))) %>% mutate(Importance=Importance/max(Importance)*10) %>% na.omit() %>% 
  mutate(Wavelength=as.numeric(as.character(Wavelength)))
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
p <- ggplot(var.imp,aes(x=Wavelength,y=Spectrum,color=Importance))+geom_point(size=1.5)+
scale_color_gradientn(colours = rainbow(5))+
  ggtitle("Class separation cumulative wavenumber importance", subtitle="Wavelength dependent")+xlab("Wavenumber,1/cm")+ylab("AU")+scale_x_reverse()
q <- ggplotly(p)
q <- q  %>% layout(dragmode = "pan")
q

From the visualization above, we can conclude that the main separation contributing bands were:

  • ~ 1745cm-1: C=0 - stretching - esters
  • ~ 1450cm-1: C-H bending - alkane
  • ~ 1190cm-1: C-O stretching - esters

Still, most of the spectrum contributed to the separation of the origin classes.

Final thoughts

Above procedure showed how we can approach Partial Least Squares Discriminant Analysis (PLSDA) with R in case of a multidimentional dataset (FTIR of olive oils). As the class count varied, we balanced it by application of the scutr package. The analysis gave 100% accuracy for the test set.



  1. Tapp H.S., Defernez M., Kemsley E.K., “FTIR Spectroscopy and Multivariate Analysis Can Distinguish the Geographic Origin of Extra Virgin Olive Oils”, Journal of Agricultural and Food Chemistry, 2003 51 (21), 6110-6115, https://doi.org/10.1021/jf030232s↩︎