knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(factoextra)
library(caret)
library(FactoMineR)
library(tidyverse)
library(plotly)
library(tibble)
library(knitr)
library(scutr)
The price of olive oil, among other factors, depends on its origin. Verifying the source of oil can be done precisely with expensive chemical methodologies (e.g., HPLC, GCMS etc). A number of studies have shown that applying Infrared analysis (both Near and Mid) is a faster and cheaper approach than traditional chemistry. It is a quick and often “no-sample-preparation-required” technique. When combined with chemometrics, the time and cost of the analysis can be radically reduced.
To decipher the information hidden in the observations (scans), we
need to apply chemometrics with an adequate algorithm. The scans are
highly correlated, so a simple regression (OLS) technique would not
work. Partial least square (PLS) does not suffer from that drawback of
OLS.
PLS has been used for over 20 years both as a tool for regression (PLSR)
and classification (PLSDA). As OLS, it is a linear regression method,
however it uses Principal Components (rather than variables as
uploaded) for that purpose.
The data is related to an experiment described in a linked article 1. The paper investigates whether Fourier transform infrared spectroscopy (FTIR), in combination with multivariate analysis, can distinguish extra virgin olive oils from different producing countries.
We download a zipped folder to our working directory, then extract
two files (readme.txt and
FTIR_Spectra_olive_oils.csv)
## [1] 572 121
url_oil <- "http://csr.quadram.ac.uk/wp-content/uploads/FTIRSpectraOliveOils.zip" # url to our data
download.file(url_oil,"FTIRSpectraOliveOils.zip") # we download a zipped folder to our working directory
oil <- as.data.frame(t(read.csv(unzip("FTIRSpectraOliveOils.zip","FTIR_Spectra_olive_oils.csv")))) # we read the csv file and transpose it
readme <- read_file(unzip("FTIRSpectraOliveOils.zip","readme.txt")) # we can also read the text file contained in the zipped folder
dim(oil)
Let’s first look at the data; for practical reasons, here we will print out just few first rows / columns:
oil%>% select(1:5) %>% slice_head(n=5)
## Sample.Number. X1...2 X1...3 X2...4 X2...5
## 1 Group Code: 1 1 1 1
## 2 Wavenumbers Greece Greece Greece Greece
## 3 798.892 0.127523009 0.126498181 0.130411785 0.130022227
## 4 800.8215 0.127949615 0.127130974 0.130675401 0.130406662
## 5 802.751 0.129282219 0.128510777 0.13201661 0.132018029
We need to:
colnamesreadme.txt)colname of the country of originoil <- as.data.frame(t(oil))
colnames(oil) <- oil[1,] # changing the column names
oil2 <- oil[-1,]# removing the first row
oil2 <- oil2[,-1]# removing the first column
oil2 <- oil2 %>% rename(Origin=Wavenumbers) # renaming a columns
rownames(oil2) <- NULL # removing row names
oil2[,2:ncol(oil2)] <- apply(oil2[,2:ncol(oil2)],2,as.numeric) # changing the class of ATR data
oil2[,-1] <- (oil2[,-1]-min(oil2[,-1]))/(max(oil2[,-1])-min(oil2[,-1])) # scaling min-max
oil2[,1] <- as.factor(oil2[,1]) # setting Origin as factor
oil2 %>% select(1:5) %>% slice_head(n=5)
## Origin 798.892 800.8215 802.751 804.6805
## 1 Greece 0.08094809 0.08121735 0.08205844 0.08325256
## 2 Greece 0.08030126 0.08070066 0.08157153 0.08272601
## 3 Greece 0.08277138 0.08293776 0.08378428 0.08492507
## 4 Greece 0.08252550 0.08276814 0.08378517 0.08504071
## 5 Greece 0.08162910 0.08174749 0.08252553 0.08378128
Now our data frame contains relevant information: country of origin (first column) stored as a factor, and spectral values (numeric) with column names as relevant wavenumbers.
Finally, we need to check if there are any missing values in the dataset.
if(length(oil2[is.na(oil2)])==0) {
print("There were no NAs")
} else{"Please remove NAs!"}
## [1] "There were no NAs"
There were no missing values in the dataset. We are ready to proceed with our analysis.
We will print a fraction of the data glimpse and data summary:
oil2 %>% slice(1:5) %>% select(1:5) %>% glimpse()
## Rows: 5
## Columns: 5
## $ Origin <fct> Greece, Greece, Greece, Greece, Greece
## $ `798.892` <dbl> 0.08094809, 0.08030126, 0.08277138, 0.08252550, 0.08162910
## $ `800.8215` <dbl> 0.08121735, 0.08070066, 0.08293776, 0.08276814, 0.08174749
## $ `802.751` <dbl> 0.08205844, 0.08157153, 0.08378428, 0.08378517, 0.08252553
## $ `804.6805` <dbl> 0.08325256, 0.08272601, 0.08492507, 0.08504071, 0.08378128
oil2 %>% select(1:5) %>% summary()
## Origin 798.892 800.8215 802.751
## Greece :20 Min. :0.07949 Min. :0.07974 Min. :0.08060
## Italy :34 1st Qu.:0.08043 1st Qu.:0.08073 1st Qu.:0.08160
## Portugal:16 Median :0.08098 Median :0.08124 Median :0.08207
## Spain :50 Mean :0.08162 Mean :0.08190 Mean :0.08275
## 3rd Qu.:0.08158 3rd Qu.:0.08184 3rd Qu.:0.08266
## Max. :0.09183 Max. :0.09221 Max. :0.09307
## 804.6805
## Min. :0.08171
## 1st Qu.:0.08286
## Median :0.08329
## Mean :0.08395
## 3rd Qu.:0.08383
## Max. :0.09422
Let’s visualize the dataset. To do so, we need to:
oil2 %>% mutate(Scan=1:nrow(oil2)) %>%
pivot_longer(cols=-c(Origin,Scan)) %>% mutate(name=as.numeric(name)) %>%
ggplot(aes(name,value, group=Scan, color=Origin))+geom_line()+
theme_gray()+ggtitle("ATR of virgin olive oils",subtitle="After initial transformations")+xlab(expression(paste("Wavenumber, ", cm^-1)))+ylab("Absorbance")+scale_x_reverse()
Let’s check the number of identifying factors and their balance.
oil2 %>% group_by(Origin) %>% summarize(Count=n()) %>% mutate('Fraction, %'=Count/sum(Count)*100) %>% mutate_at(3,funs(round(.,1))) %>% print()
## # A tibble: 4 × 3
## Origin Count `Fraction, %`
## <fct> <int> <dbl>
## 1 Greece 20 16.7
## 2 Italy 34 28.3
## 3 Portugal 16 13.3
## 4 Spain 50 41.7
The data is not balanced: the number of samples originated from Spain
is over 3x higher than the Portuguese samples. We will balance the data
with scutr package options when applying PLS algorithm.
We will check how well the PCA separates Origin groups.
pca.model <- FactoMineR::PCA(as.matrix(oil2[,-1]), scale = TRUE, graph = FALSE)
factoextra::fviz_screeplot(pca.model, addlabels = TRUE, ylim = c(0, 50))
The scree plot tells us that that first 8 components can explain over 96% of variance in the dataset.
factoextra::fviz_pca_ind(pca.model,
label = "none", # hide individual labels
habillage = oil2$Origin, # color by groups
#palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE # Concentration ellipses
)
The scree plot of individuals shows first two components are responsible for 65.8% of the variance.
We need to mix data rows to make the data analysis more realistic.
set.seed(123); oil3 <- oil2[order(runif(nrow(oil2))),]
We split the data into training and testing sets in ratio 7/3,
respectively, according to the Origin labels:
set.seed(123)
indexOil<-createDataPartition(oil3$Origin, p=0.7, list=FALSE)
trainData <- oil3[indexOil,]
testData <- oil3[-indexOil,]
Let’s check what is the proportion of Origin in the
original, the training and the testing sets:
data.frame(
Original_set=oil3 %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round(),
Train_set=trainData %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round(),
Test_set=testData %>% dplyr::group_by(Origin) %>% summarize(count=n()) %>%
mutate(pc=count*100/sum(count)) %>% pull(pc) %>% round()
) %>% add_column(Origin=sort(unique(oil3$Origin)),.before=1) %>% kable(caption="Proportion of different countries of origin") %>% kableExtra::kable_styling()
| Origin | Original_set | Train_set | Test_set |
|---|---|---|---|
| Greece | 17 | 16 | 17 |
| Italy | 28 | 28 | 29 |
| Portugal | 13 | 14 | 11 |
| Spain | 42 | 41 | 43 |
We will balance the training set by SMOTE (Synthetic
Minority Oversampling Technique). In the scutr package
description we read: “In a dataframe with n classes and m rows, the
resulting dataframe will have m/n rows per class.”
table(trainData$Origin)
##
## Greece Italy Portugal Spain
## 14 24 12 35
train_balanced <- SCUT(data=trainData,cls_col="Origin",oversample = oversample_smote)
train_balanced$Origin %>% table()
## .
## Greece Italy Portugal Spain
## 21 21 21 21
Each of the four countries is now represented equally in the training set.
To have more robust results, we will apply cross-validation via the
trControl function. Also, pre-processing (centering and
scaling) will be done by the algorithm (see
preProcess).
set.seed(123)
control1 <- caret::trainControl(method = "repeatedcv", repeats = 10,number=3,classProbs = TRUE)
# We train the PLS algorithm on the training set
model_asReceived <- caret::train(Origin~.,
data=train_balanced,
method = "pls",
trControl=control1,
preProcess =c("center","scale"),
tuneLength = 20,
summaryFunction = multiClassSummary)
confusionMatrix(predict(model_asReceived,testData),as.factor(testData$Origin))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Greece Italy Portugal Spain
## Greece 6 0 0 0
## Italy 0 10 0 0
## Portugal 0 0 4 0
## Spain 0 0 0 15
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9, 1)
## No Information Rate : 0.4286
## P-Value [Acc > NIR] : 1.321e-13
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Greece Class: Italy Class: Portugal Class: Spain
## Sensitivity 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.1714 0.2857 0.1143 0.4286
## Detection Rate 0.1714 0.2857 0.1143 0.4286
## Detection Prevalence 0.1714 0.2857 0.1143 0.4286
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000
The accuracy results for the test set is 100%.
Let’s visualize tuning. Instead the plot(model)
function, we will customize the plot.
model_tune<- data.frame(cbind(1:length(model_asReceived$results$Accuracy),model_asReceived$results$Accuracy)) %>% rename(Components=1,Accuracy=2)
model_tune %>% ggplot(aes(Components,Accuracy))+geom_line(color="steelblue")+geom_point(aes(Components,Accuracy), color="steelblue", size=2, alpha=0.5)+geom_point(aes(y=max(Accuracy),x=Components[which.max(Accuracy)]), color="red", size=3)+ggtitle("Cross-validation tuning", subtitle = "Best accuracy marked with red")+scale_x_continuous(breaks=seq(1,25,2))
The best results were for 15 latent variables.
Now we can visualize how much various parts of the spectrum contributed to the separation. Please mind that is an overall contribution of variables, and for a particular class it might vary.
var.imp <- varImp(model_asReceived)$importance %>% as.data.frame() %>% bind_cols(Wavelength=as.numeric(colnames(oil2)[-1])) %>% mutate(Importance=rowSums(.[1:4])) %>% bind_cols(Spectrum=as.numeric(t(oil2[1,-1]))) %>% mutate(Importance=Importance/max(Importance)*10) %>% na.omit() %>%
mutate(Wavelength=as.numeric(as.character(Wavelength)))
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
p <- ggplot(var.imp,aes(x=Wavelength,y=Spectrum,color=Importance))+geom_point(size=1.5)+
scale_color_gradientn(colours = rainbow(5))+
ggtitle("Class separation cumulative wavenumber importance", subtitle="Wavelength dependent")+xlab("Wavenumber,1/cm")+ylab("AU")+scale_x_reverse()
q <- ggplotly(p)
q <- q %>% layout(dragmode = "pan")
q
From the visualization above, we can conclude that the main separation contributing bands were:
Still, most of the spectrum contributed to the separation of the origin classes.
Above procedure showed how we can approach Partial Least Squares
Discriminant Analysis (PLSDA) with R in case of a multidimentional
dataset (FTIR of olive oils). As the class count varied, we balanced it
by application of the scutr package. The analysis gave 100%
accuracy for the test set.
Tapp H.S., Defernez M., Kemsley E.K., “FTIR Spectroscopy and Multivariate Analysis Can Distinguish the Geographic Origin of Extra Virgin Olive Oils”, Journal of Agricultural and Food Chemistry, 2003 51 (21), 6110-6115, https://doi.org/10.1021/jf030232s↩︎