Moisture in cocoa beans: multivariate regression with R

library(tidyverse)
library(knitr)
library(caret)
library(readxl)
library(RColorBrewer)
library(e1071)
library(ggthemes)
library(plotly)
library(kableExtra)

1. Introduction

Near Infrared (NIR) is a methodology widely used in chemistry and many other industries (e.g., medicine, physics, pharma, food etc.). It is fast, non-destructive, and requires very little sample preparation. If we can sacrifice some of the accuracy of an NIR instrument, we can buy a relatively cheap, handheld device (the size of a computer mouse), which can work as Internet of Things (IoT).

The NIR spectrum comprises of combination bands as well as overtones. We need to apply chemometrics to access the information scrambled in the NIR.

In this tutorial we will train two algorithms with a portion of NIR data (training set), and then we will assess their performance on a testing set.

2. Dataset

According to the description of the dataset: This dataset contains absorbance spectrum (NIR) in wavelength range from 1000 to 2500 nm for a total of 72 bulk intact cocoa bean samples. Each bulk amounted 50g of intact beans. Inner quality parameters were measured as moisture content (%) and fat content (%). We employ NIRS method as a part of technology adoption for cocoa farmers in Aceh province, Indonesia. The NIRS technology can rapidly and simultaneously predict those both quality parameters of intact cocoa beans.[…].

Both the publication ¹ and the data are publicly available under licence CC BY 4.0 here.

3. Data preparation

We start with uploading the data:

  dat1 <- read_excel("NIRS spectrum Cocoa Beans.xlsx")

Let’s print a portion of the data:

dat1[1:10,1:10]

## # A tibble: 10 × 10
##    NIRS Spectra…¹     ...2     ...3     ...4     ...5     ...6     ...7     ...8
##    <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 Acquired in w…   NA       NA       NA       NA       NA       NA       NA    
##  2 Bulk cocoa sa…   NA       NA       NA       NA       NA       NA       NA    
##  3 quality param…   NA       NA       NA       NA       NA       NA       NA    
##  4 <NA>             NA       NA       NA       NA       NA       NA       NA    
##  5 <NA>             NA       NA       NA       NA       NA       NA       NA    
##  6 sample Num     1000.    1000.    1001.    1001.    1001.    1002.    1002.   
##  7 1                 0.455    0.455    0.455    0.455    0.455    0.455    0.455
##  8 2                 0.485    0.485    0.485    0.485    0.485    0.485    0.485
##  9 3                 0.504    0.504    0.504    0.504    0.504    0.504    0.504
## 10 4                 0.470    0.470    0.470    0.470    0.470    0.470    0.470
## # … with 2 more variables: ...9 <dbl>, ...10 <dbl>, and abbreviated variable
## #   name ¹`NIRS Spectral data of intact Cocoa Bean samples`

We will rename our columns with the 6th row, then remove first six rows.

dat1a <- dat1 %>% setNames(as.character(dat1[6,])) %>% dplyr::select(-1) %>% slice(-c(1:6))

For easier data manipulation, we will rename the response variable (Moisture Content (%)), and will remove the Fat Content (%) variable:

data2 <- dat1a %>% rename_at(vars(starts_with('Moisture')),~'Moisture') %>% dplyr::select(-(ncol(dat1a)))

Now, we can relocate our response variable (Moisture) to the beginning of our dataset:

data2 <- data2 %>% relocate(Moisture,.before=1)

Let’s print out first few variables:

data2 %>% dplyr::select(1:10) %>% glimpse()

## Rows: 72
## Columns: 10
## $ Moisture <chr> "8.17", "7.52", "8.52", "8.65", "7.66", "8.18", "7.86", "8.34…
## $ `999.9`  <dbl> 0.4552450, 0.4852818, 0.5041278, 0.4700770, 0.4953564, 0.4538…
## $ `1000.3` <dbl> 0.4551343, 0.4854074, 0.5037841, 0.4697611, 0.4953353, 0.4531…
## $ `1000.7` <dbl> 0.4549861, 0.4854073, 0.5035499, 0.4697801, 0.4955131, 0.4533…
## $ `1001.1` <dbl> 0.4550090, 0.4854785, 0.5039826, 0.4698518, 0.4955038, 0.4536…
## $ `1001.4` <dbl> 0.4547225, 0.4853990, 0.5039921, 0.4697831, 0.4953027, 0.4537…
## $ `1001.8` <dbl> 0.4546516, 0.4852844, 0.5035763, 0.4696004, 0.4951282, 0.4532…
## $ `1002.2` <dbl> 0.4545161, 0.4851503, 0.5036557, 0.4695199, 0.4951181, 0.4535…
## $ `1002.6` <dbl> 0.4545274, 0.4851739, 0.5036467, 0.4694393, 0.4953395, 0.4537…
## $ `1003`   <dbl> 0.4545811, 0.4854389, 0.5033267, 0.4696478, 0.4951361, 0.4536…

All data should be numeric. Let’s check if that is the case:

data2 %>% select_if(~!is.numeric(.x)) %>% head()

## # A tibble: 6 × 1
##   Moisture
##   <chr>   
## 1 8.17    
## 2 7.52    
## 3 8.52    
## 4 8.65    
## 5 7.66    
## 6 8.18

We need to convert the first columns to numeric:

data3 <- data2 %>% mutate_if(is.character, as.numeric)

Let’s print the response variables summary:

data3 %>% dplyr::select(Moisture) %>% summary()

##     Moisture     
##  Min.   : 6.740  
##  1st Qu.: 8.088  
##  Median : 8.785  
##  Mean   : 9.036  
##  3rd Qu.: 9.562  
##  Max.   :12.080

Now, we can visualize the dataset, with colour depending on the moisture values:

data3 %>% dplyr::select(-2) %>% mutate(Scan=1:nrow(data3)) %>% 
  pivot_longer(cols=-c(Moisture,Scan)) %>% mutate(name=as.numeric(name)) %>% 
  ggplot(aes(name,value, group=Scan, color=Moisture))+geom_line()+
  scale_color_distiller(palette = "Spectral")+
  theme_igray()+ggtitle("Moisture in cocoa, NIR",subtitle="As received")+xlab("Wavelength, nm ")+ylab("AU")

3-1. Data derivation

According to many sources (e.g., ²), various types of corrections can help in class separation/regression.

I usually apply:

Multiplicative Scatter Correction (MSC)
Standard Normal Variate (SNV)
First Derivative with Sawitzky-Goley filtering (SG1)
Second Derivative with Sawitzky-Goley filtering (SG2)

The derivation tool I prefer is the mdatools package (only for msc correction I used the pls package). Unfortunately, mdatools is not compatible with package caret which I used here, hence I had to save csv files, and relevant data was uploaded in the course of the analysis. Detaching the mdatools library or even not uploading it directly (i.e., via mdatools::) did not do the trick.

Still, you can find the(“hushed”) derivation procedure code below.

We can separate the independent and response variables:

response <- data3 %>% dplyr::select(1)
expl_vars <- data3 %>% dplyr::select(-1)

The MSC correction:

#data_msc1 <- pls::msc(as.matrix(data3[,-c(1,2)]))
#scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
#data_msc2 <- response %>% bind_cols(t(apply(data_msc1,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
#write_csv(data_msc2,"data_msc.csv")

data_msc <- readr::read_csv("data_msc.csv",show_col_types = FALSE)
data_msc %>% mutate(Scan=1:nrow(data_msc)) %>% 
   pivot_longer(-c("Moisture","Scan")) %>% 
  ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
  scale_color_distiller(palette = "Spectral")+
  theme_igray()+ggtitle("NIR of cocoa",subtitle="After MSC correction")+xlab("Wavelength, nm")+ylab("AU")

The SNV correction:

# data_snv <- mdatools::prep.snv(expl_vars)
# scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
# data_snv2 <- response %>% bind_cols(t(apply(data_snv,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
# write_csv(data_snv2,"data_snv.csv")

data_snv <- readr::read_csv("data_snv.csv",show_col_types = FALSE)

data_snv %>% mutate(Scan=1:nrow(data_snv)) %>% 
   pivot_longer(-c("Moisture", "Scan")) %>% 
  ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
  scale_color_distiller(palette = "Spectral")+
  theme_igray()+ggtitle("NIR of cocoa",subtitle="After SNV correction")+xlab("Wavelength, nm")+ylab("AU")

The SG1 correction:

#SG1
# We set the window for the smoothing at 15, and the polynomial order at 3:
# sg.order=15
# data_sg1a <- mdatools::prep.savgol(expl_vars, width = sg.order, porder = 3, dorder = 1)
#data_sg1b <- response %>% bind_cols(t(apply(data_sg1a,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
#write_csv(data_sg1b,"data_sg1.csv")

data_sg1 <- readr::read_csv("data_sg1.csv",show_col_types = FALSE)

data_sg1 %>% mutate(Scan=1:nrow(data_sg1)) %>% 
   pivot_longer(-c("Moisture", "Scan")) %>% 
  ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
  scale_color_distiller(palette = "Spectral")+
   theme_igray()+ggtitle("NIR of cocoa",subtitle="After SG1 correction")+xlab("Wavelength, nm")+ylab("AU")

The SG2 correction:

# sg.order=15
# data_sg2a <- mdatools::prep.savgol(expl_vars, width = sg.order, porder = 3, dorder = 2)
# data_sg2b <- response %>% bind_cols(t(apply(data_sg2a,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
# write_csv(data_sg2b,"data_sg2_1.csv")

data_sg2 <- readr::read_csv("data_sg2.csv",show_col_types = FALSE)

data_sg2 %>% mutate(Scan=1:nrow(data_sg2)) %>% 
   pivot_longer(-c("Moisture", "Scan")) %>% 
  ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
  scale_color_distiller(palette = "Spectral")+
  theme_igray()+ggtitle("NIR of cocoa",subtitle="After SG2 correction")+xlab("Wavelength, nm")+ylab("AU")

4. Regression

In NIR data, the observations show high degree of multicollinearity. In addition, our dataset has more predictors than observations. To avoid overfitting and other problems (model instability and its interpretability), we should use suitable regression methodologies, e.g.,:

Principal Component Regression (PCR)
Partial Least Square Regression (PLSR)
Support Vector Machine regression (SVM)
regularized regression (Lasso, Ridge, Elastic Net)

In this tutorial we will focus on PLS and SVM algorithms. We will apply caret package.

4-1. Setting up results storing objects

We will keep the results in a dataframe, and the models in a list.

# types of applied pre-processing
data.names <- c("msc","sg1","sg2","snv") 

# types of algorithms 
alg.names <- c("pls","svm") 

# data frame to hold final results: test set R2, train set R2, with information on the pre-processing and algorithm applied
results_df <- data.frame(test_R2=c(NA),train_R2=c(NA),info=c(NA)) 

# vector for storing all models 
models <- vector(mode='list', length=8) 

# naming the models list elements
names(models) <- sort(levels(interaction(data.names,alg.names,sep='_')))

4-2. Training algorithms

We will loop over all possible combinations: data pre-processed in 4 different ways and 2 different algorithms. For reproducibility, we will apply seed=123.

for(i in 1:4){
  # Uploading previously saved data
 dat_all <- read_csv(paste("data_",data.names[i],".csv",sep=""),show_col_types = FALSE)
  
 # Mixing data...
 set.seed(123)
 dat_all <- dat_all[order(rnorm(nrow(dat_all))),]
 # ... before we split it into train and test sets 
 train_ind <- 1:floor(nrow(dat_all)*0.7)
 trainSet <- dat_all[train_ind,]
 testSet <- dat_all[-train_ind,]

  # We set trainControl to perform 3 times repeated 10-fold cross-validation:
 set.seed(123)
 fitControl<- trainControl(method = "repeatedcv",number = 10,repeats = 3,verboseIter =TRUE)
 # We pre-process each time the datasets with centring, scaling; we also remove zero-variance predictors
 preProc <- c("center", "scale","nzv")

 # Now, we can train the PLS algorithm on our train set
 plsFit <- train(Moisture~., data = trainSet, 
                       method = "pls", 
                       trControl = fitControl, 
                       preProc = preProc,
                       tuneLength = 25,
                       seed=123)
 # Below we save prediction of the trained algorithm for the train and set tests,
 # with three other columns holding information about the dataset/algorithm used
  pls_results <- data.frame(
    test_R2=c(caret::R2(predict(plsFit,testSet),testSet$Moisture)), #test
    train_R2=c(caret::R2(predict(plsFit,trainSet),trainSet$Moisture)) # train
  ) %>% round(3)
  results_df[2*i-1,1:2] <- pls_results
  results_df[2*i-1,3] <- names(models)[2*i-1]
  
  # Here we save the trained model
  models[[2*i-1]]<- plsFit
  
  # Now, we train the SVM linear algorithm
  # We will use `expand.grid` to find an optimal C (cost) value
  grid_svm <- expand.grid(C = 2^seq(-10,2,0.1))
  svmFit <- train(Moisture~., data = trainSet, 
                           method = "svmLinear", 
                           trControl = fitControl, 
                           preProc = preProc,
                           tuneGrid=grid_svm,
                           tuneLength = 25)
  
  # As with PLS, we save predicted results by trained svm model
  svm_results <- data.frame(
    test_R2=c(caret::R2(predict(svmFit,testSet),testSet$Moisture)), #test
    train_R2=c(caret::R2(predict(svmFit,trainSet),trainSet$Moisture)) # train
  ) %>% round(3) 
  results_df[2*i,1:2] <- svm_results
  results_df[2*i,3] <- names(models)[2*i]
  
  # Here we save the svm trained model
  models[[2*i]]<- svmFit
}

We can save results:

# After the calculations are finished, we can save both the data frame with coefficients of determination ...
write.csv(results_df,"results_df.csv")

# ...and all the models.
saveRDS(models,"models.rds")

5. Results

Let’s have a look at the results:

results_df %>% select(preproc,algorithm,test_R2,train_R2) %>% arrange(desc(test_R2),desc(train_R2)) %>% 
  rename('pre-processing'=preproc) %>% 
mutate_if(is.numeric,round,2) %>% kable() %>% kable_styling(full_width=FALSE)

pre-processing	algorithm	test_R2	train_R2
msc	pls	0.94	0.91
msc	svm	0.90	0.87
sg2	pls	0.90	0.90
sg1	svm	0.90	0.87
sg1	pls	0.90	0.84
sg2	svm	0.88	0.90
snv	svm	0.87	0.78
snv	pls	0.87	0.78

PLS algorithm gave best performance, when trained on MSC corrected data.

Let’s now visualize the results:

# Here we can see which model did best
results_df %>% select(test_R2,train_R2,info) %>% 
  pivot_longer(c("test_R2","train_R2")) %>% 
  ggplot(aes(info,value,group=info, color=name))+geom_jitter(size=4,stroke=2, alpha=0.5, width=0.1)+theme_igray()+
  labs(title="Coefficient of determination",
       subtitle = "By pre-processing and algorithm",
       colour = "Set",
       x="Pre-processing / algorithm",
       y="Value")+
  theme(text = element_text(size=15),
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, face = "plain"))

Only a combination of sg2_pls gave balanced results. Nearly all trained models showed undefitting, while the sg2_svm was overfitted.

5-1. Spectral ranges importance

We can check which wavelengths actually were most important in the regression analysis:

varImportance <- varImp(sg2_pls)$importance %>% as.data.frame() %>% bind_cols(Wavelength=as.numeric(colnames(data_sg2)[-c(1,1557)])) %>% rename(Importance=Overall) %>% bind_cols(Spectrum=as.numeric(t(data_msc[1,-c(1,1557)]))) 


plot <- ggplot(varImportance,aes(x=Wavelength,y=Spectrum,color=Importance))+geom_point(size=1.5)+
scale_color_gradientn(colours = rainbow(5))+
  ggtitle("Wavelength importance")+xlab("Wavelength, nm")+ylab("AU")+theme_igray()

q <- ggplotly(plot)
q <- q  %>% layout(dragmode = "pan")
q

As expected, the wavelengths around 1450nm (O-H stretching: first vibrational overtone) and 1940nm (combination of the H-O-H bend and O-H stretching) were contributing most to the regression.

6. Summary

We used caretpackage for implementation of two algorithms:

Partial Least Squares
Support Vector Machines

for regression of a multidimensional dataset.

We trained both algorithms on data pre-processed in four ways:

Multiplicative Scatter Correction
Standard Normal Variate
Gavitzky-Golay smoothing with 1st derivative
Gavitzky-Golay smoothing with 2nd derivative

The best results gave PLS algorithm, when trained on MSC corrected data.

Only in one case the results were balanced, with most of the time train models were under-fitting. The reason of under-fitting might be the way the train and test sets were separated.

Agussabti, Rahmaddiansyah, Satriyo P., Munawar A. A.,“Data analysis on near infrared spectroscopy as a part of technology adoption for cocoa farmer in Aceh Province, Indonesia”, Data in Brief,Volume 29,2020,105251,ISSN 2352-3409,https://doi.org/10.1016/j.dib.2020.105251.↩︎
Rinnan Å.,van denBerg F., Balling Engelsen S.B., “Review of the most common pre-processing techniques for near-infrared spectra”, TrAC Trends in Analytical Chemistry, Volume 28, Issue 10,2009,Pages 1201-1222,https://doi.org/10.1016/j.trac.2009.07.007.↩︎