library(tidyverse)
library(knitr)
library(caret)
library(readxl)
library(RColorBrewer)
library(e1071)
library(ggthemes)
library(plotly)
library(kableExtra)
Near Infrared (NIR) is a methodology widely used in chemistry and many other industries (e.g., medicine, physics, pharma, food etc.). It is fast, non-destructive, and requires very little sample preparation. If we can sacrifice some of the accuracy of an NIR instrument, we can buy a relatively cheap, handheld device (the size of a computer mouse), which can work as Internet of Things (IoT).
The NIR spectrum comprises of combination bands as well as overtones. We need to apply chemometrics to access the information scrambled in the NIR.
In this tutorial we will train two algorithms with a portion of NIR data (training set), and then we will assess their performance on a testing set.
According to the description of the dataset: This dataset contains absorbance spectrum (NIR) in wavelength range from 1000 to 2500 nm for a total of 72 bulk intact cocoa bean samples. Each bulk amounted 50g of intact beans. Inner quality parameters were measured as moisture content (%) and fat content (%). We employ NIRS method as a part of technology adoption for cocoa farmers in Aceh province, Indonesia. The NIRS technology can rapidly and simultaneously predict those both quality parameters of intact cocoa beans.[…].
Both the publication 1 and the data are publicly available under licence CC BY 4.0 here.
We start with uploading the data:
dat1 <- read_excel("NIRS spectrum Cocoa Beans.xlsx")
Let’s print a portion of the data:
dat1[1:10,1:10]
## # A tibble: 10 × 10
## NIRS Spectra…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Acquired in w… NA NA NA NA NA NA NA
## 2 Bulk cocoa sa… NA NA NA NA NA NA NA
## 3 quality param… NA NA NA NA NA NA NA
## 4 <NA> NA NA NA NA NA NA NA
## 5 <NA> NA NA NA NA NA NA NA
## 6 sample Num 1000. 1000. 1001. 1001. 1001. 1002. 1002.
## 7 1 0.455 0.455 0.455 0.455 0.455 0.455 0.455
## 8 2 0.485 0.485 0.485 0.485 0.485 0.485 0.485
## 9 3 0.504 0.504 0.504 0.504 0.504 0.504 0.504
## 10 4 0.470 0.470 0.470 0.470 0.470 0.470 0.470
## # … with 2 more variables: ...9 <dbl>, ...10 <dbl>, and abbreviated variable
## # name ¹`NIRS Spectral data of intact Cocoa Bean samples`
We will rename our columns with the 6th row, then remove first six rows.
dat1a <- dat1 %>% setNames(as.character(dat1[6,])) %>% dplyr::select(-1) %>% slice(-c(1:6))
For easier data manipulation, we will rename the response variable
(Moisture Content (%)), and will remove the
Fat Content (%) variable:
data2 <- dat1a %>% rename_at(vars(starts_with('Moisture')),~'Moisture') %>% dplyr::select(-(ncol(dat1a)))
Now, we can relocate our response variable (Moisture) to
the beginning of our dataset:
data2 <- data2 %>% relocate(Moisture,.before=1)
Let’s print out first few variables:
data2 %>% dplyr::select(1:10) %>% glimpse()
## Rows: 72
## Columns: 10
## $ Moisture <chr> "8.17", "7.52", "8.52", "8.65", "7.66", "8.18", "7.86", "8.34…
## $ `999.9` <dbl> 0.4552450, 0.4852818, 0.5041278, 0.4700770, 0.4953564, 0.4538…
## $ `1000.3` <dbl> 0.4551343, 0.4854074, 0.5037841, 0.4697611, 0.4953353, 0.4531…
## $ `1000.7` <dbl> 0.4549861, 0.4854073, 0.5035499, 0.4697801, 0.4955131, 0.4533…
## $ `1001.1` <dbl> 0.4550090, 0.4854785, 0.5039826, 0.4698518, 0.4955038, 0.4536…
## $ `1001.4` <dbl> 0.4547225, 0.4853990, 0.5039921, 0.4697831, 0.4953027, 0.4537…
## $ `1001.8` <dbl> 0.4546516, 0.4852844, 0.5035763, 0.4696004, 0.4951282, 0.4532…
## $ `1002.2` <dbl> 0.4545161, 0.4851503, 0.5036557, 0.4695199, 0.4951181, 0.4535…
## $ `1002.6` <dbl> 0.4545274, 0.4851739, 0.5036467, 0.4694393, 0.4953395, 0.4537…
## $ `1003` <dbl> 0.4545811, 0.4854389, 0.5033267, 0.4696478, 0.4951361, 0.4536…
All data should be numeric. Let’s check if that is the case:
data2 %>% select_if(~!is.numeric(.x)) %>% head()
## # A tibble: 6 × 1
## Moisture
## <chr>
## 1 8.17
## 2 7.52
## 3 8.52
## 4 8.65
## 5 7.66
## 6 8.18
We need to convert the first columns to numeric:
data3 <- data2 %>% mutate_if(is.character, as.numeric)
Let’s print the response variables summary:
data3 %>% dplyr::select(Moisture) %>% summary()
## Moisture
## Min. : 6.740
## 1st Qu.: 8.088
## Median : 8.785
## Mean : 9.036
## 3rd Qu.: 9.562
## Max. :12.080
Now, we can visualize the dataset, with colour depending on the moisture values:
data3 %>% dplyr::select(-2) %>% mutate(Scan=1:nrow(data3)) %>%
pivot_longer(cols=-c(Moisture,Scan)) %>% mutate(name=as.numeric(name)) %>%
ggplot(aes(name,value, group=Scan, color=Moisture))+geom_line()+
scale_color_distiller(palette = "Spectral")+
theme_igray()+ggtitle("Moisture in cocoa, NIR",subtitle="As received")+xlab("Wavelength, nm ")+ylab("AU")
According to many sources (e.g., 2), various types of corrections can help in class separation/regression.
I usually apply:
The derivation tool I prefer is the mdatools package
(only for msc correction I used the pls package).
Unfortunately, mdatools is not compatible with package
caret which I used here, hence I had to save
csv files, and relevant data was uploaded in the course of
the analysis. Detaching the mdatools library or even not
uploading it directly (i.e., via mdatools::) did not do the
trick.
Still, you can find the(“hushed”) derivation procedure code below.
We can separate the independent and response variables:
response <- data3 %>% dplyr::select(1)
expl_vars <- data3 %>% dplyr::select(-1)
The MSC correction:
#data_msc1 <- pls::msc(as.matrix(data3[,-c(1,2)]))
#scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
#data_msc2 <- response %>% bind_cols(t(apply(data_msc1,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
#write_csv(data_msc2,"data_msc.csv")
data_msc <- readr::read_csv("data_msc.csv",show_col_types = FALSE)
data_msc %>% mutate(Scan=1:nrow(data_msc)) %>%
pivot_longer(-c("Moisture","Scan")) %>%
ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
scale_color_distiller(palette = "Spectral")+
theme_igray()+ggtitle("NIR of cocoa",subtitle="After MSC correction")+xlab("Wavelength, nm")+ylab("AU")
The SNV correction:
# data_snv <- mdatools::prep.snv(expl_vars)
# scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
# data_snv2 <- response %>% bind_cols(t(apply(data_snv,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
# write_csv(data_snv2,"data_snv.csv")
data_snv <- readr::read_csv("data_snv.csv",show_col_types = FALSE)
data_snv %>% mutate(Scan=1:nrow(data_snv)) %>%
pivot_longer(-c("Moisture", "Scan")) %>%
ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
scale_color_distiller(palette = "Spectral")+
theme_igray()+ggtitle("NIR of cocoa",subtitle="After SNV correction")+xlab("Wavelength, nm")+ylab("AU")
The SG1 correction:
#SG1
# We set the window for the smoothing at 15, and the polynomial order at 3:
# sg.order=15
# data_sg1a <- mdatools::prep.savgol(expl_vars, width = sg.order, porder = 3, dorder = 1)
#data_sg1b <- response %>% bind_cols(t(apply(data_sg1a,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
#write_csv(data_sg1b,"data_sg1.csv")
data_sg1 <- readr::read_csv("data_sg1.csv",show_col_types = FALSE)
data_sg1 %>% mutate(Scan=1:nrow(data_sg1)) %>%
pivot_longer(-c("Moisture", "Scan")) %>%
ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
scale_color_distiller(palette = "Spectral")+
theme_igray()+ggtitle("NIR of cocoa",subtitle="After SG1 correction")+xlab("Wavelength, nm")+ylab("AU")
The SG2 correction:
# sg.order=15
# data_sg2a <- mdatools::prep.savgol(expl_vars, width = sg.order, porder = 3, dorder = 2)
# data_sg2b <- response %>% bind_cols(t(apply(data_sg2a,1,scale_01))) %>% data.frame() %>% setNames(colnames(data3))
# write_csv(data_sg2b,"data_sg2_1.csv")
data_sg2 <- readr::read_csv("data_sg2.csv",show_col_types = FALSE)
data_sg2 %>% mutate(Scan=1:nrow(data_sg2)) %>%
pivot_longer(-c("Moisture", "Scan")) %>%
ggplot(aes(as.numeric(name),value, group=Scan, color=Moisture))+geom_line()+
scale_color_distiller(palette = "Spectral")+
theme_igray()+ggtitle("NIR of cocoa",subtitle="After SG2 correction")+xlab("Wavelength, nm")+ylab("AU")
In NIR data, the observations show high degree of multicollinearity. In addition, our dataset has more predictors than observations. To avoid overfitting and other problems (model instability and its interpretability), we should use suitable regression methodologies, e.g.,:
In this tutorial we will focus on PLS and SVM algorithms. We will
apply caret package.
We will keep the results in a dataframe, and the models
in a list.
# types of applied pre-processing
data.names <- c("msc","sg1","sg2","snv")
# types of algorithms
alg.names <- c("pls","svm")
# data frame to hold final results: test set R2, train set R2, with information on the pre-processing and algorithm applied
results_df <- data.frame(test_R2=c(NA),train_R2=c(NA),info=c(NA))
# vector for storing all models
models <- vector(mode='list', length=8)
# naming the models list elements
names(models) <- sort(levels(interaction(data.names,alg.names,sep='_')))
We will loop over all possible combinations: data pre-processed in 4 different ways and 2 different algorithms. For reproducibility, we will apply seed=123.
for(i in 1:4){
# Uploading previously saved data
dat_all <- read_csv(paste("data_",data.names[i],".csv",sep=""),show_col_types = FALSE)
# Mixing data...
set.seed(123)
dat_all <- dat_all[order(rnorm(nrow(dat_all))),]
# ... before we split it into train and test sets
train_ind <- 1:floor(nrow(dat_all)*0.7)
trainSet <- dat_all[train_ind,]
testSet <- dat_all[-train_ind,]
# We set trainControl to perform 3 times repeated 10-fold cross-validation:
set.seed(123)
fitControl<- trainControl(method = "repeatedcv",number = 10,repeats = 3,verboseIter =TRUE)
# We pre-process each time the datasets with centring, scaling; we also remove zero-variance predictors
preProc <- c("center", "scale","nzv")
# Now, we can train the PLS algorithm on our train set
plsFit <- train(Moisture~., data = trainSet,
method = "pls",
trControl = fitControl,
preProc = preProc,
tuneLength = 25,
seed=123)
# Below we save prediction of the trained algorithm for the train and set tests,
# with three other columns holding information about the dataset/algorithm used
pls_results <- data.frame(
test_R2=c(caret::R2(predict(plsFit,testSet),testSet$Moisture)), #test
train_R2=c(caret::R2(predict(plsFit,trainSet),trainSet$Moisture)) # train
) %>% round(3)
results_df[2*i-1,1:2] <- pls_results
results_df[2*i-1,3] <- names(models)[2*i-1]
# Here we save the trained model
models[[2*i-1]]<- plsFit
# Now, we train the SVM linear algorithm
# We will use `expand.grid` to find an optimal C (cost) value
grid_svm <- expand.grid(C = 2^seq(-10,2,0.1))
svmFit <- train(Moisture~., data = trainSet,
method = "svmLinear",
trControl = fitControl,
preProc = preProc,
tuneGrid=grid_svm,
tuneLength = 25)
# As with PLS, we save predicted results by trained svm model
svm_results <- data.frame(
test_R2=c(caret::R2(predict(svmFit,testSet),testSet$Moisture)), #test
train_R2=c(caret::R2(predict(svmFit,trainSet),trainSet$Moisture)) # train
) %>% round(3)
results_df[2*i,1:2] <- svm_results
results_df[2*i,3] <- names(models)[2*i]
# Here we save the svm trained model
models[[2*i]]<- svmFit
}
We can save results:
# After the calculations are finished, we can save both the data frame with coefficients of determination ...
write.csv(results_df,"results_df.csv")
# ...and all the models.
saveRDS(models,"models.rds")
Let’s have a look at the results:
results_df %>% select(preproc,algorithm,test_R2,train_R2) %>% arrange(desc(test_R2),desc(train_R2)) %>%
rename('pre-processing'=preproc) %>%
mutate_if(is.numeric,round,2) %>% kable() %>% kable_styling(full_width=FALSE)
| pre-processing | algorithm | test_R2 | train_R2 |
|---|---|---|---|
| msc | pls | 0.94 | 0.91 |
| msc | svm | 0.90 | 0.87 |
| sg2 | pls | 0.90 | 0.90 |
| sg1 | svm | 0.90 | 0.87 |
| sg1 | pls | 0.90 | 0.84 |
| sg2 | svm | 0.88 | 0.90 |
| snv | svm | 0.87 | 0.78 |
| snv | pls | 0.87 | 0.78 |
PLS algorithm gave best performance, when trained on MSC corrected data.
Let’s now visualize the results:
# Here we can see which model did best
results_df %>% select(test_R2,train_R2,info) %>%
pivot_longer(c("test_R2","train_R2")) %>%
ggplot(aes(info,value,group=info, color=name))+geom_jitter(size=4,stroke=2, alpha=0.5, width=0.1)+theme_igray()+
labs(title="Coefficient of determination",
subtitle = "By pre-processing and algorithm",
colour = "Set",
x="Pre-processing / algorithm",
y="Value")+
theme(text = element_text(size=15),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, face = "plain"))
Only a combination of sg2_pls gave balanced results.
Nearly all trained models showed undefitting, while the
sg2_svm was overfitted.
We can check which wavelengths actually were most important in the regression analysis:
varImportance <- varImp(sg2_pls)$importance %>% as.data.frame() %>% bind_cols(Wavelength=as.numeric(colnames(data_sg2)[-c(1,1557)])) %>% rename(Importance=Overall) %>% bind_cols(Spectrum=as.numeric(t(data_msc[1,-c(1,1557)])))
plot <- ggplot(varImportance,aes(x=Wavelength,y=Spectrum,color=Importance))+geom_point(size=1.5)+
scale_color_gradientn(colours = rainbow(5))+
ggtitle("Wavelength importance")+xlab("Wavelength, nm")+ylab("AU")+theme_igray()
q <- ggplotly(plot)
q <- q %>% layout(dragmode = "pan")
q
As expected, the wavelengths around 1450nm (O-H
stretching: first vibrational overtone) and 1940nm (combination of the
H-O-H bend and O-H stretching) were
contributing most to the regression.
We used caretpackage for implementation of two
algorithms:
for regression of a multidimensional dataset.
We trained both algorithms on data pre-processed in four ways:
The best results gave PLS algorithm, when trained on MSC corrected data.
Only in one case the results were balanced, with most of the time train models were under-fitting. The reason of under-fitting might be the way the train and test sets were separated.
Agussabti, Rahmaddiansyah, Satriyo P., Munawar A. A.,“Data analysis on near infrared spectroscopy as a part of technology adoption for cocoa farmer in Aceh Province, Indonesia”, Data in Brief,Volume 29,2020,105251,ISSN 2352-3409,https://doi.org/10.1016/j.dib.2020.105251.↩︎
Rinnan Å.,van denBerg F., Balling Engelsen S.B., “Review of the most common pre-processing techniques for near-infrared spectra”, TrAC Trends in Analytical Chemistry, Volume 28, Issue 10,2009,Pages 1201-1222,https://doi.org/10.1016/j.trac.2009.07.007.↩︎