Introduction
In this document we will save all data manipulation and analysis related to the infrared spectra from the project: The structure and interaction of silica in plant cell walls
Loading data into R
First, we create an object called names, which is a
list, in which we will save the names of the files with the
.CSV extension contained in this
folder.
names <- list.files(pattern = '.CSV')Then, we use the function lapply to apply the function
read.csv to each element of the list names and
then save each of the data.frames from the files in to
R.
spectra.list <- lapply(names,
read.csv,
header = F)Now, se can save the frequencies of incident light as a vector called
wavenumbers. The code spectra.list[[1]][1]
selects the first element of the list spectra.list using
the index operator for lists [[1]], which is a
data.frame. Then, we select the first column of that
data.frame using the index operator [1]:
wavenumbers <- unlist(spectra.list[[1]][1])The object wavenumbers has 1 row and 1869 columns, each
column is one of the frequencies used by the spectrometer in the
experiment.
As we did previously, we can use again the function
lapply to apply the index operator [ and
select ONLY the second column of each data.frame that has
inside the absorbance at each frequency.
spectra.list2 <- lapply(spectra.list, '[', 2)In order to make data easier to manipulate, we can transform the list
of columns called spectra.list2 into a
data.frame as follows:
spectra.df <- as.data.frame(
t(as.data.frame(spectra.list2))
)the object spectra.df has 318 rows, i.e. 318 spectra
from different samples and 1869 columns, or infrared frequencies.
Since the function t() transforms
data.frame class objects into matrix class
objects, we have to assign again to spectra.df (a matrix)
the corresponding column and row names:
rownames(spectra.df) <- names
colnames(spectra.df) <- wavenumbers
# gsub('.{6}$', '', names)Tidying up the data.frame
Since the names of the sample right now have the following structure:
head(names[1:3])## [1] "125-1.CSV" "125-2.CSV" "125-3.CSV"
We can get rid of the unnecessary characters in the names of the
samples, such as the extension and other symbols such as "
or .. For that we can use the function gsub()
and regex or regular expressions in order to replace these
symbols for nothing, or ''.
The regular expression for this replacement is built as follows:
.matches any element that complies with the requirement, or applies the query to every element{4}replaces characters by''exactly 4 times$matches the end of each name
That is described before allows us to tell R that we
want to replace the last 4 characters in each name by nothing, or
''.
names2 <- gsub('.{4}$',
'',
rownames(spectra.df))
head(names2[1:4])## [1] "125-1" "125-2" "125-3" "126-1"
There are some samples that have longer names, such as:
head(names2[292:318])## [1] "P1-1_Thu May 09 07-26-09 2019 (GMT+02-00)"
## [2] "P1-2_Thu May 09 07-26-09 2019 (GMT+02-00)"
## [3] "P1-3_Thu May 09 07-26-09 2019 (GMT+02-00)"
## [4] "P2-1_Thu May 09 07-26-09 2019 (GMT+02-00)"
## [5] "P2-2_Thu May 09 07-26-09 2019 (GMT+02-00)"
## [6] "P2-3_Thu May 09 07-26-09 2019 (GMT+02-00)"
We can also save just the information that is useful for displaying in tables and plots:
names2[292:318] <- gsub('.{41}$',
'',
rownames(
spectra.df
)[292:318])
head(names2[292:318])[1:4]## [1] "P1-1" "P1-2" "P1-3" "P2-1"
Finally, we assign the curated names to our data frame,
spectra.df
rownames(spectra.df) <- names2Plotting intial data
Since we have joined several spectra from different samples, we can try to visualize what we are dealing with.
Knowing that we have already the names of the samples, it would be handy if we used a factor that helped us to differentiate replicas of experiments:
options(width = 30)
cols <- factor(gsub('.{2}$', '', names2))
head(cols)
## [1] 125 125 125 126 126 126
## 106 Levels: 125 126 ... P9If we use the factor cols, we can plot three replicas of
one sample using the same color
Now, we can use a loop to plot each one of the rows of our
data.frame:
win.graph()
for(i in 1:2){
plot(wavenumbers,
spectra.df[i,],
axes = F,
xlab = '',
ylab = '',
xlim = c(4000, 400),
ylim= c(0,0.2),
type = 'l',
col =cols[i]
)
par(new = T)
}
box()
axis(1)
axis(2)
title(main = '',
xlab = expression(
paste('Wave number (cm'^'-1',
')')
),
ylab ='absorbance (a.u.)')
abline(v = 3327)
abline(v = 2917)
abline(v = 2850)
abline(v = 1730)
abline(v = 1623)raw spectra of triplicated samples
As we can see in Figure \(\ref{fig:figs}\) In the range between 2500 and 2000 cm-1 variability due to experimental noise and background correction is present. In order to use multivariate methods and extract the chemical information from the dataset, we can select a region of interest (ROI) in which we take into account characteristic bands for molecules which concentration can be different between samples, such as silicon oxides, or lignin monomers.
range selection
Now, we can check for where is that ROI in our data frame. Since the frequencies are arranged along the columns, we can search for the columns that have column names between 1700 and 400 \(cm^{-1}\)
colnames(spectra.df)[c(1,676)]## [1] "399.1989" "1700.935"
Once we know where the ROI is in the data frame, we can create a
subset called range1 to store the spectra of all samples
within this range, and plot it:
range1 <- spectra.df[,c(1:676)]
wavenumbers1 <- wavenumbers[c(1:676)]for(i in 1:length(rownames(range1))){
plot(wavenumbers1,
range1[i,],
axes = F,
xlab = '',
ylab = '',
xlim = c(1700, 400),
ylim= c(0,0.2),
type = 'l',
col =cols[i]
)
par(new = T)
}
box()
axis(1)
axis(2)
title(main = 'raw spectra - ROI',
xlab = expression(
paste('Wave number (cm'^'-1',
')'
)),
ylab ='absorbance (a.u.)')ROI spectra of triplicated samples
mean spectra calculation
Since each sample has three replicated spectra, we can calculate the mean of each three samples as follows:
- First, we create a vector to tell R which samples to look for. Using
the function
unique()we extract the unique values without taking their repetitions:
options(width = 30)
search.vector <- unique(unlist(cols))
head(search.vector)## [1] 125 126 127 128 129 130
## 106 Levels: 125 126 ... P9
Once we have the names for each sample, we can search for \(\dfrac{318}{3}=106\) triads of samples. Once we know where they are, we can calculate the mean for each three replicates.
First, we create a list in which we are going to save the position of each three spectra:
index <- list(106)Now we can again use the help of regular expressions to search for every sample that matches with one single sample:
As an example:
The regular expression that we will use this time is
(?=.*<search>) where <search> is
replaced with the sample’s name.
This will find any element for the vector
rownames(spectra.df) that matches with the character in
search vector. For example, if we search for the first
sample, search.vector[1] which is sample
125:
as.character(search.vector[1]) # the name of ## [1] "125"
#the first sampleThe search will tell us the positions in
rownames(spectra.df) where where we can find the character
'125'
which(
grepl(
paste0('(?=.*',
as.character(search.vector[1]),
')'
),
rownames(spectra.df),
perl=T
)
)## [1] 1 2 3
Then we can confirm:
rownames(spectra.df)[c(1, 2, 3)]## [1] "125-1" "125-2" "125-3"
This process can be automated using a for loop in R:
for (i in 1:106){
index[[i]] <- which(
grepl(
paste0('(?=.*',
as.character(search.vector[i])
,')'),
rownames(spectra.df),
perl=T
)
)
}search.vector[2]## [1] 126
## 106 Levels: 125 126 ... P9
rownames(spectra.df)[c(index[[2]])]## [1] "126-1" "126-2" "126-3"
Once we know where each triplicate is, we can proceed to calculate the means:
- First, we create a matrix to save the mean spectra that will result from the calculation.
mean <- matrix(ncol= ncol(range1),
nrow = nrow(range1)/3)*Then, we assign to this empty matrix, mean, its
corresponding column and row names:
# mean <- as.data.frame(mean)
colnames(mean) <- colnames(range1)
rownames(mean) <- search.vectorThen we can loop through columns indexing each wave number, one after
another, in each iteration of the loop, or each time j
changes its value. Posibles values for j are defined when
in 1:length(ncol(range1)). where ncol gives us
the number of columns of range1 or the number of wave
numbers present along the ROI.
At the same time, i takes values at each one of the
numbers present between 1 and nrow(mean)
at step 1 in both loops, we save in the position
mean[1,1] the resulting value when we use the function
mean() which as default calculates the arithmetic mean
(when the argument trim = 0 is given as default, the function trims a
fraction of 0 observations from each end of the three observations
before the means is computed.)
In order to calculate this mean estimator from a sample of 3
replicates, we index the spectral matrix range1 in three
special positions for each sample, those found in the creation of the
index list.
In this step of the calculation, the first value of absorbance for the sample 125 at wave number 399.1, is empty, as are all the other slots of this matrix.
mean[1,1]## [1] NA
for example, for the first sample, and the first wave number, we will
save the result of the calculation of the mean of three replicas of
sample 125, and at a wave number of 399.1, mean[1,1].In
order to accomplish this, we select the first position of the list
index[[1]]:
index[[1]]## [1] 1 2 3
when we print the content of this position of the list, we can notice
that the samples 125-1 125-2 and
125-3 lie in the positions 1, 2, and 3 respectively of the
data.frame range1.
So the calculation will be applied to
range1[index[[1]][1],1],
range1[index[[1]][2],1] and
range1[index[[1]][2],1], or rows 1, 2, and 3.
the process is then repeated through all of the wave numbers (columns) and samples (rows) of the data set.
for(j in 1:ncol(range1)){
for(i in 1:nrow(mean)){
mean[i,j] <- mean(c(range1[index[[i]][1],j],
range1[index[[i]][2],j],
range1[index[[i]][3],j]
) )
}
}then we turn this matrix into a data.frame:
mean <- as.data.frame(mean)Once we have calculated the mean for each sample, we can plot the
resulting mean spectra using the same factor used before,
cols to color each sample of the same color of their
triplicates.
cols.means <- as.factor(search.vector)
win.graph()
for(i in 1:length(rownames(mean))){
plot(wavenumbers1,
mean[i,],
axes = F,
xlab = '',
ylab = '',
xlim = c(4000, 400),
ylim= c(0,0.135309),
type = 'l',
col =cols.means[i]
)
par(new = T)
}
box()
axis(1)
axis(2)
title(main = '',
xlab = expression(
paste('Wave number (cm'^'-1',
')')
),
ylab ='absorbance (a.u.)')
abline(v = 3327)
abline(v = 2917)
abline(v = 2850)
abline(v = 1730)
abline(v = 1650)Calculated mean spectra
Once we have calculated the mean estimator for each multivariate spectra, we can proceed to perform pre-treatments in order to clean the data, and extract as much as chemical information as we can.
baseline correction
In this experiment we have performed an analogue to the the baseline correction described by beleites et al. 2020:
library(hyperSpec)
# hyperSpec package functions and
#data is charged in this sessionWe create an object that can be identified and manipulated by the
functions of hyperSpec package.
spc <- new('hyperSpec', # The class of the object
spc= mean, # the spectra matrix
wavelength = wavenumbers1
# independent variable, whether wave number
#or wave length
) then in an object called bend, we can save the result of
wl.eval applied to spc. This function
generates a baseline ‘reference spectra’:
bend <- 0.1 * wl.eval(spc,
function (x)
x^6+x^5+x^4+x^3+x^2,
normalize.wl =
normalize01)function (x) x^6+x^5+x^4+x^3+x^2defines a polynomial that is used to fit the baselines to the supporting points in the lower region of the spectra contained in ‘spc’, using this functions the baseline ‘reference spectra’ is calculatednormalize.wl=normalize01is a function used to transform the wave numbers before evaluating the polynomial. usingnormalize01we map the range of each wave number to the interval \([0,1]\)the multiplication by \(0.1\) is the normalization chosen for these spectra.
Now we can use the rubberband method to estimate the
baseline. Setting the noise to \(1\cdot10^{-4}\) allows us to take advantage
of the low noise presented for this spectra. (a noise of
noise=4 sets an interval of 2 times standard
deviations).
bl <- spc.rubberband(spc+bend,
noise = 1e-4,
df = 20)-bendThe base-line correction method selected for this work is one of many, and tuning the correction parameters can be subject to a design of experiments, using as a response variable a quality measure of the analysis that is going to be performed after the correction. For example \(Q^2\) for PCA, or \(R_{adj}\) and \(RMSEP_{CV}\) for multiple linear regression models as suggested by hovde 2010.
Now, we can visualize the results of these baseline corrections.
First, the spectra before correction and the estimated baseline ‘reference spectra’:
labels (spc, ".wavelength") <-
expression(paste(
'Wave number (cm'^'-1',
')'))
labels (spc, "spc") <-
expression(paste('Absorbance (a.u.)'))
plot(spc, wl.reverse = TRUE)
plot(bl, add=TRUE, col=2,wl.reverse = TRUE)Mean spectra and rubberband baseline
And the heart of the rubberband method, the bending of the spectra to add support points in the convex part of the spectra:
sum <- spc+bend
plot(sum,wl.reverse = TRUE)
plot(bend, add=TRUE, col=2,wl.reverse = TRUE)bent mean spectra and bent baseline
Then, the corrected spectra, which is calculated by substracting the
baseline bl from the spectra spc:
spc3 <- spc - bl
spc3 <- spc3 + (min(spc3)*-1)
# We add the minimum value
#which is negative to have only positive
#values
plot(spc3,wl.reverse = TRUE)baseline corrected mean spectra
Now, we can extract the corrected spectra from the
hyperSpec object and save it as a data.frame
to handle the information in an easier way later.
corrected1 <- as.data.frame(spc3[1:106])
corrected <- as.data.frame(corrected1[,1])
corrected <- corrected + (min(corrected)*-1)
# shifting upwards to prevent negative valuesUp to this point, we have calculated the mean spectra and corrected the base-line of each spectrum.
multiplicative scatter correction (MSC)
Since This is a popular pre-treatment (Rinnan 2009, Wu 2018) we tried to use it in this study.
library(pls)
correctedMSC <- msc(as.matrix(mean))for (i in 1:length(rownames(correctedMSC))){
plot(as.numeric(colnames(correctedMSC)),
correctedMSC[i,],
xlab = '',
ylab = '',
axes = F,
type = 'l',
xlim = c(1700,400),
ylim = c(0,0.12),
col = cols[i])
par(new = T)
}
box()
axis(1)
axis(2)
title(main = '',
xlab = expression(paste(
'Wave number (cm'^'-1',
')')),
ylab = 'Absorbance (a.u.)')MSC correction of mean spectra
Since no noise reduction was identified, this option was dismissed for the moment.
metadata
First, we load the metadata table, created from the data table ‘Datos_para_colombia.xlsx’ available here:
library(readxl)
metadata <- read_excel("metadata.xlsx")The object called metadata is a data.frame and it has 88 rows and columns as it is. The rows vary with samples and the columns with variables. The variables in the columns are:
options(width = 30)
colnames(metadata)
## [1] "sample" "class" "N %"
## [4] "C %" "Si" "Ca"
## [7] "Mg" "P"where:
sample: Contains the alpha numeric ID given to each sample. class=numeric.class: The group of the experiment from where the sample comes from class=character.N %: The nitrogen percent quantified. class=numeric.C %: The nitrogen percent quantified. class=numeric.Si: Silicon quantified by ICP-OES class=numeric.Ca: Calcium quantified by ICP-OES class=character.Mg: magnesium quantified by ICP-OES class=character.P: Phosphorous quantified by ICP-OES class=character.
Since the raw data is not debugged (as the majority of experimental
raw data sometimes are) we can tidy up this table as far as we can using
R.
First we can search if there is any row of the metadata
that has its label or ID missing using the column
sample:
which(is.na(metadata$sample))## [1] 17
The row 17 has an NA or blank space in the column
sample. we can proceed to remove this row:
metadata <- metadata[-c(17),]
which(is.na(metadata$sample))## integer(0)
integer(0) means that once we have deleted row 17, there
are no blank spaces whatsoever.
matching samples that have metadata with samples with spectra
We can now try to find where in the rows of corrected
are the samples listed in metadata$sample. Doing this we
will know which sample that has quantification and classification data,
also has spectra available and where it is.
positions <- vector('list', 87)
# the same sizeas metadata$sample
for (i in 1:87){
positions[[i]] <- which(
grepl(
paste0('(?=.*',
as.character(
metadata$sample[i]),
')'),
rownames(corrected),
perl=T
)
)
}In positions, an object of class=list we save
in each position where in the data frame corrected the
current sample is. For instance:
metadata$sample[1] ## [1] 125
positions[[1]]## [1] 1
rownames(corrected)[positions[[1]]]## [1] "125"
In the last chunk we print the name of the spectra,
rownames(corrected), that is selected by the indexing list
positions[[1]] which is 125, the same sample
at the first row of metadata.
This seems a little bit obvious but in the raw data there are several spectra rows that have not matching quantification data, or vice versa, there are several samples that have quantification information but have no spectra available.
Since now whe know where in the spectral matrix is each sample of the
metadata data set, we can add a new column that relates
where is each sample of metadata in the
corrected matrix:
for(i in 1:length(metadata$sample)) {
if(length(positions[[i]]) == 1){
metadata$spectra[i] <-
rownames(corrected)[positions[[i]]]
}else{
metadata$spectra[i] <- NA
}
}## Warning: Unknown or
## uninitialised column:
## `spectra`.
Now we can compare if the indexing tool positions is
doing its job. Creating the data.frame called
compare we list side to side the names of
metadata samples and the names of
corrected samples
compare <- data.frame(metadata =
metadata$sample,
spectra =
metadata$spectra)
head(compare)## metadata spectra
## 1 125 125
## 2 126 126
## 3 127 127
## 4 128 128
## 5 129 129
## 6 130 130
Since now we know where in corrected is each spectra for
each sample from metadata, we can add this information to
the metadata table:
metadata$spectra.index <- rep(0, nrow(metadata))
for(i in 1:length(metadata$sample)) {
if(length(positions[[i]]) == 1){
metadata$spectra.index[i] <- positions[[i]]
}else{
metadata$spectra.index[i] <- NA
}
}This is an analogue process to that used for saving the names of the
spectra contained in corrected. The difference is that now
in the column metadata$spectra.index we have exactly where
the sample of the current row, is in the rows of the
corrected matrix.
Exploratory Data analysis
Hierarchical clustering
Hierarchical clustering can be used to measure multidimensional distances between objects, or samples. If the objective is to use this analysis we have to first tidy up the data and prepare it for the analysis.
First, we can search for missing values in the classification column:
which(is.na(metadata$class))## [1] 7 26
Now, once is known that samples in rows 7 and 26 have no classification, we can create a new data table without these
metadata.class <- metadata[
-c( which(
is.na(metadata$class))),]
which(is.na(metadata.class$class))## integer(0)
Once the samples that have no classification information have been removed, it is useful to check if there is any missing spectrum for the remaining samples:
which(is.na(metadata.class$spectra.index))## [1] 24 26 29 32 35
Since these samples have no spectra, they can be also removed as follows:
metadata.class2 <-
metadata.class[
-c(which
(is.na(
metadata.class$spectra.index)) ),]
which(is.na(metadata.class2$spectra.index))## integer(0)
The next thing needed for the analysis is the spectra. It is important that each classification matches each spectra:
spectra.class <- corrected[
metadata.class2$spectra.index,]Then we can compare if we have the same samples in both tables:
compare.class <- data.frame(
classification = metadata.class2$sample,
spectra= rownames(spectra.class))
head(compare.class)## classification spectra
## 1 125 125
## 2 126 126
## 3 127 127
## 4 128 128
## 5 129 129
## 6 130 130
df1 <- spectra.class
names.class <-
paste(metadata.class2$class, rownames(
spectra.class))
rownames(df1) <- names.class
logdf <- log10(df1[,1:676] + 1)
rownames(logdf) <- names.class
hClust1 <- hclust(dist(logdf))
plot(hClust1)library(factoextra)
res.hk <-hkmeans(logdf,
5,
hc.metric = 'minkowski')fviz_dend(res.hk,
cex = 0.5,
palette = "jco",
rect = TRUE,
rect_border = "jco",
rect_fill = TRUE)colspca <- vector('character', nrow(df1))
for(i in 1:nrow(df1)){
if( grepl(
paste0('(?=.*',
'L',
')'),
metadata.class2$class[i],perl = T))
{colspca[i] <- 'black'}else{
if(grepl(
paste0('(?=.*',
'R',
')'),
metadata.class2$class[i],perl = T))
{colspca[i] <- 'red'}
else{colspca[i] <- 'green'}
}
}
pcaall <- prcomp(spectra.class)
vp <- (pcaall$sdev)^2
variance <- round(vp/sum(vp)*100,2)
coord <- pcaall$x
plot(coord[,1],
coord[,2],
col=colspca,
xlab="PC1 - 40.74 %",
ylab= "PC2 - 16.46 %",
pch=19
)
abline(v=0, h=0, lty=2)
text(coord[,1],
coord[,2],
rownames(coord),
cex=0.4,
pos=1)Variable selection with all samples
tidyng up the data frame
Since creating any model needs that every sample has both spectra and silicon concentration, we can delete the rows that have no silicon content information.
which(is.na(metadata$Si))## [1] 28 31 34 37 62 63 64 65 66
which(is.na(metadata$spectra))## [1] 7 25 26 28 31 34 37
missingSiAll <- unique(c(which(is.na(metadata$Si)),
which(is.na(metadata$spectra))))
missingSiAll## [1] 28 31 34 37 62 63 64 65
## [9] 66 7 25 26
metadata.si <- metadata[-c(missingSiAll),]Once we have selected the samples that have both spectra and silicon content available, this data set is ready to perform modelling. Also we can show that every sample has classification information as follows:
which(is.na(metadata.si$sample))## integer(0)
integer(0) means that there are not missing values for
the classification column, metadata.si$sample.
extracting the silicon content and corresponding spectra for each sample
we can create a table that relates the silicon content of each sample with their correspondent IDs:
SiTable <- data.frame(sample = metadata.si$sample,
Si = metadata.si$Si)then, we can extract the corresponding spectra for each sample:
Sispectra <- corrected[metadata.si$spectra.index,]
SispectraRaw <- mean[metadata.si$spectra.index,]selection of variables
library(subselect)
Hmat <- lmHmat(Sispectra, metadata.si$Si)
system.time(
gen <- genetic(Hmat$mat,
kmin = 5,
kmax = 16,
H = Hmat$H,
r = 1,
crit = 'CCR12',
force = T
)
)## user system elapsed
## 78.66 0.07 82.26
# par(mfrow=c(4,3))
# for(j in 1:nrow(gen$bestsets)){
# for (i in 1:length(rownames(Sispectra))){
# plot(as.numeric(colnames(Sispectra)),
# Sispectra[i,],
# xlab = '',
# ylab = '',
# axes = F,
# type = 'l',
# xlim = c(1700,400),
# ylim = c(0,0.06))
# par(new = T)
# }
# box()
# axis(1)
# axis(2)
# title(main = paste(
# as.character(c(5:16)[j]),
# 'variables'),
# xlab = expression(
# paste('Wave number (cm'^'-1',')')),
# ylab = 'Absorbance (a.u.)')
# abline(v = as.numeric(
# colnames(
# Sispectra
# )[gen$bestsets[j,]]),
# col = 2,
# lty = 2)
# }selectedpca <- prcomp(Sispectra[,gen$bestsets[8,]])
selectedVp <- (selectedpca$sdev)^2
selectedVariance <- round(vp/sum(vp)*100,2)
oord <- pcaall$x
plot(coord[,1],
coord[,2],
col=colspca,
xlab="PC1 - 71.88 %",
ylab= "PC2 - 13.51 %",
pch=19
)
abline(v=0, h=0, lty=2)
text(coord[,1],
coord[,2],
metadata.si$sample,
cex=0.4,
pos=1)library(mdatools)##
## Attaching package: 'mdatools'
## The following object is masked from 'package:pls':
##
## crossval
wheatPLS <-
pls(Sispectra[,gen$bestsets[8,]],
SiTable$Si,
10,
cv =1)
plot(wheatPLS)Sispectra <- as.matrix(Sispectra)
AllPLSSiTable <- data.frame(I(SiTable$Si),I(Sispectra))
library(pls) # Se asegura de cargar la librarÃa necesaria para realizar los cálculos de pls.
SiAllPLSplsr <- plsr(SiTable.Si~Sispectra[,gen$bestsets[8,]],ncomp=10,data=AllPLSSiTable,validation="LOO",scale=F) # Realiza el "pls" a los datos previamente acondicionados.
predictedvalues <- predict(SiAllPLSplsr,ncomp = 2,newdata = Sispectra[,gen$bestsets[8,]])
plot(SiTable$Si, predictedvalues [,1,1],main="") # Si
model <- lm(predictedvalues~SiTable$Si)
abline(a= 9113.6429,b=0.3607,col=2)Leaves selection
Selection of samples
prior work (PCA applied to the same spectra, with different pre-treatment) indicates that leave samples presented more variability. thus, it is worth it to subset a table with these samples:
First, we ask R where can we find leave samples in
metadata:
Again, regular expressions can be used. This time, the function
grepl will find every element in the column
metadata$class that has the letter L, i.e. every sample
from wheat leaves.
We create, as before, an object that has the information of where the
samples that match the query (leave samples) are in the
metadata table:
leaves.index <- which(
grepl(
paste0('(?=.*',
'L',
')'),
metadata$class,perl = T))
leaves.index## [1] 4 5 6 11 12 13 20 21
## [9] 22 30 31 32 42 43 44 54
## [17] 55 56 69 70 71 72 73 74
Since now we know exactly where the leaves samples are, we can subset
the metadata table into a new data.frame:
metadata.leaves <- metadata[leaves.index,]Silicon quantification in leaves samples
Since information about silicon content is needed, those rows that have no silicon content can be deleted.
which(is.na(metadata.leaves$Si))## [1] 11
which(is.na(metadata.leaves$spectra))## [1] 11
position 11 has NA both in silicon content and in index of spectra:
metadata.leaves$sample[11]## [1] 164
Since sample 164 neither has silicon content nor spectra, it can be deleted:
metadata.leaves.Si <-
metadata.leaves[-c(11),]Then, we can extract the silicon content to a vector called
leavesSi:
LeavesSi <-
cbind(
metadata.leaves.Si$sample,
metadata.leaves.Si$Si)And we can create a table with corrected mean spectra for this samples:
leavesSiSpectra <- corrected[metadata.leaves.Si$spectra.index,]
leavesSiSpectraRaw <- mean[metadata.leaves.Si$spectra.index,]par(mfrow = c(1,2))
for (i in 1:length(
rownames(leavesSiSpectraRaw )
)
){
plot(as.numeric(
colnames(leavesSiSpectraRaw )
),
leavesSiSpectraRaw [i,],
xlab = '',
ylab = '',
axes = F,
type = 'l',
xlim = c(1700,400),
ylim = c(0,0.1))
par(new = T)
}
box()
axis(1)
axis(2)
title(main = '',
xlab = expression(
paste(
'Wave number (cm'^'-1',
')'
)
),
ylab = 'Absorbance (a.u.)')
par(new = F)
for (i in 1:length(
rownames(leavesSiSpectra)
)
){
plot(as.numeric(
colnames(leavesSiSpectra)
),
leavesSiSpectra[i,],
xlab = '',
ylab = '',
axes = F,
type = 'l',
xlim = c(1700,400),
ylim = c(0,0.1))
par(new = T)
}
box()
axis(1)
axis(2)
title(main = '',
xlab = expression(
paste(
'Wave number (cm'^'-1',
')'
)
),
ylab = 'Absorbance (a.u.)')Mean and corrected spectra for leaves samples
Silicon concentration
we have to create a data.frame to use the
plsr function available in the pls
package.
leavesSiSpectra <-
as.matrix(leavesSiSpectra)
SiTable <-
data.frame(Si = I(metadata.leaves.Si$Si),
spectra = I(leavesSiSpectra) )
newRange <-
leavesSiSpectra[,c(261:676)]
colnames( leavesSiSpectra[,c(261,676)])## [1] "900.608" "1700.94"
modelling
The package mdatools can be used to perform the pls
modelling with this data.
library(mdatools)
LeavesPLS <- pls(leavesSiSpectra,
SiTable$Si,
10,
cv =1)Then, we can check for
summary(LeavesPLS)##
## PLS model (class pls) summary
## -------------------------------
## Info:
## Number of selected components: 5
## Cross-validation: full (leave one out)
##
## X cumexpvar Y cumexpvar
## Cal 97.86202 74.37723
## Cv NA NA
## R2 RMSE Slope
## Cal 0.744 10063.72 0.744
## Cv 0.273 16956.23 0.487
## Bias RPD
## Cal 0.0000 2.02
## Cv 641.3282 1.20
summary(LeavesPLS$res$cal)##
## PLS regression results (class plsres) summary
## Info: calibration results
## Number of selected components: 5
##
## Response variable y1:
## X expvar X cumexpvar
## Comp 1 21.702 21.702
## Comp 2 61.508 83.210
## Comp 3 11.791 95.002
## Comp 4 1.932 96.933
## Comp 5 0.929 97.862
## Comp 6 0.503 98.365
## Comp 7 1.076 99.440
## Comp 8 0.096 99.536
## Comp 9 0.081 99.617
## Comp 10 0.062 99.679
## Y expvar Y cumexpvar
## Comp 1 30.723 30.723
## Comp 2 4.827 35.550
## Comp 3 11.719 47.269
## Comp 4 18.265 65.534
## Comp 5 8.844 74.377
## Comp 6 4.250 78.627
## Comp 7 1.280 79.907
## Comp 8 9.286 89.193
## Comp 9 4.363 93.556
## Comp 10 2.196 95.753
## R2 RMSE Slope
## Comp 1 0.307 16547.737 0.307
## Comp 2 0.356 15960.838 0.356
## Comp 3 0.473 14437.054 0.473
## Comp 4 0.655 11671.935 0.655
## Comp 5 0.744 10063.719 0.744
## Comp 6 0.786 9191.378 0.786
## Comp 7 0.799 8911.810 0.799
## Comp 8 0.892 6535.688 0.892
## Comp 9 0.936 5046.742 0.936
## Comp 10 0.958 4097.286 0.958
## Bias RPD
## Comp 1 0 1.23
## Comp 2 0 1.27
## Comp 3 0 1.41
## Comp 4 0 1.74
## Comp 5 0 2.02
## Comp 6 0 2.21
## Comp 7 0 2.28
## Comp 8 0 3.11
## Comp 9 0 4.03
## Comp 10 0 4.96
plot(LeavesPLS)par(mfrow=c(2,2))
plotXVariance(LeavesPLS)
plotXCumVariance(LeavesPLS)
plotYVariance(LeavesPLS)
plotYCumVariance(LeavesPLS)Variable selection
library(subselect)
Hmat <- lmHmat(leavesSiSpectra,
metadata.leaves.Si$Si)
gen <- genetic(Hmat$mat,
kmin =5,
kmax = 16,
H= Hmat$H,
r =1,
crit = 'CCR12',
force = T)par(mfrow=c(3,3))
for(j in 1:nrow(gen$bestsets)){
for (i in 1:length(rownames(leavesSiSpectra))){
plot(as.numeric(colnames(leavesSiSpectra)),
leavesSiSpectra[i,],
xlab = '',
ylab = '',
axes = F,
type = 'l',
xlim = c(1700,400),
ylim = c(0,0.06))
par(new = T)
}
box()
axis(1)
axis(2)
title(main = paste(
as.character(c(4:15)[j]),
'variables'),
xlab = expression(
paste('Wave number (cm'^'-1',')')),
ylab = 'Absorbance (a.u.)')
abline(v = as.numeric(
colnames(
leavesSiSpectra
)[gen$bestsets[j,]]),
col = 2,
lty = 2)
}library(mdatools)
LeavesPLS <-
pls(leavesSiSpectra[,gen$bestsets[4,]],
SiTable$Si,
10,
cv =1)
plot(LeavesPLS)summary(LeavesPLS)##
## PLS model (class pls) summary
## -------------------------------
## Info:
## Number of selected components: 3
## Cross-validation: full (leave one out)
##
## X cumexpvar Y cumexpvar
## Cal 98.46924 55.14682
## Cv NA NA
## R2 RMSE Slope
## Cal 0.551 13315.03 0.551
## Cv 0.289 16765.08 0.394
## Bias RPD
## Cal 0.0000 1.53
## Cv -56.9302 1.21
Multiple OLS with new variables
We can try and predict the silicon content in each sample using each of the models created by using que X matrix as the selected frequencies by the genetic algorithm:
First we create empty lists in which we will save in each slot, each result of each model we will build.
listOfPredictions1 <- vector('list', length = nrow(gen$bestsets))
listOfModels1 <- vector('list', length = nrow(gen$bestsets))then through a for loop, we can build each model, sub-setting just
the columns (variables) described by gen$bestsets
for(i in 1:nrow(gen$bestsets)){
listOfModels1[[i]] <- lm(metadata.leaves.Si$Si ~ leavesSiSpectra[,gen$bestsets[i,]], y = T, x = T)
}Then, we use the function predict to make predictions
using each model and the absorbance at the selected frequencies as
predictors, to predict the silicon content in each sample:
for(i in 1:nrow(gen$bestsets)){
listOfPredictions1[[i]] <- predict(listOfModels1[[i]], newdata = as.data.frame(leavesSiSpectra[,gen$bestsets[i,]]))
}ActualSi <- SiTable$Siwin.graph()
for(i in 1:nrow(gen$bestsets)){
plot(ActualSi,
listOfPredictions1[[i]],
xlab="Actual Si (ppm)" ,
ylab="Predicted Si (ppm)",
pch=17,
cex=1.2,
col="darkorchid4",
cex.lab=1
)
abline(a=0 , b=1, col=1, lty=1, lwd=2)
readline('press enter')
par(new = F)
}## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
## press enter
Residuals
residualsT <- c(listOfModels1[[1]]$residuals,
listOfModels1[[2]]$residuals,
listOfModels1[[3]]$residuals,
listOfModels1[[4]]$residuals,
listOfModels1[[5]]$residuals,
listOfModels1[[6]]$residuals,
listOfModels1[[7]]$residuals,
listOfModels1[[8]]$residuals,
listOfModels1[[9]]$residuals)
VariablesResiduals <- c(rep(4,23),
rep(5,23),
rep(6,23),
rep(7,23),
rep(8,23),
rep(9,23),
rep(10,23),
rep(11,23),
rep(12,23)
)
ResidualsTable <- data.frame(Residuals = residualsT, Variables = VariablesResiduals)
ResidualsTable$Variables <- as.factor(ResidualsTable$Variables)
dp1 <- ggplot(ResidualsTable, aes(x=Variables, y= Residuals, fill=Variables)) +
geom_violin(trim=FALSE)+
geom_boxplot(width=0.1, fill='white')+
labs(title="",x="# of variables selected", y = "Residuals (mg/kg)")
dp1 + scale_fill_brewer(palette="Greens") + theme_minimal()
# Significance of models
win.graph()
pValTable <- data.frame(matrix(nrow = 9, ncol = 12))
summary(listOfModels1[[1]])$coefficients[,4]## (Intercept)
## 3.409603e-01
## leavesSiSpectra[, gen$bestsets[i, ]]441.626
## 1.148038e-08
## leavesSiSpectra[, gen$bestsets[i, ]]516.837
## 4.028876e-07
## leavesSiSpectra[, gen$bestsets[i, ]]912.179
## 1.802496e-03
## leavesSiSpectra[, gen$bestsets[i, ]]916.036
## 7.546255e-04
## leavesSiSpectra[, gen$bestsets[i, ]]1222.67
## 7.493065e-03
win.graph()
par(mfrow = c(3,3))
for(i in 1:9){
barplot(summary(listOfModels1[[i]])$coefficients[,4],
horiz = T,
names.arg = c('intercept', colnames(leavesSiSpectra[,gen$bestsets[i,]])))
abline(v = 0.05, lty =2)
}cross validation for MOLS and selected variables
library(tidyverse)## -- Attaching packages --------
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts -----------------
## x dplyr::collapse() masks hyperSpec::collapse()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)## Registered S3 methods overwritten by 'caret':
## method from
## predict.plsda mdatools
## print.plsda mdatools
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:mdatools':
##
## plsda
## The following object is masked from 'package:pls':
##
## R2
# First we create a list of tables with the best subsets selected by the genetic algorithm
listOfTables <- vector('list', nrow(gen$bestsets))
listOfModels <- vector('list', nrow(gen$bestsets))
listOfRMSEP <- vector('list', nrow(gen$bestsets))
system.time(
for(i in 1:nrow(gen$bestsets)){
listOfTables[[i]] <- cbind(Si = metadata.leaves.Si$Si, as.data.frame(leavesSiSpectra[,gen$bestsets[i,]]))
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control as
# repeated cross-validation and
# value of K is 10 and repetition is 100 times
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 100)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
listOfModels[[i]] <- train(Si ~.,
data = listOfTables[[i]],
method = "lm",
trControl = train_control)
listOfRMSEP[[i]] <- listOfModels[[i]]$resample$RMSE
}
)## user system elapsed
## 50.36 0.39 51.85
RMSEP <- c(listOfRMSEP[[1]],listOfRMSEP[[2]],listOfRMSEP[[3]],listOfRMSEP[[4]],listOfRMSEP[[5]],listOfRMSEP[[6]],listOfRMSEP[[7]],listOfRMSEP[[8]],listOfRMSEP[[9]])
RMSEPTable <- data.frame(RMSEP = RMSEP, variables = c(rep(4,1000),rep(5,1000),rep(6,1000),rep(7,1000),rep(8,1000),rep(9,1000),rep(10,1000),rep(11,1000),rep(12,1000)))
RMSEPTable$variables <- as.factor(RMSEPTable$variables)
# # R program to implement
# # repeated K-fold cross-validation
#
# # setting seed to generate a
# # reproducible random sampling
# set.seed(125)
#
# # defining training control as
# # repeated cross-validation and
# # value of K is 10 and repetation is 3 times
# train_control <- trainControl(method = "repeatedcv",
# number = 10, repeats = 3)
#
# # training the model by assigning sales column
# # as target variable and rest other column
# # as independent variable
# model <- train(Si ~., data = marketing,
# method = "lm",
# trControl = train_control)
#
# # printing model performance metrics
# # along with other details
# print(model)dp <- ggplot(RMSEPTable, aes(x=variables, y=RMSEP, fill=variables)) +
geom_violin(trim=FALSE)+
geom_boxplot(width=0.1, fill='white')+
labs(title="CVRMSE vs # of variables ",x="# of variables selected", y = "RMSE (n = 1000)")
dp + scale_fill_brewer(palette="Blues") + theme_minimal()