Introduction
In this document, we load several spectra from the first experiment with rice hush, variating milling time, temperature and sampling fraction.
Loading data into R
First of all, we create a list called names which
contains the filenames of all .CSV files in this
folder:
Now, we can import each of these files information, telling
R where the file is (the file names saved in each position
of the list names) and telling R that we want
to use the function read.csv for each file in this
folder.
spectra <- lapply(names, read.csv, header = F)
Now, we create a vector which will contain the frequencies of incident light of all spectra, since all spectra were recorded at the same range of frequencies, we can use one for all of them.
wn <- spectra[[1]][,1]
Now we can save just the absorbances in the object
absorbances by selecting the second column of each
data.frame contained in each position of the list
spectra.
absorbances <- lapply(spectra, '[' , 2)
absorbances <- as.data.frame(absorbances) # convert list of data.frames to one data.frame
absorbances <- t(absorbances) #arrange variables in columns and subjects in rows
Then we can use proper names for columns (frequencies) and rows (samples)
colnames(absorbances) <- wn
rownames(absorbances) <- names
estimating the mean spectra of each sample.
Upper and lower part samples.
We can use a search algorithm of two steps:
- difference samples from upper fraction and lower fraction, using
@. - for each fraction, crate a search grid with the combinations of names (variables of treatment, such as temperature ,milling time and day of sampling .)
At this moment, all samples have a @ in their name. we
can search for those ones with As and Bs at the end of their name and
remove the @ as follows:
for(i in 1:length(rownames(absorbances))){
if(grepl('(?=.*_A|_ A|_B|_ B)',
rownames(absorbances)[i],
perl = T)){rownames(absorbances)[i] <- sub('@','',rownames(absorbances)[i])}else{}
}
rownames(absorbances)
## [1] "50C T_2 min 1_A.CSV" "50C T_2 min 1_B.CSV" "50C T_2 min 2_A.CSV"
## [4] "50C T_2 min 2_B.CSV" "50C T_2 min 3_A.CSV" "50C T_2 min 3_B.CSV"
## [7] "50C T_4 min 1_A.CSV" "50C T_4 min 1_B.CSV" "50C T_4 min 2_A.CSV"
## [10] "50C T_4 min 2_B.CSV" "50C T_4 min 3_A.CSV" "50C T_4 min 3_B.CSV"
## [13] "50C T_6 min 1_ A.CSV" "50C T_6 min 1_ B.CSV" "50C T_6 min 2_ A.CSV"
## [16] "50C T_6 min 2_ B.CSV" "50C T_6 min 3_ A.CSV" "50C T_6 min 3_ B.CSV"
## [19] "50C T_8 min 1_ B.CSV" "50C T_8 min 1_A.CSV" "50C T_8 min 2_ A.CSV"
## [22] "50C T_8 min 2_ B.CSV" "50C T_8 min 3_ A.CSV" "50C T_8 min 3_ B.CSV"
## [25] "@A 50C 2 min 1.CSV" "@A 50C 2 min 2.CSV" "@A 50C 2 min 3.CSV"
## [28] "@A 50C 4 min 1.CSV" "@A 50C 4 min 2.CSV" "@A 50C 4 min 3.CSV"
## [31] "@A 50C 6 min 1.CSV" "@A 50C 6 min 2.CSV" "@A 50C 6 min 3.CSV"
## [34] "@A 50C 8 min 1.CSV" "@A 50C 8 min 2.CSV" "@A 50C 8 min 3.CSV"
## [37] "@A Room T 2 min 1.CSV" "@A Room T 2 min 2.CSV" "@A Room T 2 min 3.CSV"
## [40] "@A Room T 4 min 1.CSV" "@A Room T 4 min 2.CSV" "@A Room T 6 min 1.CSV"
## [43] "@A Room T 6 min 2.CSV" "@A Room T 6 min 3.CSV" "@A Room T 8 min 1.CSV"
## [46] "@A Room T 8 min 2.CSV" "@A Room T 8 min 3.CSV" "@B 50C 2 min 1.CSV"
## [49] "@B 50C 2 min 2.CSV" "@B 50C 2 min 3.CSV" "@B 50C 4 min 1.CSV"
## [52] "@B 50C 4 min 2.CSV" "@B 50C 4 min 3.CSV" "@B 50C 6 min 1.CSV"
## [55] "@B 50C 6 min 2.CSV" "@B 50C 6 min 3.CSV" "@B 50C 8 min 1.CSV"
## [58] "@B 50C 8 min 2.CSV" "@B 50C 8 min 3.CSV" "@B Room T 2 min 1.CSV"
## [61] "@B Room T 2 min 2.CSV" "@B Room T 2 min 3.CSV" "@B Room T 6 min 1.CSV"
## [64] "@B Room T 6 min 2.CSV" "@B Room T 6 min 3.CSV" "@B Room T 8 min 1.CSV"
## [67] "@B Room T 8 min 2.CSV" "@B Room T 8 min 3.CSV" "Room T 2 min 1_B.CSV"
## [70] "Room T 2 min 2_B.CSV" "Room T 2 min 3_B.CSV" "Room T 4 min 1_A.CSV"
## [73] "Room T 4 min 2_A.CSV" "Room T 4 min 3_A.CSV" "Room T 8 min 1_B.CSV"
## [76] "Room T 8 min 2_B.CSV" "Room T 8 min 3_B.CSV" "Room T_2 min 1_A.CSV"
## [79] "Room T_2 min 2_A.CSV" "Room T_2 min 3_A.CSV" "Room T_4 min 1_B.CSV"
## [82] "Room T_4 min 2_B.CSV" "Room T_4 min 3_B.CSV" "Room T_6 min 1_A.CSV"
## [85] "Room T_6 min 1_B.CSV" "Room T_6 min 2_A.CSV" "Room T_6 min 2_B.CSV"
## [88] "Room T_6 min 3_A.CSV" "Room T_6 min 3_B.CSV" "Room T_8 min 1_A.CSV"
## [91] "Room T_8 min 2_A.CSV" "Room T_8 min 3_A.CSV"
Now, we have a prior differentiation between samples of the upper half of the pellet, and samples from the lower part of the pellet.
Calculating the mean spectra for upper part
First we crate a subset of the spectra of the samples that come from the upper part of the pellets:
indexUpper <- which(grepl('(?=.*@)', rownames(absorbances), perl = T))
Upper <- absorbances[indexUpper,]
Now we can crate a search matrix, o a search grid with all the possible combinations for the names of this fraction:
searchGridUpper <- expand.grid(Temperature = c('Room T','50C'),
Time = c('2 min', '4 min' , '6 min', '8 min'),
Treatment =c('A','B'))
Now we can create the new names for the averaged spectra and store them in a character vector:
rownames.prom.Upper <- character(16)
for (i in 1:16){
rownames.prom.Upper[i] <- paste('@',searchGridUpper[i,1],searchGridUpper[i,2],searchGridUpper[i,3])
}
rownames.prom.Upper
## [1] "@ Room T 2 min A" "@ 50C 2 min A" "@ Room T 4 min A" "@ 50C 4 min A"
## [5] "@ Room T 6 min A" "@ 50C 6 min A" "@ Room T 8 min A" "@ 50C 8 min A"
## [9] "@ Room T 2 min B" "@ 50C 2 min B" "@ Room T 4 min B" "@ 50C 4 min B"
## [13] "@ Room T 6 min B" "@ 50C 6 min B" "@ Room T 8 min B" "@ 50C 8 min B"
rownames.prom.Upper <- rownames.prom.Upper[-c(11)]
Once we have the names we will search for, we can search where they
are in the Upper matrix and use this position to calculate
the mean:
- First, we create a matrix to store the mean spectra:
promUpper <- matrix(
ncol= ncol(Upper),
nrow = 15)
index.rep.Upper <- vector('list',15)
Now, in this set of experiments where was not a sample for the variation: B Room T 4 min. So we have to delete this search from the search grid:
searchGridUpper[11,]
## Temperature Time Treatment
## 11 Room T 4 min B
searchGridUpper <- searchGridUpper[-c(11),]
dim(searchGridUpper)
## [1] 15 3
for(i in 1:15)for(i in 1:15){
index.rep.Upper[[i]] <- c(which(grepl(paste0('(?=.*',
as.character(searchGridUpper[i,1]),
')(?=.*',
as.character(searchGridUpper[i,2]),
')(?=.*',
as.character(searchGridUpper[i,3]),
')'
) ,
rownames(Upper),
perl = T))
)
}
Since there are just 2 replicates for variation:
Room T 4 min A we have to use a conditional flow control
structure to calculate means:
for(i in 1:15){
for(j in 1:length(colnames(Upper))){
if(length(index.rep.Upper[[i]])==2){
promUpper[i,j] <- mean(c(Upper[index.rep.Upper[[i]][1],j],
Upper[index.rep.Upper[[i]][2],j]
) )
}else{
promUpper[i,j] <- mean(c(Upper[index.rep.Upper[[i]][1],j],
Upper[index.rep.Upper[[i]][2],j],
Upper[index.rep.Upper[[i]][3],j]
) )
}
}
}
rownames(promUpper) <- rownames.prom.Upper
colnames(promUpper) <- wn
Calculating the mean spectra for lower part
First we crate a subset of the spectra of the samples that come from the lower part of the pellets:
indexLower <- which(grepl('^((?!@).)*$', rownames(absorbances), perl = T))
Lower <- absorbances[indexLower,]
Then we create a search grid for every possible combination of experimental variables:
searchGridLower <- expand.grid(Temperature = c('Room T','50C'),
Time = c('2 min', '4 min' , '6 min', '8 min'),
Treatment =c('A','B'))
Now, we create the new names for the averaged spectra and store them in a character vector:
rownames.prom.Lower <- character(16)
for (i in 1:16){
rownames.prom.Lower[i] <- paste(searchGridLower[i,1],searchGridLower[i,2],searchGridLower[i,3])
}
rownames.prom.Lower
## [1] "Room T 2 min A" "50C 2 min A" "Room T 4 min A" "50C 4 min A"
## [5] "Room T 6 min A" "50C 6 min A" "Room T 8 min A" "50C 8 min A"
## [9] "Room T 2 min B" "50C 2 min B" "Room T 4 min B" "50C 4 min B"
## [13] "Room T 6 min B" "50C 6 min B" "Room T 8 min B" "50C 8 min B"
Once we have the names we will search for, we can search where they
are in the upper matrix and use this position to calculate
the mean:
- First, we create a matrix to store the mean spectra:
promLower <- matrix(
ncol= ncol(Lower),
nrow = nrow(Lower)/3)
index.rep.Lower <- vector('list',16)
- Then, we execute the search, and save where are each triad of spectra corresponding to each sample:
for(i in 1:16)for(i in 1:16){
index.rep.Lower[[i]] <- c(which(grepl(paste0('(?=.*',
as.character(searchGridLower[i,1]),
')(?=.*',
as.character(searchGridLower[i,2]),
')(?=.*',
as.character(searchGridLower[i,3]),
')'
) ,
rownames(Lower),
perl = T))
)
}
Once we have where are each triad, we can calculate the mean spectrum for each sample:
for(i in 1:16){
for(j in 1:length(colnames(Lower))){
promLower[i,j] <- mean(c(Lower[index.rep.Lower[[i]][1],j],
Lower[index.rep.Lower[[i]][2],j],
Lower[index.rep.Lower[[i]][3],j]
) )
}
}
rownames(promLower) <- rownames.prom.Lower
colnames(promLower) <- wn
Joining the two fractions averaged spectra
data <- rbind(promUpper,promLower)
rownames(data)[1:15] <- rownames(promUpper)
rownames(data)[16:31] <- rownames(promLower)
plotting average spectra
class <- vector('character', length = length(rownames(data)))
for(i in 1:length(rownames(data)))
{
if(
grepl('(?=.*@)', rownames(data)[i], perl = T)
) {class[i] <- 'black'}else{
class[i] <- 'red'
}
}
class
## [1] "black" "black" "black" "black" "black" "black" "black" "black" "black"
## [10] "black" "black" "black" "black" "black" "black" "red" "red" "red"
## [19] "red" "red" "red" "red" "red" "red" "red" "red" "red"
## [28] "red" "red" "red" "red"
for(i in 1:length(rownames(data))){
plot(wn,
data[i,],
axes=F,
xlab = '',
ylab = '',
xlim = c(4000,400),
ylim = c(0,0.3),
type = 'l',
col = class[i])
par(new = T)
}
axis(1)
axis(2)
box()
title(main="",
xlab=expression(paste("Wave number (cm"^"-1",")")),
ylab="Absorbance (a.u.)",
)
legend('topleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
lty = 1,
bg = 'white')
Data pre-treatment: range selection and baseline correction
Once we have all the spectra in a tidy data set called
data, we can get rid of instrumental noise prior to
performing a statistical analysis. As noticed by liland et al. (2010) if
the goal is a pretty set of spectra, visual inspection is sufficient to
assess the results of baseline correction.
But if the spectra are to be used in a statistical analysis, it is very important to ensure objectivity and reproducibility, so we can make better predictions.
range selection
We can select a spectral range containing bands due to interaction of siloxane bonds with light, such as those around \(455\), \(799\) and \(1045 \;cm^{-1}\) (Zemnukhova 2015) and those for lignin guaiacyl and syringul groups from \(1700\) to \(742\; cm^{-1}\) (Heitner 2010).
- First, we can check which columns contain this spectral range:
colnames(data)[c(1,676)]
## [1] "399.1989" "1700.935"
- Once we know which columns of the data frame we will use for the next steps, we can craete a subset as follows:
data <- data[,1:676]
and save a vector containing the corresponding frequencies for our new region of interest:
wn <- as.numeric( colnames(data)[1:676])
We can plot this new subset:
for(i in 1:length(rownames(data))){
plot(wn,
data[i,],
axes=F,
xlab = '',
ylab = '',
xlim = c(1700,400),
ylim = c(0,0.3),
type = 'l',
col = class[i])
par(new = T)
}
axis(1)
axis(2)
box()
title(main="",
xlab=expression(paste("Wave number (cm"^"-1",")")),
ylab="Absorbance (a.u.)",
)
legend('topleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
lty = 1,
bg = 'white')
Initial assessment.
Based on previous random iterations, a local optimum of parameters for the correction of the baseline using the rubberband method has beendetermined, such configuration can be tested on this data set as follows:
- First, we create a new
hyperSpecobject calledspc:
library(hyperSpec)
spc <- new('hyperSpec', spc = data, wavelength = wn)
Then we can use the rubberband method to estimate the spectral baseline and perform the correction. This method was developed for spectra which show increasing background signal along the abscissa and convex hull such as that around \(1100 \; cm^{-1}\).
bend <- 0.1 * wl.eval(spc,
function(x) x^6+x^5+x^4+x^3+x^2,
normalize.wl=normalize01
)
# plot(bend, wl.reverse =T)
bl <- spc.rubberband (spc+bend, noise = 1e-4, df=20) - bend
sum <- spc+bend
spc3 <- spc - bl
plot(spc, wl.reverse = TRUE)
plot(bl, add=TRUE, col=2,wl.reverse = TRUE)
plot(sum,wl.reverse = TRUE)
plot(bend, add=TRUE, col=2,wl.reverse = TRUE)
plot(spc3,wl.reverse = TRUE)
corrected <- as.data.frame(spc3[1:31])
corrected <- as.data.frame(corrected[,1])
for(i in 1:length(rownames(corrected))){
plot(wn,
corrected[i,],
xlab = 'wave number cm-1',
ylab = 'absorbance a.u.',
xlim = c(1700,400),
ylim = c(0,0.18),
type = 'l',
col = class[i])
par(new = T)
}
legend('topleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
lty = 1)
Spectra derivatization
library(prospectr)
## [34mprospectr version 0.2.5 -- [39m'antilla'
## [34mcheck the github repository at: https://github.com/l-ramirez-lopez/prospectr/[39m
w2 <- 40
sg <- matrix(ncol = ncol(corrected)-w2, nrow= nrow(corrected))
sg <- as.data.frame(sg)
# for (i in 1:31){
sg<- savitzkyGolay(X = corrected
,m = 2,
p = 3,
w = w2+1,
delta.wav=2)
# }
# colnames(sg) <- colnames( savitzkyGolay(X = corregido2[1,]
# ,m = 2,
# p = 3,
# w = w2+1,
# delta.wav=2))
rownames(sg) <- rownames(corrected)
# win.graph()
for(i in 1:length(rownames(sg))){
plot(as.numeric(colnames(sg)),
sg[i,],
xlab = 'wave number cm-1',
ylab = '2nd derivative of abs',
xlim = c(1700,400),
ylim = c(-1e-04,1e-04),
type = 'l',
col = class[i])
par(new = T)
}
legend('topleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
lty = 1)
Statistical analysis
Hierarchical clustering analysis
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.hk <-hkmeans(data,
2,
hc.metric = 'minkowski')
fviz_dend(res.hk,
cex = 0.6,
rect = T,
rect_fill = TRUE
)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Classical and Robust Principal Component Analysis
Classical PCA - baseline correction
Classical PCA - baseline correction and derivatization
Robust PCA - baseline correction
library(pcaPP)
X.rpc <- PCAgrid(corrected,
k=2,
scale=mad)
T <- X.rpc$scores
vp.rpc <- (X.rpc$sdev)^2
var.rpc <- vp.rpc/sum(vp.rpc)*100
plot(T[,1],
T[,2],
pch = 19,
xlab = paste('PC 1 -',as.character(round(var.rpc[1],2)), '%'),
ylab = paste('PC 2 -',as.character(round(var.rpc[2],2)), '%'),
col = class
)
abline(h = 0, v = 0, lty=2)
legend('bottomleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
pch = 19,
bg = 'white')
Robust PCA - baseline correction and derivatization
library(pcaPP)
X.rpc <- PCAgrid(sg,
k=2,
scale=mad)
T <- X.rpc$scores
vp.rpc <- (X.rpc$sdev)^2
var.rpc <- vp.rpc/sum(vp.rpc)*100
plot(T[,1],
T[,2],
pch = 19,
xlab = paste('PC 1 -',as.character(round(var.rpc[1],2)), '%'),
ylab = paste('PC 2 -',as.character(round(var.rpc[2],2)), '%'),
col = class
)
abline(h = 0, v = 0, lty=2)
legend('bottomleft',
c('Lower fraction','Upper fraction'),
col = c('red','black'),
pch = 19,
bg = 'white')