Raman spectra data analysis

Andres Beltran

2022-07-09

Loading the data

As received the spectra is separated in different folders:

dir('./degradation of rods_Yohanna/')

## [1] "309_1" "309_4"

Two folders for two samples, inside each folder we have:

First, for 309_1:

dir('./degradation of rods_Yohanna/309_1/')

## [1] "image 309-1_10x.bmp"                 
## [2] "image 309-1_50x spectrum 1.bmp"      
## [3] "image 309-1_50x spectrum 2 and 3.bmp"
## [4] "image 309-1_50x spectrum 4.bmp"      
## [5] "spectrum 1, 50mW"                    
## [6] "spectrum 2, 50mW"                    
## [7] "spectrum 3, 5mW"                     
## [8] "spectrum 4, 5mW"

Second, for 309_2:

dir('./degradation of rods_Yohanna/309_4/')

## [1] "309_4 specturm 1"                    
## [2] "309_4 specturm 2"                    
## [3] "309_4 specturm 3"                    
## [4] "image 309-4_50x spectrum 1.bmp"      
## [5] "image 309-4_50x spectrum 2 and 3.bmp"

Now we know that there are two replicates at 50 mW and two at 5 mW for sample 309_1, and three replicates for 309_2 without further information depicted in the folder names. Nonetheless, each file name specifieas that all spectra for 309_4 is taken using 5 mW.

The names that lack extension are folders, so within each folder we will find the file of the spectrum:

dir('./degradation of rods_Yohanna/309_1/spectrum 1, 50mW/')

## [1] "1 Export File (Header).txt"                   
## [2] "1 Export File (X-Axis).txt"                   
## [3] "1 Export File (Y-Axis).txt"                   
## [4] "309-1_50x_50mW_10acc_0.1s_001_Spec.Data 1.txt"

files <- list.files('./degradation of rods_Yohanna/', pattern = 'Data',recursive = TRUE)
files

## [1] "309_1/spectrum 1, 50mW/309-1_50x_50mW_10acc_0.1s_001_Spec.Data 1.txt"
## [2] "309_1/spectrum 2, 50mW/309-1_50x_50mW_10acc_0.1s_002_Spec.Data 1.txt"
## [3] "309_1/spectrum 3, 5mW/309-1_50x_5mW_10acc_0.1s_004_Spec.Data 1.txt"  
## [4] "309_1/spectrum 4, 5mW/309-1_50x_5mW_10acc_0.1s_005_Spec.Data 1.txt"  
## [5] "309_4/309_4 specturm 1/309-4_50x_5mW_10acc_0.1s_002_Spec.Data 1.txt" 
## [6] "309_4/309_4 specturm 2/309-4_50x_5mW_10acc_0.1s_003_Spec.Data 1.txt" 
## [7] "309_4/309_4 specturm 3/309-4_50x_5mW_10acc_0.1s_004_Spec.Data 1.txt"

Once we know where the data files are, we can import them as follows:

files <- paste0('./degradation of rods_Yohanna/',files)
data <- lapply(files, read.table, header = F)

Once we have all the data we need in the object data we can transform it to examine it.

First we check if the x axis is the same for all spectra:

1 and 2:

sum(data[[1]][,1] == data[[2]][,1])

## [1] 1024

nrow(data[[1]])

## [1] 1024

matches <- vector('numeric', length = 7)
for (i in 1:7 ) {
  
 matches[i] <- sum(data[[1]][,1] == data[[i]][,1]) 
  
  
}
matches

## [1] 1024 1024 1024 1024 1024 1024 1024

Once we have verified that the x axis is the same for all of the spectra, we can extract it and save it separatelly as follows:

shift <- as.numeric(data[[1]][,1])

and then, create a data.frame with all of the y axes for all samples:

df1 <- as.data.frame(lapply(data, '[', 2))
df1 <- as.data.frame(t(df1))
colnames(df1) <- shift

to assign a shorter name for each sample, we can inspect and extract useful information from the file names files:

files

## [1] "./degradation of rods_Yohanna/309_1/spectrum 1, 50mW/309-1_50x_50mW_10acc_0.1s_001_Spec.Data 1.txt"
## [2] "./degradation of rods_Yohanna/309_1/spectrum 2, 50mW/309-1_50x_50mW_10acc_0.1s_002_Spec.Data 1.txt"
## [3] "./degradation of rods_Yohanna/309_1/spectrum 3, 5mW/309-1_50x_5mW_10acc_0.1s_004_Spec.Data 1.txt"  
## [4] "./degradation of rods_Yohanna/309_1/spectrum 4, 5mW/309-1_50x_5mW_10acc_0.1s_005_Spec.Data 1.txt"  
## [5] "./degradation of rods_Yohanna/309_4/309_4 specturm 1/309-4_50x_5mW_10acc_0.1s_002_Spec.Data 1.txt" 
## [6] "./degradation of rods_Yohanna/309_4/309_4 specturm 2/309-4_50x_5mW_10acc_0.1s_003_Spec.Data 1.txt" 
## [7] "./degradation of rods_Yohanna/309_4/309_4 specturm 3/309-4_50x_5mW_10acc_0.1s_004_Spec.Data 1.txt"

names <- c('309_1_1 50 mW',
           '309_1_2 50 mW',
           '309_1_3 5 mW',
           '309_1_4 5 mW',
           '309_4_1 5 mW',
           '309_4_2 5 mW',
           '309_4_3 5 mW'
           )

We now assing the sample names to the rows:

rownames(df1) <- names

Since we received the x axis in nm, we proceed to express it in 1/cm:

\[\bar{\nu} = \frac{1}{\lambda}\\\]

\[\bar{\nu} \;[1/cm] = \frac{1}{\lambda [nm]} \\\]

\[cm = 1 nm \: \cdot \: (\frac{1 m}{10^9 nm}) \: \cdot \: (\frac{10^2 cm}{1 m})\\\]

\[cm^{-1} = (1 nm \: \cdot \: (\frac{1 m}{10^9 nm}) \: \cdot \: (\frac{10^2 cm}{1 m}))^{-1} \\\]

\[cm^{-1} = (\frac{1}{10^7})^{-1} \: \cdot \: nm \\\]

\[cm^{-1} = 10^7 \: \cdot \: nm\]

shift2 <- 1/(shift/10^7)

The ranges of the x axes do not match with the expected values:

range(shift)# allegedly nm

## [1] 528.129 664.427

range(shift2) # 1/cm from nm

## [1] 15050.56 18934.77

So we create a new axis for sake of plotting:

shift3 <- seq(200,4000, length = 1024)

For the plot:

col <- c(rep('red',2),rep('pink',2),rep('black',3))

for(i in 1:7){
  plot(shift3,
       df1[i,],
       type = 'l',
       col = col[i],
       xlim =,
       ylim =,
       xlab = '',
       ylab = '',
       axes = FALSE
  
  )
    par(new =T)
}
axis(1)
axis(2)
box()

legend('topleft',names, col = col, lty = 1, cex= 0.5)

At this moment we can see that replicates are very different between each other.

Baseline correction

Since all spectra are very different between each other (even between replicates, a test can be found here) it will be more useful if the estimation of the baseline is applied to each spectrum separately.

To do this iteratively, we can create a list to save each data frame in one slot of the list listOfDf:

listOfDf <- vector('list', 7)

Now we create a data.frame called param in which we will save two parameters of baseline correction:

One is the noise threshold at which the function recognizes the noise.
The second one is the degrees of freedom used by the smooth.spline function available in the package stats. If not specified leave-one-out cross-validation (ordinary or ‘generalized’ as determined by cv) is used to determine smoothing parameters.

These parameters have been selected subjectively seeking for a better appearance of the spectra. If further analysis and information extraction were to be performed to this data set, these parameters should be selected via optimization using the correct objective function as optimization criteria.

library(hyperSpec)

## Loading required package: lattice

## Loading required package: grid

## Loading required package: ggplot2

## Loading required package: xml2

## Package hyperSpec, version 0.99-20201127
## 
## To get started, try
##    vignette ("hyperspec")
##    package?hyperSpec 
##    vignette (package = "hyperSpec")
## 
## If you use this package please cite it appropriately.
##    citation("hyperSpec")
## will give you the correct reference.
## 
## The project homepage is http://hyperspec.r-forge.r-project.org

param <- data.frame(noise = c(10,17,8,8,250,15,50),
                      df. = c(15,17,10,20,5,20,5))

Since each spectrum needs different parameters we can loop trough all of them and save the results in each slot of a list:

First we create empty lists for each part of the process:
- spc for each hyperSpec object (needed by the baseline correction functions)
- bend for the bending function, this is useful especially for concave-like spectra
- bl for each estimated baseline
- spc2 for each corrected spectrum

spc  <- vector('list', 7)
bend <- vector('list', 7)
bl <- vector('list', 7)
spc2 <- vector('list', 7)

Once there are objects where information can be stored, the 4 steps required for baseline correction can be performed as follows, for all 7 samples:

for(i in 1:7){

listOfDf[[i]] <- df1[i,] #each sample is saved in each slot of a list
  
spc[[i]] <-            new('hyperSpec',
                       spc = listOfDf[[i]][, 50:1023],
                       wavelength = shift3[50:1023]) # Each spectrum is converted into a hyperSpec object 

bend[[i]]   <-         wl.eval(spc[[i]],
                        function (x)
                        x^6+x^5+x^4+x^3+x^2,
                        normalize.wl =normalize01)# a bend function is estimated for each spectrum

 bl[[i]]  <-        spc.rubberband(spc[[i]]+bend[[i]],
                     noise =param$noise[i],
                     df = param$df.[i]
                     )-bend[[i]] # Baseline is estimated with custom parameters for each spectrum
 
 spc2[[i]] <- spc[[i]] - bl[[i]] # Baseline is subtracted
 


plot(spc[[i]], wl.reverse = F)
plot(bl[[i]], add=TRUE, col=2,wl.reverse =F )
plot(spc2[[i]],add = T)

}

Once baseline has been estimated and corrected, the following chunk of code allows us to extract the data from hyperSpec objects into one data.frame:

a <- matrix(nrow  = 7, ncol = 974)
 
 
for(i in  1:7){
  
 a[i,] <- t(as.data.frame(spc2[[i]][,1]$spc[1,]))
  

 
}

 colnames(a) <- as.numeric(colnames(as.data.frame(spc2[[1]][,1]$spc)))

Now we can plot them together for comparison:

 for(i in c(1:7)){
   
 plot(colnames(a),
     a[i,],
     type = 'l',
     # axes = 'F',
     ylab = '',
     xlab = '',
      ylim = c(0,200),
     # xlim = c()
     col = col[i]
     
     )  
   par(new = T)
 }

legend('topleft',names, col = col, lty = c(1,2,1,2,1,2,3), cex= 0.5)

Since spectrum #5 is has too much noise which is 309_4_1. Se can select this spectrum as an outlier and display the rest of them:

 for(i in c(1:7)[-5]){
   
 plot(colnames(a),
     a[i,],
     type = 'l',
     # axes = 'F',
     ylab = '',
     xlab = '',
      ylim = c(0,200),
     # xlim = c()
     col = col[i]
     
     )  
   par(new = T)
 }

legend('topleft',
       names[-5], 
       col = col[-5], 
       lty = c(1,2,1,2,1,2,3)[-5], 
       cex= 0.5)

Making a closer inspection ot the region of interest:

 for(i in c(1:7)[-5]){
   
 plot(colnames(a),
     a[i,],
     type = 'l',
     # axes = 'F',
     ylab = '',
     xlab = '',
      ylim = c(0,150),
      xlim = c(400,1000),
     col = col[i]
     
     )  
   par(new = T)
 }

legend('topright',
       names[-5], 
       col = col[-5], 
       lty = c(1,2,1,2,1,2,3)[-5], 
       cex= 0.5)

There is possible to differentiate the two samples via spectral comparison once baseline has been corrected.

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] hyperSpec_0.99-20201127 xml2_1.3.2              ggplot2_3.3.5          
## [4] lattice_0.20-44        
## 
## loaded via a namespace (and not attached):
##  [1] RColorBrewer_1.1-2  bslib_0.2.5.1       compiler_4.0.5     
##  [4] pillar_1.6.2        jquerylib_0.1.4     highr_0.9          
##  [7] rmdformats_1.0.2    tools_4.0.5         testthat_3.0.4     
## [10] digest_0.6.27       jsonlite_1.7.2      evaluate_0.14      
## [13] lifecycle_1.0.0     tibble_3.1.3        gtable_0.3.0       
## [16] png_0.1-7           pkgconfig_2.0.3     rlang_0.4.11       
## [19] DBI_1.1.1           yaml_2.2.1          xfun_0.25          
## [22] withr_2.4.2         stringr_1.4.0       dplyr_1.0.7        
## [25] knitr_1.33          generics_0.1.0      sass_0.4.0         
## [28] vctrs_0.3.8         tidyselect_1.1.1    glue_1.4.2         
## [31] R6_2.5.1            jpeg_0.1-9          fansi_0.5.0        
## [34] rmarkdown_2.10      bookdown_0.23       latticeExtra_0.6-29
## [37] purrr_0.3.4         magrittr_2.0.1      scales_1.1.1       
## [40] htmltools_0.5.1.1   ellipsis_0.3.2      assertthat_0.2.1   
## [43] colorspace_2.0-2    utf8_1.2.2          stringi_1.7.3      
## [46] lazyeval_0.2.2      munsell_0.5.0       crayon_1.4.1