Topic: Why can the PCA algorithm on the NanoRam 785 differentiate between calcium nitrate and magnesium nitrate, but the RVM algorithm on the NanoRam 1064 cannot?



Part One: 785 nm



Introduction



Principal component Analysis (PCA) is a procedure that takes a set of variables and transforms them into a new set of variables which have no collinearity, called principal components. Usage of dimensionality reduction techniques such as PCA is common in spectral classification and prediction due to the large number of variables present in each spectrum. Since each spectral data point is a variable, reducing the number of variables is crucial for creating a meaningful model. PCA utilizes an approach called feature extraction, which involves creating new features from recombinations of existing variables in order to sufficiently describe a dataset with a smaller amount of variables.



Results



First, a B&W Tek handheld Raman analyzer, NanoRam 785, built a method for a single sample of both magnesium nitrate and calcium nitrate. Next, a B&W Tek NanoRam 1064 handheld Raman analyzer created a five scan method for the same samples. The raw data were processed with dark subtraction, scaling, and centering. Running PCA on the processed 785 nm spectral data yields the following:





Scree Plot



The scree plot depicts the percentage of the total variance explained by each PC:





Statistical Summary

The information displayed by the prior scree plot shown as a statistical summary:



## Importance of first k=5 (out of 40) components:
##                            PC1     PC2     PC3     PC4    PC5
## Standard deviation     22.4054 4.68831 3.55776 2.43046 2.2113
## Proportion of Variance  0.8626 0.03777 0.02175 0.01015 0.0084
## Cumulative Proportion   0.8626 0.90031 0.92206 0.93221 0.9406



Confidence Ellipses

A 95% confidence ellipse was constructed for each class . By repeatedly sampling from the underlying distribution in the same method, and each time calculating a confidence ellipse, 95% of the constructed ellipses would contain the underlying mean of the distribution.





Conclusion

The low variance for the calcium nitrate data contrasts heavily with the high variance magnesium data. One of the data points belonging to the magnesium nitrate class is a statistical outlier. Additionally, examining the spectrum reveals a distinctly different spectral output compared to the other scans. Removing this rogue scan allows the PCA algorithm to increase the distance between the two classes, instead of forcing most of the scans into a small subspace to offset a single outlier.





Part Two: 785 adjusted

Introduction



Scaling before running PCA is a seemingly contentious issue in data science and spectral literature. There is a general consensus that scaling is required when variables have different units or greatly differing standard deviations. However, spectral data variables are all just relative intensities at certain pixels, which means that the variables will have the same units and are all measured on the same scale. Given that these two conditions are met, it would be acceptable and potentially even advantageous not to scale the variables and perform the PCA based on the covariance matrix. Otherwise, scaling the variables and using the correlation matrix to perform the PCA makes more sense.



Further research about preprocessing spectral data before PCA yields polarized opinions about when scaling is appropriate. A pragmatic approach to scaling involves not scaling if the variables are all measured on the same scale and have the same units. In the initial phase of this investigation, the scaled variables used the correlation matrix to perform the PCA. All future instances of PCA in this project employ unscaled variables and the covariance matrix.



Results



First, we built 20 scan methods for a single sample of magnesium nitrate and calcium nitrate using the B&W Tek NanoRam 785. Next, we processed the data with dark subtraction and centering. I removed the magnesium nitrate sample scan #12 from the dataset as an outlier. I ran PCA on the processed 785 nm spectral data, which yielded the following results:





Scree Plot





Statistical Summary



## Importance of first k=5 (out of 39) components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     7784.6314 4094.0216 1.466e+03 730.90832 621.10275
## Proportion of Variance    0.7284    0.2015 2.582e-02   0.00642   0.00464
## Cumulative Proportion     0.7284    0.9299 9.557e-01   0.96215   0.96678



Confidence Ellipses







Interactive 3-Dimensional PCA Scores





Part Three: 1064 nm PCA

Introduction



Reduced Variable Multivariate Method for Spectral Identification (RVM) is another dimensionality reduction procedure which takes a set of variables and deletes many of the variables until a small subset of variables that accurately describe the dataset remain. RVM uses an approach called feature selection, which differs from feature extraction in a couple of ways. First, feature selection removes variables instead of transforming them into new ones, which discards data from the original dataset. In contrast, feature extraction retains 100% of the original dataset. Second, there is no guarantee that the variables are free of collinearity in feature selection. As the number of dimensions increases, the chances that variables exhibit collinearity becomes increasingly more likely, and spectral data is known for having a high number of dimensions. Feature extraction guarantees orthogonal variables with no collinearity. Before attempting to analyze the data collected on the 1064 nm device with RVM, first, I will run the data through the PCA model as a baseline for comparison.



Results





There don’t seem to be any outliers present, and as in the previous sections, the variables are centered but not scaled in the preprocessing steps.



Scree Plot





Statistical Summary



## Importance of first k=3 (out of 12) components:
##                              PC1       PC2       PC3
## Standard deviation     3.019e+04 1.356e+04 3.713e+03
## Proportion of Variance 8.159e-01 1.645e-01 1.234e-02
## Cumulative Proportion  8.159e-01 9.803e-01 9.927e-01



Interactive 3-Dimensional PCA Scores





Conclusion



The PCA algorithm has no trouble distinguishing the two classes of nitrates collected on the 1064 nm device, indicating a weakness in the RVM algorithm. The RVM method has specificity issues when examining the nitrate samples on the 1064 nm device, but the PCA algorithm does not. I will conduct further investigation to uncover the drawbacks of the RMV algorithm.



Part Four: 1064 nm RVM



Introduction



Eventually, I will embark on the journey of trying to build an RVM algorithm from scratch. I know that RVM uses feature selection and an inflation factor. Currently, there isn’t a way to alter the default inflation factor on the NanoRam 1064, although an internal paper mentions that lowering it can assist with specificity issues for spectrally similar compounds. For now, I will discuss the potential drawbacks of the feature selection approach and what I believe to be the root cause of the specificity issues.



Hypothesis

While RVM has stellar sensitivity, the main struggle for the algorithm is specificity, especially for spectrally similar compounds. An “inflation factor” increases the variance of each spectrally reduced variable in the RVM method, which also increases sensitivity. However, the inflation factor will also inherently deflate specificity. As previously mentioned, lowering this inflation factor will increase specificity, but the ability for a user to perform this action is not currently available on the device.



An interesting phenomenon can occur when comparing two spectrally similar compounds. The compound A method works correctly, but the sample B method is unable to tell A and B apart, giving false positives for scans of sample A. The underlying cause could be the spread of the data in each method having different magnitudes. Artificially boosting the confidence region for these two distributions causes spooky interactions.





For well-separated clusters (above), the original confidence regions are not very close together, so it is unlikely that any false positives will occur, even after the inflation. However, if the confidence regions are closer together…





…this can cause issues. The (blue) calcium nitrate data has a smaller variance than the (orange) magnesium nitrate data, and consequently, a smaller ellipse. My educated guess is that when running a sample of calcium nitrate and magnesium nitrate against the calcium nitrate method, the variance multiplication does not increase the area of the blue calcium nitrate ellipse enough to “cross” into the threshold of the orange magnesium nitrate ellipse. The calcium nitrate method does not have specificity issues for those two compounds. After the inflation factor, the magnesium nitrate ellipse is large enough to overlap the calcium nitrate ellipse. The overlapping confidence ellipses lead to a false positive scan when testing calcium nitrate. An even higher variance will be present for method scans on the 1064 nm device since only five scans comprise the method. Without access to the contents of the RVM algorithm, I can only surmise that the overlapping ellipses are the crux of the specificity problems. The Applications team provided a p-value crosstable from their results of testing each sample against each method. The results corroborate the premise of the hypothesis as the calcium nitrate method returned a false positive for magnesium nitrate and a higher variance for the 1064 nm device method scans than the 785 nm device scans:







Before advocating corrective actions, I will have to examine more pairs of spectrally similar compounds to ensure that the findings in this investigation are generalizable. Constructing an RVM pastiche to model the actual RVM algorithm in the device is a future goal.



Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] factoextra_1.0.6  ggfortify_0.4.8   data.table_1.12.2
##  [4] plyr_1.8.4        forcats_0.4.0     stringr_1.4.0    
##  [7] dplyr_0.8.3       purrr_0.3.2       readr_1.3.1      
## [10] tidyr_1.0.0       tibble_2.1.3      tidyverse_1.2.1  
## [13] plotly_4.9.1      ggplot2_3.2.1    
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.1     Rcpp_1.0.2        lubridate_1.7.4  
##  [4] lattice_0.20-38   assertthat_0.2.1  zeallot_0.1.0    
##  [7] digest_0.6.21     mime_0.7          R6_2.4.0         
## [10] cellranger_1.1.0  backports_1.1.4   evaluate_0.14    
## [13] httr_1.4.1        pillar_1.4.2      rlang_0.4.0      
## [16] lazyeval_0.2.2    readxl_1.3.1      rstudioapi_0.10  
## [19] rmarkdown_1.15    labeling_0.3      htmlwidgets_1.3  
## [22] munsell_0.5.0     shiny_1.3.2       broom_0.5.2      
## [25] compiler_3.6.1    httpuv_1.5.2      modelr_0.1.5     
## [28] xfun_0.9          pkgconfig_2.0.3   htmltools_0.3.6  
## [31] tidyselect_0.2.5  gridExtra_2.3     viridisLite_0.3.0
## [34] crayon_1.3.4      withr_2.1.2       ggpubr_0.2.4     
## [37] later_0.8.0       grid_3.6.1        xtable_1.8-4     
## [40] nlme_3.1-140      jsonlite_1.6      gtable_0.3.0     
## [43] lifecycle_0.1.0   magrittr_1.5      scales_1.0.0     
## [46] cli_1.1.0         stringi_1.4.3     ggsignif_0.6.0   
## [49] promises_1.0.1    xml2_1.2.2        generics_0.0.2   
## [52] vctrs_0.2.0       tools_3.6.1       glue_1.3.1       
## [55] hms_0.5.1         crosstalk_1.0.0   yaml_2.2.0       
## [58] colorspace_1.4-1  rvest_0.3.4       knitr_1.25       
## [61] haven_2.1.1