Topic: How can B&W Tek LLC improve its proprietary chemometric modeling software, BWIQ, so that SIMCA models share more analytic and graphical information with end customers?
Soft independent modeling of class analogies (SIMCA) is a commonly used supervised classification technique in fields such as chemometrics and spectroscopy. The phrase “soft” refers to the ability to place an observation into multiple classes during classification. To build a SIMCA model, implement PCA for each class to obtain a reduced dimension representation. An affine n-dimensional subspace represents each class with n principal components. Next, calculate the mean orthogonal distance (OD) between the training data and the subspace; utilize the mean OD to determine the critical distance. Now that each class has a dimensionally reduced model and a classification threshold, one can test new observations against the classifier model.
The current iteration of BWIQ only gives the end user an accuracy value and a barebones plot after building a SIMCA model. In this exploration, I will employ and explain several analytical and graphical techniques that convey valuable information about the SIMCA model using spectra collected from a sample of calcium nitrate and a sample of magnesium nitrate with a NanoRam 785 nm device. One of the first things we can do after building the SIMCA model is to use the summary() function in R to receive some enlightening statistics about the class model for calcium nitrate:
##
## SIMCA model for class "calcium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 70.66 70.66 0.94 55.89 0.88
## Comp 2 4.90 75.56 1.00 2.22 0.81
## Comp 3 3.94 79.49 1.00 2.29 0.62
Next, we call the plot() function with the calcium nitrate model object as the argument to get several informative statistics:
In terms of SIMCA, the scores (also called factor scores or component scores) are the transformed variable values corresponding to particular data points. Another way to describe score is the distance between the projection of the observation onto the principal component axis and the origin.
Modeling power shows the amount each variable contributes to the variation described by the principal components. The default representation in the plot() function displays each variable name over the vertical bars, causing clutter when there are so many variables, as in spectral data.
Squared residual distance (Q) and Hotelling’s T-squared are summary statistics typically used in factor-based models, which quantify how well a model describes a given observation and detect outliers. Q residual uses the sum of squares of each observation in the error matrix, which means that the Q residual measures the difference between an observation and its projection into the space defined by the principal component eigenvectors. For the ith observation, calculate the Q residual using the following formula, where \(e_i\) is the ith row of the error matrix E:
\[Q_i = e_ie_i^T\]
Hotelling’s T-squared statistic uses the sum of normalized squared scores, which means it measures the variation within each observation in the PCA class model. Hotelling’s T-squared gives the distance between the projection of the observation onto the principal component eigenvectors and the multivariate mean. The Hotelling distribution is as follows, where S is the sample covariance matrix:
\[T^2 = (\bar{x} - \mu_0)^T S^{-1} (\bar{x} - \mu_0)\]
In summary, Q residuals demonstrate the magnitude of variation left in each observation after the observation is projected through the model, while Hotelling’s T-squared represents the distance each observation is from the center of the class model (scores = 0).
The cumulative variance plot is fairly straightforward; it shows the total percentage of the variation in the data explained by different numbers of principal components.
Finally, we can check out an interactive visualization of the scores for the calcium nitrate model class mapped in ℝ3 using the first three eigenvector axes:
Now we repeat identical steps for the magnesium nitrate class model:
##
## SIMCA model for class "magnesium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 85.76 85.76 0.94 68.37 0.94
## Comp 2 4.07 89.83 0.94 8.43 0.94
## Comp 3 1.91 91.74 1.00 1.55 0.88
We can take the single-class SIMCA models for magnesium nitrate and calcium nitrate and combine them into a simcam object using the aptly named simcam() function in the mdatools package, which allows for SIMCA multiclass classification.
nitrates <- simcam(models = list(ca_model, mg_model), info = 'Nitrate SIMCA')
Again, we call the summary() function to get a cursory look at our model object:
summary(nitrates)
##
## SIMCA multiple classes classification (class simcam)
## Nmber of classes: 2
## Info: Nitrate SIMCA
##
## SIMCA model for class "calcium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 70.66 70.66 0.94 55.89 0.88
## Comp 2 4.90 75.56 1.00 2.22 0.81
## Comp 3 3.94 79.49 1.00 2.29 0.62
##
## SIMCA model for class "magnesium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 85.76 85.76 0.94 68.37 0.94
## Comp 2 4.07 89.83 0.94 8.43 0.94
## Comp 3 1.91 91.74 1.00 1.55 0.88
Enacting the summary() function on the simcam model object pastes the summaries from the individual SIMCA model classes together. A Cooman’s plot graphically displays the results for SIMCA by using a pairwise comparison of the classes, which checks the distance between each observation in one class and the model of the other class. Cooman’s plot has four regions with distinct meanings: Observations that belong to a single class fall in either the top left or the bottom right sectors, “soft” observations that fall into both classes are in the bottom left sector, and observations that belong to neither class are in the top right sector.
Cooman’s plots are a wonderful tool for a small number of SIMCA classes, but the number of plots required to examine all class pairings grows in quadratic time. Examining n SIMCA classes using Cooman’s plots requires \(\frac{n(n-1)}{2}\) plots. Examining plots in a reasonable amount of time becomes prohibitively difficult for models with many classes to examine.
The discrimination power plot depicts how variables in the observations contribute to class separation. Since the observations are spectral data, the discrimination power is the greatest for the variables that are pixels of the main Raman spectral peaks. Discrimination power plots are also created pairwise between two classes at a time, so the number of plots will again grow quadratically as the number of classes increases.
We can use the versatile predict() function to run our calibration set against the SIMCA model and see a visual representation of the class predictions:
We can also do the same thing with the test set to get predictions against data not used to build the SIMCA class models:
A confusion matrix is another useful tool for letting a user know about the performance of their classification model. Confusion matrices give the number of false positives, false negatives, true positives, and true negatives from testing expected outcomes against labeled data. Using FP/FN/TP/TN, calculate several other measures of model performance such as sensitivity, specificity, F1 score, and accuracy. Here is an example of a confusion matrix for the calibration set:
## calcium nitrate magnesium nitrate None
## calcium nitrate 16 0 0
## magnesium nitrate 0 16 0
Correct identification of all samples as either true positive or true negative indicates no misclassifications.
The methods which I have outlined in this overview should be sufficient to satisfy the needs of both customers and internal employees alike. I highly recommend adding the plots and analytics discussed in this report to the BWIQ software.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mdatools_0.9.4 factoextra_1.0.6 ggfortify_0.4.8
## [4] data.table_1.12.2 plyr_1.8.4 plotly_4.9.1
## [7] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
## [10] purrr_0.3.2 readr_1.3.1 tidyr_1.0.0
## [13] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.9 haven_2.1.1
## [4] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.0
## [7] generics_0.0.2 htmltools_0.3.6 viridisLite_0.3.0
## [10] yaml_2.2.0 rlang_0.4.0 later_0.8.0
## [13] pillar_1.4.2 glue_1.3.1 withr_2.1.2
## [16] modelr_0.1.5 readxl_1.3.1 lifecycle_0.1.0
## [19] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
## [22] rvest_0.3.4 htmlwidgets_1.3 evaluate_0.14
## [25] knitr_1.25 httpuv_1.5.2 crosstalk_1.0.0
## [28] broom_0.5.2 Rcpp_1.0.2 xtable_1.8-4
## [31] promises_1.0.1 scales_1.0.0 backports_1.1.4
## [34] jsonlite_1.6 mime_0.7 gridExtra_2.3
## [37] hms_0.5.1 digest_0.6.21 stringi_1.4.3
## [40] shiny_1.3.2 ggrepel_0.8.1 grid_3.6.1
## [43] cli_1.1.0 tools_3.6.1 magrittr_1.5
## [46] lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.3
## [49] zeallot_0.1.0 xml2_1.2.2 lubridate_1.7.4
## [52] assertthat_0.2.1 rmarkdown_1.15 httr_1.4.1
## [55] rstudioapi_0.10 R6_2.4.0 nlme_3.1-140
## [58] compiler_3.6.1