The starting point is a set of .csv files provided by the Beatriz Leal's laboratory containing for different samples the total FTIR espectrum from 4000 to 550 \( cm^{-1} \). Theis path names is ACUS*.csv.
As a matter of complection this Laboratory provided the spectrum of the not used oil. The name of the file is AcvirgAeroshell_sep12.csv.
Then there is an additional file having a table with the name of the espectra, its TAN according to ASTM D-664, its viscosity by 100ºC according to the ASTM D-445 and its viscosity by 40ºC. Its namefile is Espectros-AN-Vis_31-oct2012.csv.
The first step is to get data properly loaded into R.
where ACUS157.csv, ACUS161.csv, ACUS162.csv, ACUS204.csv, ACUS246.csv, ACUS260.csv, ACUS262.csv spectra have been removed as they can not be considered reliable enough.
After getting data loaded into the system, which means measured data for the spectra, spectrum for the original oil as well as data for a number of 151 samples or spectra, it is time to accomodate the ofssets and to scale the data.
In order to do offseting for data, it is recommended to estimate the absorbance average between 1800 t0 2000 \( cm^{-1} \) from the not degraded oil and then to adjust all the spectra to this value by subtracting the differential averages.
Now, after offseting the spectra, it is possible to take care on their scale. It has been decided to use the absorbance at 1730 \( cm^{-1} \).
Then, it is time to determine the differential spectra. It will be showed up in blocks of ten spectra per plot.
After this preprocessing step R ojects like orig, acusn but also some others like trans are available outside of the running environment for the namesfespectra kept into the analysis.
Different modelling technicas can be analized. As the dimension from the FTIR device is high a Principal Component Analysis (PCA) can be helpful.
It is necessary to manage that there are different number of spectra, after fitering those having unconvenient offset factors, (151) than results in terms of viscosity and TAN measured ( 157 samples). Coherent intersections are carried out in order to produce a valid dataset with a size of 145 members.
PCA can not be properly performed into the space of FTIR dimensions as the number of samples is much lower than its own dimension. However a Factorial analysis can be performed.
However, factorial analysis it is not what this application requires. Instead other techniques can be explored. As it is commonly used in this particular field of NIR / FTIR, the contribution from Partial Least Square Regression (LPSR) technique has been considered.
In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.
According to this, the contribution from raw data is analysed,
As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 6 latent components.
If a cross validation process is conducted over all the sprectal dimension, it can be found the following,
Which essentially means that the R2 performance from the PLSR is 71.6672% during learning and 56.167% during testing.
An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).
If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.
If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression
technique instead of the classical linear regression one we can found,
Even a PLS Regression can be implemented,
Let us try a non linear model builder whith the same approach based on cross validation
The first trial will be produced by using defaults values,
A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),
| Model | R^2 (%) | Ncomp | |
|---|---|---|---|
| 1 | lm | 58.40 | 10.00 |
| 2 | ltsReg | 57.40 | 4.00 |
| 3 | plsr | 58.40 | 10.00 |
| 4 | svmr | 65.70 | 5.00 |
Where the svmr is the most adequate technique that must be compared against the raw PLSR which produce 71.7 and 56.167% during testing.
So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.
It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.
For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) 64.534%.
Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05
As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.
Then,the real values against the differnt predictions can be sen in the included picture,
As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.
## Warning: the condition has length > 1 and only the first element will be
## used
In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.
According to this, the contribution from raw data is analysed,
As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 0, 5 × 10<sup>-1</sup> latent components.
If a cross validation process is conducted over all the sprectal dimension, it can be found the following,
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
Which essentially means that the R2 performance from the PLSR is 28.7146% during learning and -25.0036% during testing.
An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).
If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.
If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression
technique instead of the classical linear regression one we can found,
Even a PLS Regression can be implemented,
Let us try a non linear model builder whith the same approach based on cross validation
The first trial will be produced by using defaults values,
A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),
| Model | R^2 (%) | Ncomp | |
|---|---|---|---|
| 1 | lm | -8.30 | 4.00 |
| 2 | ltsReg | -10.80 | 3.00 |
| 3 | plsr | -8.30 | 4.00 |
| 4 | svmr | -11.90 | 9.00 |
Where the lm is the most adequate technique that must be compared against the raw PLSR which produce 28.7 and -25.0036% during testing.
So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.
It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.
## Warning: Removed 1 rows containing missing values (geom_path).
For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) -8.3533%.
## Warning: Removed 55 rows containing missing values (geom_path).
Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05
As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.
Then,the real values against the differnt predictions can be sen in the included picture,
As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.
## Warning: the condition has length > 1 and only the first element will be
## used
In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.
According to this, the contribution from raw data is analysed,
As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 0, 5 × 10<sup>-1</sup> latent components.
If a cross validation process is conducted over all the sprectal dimension, it can be found the following,
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
Which essentially means that the R2 performance from the PLSR is 28.6626% during learning and -25.8787% during testing.
An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).
If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.
If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression
technique instead of the classical linear regression one we can found,
Even a PLS Regression can be implemented,
Let us try a non linear model builder whith the same approach based on cross validation
The first trial will be produced by using defaults values,
A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),
| Model | R^2 (%) | Ncomp | |
|---|---|---|---|
| 1 | lm | -8.50 | 4.00 |
| 2 | ltsReg | -11.20 | 3.00 |
| 3 | plsr | -8.50 | 4.00 |
| 4 | svmr | -12.00 | 9.00 |
Where the lm is the most adequate technique that must be compared against the raw PLSR which produce 28.7 and -25.8787% during testing.
So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.
It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.
## Warning: Removed 1 rows containing missing values (geom_path).
For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) -8.4045%.
## Warning: Removed 61 rows containing missing values (geom_path).
Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05
As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.
Then,the real values against the differnt predictions can be sen in the included picture,
As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.
As a first conclussion, it becomed hard try to explain three different variables (TAN, Viscosity by 100ºC and Viscosity by 40ºC ) with an innet variation as shown in the next figure,
Only based on the full FTIR spectrum. A first suggestion should be to get focussed in different spectrum areas por each variable, according to the previously known dependences from literature or experience. In particular there is a strong not correlated variation between TAN values adn viscosity ones.
Another easy observed thing is the linearity between viscosity at 40ºC starting from the one at 100ºC. The relationship is 100% linear with coeefficients like V_40=-94.3104 + V_100 * 15.2754.