Preprocessing data for used oil viscosities (40º and 100ºC) and Total Acid Number (TAN)

Introduction

The starting point is a set of .csv files provided by the Beatriz Leal's laboratory containing for different samples the total FTIR espectrum from 4000 to 550 \( cm^{-1} \). Theis path names is ACUS*.csv.

As a matter of complection this Laboratory provided the spectrum of the not used oil. The name of the file is AcvirgAeroshell_sep12.csv.

Then there is an additional file having a table with the name of the espectra, its TAN according to ASTM D-664, its viscosity by 100ºC according to the ASTM D-445 and its viscosity by 40ºC. Its namefile is Espectros-AN-Vis_31-oct2012.csv.

Data Preprocessing

The first step is to get data properly loaded into R.

where ACUS157.csv, ACUS161.csv, ACUS162.csv, ACUS204.csv, ACUS246.csv, ACUS260.csv, ACUS262.csv spectra have been removed as they can not be considered reliable enough.

After getting data loaded into the system, which means measured data for the spectra, spectrum for the original oil as well as data for a number of 151 samples or spectra, it is time to accomodate the ofssets and to scale the data.

In order to do offseting for data, it is recommended to estimate the absorbance average between 1800 t0 2000 \( cm^{-1} \) from the not degraded oil and then to adjust all the spectra to this value by subtracting the differential averages.

Now, after offseting the spectra, it is possible to take care on their scale. It has been decided to use the absorbance at 1730 \( cm^{-1} \).

Then, it is time to determine the differential spectra. It will be showed up in blocks of ten spectra per plot.

After this preprocessing step R ojects like orig, acusn but also some others like trans are available outside of the running environment for the namesfespectra kept into the analysis.

TAN modelling

Data Processing

Different modelling technicas can be analized. As the dimension from the FTIR device is high a Principal Component Analysis (PCA) can be helpful.

It is necessary to manage that there are different number of spectra, after fitering those having unconvenient offset factors, (151) than results in terms of viscosity and TAN measured ( 157 samples). Coherent intersections are carried out in order to produce a valid dataset with a size of 145 members.

PCA can not be properly performed into the space of FTIR dimensions as the number of samples is much lower than its own dimension. However a Factorial analysis can be performed.

plot of chunk pintapca1 plot of chunk pintapca1

However, factorial analysis it is not what this application requires. Instead other techniques can be explored. As it is commonly used in this particular field of NIR / FTIR, the contribution from Partial Least Square Regression (LPSR) technique has been considered.

In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.

According to this, the contribution from raw data is analysed, plot of chunk plsr1 plot of chunk plsr1 plot of chunk plsr1

As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 6 latent components. If a cross validation process is conducted over all the sprectal dimension, it can be found the following,

plot of chunk plsr2

Which essentially means that the R2 performance from the PLSR is 71.6672% during learning and 56.167% during testing.

An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).

If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.

plot of chunk pintaica1

If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression technique instead of the classical linear regression one we can found, plot of chunk pintaica2

Even a PLS Regression can be implemented, plot of chunk pintaica3

Non linear Model SVMR

Let us try a non linear model builder whith the same approach based on cross validation

The first trial will be produced by using defaults values, plot of chunk svm2

A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),

Model R^2 (%) Ncomp
1 lm 58.40 10.00
2 ltsReg 57.40 4.00
3 plsr 58.40 10.00
4 svmr 65.70 5.00

Where the svmr is the most adequate technique that must be compared against the raw PLSR which produce 71.7 and 56.167% during testing.

So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.

It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.

plot of chunk svm3

For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) 64.534%.

plot of chunk svm4

Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05

Random prediction

As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.

Then,the real values against the differnt predictions can be sen in the included picture, plot of chunk mod2

As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.

Viscosity Reference = 100ºC

Data Processing

## Warning: the condition has length > 1 and only the first element will be
## used

In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.

According to this, the contribution from raw data is analysed, plot of chunk plsr21 plot of chunk plsr21 plot of chunk plsr21

As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 0, 5 &times; 10<sup>-1</sup> latent components. If a cross validation process is conducted over all the sprectal dimension, it can be found the following,

## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

plot of chunk plsr22

Which essentially means that the R2 performance from the PLSR is 28.7146% during learning and -25.0036% during testing.

An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).

If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.

plot of chunk pintaica21

If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression technique instead of the classical linear regression one we can found, plot of chunk pintaica22

Even a PLS Regression can be implemented, plot of chunk pintaica23

Non linear Model SVMR

Let us try a non linear model builder whith the same approach based on cross validation

The first trial will be produced by using defaults values, plot of chunk svm22

A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),

Model R^2 (%) Ncomp
1 lm -8.30 4.00
2 ltsReg -10.80 3.00
3 plsr -8.30 4.00
4 svmr -11.90 9.00

Where the lm is the most adequate technique that must be compared against the raw PLSR which produce 28.7 and -25.0036% during testing.

So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.

It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk svm23

For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) -8.3533%.

## Warning: Removed 55 rows containing missing values (geom_path).

plot of chunk svm24

Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05

Random prediction

As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.

Then,the real values against the differnt predictions can be sen in the included picture, plot of chunk mod22

As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.

Viscosity Reference = 40ºC

Data Processing

## Warning: the condition has length > 1 and only the first element will be
## used

In order to get independency from the particular dataset used for learning, a cross validation with a LOO strategy (leave one out) has been selected.

According to this, the contribution from raw data is analysed, plot of chunk plsr32 plot of chunk plsr32 plot of chunk plsr32

As a conclusion, the most convenient mixture of latent variables for the PLSR models uses 0, 5 &times; 10<sup>-1</sup> latent components. If a cross validation process is conducted over all the sprectal dimension, it can be found the following,

## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

plot of chunk plsr33

Which essentially means that the R2 performance from the PLSR is 28.6626% during learning and -25.8787% during testing.

An alternative procedure will be to reduce the dimension, but instead of using the variables from the SD decomposition as PLS proposes, authors want to test other projections like the Independent Component Analysis, hereinafter called (ICA).

If it is adopted a variable dimensionality for ICA projectors between 2 and 15 it is possible to evalute by a ten-cros-validation technique the performance of linear regression technique.

plot of chunk pintaica31

If we want to use a Least Trimmed Squares Robust (High Breakdown) Regression technique instead of the classical linear regression one we can found, plot of chunk pintaica32

Even a PLS Regression can be implemented, plot of chunk pintaica33

Non linear Model SVMR

Let us try a non linear model builder whith the same approach based on cross validation

The first trial will be produced by using defaults values, plot of chunk svm32

A clear conclusion is that regressing TAN from the ICA projections produces the following assessed values (after the 10 fold cross validation technique),

Model R^2 (%) Ncomp
1 lm -8.50 4.00
2 ltsReg -11.20 3.00
3 plsr -8.50 4.00
4 svmr -12.00 9.00

Where the lm is the most adequate technique that must be compared against the raw PLSR which produce 28.7 and -25.8787% during testing.

So the ICA technique allow us to a slighly improvement and to reduce drastically the amount of data we must deal with with no particular losses of precision. In fact ICA is going to produce the same latent variables as PLSR does. Linear model from ICA produces the same performance than PLSR from raw data.

It is also possible to realize that non linear techniques like SVMR outperforms the linear approach. Hereinafter it is selected the SVM model with 10 components and different kernels will be tested as well as different parameters for kernel to.

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk svm33

For radial kernel the best value is \( gamma=0.05 \) with an \( R^{2}= \) -8.4045%.

## Warning: Removed 61 rows containing missing values (geom_path).

plot of chunk svm34

Now, it should be addressed an ICAL based model learned by SVM with radial kernel with gamma=0.05

Random prediction

As the full dataset includes 145 elements it has been selected the 70% of them by random procedure to build up the model and the remaining samples are going to be used for validation.

Then,the real values against the differnt predictions can be sen in the included picture, plot of chunk mod32

As it is possible to see, the density provided by the SVM method shapes almost as the real data. In addition to this, lm, lstReg and pls produces two artificial clusters which are not in the data. Furthermore, there is a linear relationship between the linear model and the pls.

Conclusion

As a first conclussion, it becomed hard try to explain three different variables (TAN, Viscosity by 100ºC and Viscosity by 40ºC ) with an innet variation as shown in the next figure, plot of chunk ploty1

Only based on the full FTIR spectrum. A first suggestion should be to get focussed in different spectrum areas por each variable, according to the previously known dependences from literature or experience. In particular there is a strong not correlated variation between TAN values adn viscosity ones.

Another easy observed thing is the linearity between viscosity at 40ºC starting from the one at 100ºC. The relationship is 100% linear with coeefficients like V_40=-94.3104 + V_100 * 15.2754.