library(plyr)
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: scales
## Loading required package: grid
library(ggplot2)
library(printr)
options(digits = 3)
setwd('~/Dropbox/cra/rose/report/')
df = read.csv('../data/rose.csv', stringsAsFactors = FALSE)

#for our purporses, NA means zero
df[is.na(df)] = 0

#extracting percentages
perc = data.frame(
  compound = df$Compound,
  class = df$class,
  MHG = df$MHG.Area/sum(df$MHG.Area),
  SFME = df$SFME.Area/sum(df$SFME.Area),
  STEAM_DIST = df$STEAM_DIST.Area/sum(df$STEAM_DIST.Area),
  HYDRO_DIST = df$HYDRO_DIST.Area/sum(df$HYDRO_DIST.Area)
)

Starting dataset

A total of 70 compounds were measured using four different extraction techniques: MHG, SFME, STEAM_DIST and HYDRO_DIST.

The compounds were also grouped in eigth classes, of which Class 8 contains all and only the “undefined” compounds.

The analysis focused on the “Area” column, transformed into relative values (percentages).

First PCA

The first PCA was done using all the available data and no rescaling.

pr = prcomp(perc[,3:6])
ggbiplot(pr, labels = perc$compound)

It appear clear that some of the compounds contain so much variability to completely flatten the other ones. It is enough to take a look to the two outliers.

subset(perc, compound %in% c('Nerol', '2-phenyl ethanol'))
compound class MHG SFME STEAM_DIST HYDRO_DIST
2 2-phenyl ethanol 4 0.600 0.424 0.253 0.255
51 Nerol 5 0.144 0.023 0.400 0.287

They contain, indeed, a lot of variability

I’m not sure if this kind of result is expected or it’s an error. We can however take a look to actual ranges and see what happens.

#for each compound, we find the max, min and delta in the four analysis
stats = ddply(perc, 'compound', .fun = function(x){
  return(data.frame(
    max = max(x[1, 3:6]),
    min = min(x[1, 3:6]),
    delta = max(x[1, 3:6]) - min(x[1, 3:6])
  ))
})

#reordering by decreasing delta, showing up to the first compounds with a delta < 5%
stats = stats[order(stats$delta, decreasing = TRUE),]
head(stats, n=10)
compound max min delta
51 Nerol 0.400 0.023 0.377
2 2-phenyl ethanol 0.600 0.253 0.347
38 Geraniol 0.167 0.016 0.151
24 Cariophyllene oxide B 0.117 0.000 0.117
22 Cariophyllene 0.102 0.000 0.102
42 Heneicosane 0.082 0.005 0.077
53 Nonadecane 0.077 0.017 0.060
12 a-terpinolen 0.046 0.000 0.046
4 4,7,10-Cycloundecatriene, 1,1,4,8-tetramethyl (4Z, 7Z, 10Z) 0.042 0.000 0.042
54 Octane 0.042 0.000 0.042

Unsurprisingly, the compounds showing the biggest differences where exactly the two known outliers. All the other compounds (not reported) show less than 5% difference between methods.

Focusing on these compounds, we see:

#only deltas greater than 5% are allowed
stats.outliers = subset(stats, delta >= 0.05)
#examining the actual percentages
perc.outliers = perc[perc$compound %in% stats.outliers$compound,]
perc.outliers
compound class MHG SFME STEAM_DIST HYDRO_DIST
2 2-phenyl ethanol 4 0.600 0.424 0.253 0.255
22 Cariophyllene 5 0.001 0.102 0.001 0.000
24 Cariophyllene oxide B 5 0.000 0.117 0.000 0.000
39 Geraniol 5 0.167 0.016 0.083 0.125
42 Heneicosane 1 0.005 0.010 0.047 0.082
51 Nerol 5 0.144 0.023 0.400 0.287
53 Nonadecane 1 0.017 0.023 0.054 0.077

Some cases are easily interpreted. E.g. Cariophyllene oxide B is detected only by SFME and not by the other techniques. So, anything found SFME will result in very high deltas (and very high variances, intercepted by PCA).

Second PCA: no outliers, rescaled

We further proceed with the analysis, removing the outliers (the compounds with delta >5%) and, since we are at it, rescaling/centering everything

#we'll end up with 63 compounds
perc.cleaned = perc[!(perc$compound %in% perc.outliers$compound),]

#redoing the PCA
pr = prcomp(perc.cleaned[,3:6], center = TRUE, scale = TRUE)
ggbiplot(pr, labels = perc.cleaned$compound)

Things are getting clearer. The two first components now explain ~70% of the found variance. There are still a few outliers, but they are comparatively close to the cluster centre. We find again some known compound (a-terpinolen, Octane) which were already known for their high delta.

Let’s now take a look to compound classes.

cl = factor(perc.cleaned$class)
ggbiplot(pr, groups = cl) + scale_color_brewer(name='Compound\nclass', palette = 'Dark2')

A few considerations: