library(plyr)
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: scales
## Loading required package: grid
library(ggplot2)
library(printr)
options(digits = 3)
setwd('~/Dropbox/cra/rose/report/')
df = read.csv('../data/rose.csv', stringsAsFactors = FALSE)
#for our purporses, NA means zero
df[is.na(df)] = 0
#extracting percentages
perc = data.frame(
compound = df$Compound,
class = df$class,
MHG = df$MHG.Area/sum(df$MHG.Area),
SFME = df$SFME.Area/sum(df$SFME.Area),
STEAM_DIST = df$STEAM_DIST.Area/sum(df$STEAM_DIST.Area),
HYDRO_DIST = df$HYDRO_DIST.Area/sum(df$HYDRO_DIST.Area)
)
A total of 70 compounds were measured using four different extraction techniques: MHG, SFME, STEAM_DIST and HYDRO_DIST.
The compounds were also grouped in eigth classes, of which Class 8 contains all and only the “undefined” compounds.
The analysis focused on the “Area” column, transformed into relative values (percentages).
The first PCA was done using all the available data and no rescaling.
pr = prcomp(perc[,3:6])
ggbiplot(pr, labels = perc$compound)
It appear clear that some of the compounds contain so much variability to completely flatten the other ones. It is enough to take a look to the two outliers.
subset(perc, compound %in% c('Nerol', '2-phenyl ethanol'))
| compound | class | MHG | SFME | STEAM_DIST | HYDRO_DIST | |
|---|---|---|---|---|---|---|
| 2 | 2-phenyl ethanol | 4 | 0.600 | 0.424 | 0.253 | 0.255 |
| 51 | Nerol | 5 | 0.144 | 0.023 | 0.400 | 0.287 |
They contain, indeed, a lot of variability
I’m not sure if this kind of result is expected or it’s an error. We can however take a look to actual ranges and see what happens.
#for each compound, we find the max, min and delta in the four analysis
stats = ddply(perc, 'compound', .fun = function(x){
return(data.frame(
max = max(x[1, 3:6]),
min = min(x[1, 3:6]),
delta = max(x[1, 3:6]) - min(x[1, 3:6])
))
})
#reordering by decreasing delta, showing up to the first compounds with a delta < 5%
stats = stats[order(stats$delta, decreasing = TRUE),]
head(stats, n=10)
| compound | max | min | delta | |
|---|---|---|---|---|
| 51 | Nerol | 0.400 | 0.023 | 0.377 |
| 2 | 2-phenyl ethanol | 0.600 | 0.253 | 0.347 |
| 38 | Geraniol | 0.167 | 0.016 | 0.151 |
| 24 | Cariophyllene oxide B | 0.117 | 0.000 | 0.117 |
| 22 | Cariophyllene | 0.102 | 0.000 | 0.102 |
| 42 | Heneicosane | 0.082 | 0.005 | 0.077 |
| 53 | Nonadecane | 0.077 | 0.017 | 0.060 |
| 12 | a-terpinolen | 0.046 | 0.000 | 0.046 |
| 4 | 4,7,10-Cycloundecatriene, 1,1,4,8-tetramethyl (4Z, 7Z, 10Z) | 0.042 | 0.000 | 0.042 |
| 54 | Octane | 0.042 | 0.000 | 0.042 |
Unsurprisingly, the compounds showing the biggest differences where exactly the two known outliers. All the other compounds (not reported) show less than 5% difference between methods.
Focusing on these compounds, we see:
#only deltas greater than 5% are allowed
stats.outliers = subset(stats, delta >= 0.05)
#examining the actual percentages
perc.outliers = perc[perc$compound %in% stats.outliers$compound,]
perc.outliers
| compound | class | MHG | SFME | STEAM_DIST | HYDRO_DIST | |
|---|---|---|---|---|---|---|
| 2 | 2-phenyl ethanol | 4 | 0.600 | 0.424 | 0.253 | 0.255 |
| 22 | Cariophyllene | 5 | 0.001 | 0.102 | 0.001 | 0.000 |
| 24 | Cariophyllene oxide B | 5 | 0.000 | 0.117 | 0.000 | 0.000 |
| 39 | Geraniol | 5 | 0.167 | 0.016 | 0.083 | 0.125 |
| 42 | Heneicosane | 1 | 0.005 | 0.010 | 0.047 | 0.082 |
| 51 | Nerol | 5 | 0.144 | 0.023 | 0.400 | 0.287 |
| 53 | Nonadecane | 1 | 0.017 | 0.023 | 0.054 | 0.077 |
Some cases are easily interpreted. E.g. Cariophyllene oxide B is detected only by SFME and not by the other techniques. So, anything found SFME will result in very high deltas (and very high variances, intercepted by PCA).
We further proceed with the analysis, removing the outliers (the compounds with delta >5%) and, since we are at it, rescaling/centering everything
#we'll end up with 63 compounds
perc.cleaned = perc[!(perc$compound %in% perc.outliers$compound),]
#redoing the PCA
pr = prcomp(perc.cleaned[,3:6], center = TRUE, scale = TRUE)
ggbiplot(pr, labels = perc.cleaned$compound)
Things are getting clearer. The two first components now explain ~70% of the found variance. There are still a few outliers, but they are comparatively close to the cluster centre. We find again some known compound (a-terpinolen, Octane) which were already known for their high delta.
Let’s now take a look to compound classes.
cl = factor(perc.cleaned$class)
ggbiplot(pr, groups = cl) + scale_color_brewer(name='Compound\nclass', palette = 'Dark2')
A few considerations: