ROSE - exploratory study

library(plyr)
library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: scales

## Loading required package: grid

library(ggplot2)
library(printr)
options(digits = 3)
setwd('~/Dropbox/cra/rose/report/')
df = read.csv('../data/rose.csv', stringsAsFactors = FALSE)

#for our purporses, NA means zero
df[is.na(df)] = 0

#extracting percentages
perc = data.frame(
  compound = df$Compound,
  class = df$class,
  MHG = df$MHG.Area/sum(df$MHG.Area),
  SFME = df$SFME.Area/sum(df$SFME.Area),
  STEAM_DIST = df$STEAM_DIST.Area/sum(df$STEAM_DIST.Area),
  HYDRO_DIST = df$HYDRO_DIST.Area/sum(df$HYDRO_DIST.Area)
)

Starting dataset

A total of 70 compounds were measured using four different extraction techniques: MHG, SFME, STEAM_DIST and HYDRO_DIST.

The compounds were also grouped in eigth classes, of which Class 8 contains all and only the “undefined” compounds.

The analysis focused on the “Area” column, transformed into relative values (percentages).

First PCA

The first PCA was done using all the available data and no rescaling.

pr = prcomp(perc[,3:6])
ggbiplot(pr, labels = perc$compound)

It appear clear that some of the compounds contain so much variability to completely flatten the other ones. It is enough to take a look to the two outliers.

subset(perc, compound %in% c('Nerol', '2-phenyl ethanol'))

	compound	class	MHG	SFME	STEAM_DIST	HYDRO_DIST
2	2-phenyl ethanol	4	0.600	0.424	0.253	0.255
51	Nerol	5	0.144	0.023	0.400	0.287

They contain, indeed, a lot of variability

Nerol varies from 2% in SFME to 40% in STEAM_DIST
2-phenyl ethanol varies from 25% in both distillations to 60% in MHG.

I’m not sure if this kind of result is expected or it’s an error. We can however take a look to actual ranges and see what happens.

#for each compound, we find the max, min and delta in the four analysis
stats = ddply(perc, 'compound', .fun = function(x){
  return(data.frame(
    max = max(x[1, 3:6]),
    min = min(x[1, 3:6]),
    delta = max(x[1, 3:6]) - min(x[1, 3:6])
  ))
})

#reordering by decreasing delta, showing up to the first compounds with a delta < 5%
stats = stats[order(stats$delta, decreasing = TRUE),]
head(stats, n=10)

	compound	max	min	delta
51	Nerol	0.400	0.023	0.377
2	2-phenyl ethanol	0.600	0.253	0.347
38	Geraniol	0.167	0.016	0.151
24	Cariophyllene oxide B	0.117	0.000	0.117
22	Cariophyllene	0.102	0.000	0.102
42	Heneicosane	0.082	0.005	0.077
53	Nonadecane	0.077	0.017	0.060
12	a-terpinolen	0.046	0.000	0.046
4	4,7,10-Cycloundecatriene, 1,1,4,8-tetramethyl (4Z, 7Z, 10Z)	0.042	0.000	0.042
54	Octane	0.042	0.000	0.042

Unsurprisingly, the compounds showing the biggest differences where exactly the two known outliers. All the other compounds (not reported) show less than 5% difference between methods.

Focusing on these compounds, we see:

#only deltas greater than 5% are allowed
stats.outliers = subset(stats, delta >= 0.05)
#examining the actual percentages
perc.outliers = perc[perc$compound %in% stats.outliers$compound,]
perc.outliers

	compound	class	MHG	SFME	STEAM_DIST	HYDRO_DIST
2	2-phenyl ethanol	4	0.600	0.424	0.253	0.255
22	Cariophyllene	5	0.001	0.102	0.001	0.000
24	Cariophyllene oxide B	5	0.000	0.117	0.000	0.000
39	Geraniol	5	0.167	0.016	0.083	0.125
42	Heneicosane	1	0.005	0.010	0.047	0.082
51	Nerol	5	0.144	0.023	0.400	0.287
53	Nonadecane	1	0.017	0.023	0.054	0.077

Some cases are easily interpreted. E.g. Cariophyllene oxide B is detected only by SFME and not by the other techniques. So, anything found SFME will result in very high deltas (and very high variances, intercepted by PCA).

Second PCA: no outliers, rescaled

We further proceed with the analysis, removing the outliers (the compounds with delta >5%) and, since we are at it, rescaling/centering everything

#we'll end up with 63 compounds
perc.cleaned = perc[!(perc$compound %in% perc.outliers$compound),]

#redoing the PCA
pr = prcomp(perc.cleaned[,3:6], center = TRUE, scale = TRUE)
ggbiplot(pr, labels = perc.cleaned$compound)

Things are getting clearer. The two first components now explain ~70% of the found variance. There are still a few outliers, but they are comparatively close to the cluster centre. We find again some known compound (a-terpinolen, Octane) which were already known for their high delta.

Let’s now take a look to compound classes.

cl = factor(perc.cleaned$class)
ggbiplot(pr, groups = cl) + scale_color_brewer(name='Compound\nclass', palette = 'Dark2')

A few considerations:

classes are unbalanced in numbers. We have 44 compounds of class 5, followed by only 5 compounds of class 1. This makes hard to distinguish between techniques (to put it simple, everything is class 5). Breaking it down could help
the two distillations keep being similar
there is a strange pattern forming a straight line of dots. It could be due to actual, correct data (especially for compounds being zero for some techniques but not for others). Or it could be an artefact. Further investigation is required

ROSE - exploratory study

Nelson Nazzicari

June 26, 2017

Starting dataset

First PCA

Second PCA: no outliers, rescaled