Data analysis for phenolics in R

Introduction

The purpose of this document is to illustrate the functions for data analysis that can be applied to our data sets.
The main steps have been summarized in convenient functions, in order to keep the analysis quick and clean.
Functions are provided to load and print chromatogram traces, and to load and visualize peak areas quantitated from
chromatograms. All the input is done through simple CSV files.

Chromatograms

Chromatograms must be exported from Chemstation through the command “Export 3-D data” in the “tools” menu.
After loading it with the read.csv() function, the result is a data frame in which the first column contains
the scan number and the following columns the intensities for each ion. The chromatogram in this way is not ready
for use yet, and we want to convert the scan number to the time elapsed. To do so, we have the ChromFix()
function. It accepts as input the data.frame of the chromatogram and the time of the first and last scan. It
automatically calculates TIC, and rescales it to an arbitrary factor in order to achieve a better visualization.
As an example we'll use a full scan chromatogram from a chicory sample.

threeD <- read.csv("3D.csv")
chrom.fixed <- ChromFix(threeD)
print(chrom.fixed[1:10, c(1:4, ncol(chrom.fixed))])
##     time  X50  X51  X52    TIC
## 1  6.009 1186 1685 2600  625.0
## 2  6.017 1376 2107 2447  728.0
## 3  6.026 1541 1678 2557  953.8
## 4  6.034 1353 2313 2653 1489.1
## 5  6.043 1639 2968 3338 2304.7
## 6  6.051 1518 3873 3336 3470.8
## 7  6.060 1571 4356 3779 4514.4
## 8  6.068 1738 4660 3206 5107.5
## 9  6.077 1915 3855 3303 4580.8
## 10 6.085 1458 2884 2857 3074.6

After fixing the chromatogram, we want to melt it in long format to be able to graph it with ggplot2. The
function \emph{ChromMelt()} does this and also removes the annoying X's in front of the ion m/z.

chrom.melted <- ChromMelt(chrom.fixed)
head(chrom.melted)
##    time ion value
## 1 6.009  50  1186
## 2 6.017  50  1376
## 3 6.026  50  1541
## 4 6.034  50  1353
## 5 6.043  50  1639
## 6 6.051  50  1518

Finally we can plot the chromatogram with the ChromPlot function. By default it plots TIC in black.
However it is very easy to extract individual ion chromatograms, change colors and zoom.

ChromPlot(chrom.fixed)

plot of chunk plot-chromatogram

ChromPlot(chrom.fixed, ions = c("219", "345"), ions.color = c("blue4", 
    "darkred"), ions.zoom = c(1, 1), ions.offset = c(0, 0), ions.rescale = FALSE)

plot of chunk plot-chromatogram

ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4", 
    "darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 0, 0), 
    ions.rescale = FALSE, x.zoom = c(34, 35), y.zoom = c(0, 10000))

plot of chunk plot-chromatogram

ChromPlot(chrom.fixed, ions = c("219", "345"), ions.color = c("blue4", 
    "darkred"), ions.zoom = c(1, 1), ions.offset = c(0, 0), ions.rescale = TRUE, 
    x.zoom = c(34, 35))

plot of chunk plot-chromatogram

ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4", 
    "darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 10000, 
    20000), ions.rescale = FALSE, x.zoom = c(30, 40))

plot of chunk plot-chromatogram

ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4", 
    "darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 0.2, 
    0.4), ions.rescale = TRUE, x.zoom = c(34, 35))

plot of chunk plot-chromatogram

Data analysis

Here comes the most important part. The workflow should ideally proceed as this:

  1. Quantify individual chromatogram in Chemstation and export in excel
  2. Merge all data in one single excel file and load file in R
  3. Normalize each chromatogram for its internal standard
  4. Average technical replicates
  5. If desired rescale and mean center
  6. Analyze data

It is of uttermost importance that normalization with internal standard is performed before any other
transformation of the data, including averaging of technical replicates or rescaling.

!(workflow.pdf)

The input must be a data.table in which every column is a sample, and each row an analyte. Samples must follow
a coding to distinguish experimental conditions and technical replicates. Their name can be composed of up to 4
“blocks”, separated by an underscore. Both numbers and
letters can be used. An example of a sample name series is Treatment_Sample1_ReplicateA,
Treatment_Sample1_ReplicateB, Treatment_Sample1_ReplicateC, Treatment_Sample2_ReplicateA,
…, Control_Sample1_ReplicateA, …, Control_Sample2_ReplicateA, …
It is important that the blocks are assigned so that the rightmost one is the fastest changing,
while the leftmost one is the slowest changing. As an example, we'll load a dataset from a chicory sampling.
In the dataset we have 5 chicory varieties (RH = Rudy Pasgang hydroponic, RG = Rudy Pasgang ground, T =
Topscore, F = Focus, MB = Montblanc); for each of them 5 individual leaf samples were analyzed in two
technical replicates. In total there are 3 “blocks”:

  1. the varieties
  2. the sample number
  3. the technical repetitions We also redefine the order of the levels in a more meaningful way, so that chemically correlated compounds appear close to each other.
selecting.varieties <- read.csv("data_analysis_edited.csv")
selecting.varieties[, 1] <- factor(selecting.varieties[, 1], levels = c("quinic acid", 
    "p-coumaric acid", "caffeic acid", "caffeic acid isomer", "chlorogenic acid", 
    "neochlorogenic acid", "chlorogenic acid isomer 1", "chlorogenic acid isomer 2", 
    "caftaric acid", "ferulic acid", "feruloyl quinic acid", "feruloyl tartaric acid", 
    "phenyl-beta-d-glp"))
print(selecting.varieties[1:10, 1:5])
##                Compound.Name   RH_1_a   RH_1_b   RH_2_a   RH_2_b
## 1          phenyl-beta-d-glp   956269   620417   858300   788485
## 2                quinic acid   464858   638994   207355   327551
## 3            p-coumaric acid    28295    14756    25305    27591
## 4        caffeic acid isomer  1785242  1273867  1546105  1845096
## 5               ferulic acid    58996    34304    35073    39378
## 6               caffeic acid 31659700 19794454 20170831 21679547
## 7     feruloyl tartaric acid    53226    34571    12274    22682
## 8              caftaric acid  8626191  5307349  1169070  2770500
## 9  chlorogenic acid isomer 1    77109    51419    39319    39355
## 10 chlorogenic acid isomer 2    17559     7999     7696     4832

The function IstdNormalize() divides the areas of each analyte in a column by the area of the internal
standard, and removes the latter. By default internal standard is searched at the first row, but it can be
specified at will

normalized.varieties <- IstdNormalize(selecting.varieties, istd = 1)
print(normalized.varieties[1:10, 1:5])
##                     compound   RH_1_a   RH_1_b    RH_2_a    RH_2_b
## 2                quinic acid  0.48612  1.02994  0.241588  0.415418
## 3            p-coumaric acid  0.02959  0.02378  0.029483  0.034992
## 4        caffeic acid isomer  1.86688  2.05324  1.801357  2.340052
## 5               ferulic acid  0.06169  0.05529  0.040863  0.049941
## 6               caffeic acid 33.10753 31.90508 23.500910 27.495193
## 7     feruloyl tartaric acid  0.05566  0.05572  0.014300  0.028767
## 8              caftaric acid  9.02067  8.55449  1.362076  3.513700
## 9  chlorogenic acid isomer 1  0.08064  0.08288  0.045810  0.049912
## 10 chlorogenic acid isomer 2  0.01836  0.01289  0.008967  0.006128
## 11      feruloyl quinic acid  0.05746  0.06809  0.029695  0.038305

Then we must average the technical replicates. The inner workings of the function are a bit difficult.
Basically first we transform the data into the long format, creating new, separate columns for
experimental conditions and technical repetitions. Then we average all the rows with the same experimental
conditions, and finally we cast it back into the wide format. On the practical side, to do so we must simply
specify the number of “blocks” defining the sample. The function will average the rightmost one (generally the
technical repetitions). Of course it is also possible to repeat the process more than one time, in order
to average also the other “blocks”.

averaged.varieties <- AvgRep(normalized.varieties, n.splits = 3)
## Using compound as id variables
## Using area as value column.  Use the value argument to cast to override this choice
## Using area as value column.  Use the value argument to cast to override this choice
print(averaged.varieties[1:10, 1:5])
##                     compound      F_1      F_2      F_3      F_4
## 1                quinic acid  2.29430  1.75434  1.97473  1.61286
## 2            p-coumaric acid  0.11690  0.08704  0.09807  0.08634
## 3               caffeic acid 23.88367 19.79064 25.75655 22.56089
## 4        caffeic acid isomer  2.51853  2.66497  2.98266  2.35239
## 5           chlorogenic acid 12.86009  8.13978  7.65877  7.86788
## 6        neochlorogenic acid  0.21571  0.14859  0.14438  0.17251
## 7  chlorogenic acid isomer 1  0.45208  0.42023  0.31987  0.30237
## 8  chlorogenic acid isomer 2  0.08339  0.04747  0.04744  0.04969
## 9              caftaric acid  2.86197  1.27703  2.05852  2.40229
## 10              ferulic acid  0.03697  0.04089  0.04277  0.04214
twice.averaged.varieties <- AvgRep(averaged.varieties, n.splits = 2)
## Using compound as id variables
## Using area as value column.  Use the value argument to cast to override this choice
## Using area as value column.  Use the value argument to cast to override this choice
print(twice.averaged.varieties[1:10, 1:5])
##                     compound        F       MB       RG       RH
## 1                quinic acid  2.04676  0.68868  0.36943  0.66476
## 2            p-coumaric acid  0.10708  0.09318  0.02730  0.02656
## 3               caffeic acid 23.04432 16.61523 12.98445 25.46560
## 4        caffeic acid isomer  2.51761  0.53706  0.45941  1.98182
## 5           chlorogenic acid 10.19031  3.76665  1.24717  2.81513
## 6        neochlorogenic acid  0.19615  0.15062  0.03434  0.07969
## 7  chlorogenic acid isomer 1  0.38269  0.05691  0.03946  0.09913
## 8  chlorogenic acid isomer 2  0.05913  0.02668  0.01912  0.01377
## 9              caftaric acid  2.50988  3.62420  2.24016  3.49557
## 10              ferulic acid  0.04113  0.04056  0.01792  0.05087

We can now plot our data to see what's going on. The DataDotsPlot(), DataSimplePlot() and DataBoxPlot()
functions will take care of the tedious details of plotting for you, returning a ggplot2 object than can be
modified afterwards. Without specifying any detail the function will plot the areas versus the compound names
without any further grouping or subdivision (this means that “blocks” are ignored).

DataDotsPlot(averaged.varieties, n.splits = 2)
## Using compound as id variables

plot of chunk data-plot

DataSimplePlot(averaged.varieties, n.splits = 2)
## Using compound as id variables

plot of chunk data-plot

DataBoxPlot(averaged.varieties, n.splits = 2)
## Using compound as id variables

plot of chunk data-plot

This graph however does not give us any information about how compounds differ between samples or varieties. To see it,
we can use faceting. The function automatically chooses the parameters to be mapped for faceting starting from the
leftmost block. This is to say:

*One split: data is faceted according to the leftmost block. Facets are ordered according to the levels of the block
*Two splits: data is faceted on the vertical axis according to the leftmost “block”, and on the horizontal axis
according to the “block” before the leftmost.
*More than two splits: data is faceted on the vertical axis according to the leftmost “block”,
and on the horizontal axis according to the discrete combinations of the levels of the following blocks.

DataSimplePlot(averaged.varieties, n.splits = 2, facets = 1)
## Using compound as id variables

plot of chunk data-facetplot1

DataDotsPlot(averaged.varieties, n.splits = 2, facets = 2)
## Using compound as id variables

plot of chunk data-facetplot2

DataDotsPlot(normalized.varieties, n.splits = 3, facets = 3)
## Using compound as id variables

plot of chunk data-facetplot3

However there is a problem with all this graphs: each compound is measured at a very different amount, making comparison
really difficult. We can then re-center the data, using two approaches:

centered.varieties <- CpdCenter1(averaged.varieties)
zdist.varieties <- CpdCenter2(averaged.varieties)
DataSimplePlot(centered.varieties, n.splits = 2, facets = 1)
## Using compound as id variables

plot of chunk mean-center

DataSimplePlot(zdist.varieties, n.splits = 2, facets = 1)
## Using compound as id variables

plot of chunk mean-center

Plotting like this allows us to see that the data is log-normal rather than normal. So before proceeding
any further with statistics, we might want to do a log-transform. Remember that before doing any processing on the
data, including log-normal transformation, we have to average between the technical replicates. so instead
of taking the logarithm on the raw data, we'll take it from the averaged data.

log.averaged.varieties <- cbind(averaged.varieties[, 1], log10(averaged.varieties[, 
    -1]))
log.zdist.varieties <- CpdCenter2(log.averaged.varieties)
DataSimplePlot(log.zdist.varieties, n.splits = 2, facets = 1)
## Using compound as id variables

plot of chunk log-transform

Now the data is normally distributed and we can go on with our analysis. One interesting thing is analyzing correlation
between the amount of different compounds. The function ScatterplotMatrix() takes as input the usual dataframe,
and draws the scatterplot matrix of all the compounds. No “block” is considered here.

ScatterplotMatrix(log.averaged.varieties)

plot of chunk scatteplot-matrix

Another interesting analysis is PCA.

PCAPlot(log.zdist.varieties, 2, c(1, 2))

plot of chunk PCA