The purpose of this document is to illustrate the functions for data analysis that can be applied to our data sets.
The main steps have been summarized in convenient functions, in order to keep the analysis quick and clean.
Functions are provided to load and print chromatogram traces, and to load and visualize peak areas quantitated from
chromatograms. All the input is done through simple CSV files.
Chromatograms must be exported from Chemstation through the command “Export 3-D data” in the “tools” menu.
After loading it with the read.csv() function, the result is a data frame in which the first column contains
the scan number and the following columns the intensities for each ion. The chromatogram in this way is not ready
for use yet, and we want to convert the scan number to the time elapsed. To do so, we have the ChromFix()
function. It accepts as input the data.frame of the chromatogram and the time of the first and last scan. It
automatically calculates TIC, and rescales it to an arbitrary factor in order to achieve a better visualization.
As an example we'll use a full scan chromatogram from a chicory sample.
threeD <- read.csv("3D.csv")
chrom.fixed <- ChromFix(threeD)
print(chrom.fixed[1:10, c(1:4, ncol(chrom.fixed))])
## time X50 X51 X52 TIC
## 1 6.009 1186 1685 2600 625.0
## 2 6.017 1376 2107 2447 728.0
## 3 6.026 1541 1678 2557 953.8
## 4 6.034 1353 2313 2653 1489.1
## 5 6.043 1639 2968 3338 2304.7
## 6 6.051 1518 3873 3336 3470.8
## 7 6.060 1571 4356 3779 4514.4
## 8 6.068 1738 4660 3206 5107.5
## 9 6.077 1915 3855 3303 4580.8
## 10 6.085 1458 2884 2857 3074.6
After fixing the chromatogram, we want to melt it in long format to be able to graph it with ggplot2. The
function \emph{ChromMelt()} does this and also removes the annoying X's in front of the ion m/z.
chrom.melted <- ChromMelt(chrom.fixed)
head(chrom.melted)
## time ion value
## 1 6.009 50 1186
## 2 6.017 50 1376
## 3 6.026 50 1541
## 4 6.034 50 1353
## 5 6.043 50 1639
## 6 6.051 50 1518
Finally we can plot the chromatogram with the ChromPlot function. By default it plots TIC in black.
However it is very easy to extract individual ion chromatograms, change colors and zoom.
ChromPlot(chrom.fixed)
ChromPlot(chrom.fixed, ions = c("219", "345"), ions.color = c("blue4",
"darkred"), ions.zoom = c(1, 1), ions.offset = c(0, 0), ions.rescale = FALSE)
ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4",
"darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 0, 0),
ions.rescale = FALSE, x.zoom = c(34, 35), y.zoom = c(0, 10000))
ChromPlot(chrom.fixed, ions = c("219", "345"), ions.color = c("blue4",
"darkred"), ions.zoom = c(1, 1), ions.offset = c(0, 0), ions.rescale = TRUE,
x.zoom = c(34, 35))
ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4",
"darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 10000,
20000), ions.rescale = FALSE, x.zoom = c(30, 40))
ChromPlot(chrom.fixed, ions = c("219", "249", "345"), ions.color = c("blue4",
"darkgreen", "darkred"), ions.zoom = c(1, 1, 1), ions.offset = c(0, 0.2,
0.4), ions.rescale = TRUE, x.zoom = c(34, 35))
Here comes the most important part. The workflow should ideally proceed as this:
It is of uttermost importance that normalization with internal standard is performed before any other
transformation of the data, including averaging of technical replicates or rescaling.
!(workflow.pdf)
The input must be a data.table in which every column is a sample, and each row an analyte. Samples must follow
a coding to distinguish experimental conditions and technical replicates. Their name can be composed of up to 4
“blocks”, separated by an underscore. Both numbers and
letters can be used. An example of a sample name series is Treatment_Sample1_ReplicateA,
Treatment_Sample1_ReplicateB, Treatment_Sample1_ReplicateC, Treatment_Sample2_ReplicateA,
…, Control_Sample1_ReplicateA, …, Control_Sample2_ReplicateA, …
It is important that the blocks are assigned so that the rightmost one is the fastest changing,
while the leftmost one is the slowest changing. As an example, we'll load a dataset from a chicory sampling.
In the dataset we have 5 chicory varieties (RH = Rudy Pasgang hydroponic, RG = Rudy Pasgang ground, T =
Topscore, F = Focus, MB = Montblanc); for each of them 5 individual leaf samples were analyzed in two
technical replicates. In total there are 3 “blocks”:
selecting.varieties <- read.csv("data_analysis_edited.csv")
selecting.varieties[, 1] <- factor(selecting.varieties[, 1], levels = c("quinic acid",
"p-coumaric acid", "caffeic acid", "caffeic acid isomer", "chlorogenic acid",
"neochlorogenic acid", "chlorogenic acid isomer 1", "chlorogenic acid isomer 2",
"caftaric acid", "ferulic acid", "feruloyl quinic acid", "feruloyl tartaric acid",
"phenyl-beta-d-glp"))
print(selecting.varieties[1:10, 1:5])
## Compound.Name RH_1_a RH_1_b RH_2_a RH_2_b
## 1 phenyl-beta-d-glp 956269 620417 858300 788485
## 2 quinic acid 464858 638994 207355 327551
## 3 p-coumaric acid 28295 14756 25305 27591
## 4 caffeic acid isomer 1785242 1273867 1546105 1845096
## 5 ferulic acid 58996 34304 35073 39378
## 6 caffeic acid 31659700 19794454 20170831 21679547
## 7 feruloyl tartaric acid 53226 34571 12274 22682
## 8 caftaric acid 8626191 5307349 1169070 2770500
## 9 chlorogenic acid isomer 1 77109 51419 39319 39355
## 10 chlorogenic acid isomer 2 17559 7999 7696 4832
The function IstdNormalize() divides the areas of each analyte in a column by the area of the internal
standard, and removes the latter. By default internal standard is searched at the first row, but it can be
specified at will
normalized.varieties <- IstdNormalize(selecting.varieties, istd = 1)
print(normalized.varieties[1:10, 1:5])
## compound RH_1_a RH_1_b RH_2_a RH_2_b
## 2 quinic acid 0.48612 1.02994 0.241588 0.415418
## 3 p-coumaric acid 0.02959 0.02378 0.029483 0.034992
## 4 caffeic acid isomer 1.86688 2.05324 1.801357 2.340052
## 5 ferulic acid 0.06169 0.05529 0.040863 0.049941
## 6 caffeic acid 33.10753 31.90508 23.500910 27.495193
## 7 feruloyl tartaric acid 0.05566 0.05572 0.014300 0.028767
## 8 caftaric acid 9.02067 8.55449 1.362076 3.513700
## 9 chlorogenic acid isomer 1 0.08064 0.08288 0.045810 0.049912
## 10 chlorogenic acid isomer 2 0.01836 0.01289 0.008967 0.006128
## 11 feruloyl quinic acid 0.05746 0.06809 0.029695 0.038305
Then we must average the technical replicates. The inner workings of the function are a bit difficult.
Basically first we transform the data into the long format, creating new, separate columns for
experimental conditions and technical repetitions. Then we average all the rows with the same experimental
conditions, and finally we cast it back into the wide format. On the practical side, to do so we must simply
specify the number of “blocks” defining the sample. The function will average the rightmost one (generally the
technical repetitions). Of course it is also possible to repeat the process more than one time, in order
to average also the other “blocks”.
averaged.varieties <- AvgRep(normalized.varieties, n.splits = 3)
## Using compound as id variables
## Using area as value column. Use the value argument to cast to override this choice
## Using area as value column. Use the value argument to cast to override this choice
print(averaged.varieties[1:10, 1:5])
## compound F_1 F_2 F_3 F_4
## 1 quinic acid 2.29430 1.75434 1.97473 1.61286
## 2 p-coumaric acid 0.11690 0.08704 0.09807 0.08634
## 3 caffeic acid 23.88367 19.79064 25.75655 22.56089
## 4 caffeic acid isomer 2.51853 2.66497 2.98266 2.35239
## 5 chlorogenic acid 12.86009 8.13978 7.65877 7.86788
## 6 neochlorogenic acid 0.21571 0.14859 0.14438 0.17251
## 7 chlorogenic acid isomer 1 0.45208 0.42023 0.31987 0.30237
## 8 chlorogenic acid isomer 2 0.08339 0.04747 0.04744 0.04969
## 9 caftaric acid 2.86197 1.27703 2.05852 2.40229
## 10 ferulic acid 0.03697 0.04089 0.04277 0.04214
twice.averaged.varieties <- AvgRep(averaged.varieties, n.splits = 2)
## Using compound as id variables
## Using area as value column. Use the value argument to cast to override this choice
## Using area as value column. Use the value argument to cast to override this choice
print(twice.averaged.varieties[1:10, 1:5])
## compound F MB RG RH
## 1 quinic acid 2.04676 0.68868 0.36943 0.66476
## 2 p-coumaric acid 0.10708 0.09318 0.02730 0.02656
## 3 caffeic acid 23.04432 16.61523 12.98445 25.46560
## 4 caffeic acid isomer 2.51761 0.53706 0.45941 1.98182
## 5 chlorogenic acid 10.19031 3.76665 1.24717 2.81513
## 6 neochlorogenic acid 0.19615 0.15062 0.03434 0.07969
## 7 chlorogenic acid isomer 1 0.38269 0.05691 0.03946 0.09913
## 8 chlorogenic acid isomer 2 0.05913 0.02668 0.01912 0.01377
## 9 caftaric acid 2.50988 3.62420 2.24016 3.49557
## 10 ferulic acid 0.04113 0.04056 0.01792 0.05087
We can now plot our data to see what's going on. The DataDotsPlot(), DataSimplePlot() and DataBoxPlot()
functions will take care of the tedious details of plotting for you, returning a ggplot2 object than can be
modified afterwards. Without specifying any detail the function will plot the areas versus the compound names
without any further grouping or subdivision (this means that “blocks” are ignored).
DataDotsPlot(averaged.varieties, n.splits = 2)
## Using compound as id variables
DataSimplePlot(averaged.varieties, n.splits = 2)
## Using compound as id variables
DataBoxPlot(averaged.varieties, n.splits = 2)
## Using compound as id variables
This graph however does not give us any information about how compounds differ between samples or varieties. To see it,
we can use faceting. The function automatically chooses the parameters to be mapped for faceting starting from the
leftmost block. This is to say:
*One split: data is faceted according to the leftmost block. Facets are ordered according to the levels of the block
*Two splits: data is faceted on the vertical axis according to the leftmost “block”, and on the horizontal axis
according to the “block” before the leftmost.
*More than two splits: data is faceted on the vertical axis according to the leftmost “block”,
and on the horizontal axis according to the discrete combinations of the levels of the following blocks.
DataSimplePlot(averaged.varieties, n.splits = 2, facets = 1)
## Using compound as id variables
DataDotsPlot(averaged.varieties, n.splits = 2, facets = 2)
## Using compound as id variables
DataDotsPlot(normalized.varieties, n.splits = 3, facets = 3)
## Using compound as id variables
However there is a problem with all this graphs: each compound is measured at a very different amount, making comparison
really difficult. We can then re-center the data, using two approaches:
centered.varieties <- CpdCenter1(averaged.varieties)
zdist.varieties <- CpdCenter2(averaged.varieties)
DataSimplePlot(centered.varieties, n.splits = 2, facets = 1)
## Using compound as id variables
DataSimplePlot(zdist.varieties, n.splits = 2, facets = 1)
## Using compound as id variables
Plotting like this allows us to see that the data is log-normal rather than normal. So before proceeding
any further with statistics, we might want to do a log-transform. Remember that before doing any processing on the
data, including log-normal transformation, we have to average between the technical replicates. so instead
of taking the logarithm on the raw data, we'll take it from the averaged data.
log.averaged.varieties <- cbind(averaged.varieties[, 1], log10(averaged.varieties[,
-1]))
log.zdist.varieties <- CpdCenter2(log.averaged.varieties)
DataSimplePlot(log.zdist.varieties, n.splits = 2, facets = 1)
## Using compound as id variables
Now the data is normally distributed and we can go on with our analysis. One interesting thing is analyzing correlation
between the amount of different compounds. The function ScatterplotMatrix() takes as input the usual dataframe,
and draws the scatterplot matrix of all the compounds. No “block” is considered here.
ScatterplotMatrix(log.averaged.varieties)
Another interesting analysis is PCA.
PCAPlot(log.zdist.varieties, 2, c(1, 2))