This documents describes data import, exploration, preprocessing and analysis of LCMS experiments with xcms version >= 3.
xcms
supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF, mzML/mzXML and mzData format. For demonstration purpose we will analyze a subset of the data from [1] in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds.
Below we load all required packages, locate the raw CDF files within the faahKO package and build a phenodata data frame describing the experimental setup.
library(xcms)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: <U+393C><U+3E31>BiocGenerics<U+393C><U+3E32>
The following objects are masked from <U+393C><U+3E31>package:parallel<U+393C><U+3E32>:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
IQR, mad, sd, var, xtabs
The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocParallel
Loading required package: MSnbase
Loading required package: mzR
Loading required package: Rcpp
mzR has been built against a different Rcpp version (0.12.16)
than is installed on your system (0.12.17). This might lead to errors
when loading mzR. If you encounter such issues, please send a report,
including the output of sessionInfo() to the Bioc support forum at
https://support.bioconductor.org/. For details see also
https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.Loading required package: ProtGenerics
This is MSnbase version 2.6.1
Visit https://lgatto.github.io/MSnbase/ to get started.
Attaching package: <U+393C><U+3E31>MSnbase<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
smooth
The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
trimws
This is xcms version 3.2.0
Attaching package: <U+393C><U+3E31>xcms<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
sigma
library(faahKO)
library(RColorBrewer)
library(pander)
library(magrittr)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 6), rep("WT", 6)),
stringsAsFactors = FALSE)
pd
Subsequently we load the raw data as an OnDiskMSnExp
object using the readMSData
method from the MSnbase
package. While the MSnbase
package was originally developed for proteomics data processing, many of its functionality, including raw data import and data representation, can be shared and reused in metabolomics data analysis.
raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
Polarity can not be extracted from netCDF files, please set manually the polarity with the 'polarity' method.
The OnDiskMSnExp
object contains general information about the number of spectra, retention times, the measured total ion current etc, but does not contain the full raw data (i.e. the m/z and intensity values from each measured spectrum). Its memory footprint is thus rather small making it an ideal object to represent large metabolomics experiments while still allowing to perform simple quality controls, data inspection and exploration as well as data sub-setting operations. The m/z and intensity values are imported from the raw data files on demand, hence the location of the raw data files should not be changed after initial data import.
The OnDiskMSnExp
organizes the MS data by spectrum and provides the methods intensity, mz and rtime to access the raw data from the files (the measured intensity values, the corresponding m/z and retention time values). In addition, the spectra
method could be used to return all data encapsulated in Spectrum
classes. Below we extract the retention time values from the object.
head(rtime(raw_data))
F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
2501.378 2502.943 2504.508 2506.073 2507.638 2509.203
All data is returned as one-dimensional vectors (a numeric vector for rtime
and a list
of numeric vectors for mz
and intensity
, each containing the values from one spectrum), even if the experiment consists of multiple files/samples. The fromFile
function returns a numeric vector that provides the mapping of the values to the originating file. Below we use the fromFile
indices to organize the mz values by file.
mzs <- mz(raw_data)
## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))
length(mzs_by_file)
[1] 12
As a first evaluation of the data we plot below the base peak chromatogram (BPC) for each file in our experiment. We use the chromatogram
method and set the aggregationFun
to "max"
to return for each spectrum the maximal intensity and hence create the BPC from the raw data. To create a total ion chromatogram we could set aggregationFun
to sum
.
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- brewer.pal(3, "Set1")[1:2]
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])
Below we extract the chromatogram of the first sample and access its retention time and intensity values.
str(bpis[2,12])
Error in x@.Data[i, j, drop = TRUE] : subscript out of bounds
The chromatogram
method supports also extraction of chromatographic data from a m/z-rt slice of the MS data. In the next section we will use this method to create an extracted ion chromatogram (EIC) for a selected peak.
Below we create boxplots representing the distribution of total ion currents per file. Such plots can be very useful to spot problematic or failing MS runs.
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = group_colors[raw_data$sample_group],
ylab = "intensity", main = "Total ion current")
Next we perform the chromatographic peak detection using the centWave algorithm [2]. Before running the peak detection it is however strongly suggested to visually inspect e.g. the extracted ion chromatogram of internal standards or known compounds to evaluate and adapt the peak detection settings since the default settings will not be appropriate for most LCMS experiments. The two most critical parameters for centWave are the peakwidth
(expected range of chromatographic peak widths) and ppm
(maximum expected deviation of m/z values of centroids corresponding to one chromatographic peak; this is usually much larger than the ppm specified by the manufacturer) parameters. To evaluate the typical chromatographic peak width we plot the EIC for one peak.
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
For the ppm
parameter we extract the full MS data (intensity, retention time and m/z values) corresponding to the above peak. To this end we first filter the raw object by retention time, then by m/z and finally plot the object with type = "XIC"
to produce the plot below. We use the pipe (%>%)
command better illustrate the corresponding workflow.
raw_data %>%
filterRt(rt = rtr) %>%
filterMz(mz = mzr) %>%
plot(type = "XIC")
For each plot: upper panel: chromatogram plotting the intensity values against the retention time, lower panel m/z against retention time plot. The individual data points are colored according to the intensity.
Below we perform the chromatographic peak detection using the findChromPeaks method. The submitted parameter object defines which algorithm will be used and allows to define the settings for this algorithm. Note that we set the argument noise to 1000 to slightly speed up the analysis by considering only signals with a value larger than 1000 in the peak detection step.
cwp <- CentWaveParam(peakwidth = c(30, 80), noise = 1000)
xdata <- findChromPeaks(raw_data, param = cwp)
The results are returned as an XCMSnExp object which extends the OnDiskMSnExp object by storing also LC/GC-MS preprocessing results. This means also that all methods to sub-set and filter the data or to access the (raw) data are inherited from the OnDiskMSnExp object. The results from the chromatographic peak detection can be accessed with the chromPeaks method.
head(chromPeaks(xdata))
mz mzmin mzmax rt rtmin rtmax into intb maxo
[1,] 236.1 236.1 236.1 2520.158 2501.378 2553.022 301289.53 268211.58 12957
[2,] 362.9 362.9 362.9 2596.840 2587.451 2604.665 19916.60 19674.79 1453
[3,] 302.0 302.0 302.0 2617.185 2587.451 2648.484 699458.52 687162.27 30552
[4,] 316.2 316.2 316.2 2668.828 2661.003 2676.653 51310.09 50854.60 4267
[5,] 370.1 370.1 370.1 2673.523 2643.789 2706.387 458247.09 423667.40 25672
[6,] 360.0 360.0 360.0 2682.913 2635.964 2718.907 9034381.61 8518838.59 317568
sn sample is_filled
[1,] 13 1 0
[2,] 12 1 0
[3,] 73 1 0
[4,] 17 1 0
[5,] 16 1 0
[6,] 20 1 0
We can also plot the location of the identified chromatographic peaks in the m/z - retention time space for one file using the plotChromPeaks function. Below we plot the chromatographic peaks for the 5th sample.
plotChromPeaks(xdata, file = 5)
To get a global overview of the peak detection we can plot the frequency of identified peaks per file along the retention time axis. This allows to identify time periods along the MS run in which a higher number of peaks was identified and evaluate whether this is consistent across files.
plotChromPeakImage(xdata)
Next we highlight the identified chromatographic peaks for the example peak from before. Evaluating such plots on a list of peaks corresponding to known peaks or internal standards helps to ensure that peak detection settings were appropriate and correctly identified the expected peaks. The rectangles indicate the identified chromatographic peaks per sample.
plot(chr_raw, col = group_colors[chr_raw$sample_group], lwd = 3)
highlightChromPeaks(xdata, border = group_colors[chr_raw$sample_group],
lty = 3, rt = rtr, mz = mzr)
pander(chromPeaks(xdata, mz = mzr, rt = rtr),
caption = paste("Identified chromatographic peaks in a selected ",
"m/z and retention time range."))
-----------------------------------------------------------------------------
mz mzmin mzmax rt rtmin rtmax into intb maxo sn
----- ------- ------- ------ ------- ------- --------- --------- ------- ----
335 335 335 2783 2756 2817 421286 392692 16856 26
335 335 335 2788 2756 2821 1529680 1490861 58736 77
335 335 335 2788 2758 2822 221355 204694 8158 19
335 335 335 2786 2756 2821 299084 281522 9154 22
335 335 335 2799 2758 2838 174587 160226 6295 13
335 335 335 2788 2756 2822 954335 933983 35856 76
335 335 335 2791 2758 2827 1668431 1635820 54640 94
335 335 335 2786 2756 2810 644912 632013 20672 54
335 335 335 2794 2769 2832 931316 904196 27200 49
-----------------------------------------------------------------------------
Table: Identified chromatographic peaks in a selected m/z and retention time range. (continued below)
--------------------
sample is_filled
-------- -----------
1 0
2 0
3 0
4 0
5 0
8 0
9 0
10 0
11 0
--------------------
Finally we plot also the distribution of peak intensity per sample. This allows to investigate whether systematic differences in peak signals between samples are present.
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(xdata)[, "into"]),
f = chromPeaks(xdata)[, "sample"])
boxplot(ints, varwidth = TRUE, col = group_colors[xdata$sample_group],
ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment. The method to perform the alignment/retention time correction in xcms
is adjustRtime
which uses different alignment algorithms depending on the provided parameter class. In the example below we use the obiwarp method [4] to align the samples. We use a binSize = 0.6
which creates warping functions in mz bins of 0.6. Also here it is advisable to modify the settings for each experiment and evaluate if retention time correction did align internal controls or known compounds properly.
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))
adjustRtime
, besides calculating adjusted retention times for each spectrum, does also adjust the reported retention times of the identified chromatographic peaks.
To extract the adjusted retention times we can use the adjustedRtime
method, or simply the rtime
method that, if present, returns by default adjusted retention times from an XCMSnExp
object.
## Or simply use the rtime method
head(rtime(xdata, adjusted = TRUE))
F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
2501.378 2502.958 2504.538 2506.118 2507.699 2509.280
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In addition we plot the differences of the adjusted- to the raw retention times per sample using the plotAdjustedRtime
function.
## Get the base peak chromatograms.
bpis_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj, col = group_colors[bpis_adj$sample_group])
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
Base peak chromatogram after alignment (top) and difference between adjusted and raw retention times along the retention time axis (bottom).
Too large differences between adjusted and raw retention times could indicate poorly performing samples or alignment.
At last we evaluate the impact of the alignment on the test peak.
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = group_colors[chr_raw$sample_group])
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(xdata, rt = rtr, mz = mzr)
plot(chr_adj, col = group_colors[chr_raw$sample_group])
The final step in the metabolomics preprocessing is the correspondence that matches detected chromatographic peaks between samples. The method to perform the correspondence in xcms
is groupChromPeaks
. We will use the peak density method to group chromatographic peaks. The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis within small slices along the mz dimension.
To illustrate this we plot below the chromatogram for an mz slice with multiple chromatographic peaks within each sample. We use below a value of 0.4 for the minFraction
parameter hence only chromatographic peaks present in at least 40% of the samples per sample group are grouped into a feature. The sample group assignment is specified with the sampleGroups
argument.
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(xdata, mz = mzr, rt = c(2500, 4000))
par(mfrow = c(3, 1), mar = c(1, 4, 1, 0.5))
cols <- group_colors[chr_mzr$sample_group]
plot(chr_mzr, col = cols, xaxt = "n", xlab = "")
## Highlight the detected peaks in that region.
highlightChromPeaks(xdata, mz = mzr, col = cols, type = "point", pch = 16)
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.99, bw = 30)
par(mar = c(4, 4, 1, 0.5))
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
## Use a different bw
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
The upper panel in the plot above shows the extracted ion chromatogram for each sample with the detected peaks highlighted. The middle and lower plot shows the retention time for each detected peak within the different samples. The black solid line represents the density distribution of detected peaks along the retention times. Peaks combined into features (peak groups) are indicated with grey rectangles. Different values for the bw
parameter of the PeakDensityParam
were used: bw = 30
in the middle and bw = 20
in the lower panel. With the default value for the parameter bw the two neighboring chromatographic peaks would be grouped into the same feature, while with a bw
of 20 they would be grouped into separate features. This grouping depends on the parameters for the density function and other parameters passed to the algorithm with the PeakDensityParam
.
## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
xdata <- groupChromPeaks(xdata, param = pdp)
The results from the correspondence can be extracted using the featureDefinitions
method, that returns a DataFrame
with the definition of the features. The featureValues
method returns a matrix with rows being features and columns samples. The content of this matrix can be defined using the value
argument. Setting value = "into"
returns a matrix with the integrated signal of the peaks corresponding to a feature in a sample. Any column name of the chromPeaks
matrix can be passed to the argument value
. Below we extract the integrated peak intensity per feature/sample.
## Extract the feature definitions
featureDefinitions(xdata)
## Extract the into column for each feature.
head(featureValues(xdata, value = "into"))
This feature matrix contains NA
for samples in which no chromatographic peak was detected in the feature’s m/z-rt region. While in many cases there might indeed be no peak signal in the respective region, it might also be that there is signal, but the peak detection algorithm failed to detect a chromatographic peak. xcms
provides the fillChromPeaks
method to fill in intensity data for such missing values from the original files. The filled in peaks are added to the chromPeaks
matrix and are flagged with an 1
in the "is_filled"
column. Below we perform such a filling-in of missing peaks.
xdata <- fillChromPeaks(xdata)
Requesting 103 missing peaks from ko15.CDF ... got 100.
Requesting 97 missing peaks from ko16.CDF ... got 94.
Requesting 82 missing peaks from ko18.CDF ... got 81.
Requesting 117 missing peaks from ko19.CDF ... got 115.
Requesting 133 missing peaks from ko21.CDF ... got 125.
Requesting 112 missing peaks from ko22.CDF ... got 106.
Requesting 122 missing peaks from wt15.CDF ... got 115.
Requesting 104 missing peaks from wt16.CDF ... got 99.
Requesting 126 missing peaks from wt18.CDF ... got 120.
Requesting 122 missing peaks from wt19.CDF ... got 118.
Requesting 137 missing peaks from wt21.CDF ... got 129.
Requesting 110 missing peaks from wt22.CDF ... got 102.
head(featureValues(xdata))
ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF
FT001 43 416 912 1242 1500 1727 1978 2350
FT002 25 3898 3992 1232 1495 1724 1968 2341
FT003 3798 399 881 4073 1482 1707 1962 2327
FT004 1 376 3993 4074 4188 4313 1943 2304
FT005 253 3899 1122 1397 4189 1880 4419 2540
FT006 46 424 916 4075 4190 1728 1984 4534
wt18.CDF wt19.CDF wt21.CDF wt22.CDF
FT001 2705 3000 3270 3508
FT002 2693 2996 3267 3501
FT003 2673 4753 4871 3482
FT004 4633 4754 3235 5000
FT005 4634 3184 4872 5001
FT006 2706 3004 4873 3519
For features without detected peaks in a sample, the method extracts all intensities in the mz-rt region of the feature, integrates the signal and adds a filled-in peak to the chromPeaks
matrix. No peak is added if no signal is measured/available for the mz-rt region of the feature. For these, even after filling in missing peak data, a NA is reported in the featureValues matrix.
Below we compare the number of missing values before and after filling in missing values. We can use the parameter filled of the featureValues method to define whether or not filled-in peak values should be returned too.
## Missing values before filling in peaks
apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF
103 97 82 117 133 112 122 104
wt18.CDF wt19.CDF wt21.CDF wt22.CDF
126 122 137 110
## Missing values after filling in peaks
apply(featureValues(xdata), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF
3 3 1 2 8 6 7 5
wt18.CDF wt19.CDF wt21.CDF wt22.CDF
6 4 8 8
At last we perform a principal component analysis to evaluate the grouping of the samples in this experiment. Note that we did not perform any data normalization hence the grouping might (and will) also be influenced by technical biases.
## Extract the features and log2 transform them
ft_ints <- log2(featureValues(xdata, value = "into"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
cols <- group_colors[xdata$sample_group]
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
digits = 3), " % variance"),
ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
digits = 3), " % variance"),
col = "darkgrey", bg = cols, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = xdata$sample_name, col = "darkgrey",
pos = 3, cex = 2)
We can see the expected separation between the KO and WT samples on PC2. On PC1 samples separate based on their ID, samples with an ID <= 18 from samples with an ID > 18. This separation might be caused by a technical bias (e.g. measurements performed on different days/weeks) or due to biological properties of the mice analyzed (sex, age, litter mates etc).