Introduction & Installation
metIdentify is a R package which can be used for metabolite identification based on in-house and public database.
Please install it via github.
Database construction
If you have in-house standards which have been acquired with MS spectra data, then you can construct the in-house MS2 spectra databases using metIdentify package.
Data preparation
Firstly, please transform your raw standard MS data (positive and negative modes) to mzXML format using ProteoWizard. The parameter setting is shown like figure below:
Data organization
Secondly, please organize your standard information as a table, and output as a csv or xlsx format. The format of stanford information can refer to our demo data in demoData package.
From column 1 to 11, the columns are “Lab.ID”, “mz”, “RT”, “CAS.ID”, “HMDB.ID”, “KEGG.ID”, “Formula”, “mz.pos”, “mz.neg”, “Submitter”, respectively. It it OK if you have another information for the standards. Like the demo data shows, there are other additional information, namely “Family”, “Sub.pathway” and “Note”.
mz: Accurate mass of compound.
RT: Retention time, second.
mz.pos: Mass to change ratio of compound in positive mode, for example, M+H.
mz.neg: Mass to change ratio of compound in negative mode, for example, M-H.
Submitter: The name of person or organization.
Then creat a folder and put you mzXML format datasets (positive mode in ‘POS’ folder and negative mode in ‘NEG’ folder) and compound information in it.
Run databaseConstruction function
We use the demo data in demoData package to show how to use metIdentify. Please install it first.
library(demoData)
library(metIdentify)
path <- system.file("database_construction", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
#> [1] TRUE
new.path <- file.path("./database_construction")
test.database <- databaseConstruction(path = new.path,
version = "0.0.1",
metabolite.info.name = "metabolite.info_RPLC.csv",
source = "Michael Snyder lab",
link = "http://snyderlab.stanford.edu/",
creater = "Xiaotao Shen",
email = "shenxt1990@163.com",
rt = TRUE,
mz.tol = 15,
rt.tol = 30,
threads = 3)
#> Reading positive MS2 data...
#> Reading MS2 data...
#> Processing...
#>
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
#>
#> Reading negative MS2 data...
#> Reading MS2 data...
#> Processing...
#>
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
#>
#> Matching metabolites with MS2 spectra (positive)...
#> Matching metabolites with MS2 spectra (negative)...The arguments of databaseConstruction can be found using ?databaseConstruction.
test.database is a databaseClass object, you can print it to see its information.
test.database
#> -----------Base information------------
#> Version: 0.0.1
#> Source: Michael Snyder lab
#> Link: http://snyderlab.stanford.edu/
#> Creater: Xiaotao Shen ( shenxt1990@163.com )
#> With RT information
#> -----------Spectral information------------
#> There are 14 items of metabolites in database:
#> Lab.ID; Compound.name; mz; RT; CAS.ID; HMDB.ID; KEGG.ID; Formula; mz.pos; mz.neg; Submitter; Family; Sub.pathway; Note
#> There are 170 metabolites in total
#> There are 108 metabolites in positive mode
#> There are 104 metabolites in positive mode
#> Collision energy in positive mode:
#> NCE25; NCE50
#> Collision energy in negative mode:
#> NCE25; NCE50Retention time correction
The metabolite retention time (RT) may shift in different batches. So if you spike internal standards into your standards and biological samples, you can correct the RTs in database using rtCor4database function.
Data preparation
Firstly, please prepare two internal standard (IS) tables for database and biological samples. The format of IS table is shown like figure below:
The IS table for database should be named as “database.is.table.xlsx” and the IS table for experiment should be named as “experiment.is.table.xlsx”.
Run rtCor4database function
test.database2 <- rtCor4database(experiment.is.table = "experiment.is.table.xlsx",
database.is.table = "database.is.table.xlsx",
database = test.database,
path = new.path)The database should be the database (databaseClass object) which you want to correct RTs.
Metabolite identification
Identify metabolites based on MS2 library
MS1 data preparation
The peak table must contain “name” (peak name), “mz” (mass to charge ratio) and “rt” (retention time, second). It can be from any data processing software (XCMS, MS-DIAL and so on).
MS2 data preparation
The raw MS2 data from DDA or DIA should be transfered to msp, msp or mzXML format files. You can use ProteoWizard.
Data organization
Place the MS1 peak table, MS2 data and database which you want to use in one folder like below figure shows:
Run metIdentify function
We can use the demo data from demoData package to show how to use it.
library(metIdentify)
path <- system.file("ms2_identification_demo_data1", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
#> [1] TRUE
new.path <- file.path("./ms2_identification_demo_data1")
data("msDatabase_rplc0.0.1", package = "metIdentify")
save(msDatabase_rplc0.0.1, file = file.path(new.path, "msDatabase_rplc0.0.1"))
ms2.data <- grep(pattern = "mgf", dir(new.path), value = TRUE)
result <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported
ms1.ms2.match.mz.tol = 25,
ms1.ms2.match.rt.tol = 10,
ms1.match.ppm = 25,
ms2.match.tol = 0.4,
fraction.weight = 0.3,
dp.forward.weight = 0.6,
dp.reverse.weight = 0.1,
rt.match.tol = 60,
polarity = "positive",
ce = "all",
column = "rp",
ms1.match.weight = 0.25,
rt.match.weight = 0.25,
ms2.match.weight = 0.5,
path = new.path,
total.score.tol = 0,
candidate.num = 3,
database = "msDatabase_rplc0.0.1",
threads = 3)
#> Use old data
#> Matching peak table with MS2 spectrum...
#> 1096 out of 22193 peaks have MS2 spectra.
#> Selecting the most intense MS2 spectrum for each peak...
#> You use all CE values.
#>
#> Identifing metabolites with MS/MS database...
#>
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
#>
#> All is done.The argument of metIdentify can be found using ?metIdentify.
ms1.data: The MS1 peak table name. It must be the “csv” format.
ms2.data: The MS2 data name. It can be msp, mgf or mzXML format.
ms1.ms2.match.mz.tol: The m/z tolerance for MS1 peak and MS2 spectrum match. Default is 25 ppm.
ms1.ms2.match.rt.tol: The RT tolerance for MS1 peak and MS2 spectrum match. Default is 10 s.
ms1.match.ppm: The m/z tolerance for peak and database metabolite match.
ms2.match.tol: The MS2 similarity tolerance for peak and database metabolite match. The MS2 similarity refers to the algorithm from MS-DIAl. So if you want to know more information about it, please read this publication.
\[MS2\;Simlarity\;Score\;(SS) = Fragment\;fraction*Weight_{fraction} + Dot\;product(forward) * Weight_{dp.reverse}+Dot\;product(reverse)*Weight_{dp.reverse}\]
- fraction.weight: The weight for fragment match fraction.
\[Fragment\;match\;fraction = \dfrac{Match\;fragement\;number}{All\;fragment\;number}\]
dp.forward.weight: The weight for dot product (forward)
dp.forward.weight: The weight for dot product (forward)
\[Dot\;product = \dfrac{\sum(wA_{act.}wA_{lib})^2}{\sum(wA_{act.})^2\sum(wA_{lib})^2}with\;w =1/(1+\dfrac{A}{\sum(A-0.5)})\]
result is metIdentifyClass object. You can print it out to see the identificaiton information.
result
#> -----------Identifications------------
#> (Use getIdentificationTable to get identification table)
#> There are 22193 peaks
#> 1096 peaks have MS2 spectra
#> There are 73 metabolites are identified
#> There are 56 peaks with identification
#> -----------Parameters------------
#> (Use getParam to get all the parameters of this processing)
#> Polarity: positive
#> Collision energy: all
#> database: msDatabase_rplc0.0.1
#> Total score cutoff: 0
#> Column: rp
#> Adduct table:
#> (M+H)+;(M+H-H2O)+;(M+H-2H2O)+;(M+NH4)+;(M+Na)+;(M-H+2Na)+;(M-2H+3Na)+;(M+K)+;(M-H+2K)+;(M-2H+3K)+;(M+CH3CN+H)+;(M+CH3CN+Na)+;(2M+H)+;(2M+NH4)+;(2M+Na)+;(2M+K)+;(M+HCOO+2H)+You can get processsing parameters from it.
parameters <- getParams(object = result)
tibble::as.tibble(parameters)
#> # A tibble: 17 x 3
#> Parameter Meaning Value
#> <chr> <chr> <chr>
#> 1 ms1.ms2.match.m~ MS1 features & MS spectra matchin~ 25
#> 2 ms1.ms2.match.r~ MS1 features & MS spectra matchin~ 10
#> 3 ms1.match.ppm MS1 match tolerance (ppm) 25
#> 4 ms2.match.ppm MS2 fragment match tolerance (ppm) 30
#> 5 ms2.match.tol MS2 match tolerance 0.4
#> 6 rt.match.tol RT match tolerance (s) 60
#> 7 polarity Polarity positive
#> 8 ce Collision energy all
#> 9 column Column rp
#> 10 ms1.match.weight MS1 match weight 0.25
#> 11 rt.match.weight RT match weight 0.25
#> 12 ms2.match.weight MS2 match weight 0.5
#> 13 path Work directory ./ms2_identificatio~
#> 14 total.score.tol Total score tolerance 0
#> 15 candidate.num Candidate number 3
#> 16 database MS2 database msDatabase_rplc0.0.1
#> 17 threads Thread number 3You can get the identification table from it.
Filter
You can use filterIden function to filer identification results from the metIdentifyClass object according to m/z error, rt error, MS2 spectra similarity and total score.
MS2 spectra match plot output
You can also use ms2plot function to output the MS2 specra match plot for one, multiple or all peaks.
Output one MS2 spectra match plot.
##which peaks have identification
peak.name <- whichHasIden(object = result)
head(peak.name)
#> MS1.peak.name MS2.spectra.name
#> 1 pRPLC_125 mz190.086441040039rt335.493954
#> 2 pRPLC_285 mz526.292943427598rt642.48804
#> 3 pRPLC_287 mz321.240051269531rt654.50772
#> 4 pRPLC_417 mz548.37107890314rt693.06558
#> 5 pRPLC_430 mz502.292999267578rt641.79102
#> 6 pRPLC_479 mz167.070388793945rt511.339296
ms2.plot1 <- ms2plot(object = result, database = msDatabase_rplc0.0.1,
which.peak = peak.name[1,1])
ms2.plot1You can also output interactive MS2 spectra match plot by setting
interaction.plotas TRUE.
##which peaks have identification
ms2.plot2 <- ms2plot(object = result, database = msDatabase_rplc0.0.1,
which.peak = peak.name[1,1], interaction.plot = TRUE)
ms2.plot2You can output all MS2 spectra match plots by setting
which.peakas “all”.
ms2plot(object = result, database = msDatabase_rplc0.0.1,
which.peak = "all", path = file.path(new.path, "inhouse"))Then all the MS2 spectra match plots will be output in the “inhouse” folder.
Result integration and output
You can also use other public database for metabolite identificaiton based on MS2 spectra. We provided four public database, which can be got from the demoData package.
MassBank database (massbankDatabase0.0.1)
MoNA database (monaDatabase0.0.1)
HMDB database (hmdbDatabase0.0.1)
Orbitrap database from MassBank (orbitrapDatabase0.0.1)
Here, we will use orbitrap database for metabolite identification.
library(demoData)
library(metIdentify)
path <- system.file("ms2_database", package = "demoData")
file.copy(from = file.path(path, "orbitrapDatabase0.0.1"),
to = new.path, overwrite = TRUE, recursive = TRUE)
load(file.path(new.path, "orbitrapDatabase0.0.1"))
result2 <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported
ms1.ms2.match.mz.tol = 25,
ms1.ms2.match.rt.tol = 10,
ms1.match.ppm = 25,
ms2.match.tol = 0.4,
fraction.weight = 0.3,
dp.forward.weight = 0.6,
dp.reverse.weight = 0.1,
rt.match.tol = 60,
polarity = "positive",
ce = "all",
column = "rp",
ms1.match.weight = 0.25,
rt.match.weight = 0.25,
ms2.match.weight = 0.5,
path = new.path,
total.score.tol = 0,
candidate.num = 3,
database = "orbitrapDatabase0.0.1",
threads = 3)We can also identify metabolites only m/z match, which is level 3 identification according to MSI. We provid the HMDB metabolite database in the demoData package.
library(demoData)
library(metIdentify)
path <- system.file("hmdb_metabolite_database", package = "demoData")
file.copy(from = file.path(path, "HMDB.metabolite.data"),
to = new.path, overwrite = TRUE, recursive = TRUE)
result3 <- mzIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
ms1.match.ppm = 25,
polarity = "positive",
column = "rp",
path = new.path,
candidate.num = 3,
database = "HMDB.metabolite.data",
threads = 3)For result1 using in-hosue database (m/z, RT and MS2 spectra), they are level 1 identifications. For result2 using public MS2 spectra database (m/z and MS2 spectra), they are level 2 identifications. For result3 using HMDB metabolite database (only m/z), they are level 3 identifications according to MSI. So you can integrate those three results together.
identification.table1 <- getIdentificationTable(object = result, candidate.num = 1, type = "new")
identification.table1 <- identification.table1[!is.na(identification.table1$Compound.name),,drop = FALSE]
identification.table2 <- getIdentificationTable(object = result2, candidate.num = 1, type = "new")
identification.table2 <- identification.table2[!is.na(identification.table2$Compound.name),,drop = FALSE]
identification.table2 <- identification.table2[!(identification.table2$name %in% identification.table1$name),,drop = FALSE]
identification.table3 <- getIdentificationTable2(object = result3, candidate.num = 1, type = "new")
identification.table3 <- identification.table3[!is.na(identification.table3$Compound.name),,drop = FALSE]
identification.table3 <- identification.table3[!(identification.table3$name %in% c(identification.table1$name,identification.table1$name)),,drop = FALSE]
identification.table1 <- data.frame(identification.table1, Level = 1, stringsAsFactors = FALSE)
identification.table2 <- data.frame(identification.table2, Level = 2, stringsAsFactors = FALSE)
identification.table3 <- data.frame(identification.table3[-11],
"mz.match.score" = NA,
"RT.error" = NA,
"RT.match.score" = NA,
"CE" = NA,
"SS" = NA,
"Total.score" = NA,
identification.table3[11],
Level = 3, stringsAsFactors = FALSE)
identification.table <- rbind(identification.table1,
identification.table2,
identification.table3)
write.csv(identification.table, file = file.path(new.path, "identification.table.csv"), row.names = FALSE)