metIdentify version0.1.4

Introduction & Installation

metIdentify is a R package which can be used for metabolite identification based on in-house and public database.

Please install it via github.

if(!require(devtools)){
  install.packages("devtools")
}
devtools::install_github("jaspershen/metIdentify")
##tinyTools is a dependent package for metIdentify
devtools::install_github("jaspershen/tinyTools")

Database construction

If you have in-house standards which have been acquired with MS spectra data, then you can construct the in-house MS2 spectra databases using metIdentify package.

Data preparation

Firstly, please transform your raw standard MS data (positive and negative modes) to mzXML format using ProteoWizard. The parameter setting is shown like figure below:

Data organization

Secondly, please organize your standard information as a table, and output as a csv or xlsx format. The format of stanford information can refer to our demo data in demoData package.

From column 1 to 11, the columns are “Lab.ID”, “mz”, “RT”, “CAS.ID”, “HMDB.ID”, “KEGG.ID”, “Formula”, “mz.pos”, “mz.neg”, “Submitter”, respectively. It it OK if you have another information for the standards. Like the demo data shows, there are other additional information, namely “Family”, “Sub.pathway” and “Note”.

mz: Accurate mass of compound.
RT: Retention time, second.
mz.pos: Mass to change ratio of compound in positive mode, for example, M+H.
mz.neg: Mass to change ratio of compound in negative mode, for example, M-H.
Submitter: The name of person or organization.

Then creat a folder and put you mzXML format datasets (positive mode in ‘POS’ folder and negative mode in ‘NEG’ folder) and compound information in it.

Run `databaseConstruction` function

We use the demo data in demoData package to show how to use metIdentify. Please install it first.

devtools::install_github("jaspershen/demoData")

library(demoData)
library(metIdentify)
path <- system.file("database_construction", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
#> [1] TRUE
new.path <- file.path("./database_construction")

test.database <- databaseConstruction(path = new.path, 
                                      version = "0.0.1",
                                      metabolite.info.name = "metabolite.info_RPLC.csv", 
                                      source = "Michael Snyder lab", 
                                      link = "http://snyderlab.stanford.edu/",
                                      creater = "Xiaotao Shen",
                                      email = "shenxt1990@163.com",
                                      rt = TRUE,
                                      mz.tol = 15,
                                      rt.tol = 30,
                                      threads = 3)
#> Reading positive MS2 data...
#> Reading MS2 data...
#> Processing...
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> Reading negative MS2 data...
#> Reading MS2 data...
#> Processing...
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> Matching metabolites with MS2 spectra (positive)...
#> Matching metabolites with MS2 spectra (negative)...

The arguments of databaseConstruction can be found using ?databaseConstruction.

test.database is a databaseClass object, you can print it to see its information.

test.database
#> -----------Base information------------
#> Version: 0.0.1 
#> Source: Michael Snyder lab 
#> Link: http://snyderlab.stanford.edu/ 
#> Creater: Xiaotao Shen ( shenxt1990@163.com )
#> With RT information
#> -----------Spectral information------------
#> There are 14 items of metabolites in database:
#> Lab.ID; Compound.name; mz; RT; CAS.ID; HMDB.ID; KEGG.ID; Formula; mz.pos; mz.neg; Submitter; Family; Sub.pathway; Note 
#> There are 170 metabolites in total
#> There are 108 metabolites in positive mode
#> There are 104 metabolites in positive mode
#> Collision energy in positive mode:
#> NCE25; NCE50 
#> Collision energy in negative mode:
#> NCE25; NCE50

Retention time correction

The metabolite retention time (RT) may shift in different batches. So if you spike internal standards into your standards and biological samples, you can correct the RTs in database using rtCor4database function.

Data preparation

Firstly, please prepare two internal standard (IS) tables for database and biological samples. The format of IS table is shown like figure below:

The IS table for database should be named as “database.is.table.xlsx” and the IS table for experiment should be named as “experiment.is.table.xlsx”.

Run `rtCor4database` function

test.database2 <- rtCor4database(experiment.is.table = "experiment.is.table.xlsx", 
                                 database.is.table = "database.is.table.xlsx", 
                                 database = test.database, 
                                 path = new.path)

The database should be the database (databaseClass object) which you want to correct RTs.

Metabolite identification

Identify metabolites based on MS2 library

MS1 data preparation

The peak table must contain “name” (peak name), “mz” (mass to charge ratio) and “rt” (retention time, second). It can be from any data processing software (XCMS, MS-DIAL and so on).

MS2 data preparation

The raw MS2 data from DDA or DIA should be transfered to msp, msp or mzXML format files. You can use ProteoWizard.

Data organization

Place the MS1 peak table, MS2 data and database which you want to use in one folder like below figure shows:

Run `metIdentify` function

We can use the demo data from demoData package to show how to use it.

library(metIdentify)
path <- system.file("ms2_identification_demo_data1", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
#> [1] TRUE
new.path <- file.path("./ms2_identification_demo_data1")

data("msDatabase_rplc0.0.1", package = "metIdentify")
save(msDatabase_rplc0.0.1, file = file.path(new.path, "msDatabase_rplc0.0.1"))
ms2.data <- grep(pattern = "mgf", dir(new.path), value = TRUE)

result <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
           ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported
           ms1.ms2.match.mz.tol = 25,
           ms1.ms2.match.rt.tol = 10,
           ms1.match.ppm = 25,
           ms2.match.tol = 0.4,
           fraction.weight = 0.3,
           dp.forward.weight = 0.6,
           dp.reverse.weight = 0.1,
           rt.match.tol = 60,
           polarity = "positive", 
           ce = "all",
           column = "rp",
           ms1.match.weight = 0.25,
           rt.match.weight = 0.25,
           ms2.match.weight = 0.5,
           path = new.path,
           total.score.tol = 0,
           candidate.num = 3,
           database = "msDatabase_rplc0.0.1",
           threads = 3)
#> Use old data
#> Matching peak table with MS2 spectrum...
#> 1096 out of 22193 peaks have MS2 spectra.
#> Selecting the most intense MS2 spectrum for each peak...
#> You use all CE values.
#> 
#> Identifing metabolites with MS/MS database...
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> All is done.

The argument of metIdentify can be found using ?metIdentify.

ms1.data: The MS1 peak table name. It must be the “csv” format.
ms2.data: The MS2 data name. It can be msp, mgf or mzXML format.
ms1.ms2.match.mz.tol: The m/z tolerance for MS1 peak and MS2 spectrum match. Default is 25 ppm.
ms1.ms2.match.rt.tol: The RT tolerance for MS1 peak and MS2 spectrum match. Default is 10 s.
ms1.match.ppm: The m/z tolerance for peak and database metabolite match.
ms2.match.tol: The MS2 similarity tolerance for peak and database metabolite match. The MS2 similarity refers to the algorithm from MS-DIAl. So if you want to know more information about it, please read this publication.

\[MS2\;Simlarity\;Score\;(SS) = Fragment\;fraction*Weight_{fraction} + Dot\;product(forward) * Weight_{dp.reverse}+Dot\;product(reverse)*Weight_{dp.reverse}\]

fraction.weight: The weight for fragment match fraction.

\[Fragment\;match\;fraction = \dfrac{Match\;fragement\;number}{All\;fragment\;number}\]

dp.forward.weight: The weight for dot product (forward)
dp.forward.weight: The weight for dot product (forward)

\[Dot\;product = \dfrac{\sum(wA_{act.}wA_{lib})^2}{\sum(wA_{act.})^2\sum(wA_{lib})^2}with\;w =1/(1+\dfrac{A}{\sum(A-0.5)})\]

result is metIdentifyClass object. You can print it out to see the identificaiton information.

result
#> -----------Identifications------------
#> (Use getIdentificationTable to get identification table)
#> There are 22193 peaks
#> 1096 peaks have MS2 spectra
#> There are 73 metabolites are identified
#> There are 56 peaks with identification
#> -----------Parameters------------
#> (Use getParam to get all the parameters of this processing)
#> Polarity: positive 
#> Collision energy: all 
#> database: msDatabase_rplc0.0.1 
#> Total score cutoff: 0 
#> Column: rp 
#> Adduct table:
#> (M+H)+;(M+H-H2O)+;(M+H-2H2O)+;(M+NH4)+;(M+Na)+;(M-H+2Na)+;(M-2H+3Na)+;(M+K)+;(M-H+2K)+;(M-2H+3K)+;(M+CH3CN+H)+;(M+CH3CN+Na)+;(2M+H)+;(2M+NH4)+;(2M+Na)+;(2M+K)+;(M+HCOO+2H)+

You can get processsing parameters from it.

parameters <- getParams(object = result)
tibble::as.tibble(parameters)
#> # A tibble: 17 x 3
#>    Parameter        Meaning                            Value               
#>    <chr>            <chr>                              <chr>               
#>  1 ms1.ms2.match.m~ MS1 features & MS spectra matchin~ 25                  
#>  2 ms1.ms2.match.r~ MS1 features & MS spectra matchin~ 10                  
#>  3 ms1.match.ppm    MS1 match tolerance (ppm)          25                  
#>  4 ms2.match.ppm    MS2 fragment match tolerance (ppm) 30                  
#>  5 ms2.match.tol    MS2 match tolerance                0.4                 
#>  6 rt.match.tol     RT match tolerance (s)             60                  
#>  7 polarity         Polarity                           positive            
#>  8 ce               Collision energy                   all                 
#>  9 column           Column                             rp                  
#> 10 ms1.match.weight MS1 match weight                   0.25                
#> 11 rt.match.weight  RT match weight                    0.25                
#> 12 ms2.match.weight MS2 match weight                   0.5                 
#> 13 path             Work directory                     ./ms2_identificatio~
#> 14 total.score.tol  Total score tolerance              0                   
#> 15 candidate.num    Candidate number                   3                   
#> 16 database         MS2 database                       msDatabase_rplc0.0.1
#> 17 threads          Thread number                      3

You can get the identification table from it.

identification.table1 <- getIdentificationTable(object = result)

Filter

You can use filterIden function to filer identification results from the metIdentifyClass object according to m/z error, rt error, MS2 spectra similarity and total score.

result.new <- filterIden(object = result, ms1.match.ppm = 10)

MS2 spectra match plot output

You can also use ms2plot function to output the MS2 specra match plot for one, multiple or all peaks.

Output one MS2 spectra match plot.

##which peaks have identification
peak.name <- whichHasIden(object = result)
head(peak.name)
#>   MS1.peak.name               MS2.spectra.name
#> 1     pRPLC_125 mz190.086441040039rt335.493954
#> 2     pRPLC_285  mz526.292943427598rt642.48804
#> 3     pRPLC_287  mz321.240051269531rt654.50772
#> 4     pRPLC_417   mz548.37107890314rt693.06558
#> 5     pRPLC_430  mz502.292999267578rt641.79102
#> 6     pRPLC_479 mz167.070388793945rt511.339296
ms2.plot1 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = peak.name[1,1])
ms2.plot1

You can also output interactive MS2 spectra match plot by setting interaction.plot as TRUE.

##which peaks have identification
ms2.plot2 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = peak.name[1,1], interaction.plot = TRUE)
ms2.plot2

You can output all MS2 spectra match plots by setting which.peak as “all”.

ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = "all", path = file.path(new.path, "inhouse"))

Then all the MS2 spectra match plots will be output in the “inhouse” folder.

Result integration and output

You can also use other public database for metabolite identificaiton based on MS2 spectra. We provided four public database, which can be got from the demoData package.

MassBank database (massbankDatabase0.0.1)
MoNA database (monaDatabase0.0.1)
HMDB database (hmdbDatabase0.0.1)
Orbitrap database from MassBank (orbitrapDatabase0.0.1)

Here, we will use orbitrap database for metabolite identification.

library(demoData)
library(metIdentify)
path <- system.file("ms2_database", package = "demoData")
file.copy(from = file.path(path, "orbitrapDatabase0.0.1"), 
          to = new.path, overwrite = TRUE, recursive = TRUE)
load(file.path(new.path, "orbitrapDatabase0.0.1"))

result2 <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
           ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported
           ms1.ms2.match.mz.tol = 25,
           ms1.ms2.match.rt.tol = 10,
           ms1.match.ppm = 25,
           ms2.match.tol = 0.4,
           fraction.weight = 0.3,
           dp.forward.weight = 0.6,
           dp.reverse.weight = 0.1,
           rt.match.tol = 60,
           polarity = "positive", 
           ce = "all",
           column = "rp",
           ms1.match.weight = 0.25,
           rt.match.weight = 0.25,
           ms2.match.weight = 0.5,
           path = new.path,
           total.score.tol = 0,
           candidate.num = 3,
           database = "orbitrapDatabase0.0.1",
           threads = 3)

We can also identify metabolites only m/z match, which is level 3 identification according to MSI. We provid the HMDB metabolite database in the demoData package.

library(demoData)
library(metIdentify)
path <- system.file("hmdb_metabolite_database", package = "demoData")
file.copy(from = file.path(path, "HMDB.metabolite.data"), 
          to = new.path, overwrite = TRUE, recursive = TRUE)

result3 <- mzIdentify(ms1.data = "ms1.peak.table.csv", ##csv format 
           ms1.match.ppm = 25,
           polarity = "positive", 
           column = "rp",
           path = new.path,
           candidate.num = 3,
           database = "HMDB.metabolite.data",
           threads = 3)

For result1 using in-hosue database (m/z, RT and MS2 spectra), they are level 1 identifications. For result2 using public MS2 spectra database (m/z and MS2 spectra), they are level 2 identifications. For result3 using HMDB metabolite database (only m/z), they are level 3 identifications according to MSI. So you can integrate those three results together.

identification.table1 <- getIdentificationTable(object = result, candidate.num = 1, type = "new")
identification.table1 <- identification.table1[!is.na(identification.table1$Compound.name),,drop = FALSE]

identification.table2 <- getIdentificationTable(object = result2, candidate.num = 1, type = "new")
identification.table2 <- identification.table2[!is.na(identification.table2$Compound.name),,drop = FALSE]
identification.table2 <- identification.table2[!(identification.table2$name %in% identification.table1$name),,drop = FALSE]

identification.table3 <- getIdentificationTable2(object = result3, candidate.num = 1, type = "new")
identification.table3 <- identification.table3[!is.na(identification.table3$Compound.name),,drop = FALSE]
identification.table3 <- identification.table3[!(identification.table3$name %in% c(identification.table1$name,identification.table1$name)),,drop = FALSE]

identification.table1 <- data.frame(identification.table1, Level = 1, stringsAsFactors = FALSE)
identification.table2 <- data.frame(identification.table2, Level = 2, stringsAsFactors = FALSE)
identification.table3 <- data.frame(identification.table3[-11], 
                                    "mz.match.score" = NA,
                                    "RT.error" = NA,
                                    "RT.match.score" = NA, 
                                    "CE" = NA,  
                                    "SS" = NA,
                                    "Total.score" = NA,
                                    identification.table3[11],
                                    Level = 3, stringsAsFactors = FALSE)

identification.table <- rbind(identification.table1, 
                              identification.table2, 
                              identification.table3)

write.csv(identification.table, file = file.path(new.path, "identification.table.csv"), row.names = FALSE)

Introduction & Installation

Database construction

Data preparation

Data organization

Run databaseConstruction function

Retention time correction

Data preparation

Run rtCor4database function

Metabolite identification

Identify metabolites based on MS2 library

MS1 data preparation

MS2 data preparation

Data organization

Run metIdentify function

Filter

MS2 spectra match plot output

Result integration and output

Run `databaseConstruction` function

Run `rtCor4database` function

Run `metIdentify` function