R4Proteomics

Mohieddin Jafari
Dec 2019

Proteomic data analysis

  • MS/MS spectra and identification
  • Database search algorithms
  • Sequencial search algorithms
  • A key issue is to choose the right database
  • Decoys and false discovery rate calculation

alt text

Database search engines function

alt text

  1. spectra are filtered to reduce the number of peaks to process

  2. theoretical spectra are derived from the database sequence

  3. theoretical and experimental spectra are compared and their match is scored

  4. peptidespectrum matches (PSMs) are exported for post-processing.

Different types of masses

  • Nominal mass: the most abundant isotope without regard of mass defect/excess (i.e., H:1, C:12,N:14, O:16, etc.).

  • Monoisotopic mass: the most abundant isotope, but the exact mass is used (i.e., 1H:1.007825, 12C:12.000000, 14N:14.003074, 16O:15.994915, 31P:30.973762, 32S:31.972070, etc.).

  • Average mass: the average masses of the elements weighted for abundance (i.e., H:1.00794, C:12.011, N:14.00674, O:15.9994, P:30.97376, S:32.066, etc.)

Several types of fragment ions

alt text

There are several other ion types that can be annotated, as well as internal fragments. These are harder to use for ladder sequencing, but can still be interpreted.

Peptide interpretation

Ideal world alt text

Real world alt text

The influence of various parameter

alt text

alt text

Verheggen, Mass Spec Reviews, 2017

The main search engines

alt text

Verheggen, Mass Spec Reviews, 2017

Seems reasonable, but ...!

alt text

  1. Inappropriate database
  2. Incorrect search engine parameters
  3. Unsuitable acceptance sxore threshold

Knudsen, PLoS ONE, 2011

Three main types of decoy DB’s

  • Reversed databases (easy)

LENNARTMARTENS –> SNETRAMTRANNEL

  • Shuffled databases (slightly more difficult)

LENNARTMARTENS –> NMERLANATERTTN (for instance)

  • Randomized databases (as difficult as you want it to be)

LENNARTMARTENS –> GFVLAEPHSEAITK (for instance)

Setting a threshold inevitably leads to errors

alt text

Further Reading

[1] Using R and Bioconductor for proteomics data analysis; Laurent Gatto & Andy Christoforou, arXiv:1305.6559.

[2] Visualization of proteomics data using R and Bioconductor; Laurent Gatto et al., DOI:10.1002/pmic.201400392