Overview
In todays workshop we will be looking to:
- Use online tools to analyze proteomic data
- Interpret output related to proteomic analytical methods
MASCOT
MASCOT is an online tool to analyis tandem MS (or MS/MS) data related
to proteomic analysis. As you will see, Mascot and other proteomic tools
take MGF input files. These are a file type that contains the parameters
and mass spectra data from the samples you have analyzed.
Navigate to MASCOT here: http://www.matrixscience.com/cgi/search_form.pl?FORMVER=2&SEARCH=MIS
Today, we will use MASCOT to:
- identify proteins in a sample
- quantify proteins in a mixture (emPAI)
- determine differential abundance of proteins (iTRAQ)
MASCOT Parameters
- Proteolytic enzyme
- You need to select the one used for protein digestion
- Cleavages
- Acidic sites: AspN (D), GluC (E), V8 (D, E)
- Basic sites: Trypsin (R,K), Lys-C (K), Arg-C (R)
- Chemical: CNBr (M), W oxidation
- Missed-cleavage
- In case of partial fragments present, usually set up as 1 or 2.
- Modifications
- Fixed or variable
- Fixed modifications are applied to every instance of the specified
residue(s) or terminus. For example, selecting Carboxymethyl (C) means
that all calculations will use add 58 Da to the mass of cysteine.
- Variable modifications are those which may or may not be
present.
- Peptide and fragment mass tolerance
- Depends on the mass spectrometer used. This will be a known
instrument parameter.
Confident and quality criteria in MASCOT
Confidence Criteria
- Number of peptide sequences: minimum of 2 required
- Protein sequence coverage
- Total MASCOT score and individual ion score
- The quality of the MS/MS spectra judged by a full length y-ion
series of peptides comprising at least six consecutive amino acids and
no missed cleavages
False Discovery Rate (FDR)
- Using controlled FDR, we can find an optimum between maximized true
positives and minimized false positive and false negatives
- FDR = # False Discoveries/#Total Discoveries
- Common acceptance criteria is either 5% or 1%
- Modify the significance threshold to get close to the optimum
FDR
- How do we calculate FDR?
- To determine the number of false discoveries we search against a
“Decoy” protein database that has nonsense proteins
- Common decoy strategies: reversed database, shuffled database and
randomized database.
Exercise 1: Identification
- Locate the file, “identification.mgf”
- Enter you name and email
- For this data set use
- enzyme =trypsin
- 1 missed cleavage
- oxidation (M) as a variable modification
- +/- 0.2 Da for peptide ion tolerance
- +/- 0.2 Da fragment ion tolerance
- monoisotopic values
- charge state 1+, 2+, 3+
- instrument type = ESI-Quad ToF
- check Decoy Database
- For Data File, browse to the location of “identification.mgf”
- Click “Start Search…”
Question 1: Are there any proteins identified?
- Click “Re-search” near the top of the output page. This will take
you back to your query with settings as you set them previously.
- Add “SwissProt” to the database you are searching.
- Click “Start Search…”
Question 2: What protein is the best hit?
- Click on your top search hit.
Question 3: Examine the regions of this protein that have
been matched from your data. Is the entire protein
detected?
Try changing some of the following:
- Peptide and fragment ion tolerances
- Database/taxonomy filter
- Missed cleavages
- Modifications
- Try loosening and tightening the mass tolerances
- Try decreasing the search space by adding a taxonomy filter and/or
choosing a different database
- Now let’s see if we can match more peptides to proteins adding
possible modifications or missed cleavages
Exercise 2: Quanitifying (emPAI)
What is emPAI?
- emPAI = exponentially modified Protein Abundance Index
- \(emPAI = 10^{PAI}-1\)
- \(PAI = N_{observed
peptides}/N_{observable peptides}\)
- \(\textrm{Mole Fraction of a protein} =
emPAI/\sum_{i=0}^{n}emPAI_n\)
Let’s make some measurements!
- Locate the file, “emPAI.mgf”
- For all this data set use
- enzyme = trypsin
- 2 missed cleavage
- oxidation (M) (variable)
- Carbamidomethyl (C) (fixed)
- +/- 0.7 Da for peptide ion tolerance
- +/- 0.2 Da fragment ion tolerance
- monoisotopic values
- charge state 1+, 2+, 3+
- instrument type = ESI-Quad ToF
- check Decoy Database
- For Data File, browse to the location of “emPAI.mgf”
- For Database choose Swissprot, for Taxonomy choose Humans
- Click “Start Search…”
Question 4: What is the top hit? What is its emPAI
value?
- Go to hit YES, emPAI = 1.17, open protein report in a new tab (by
clicking on YES Human)
- Go to hit RPN2, emPAI = 0.13; open protein report in a new tab
- Notice the difference in sequence coverage and how that is reflected
in the emPAI value
- Let’s check the peptides from RPN2 to make sure since it is on the
border of confidence criteria
Question 5: What do you notice about proteins with high emPAI
vs. low emPAI values with respect to coverage?
- Go to hit #1
- Go to hit #2
- Notice the shared peptides between them.
- Notice that they both have uniquely identifying peptides
- Which protein is really present?
- Are both present?
- How do we know which peptides go with which isoform?
This is the “Protein Inference Problem” , or how do we infer the
identity or quantity of a protein present in a complex mixture when some
peptides are shared among different proteins or protein forms?
Exercise 3: Differential abundance (iTRAQ)
- Locate the file, “iTRAQ.mgf”
- Search against SwissProt with
- drosophila as the taxonomy filter
- iTRAQ 4-plex quantitation
- +/- 0.7 Da peptide/fragment ion tolerances
- 3 missed cleavages
- +1, +2 and +3 charge states
- fixed methylthio (C)
- variable oxidation (M)
- Check Decoy
- monoisotopic values
- When the search is complete, filter by unique only, automatic
outlier removal ( in the new yellow box), then press “Format As”.
- Go to hit PDI DROME
- Notice the variation for K.SVFEGELNEENLK.K
- Notice the general trend for all peptides and proteins
Question 6: Consider VIT3_DROME. Examining the variation in
iTRAQ ratios of the peptides detected, which sample (115, 116, or 117)
is creating problems in analysis? Why?
Important considerations:
- There are variations in the measurements
- There is a problem with interference
- This is an example where we are asking many questions at once
knowing there will be many “fuzzy” answers