PROTREC

Wilson Wen Bin Goh

2016-08-25

PROTREC is an R package containing several functions for predicting, validating and evaluating missing proteins in proteomics data 1 2.

PROTREC methods are currently divisible into three functional classes

Getting started

Protein complex data

First we need to find some network information to act as the feature vector. PROTREC works well with real complexes and this data can be obtained from CORUM (http://mips.helmholtz-muenchen.de/genre/proj/corum/).

An example dataset is available here (https://drive.google.com/file/d/0BxBcRNJvgbEPN3pzcUFMbHhTV2c/view?usp=sharing). To use the example complex dataset with ‘PROTREC’, convert it into an R list object e.g.

library(PROTREC)
complexes <- read.table("human_complexes.txt", sep="\t", header=FALSE)
complex_vector <- strsplit(as.vector(complexes[,4]), ',')
complex_vector <- setNames(complex_vector, complexes[,1]) #assignes the complex ID as the name

Alternatively, PROTREC can also use as its feature vector a list of predicted network clusters or pathways.

Two proteomics expression datasets are provided. The renal control dataset (RCC_original) is a test expression comprising 12 control samples. The renal cancer dataset (RC_original) comprises 12 normal and 12 cancer samples. Both datasets may be called by their names (in brackets).

A peptide based validation file for RC is available at (https://drive.google.com/file/d/0BxBcRNJvgbEPX2ZmMERoVVFvVkk/view?usp=sharing). This is akin to a list of unique and ambiguous PSMs that can be used for checking if there is at least one peptide that points to the presence of a predicted missing protein. Following download, the peptide file can be read into memory and processed as follows:

peptide_support <- read.table("peptides_processed.txt", sep="\t", header=T) 
rownames(peptide_support) <- make.names(peptide_support[,1], unique=TRUE)
peptide_support <- peptide_support[,5:ncol(peptide_support)]
peptide_support <- peptide_support[as.numeric(lapply(rownames(peptide_support), nchar)) <= 6,] #subset the peptide_support matrix

Functional Class Scoring (FCS) functions

fcs

FCS generates a matrix of p-values based on significant enrichment of observed proteins against a vector of complexes. It takes a data matrix (RCC_original) and a vector of complex features as its primary inputs. Sim_size is the number of simulations and should be set to 1000 typically. Threshold is the minimal complex size to consider (default is usually size 5).

  fcs(RCC_original, complex_vector,sim_size=1000, threshold=5)

Note that FCS can take a while to run, especially if there are many samples, and a large feature vector to consider. A example output is provided in fcs_rc_n.

  head(fcs_rn_n, 3)

fcs_prot_pairwise_recovery

Evaluates the significance of the proportion of verified proteins given a set of predicted proteins.

  fcs_prot_pairwise_recovery(fcs_list, original_prot_list, check_prot_list, complex_vector)

Where fcs_list is a vector of FCS significant values for a given sample, original_prot_list is the original set of proteins observed for that particular sample, check_prot_list is the set of proteins in the cross-replicate for verification, and complex_vector is the feature vector set.

For example:

  pairwise_recovery(t(fcs_rc_n)[,1], rownames(rc_prot)[!is.na(rc_prot$N1_1)],rownames(rc_prot)[!is.na(rc_prot$N1_2)], complex_vector)

fcs_prot_prob_assign

fcs_prot_prob_assign assigns probabilities to individual proteins based on the FCS probability.

  fcs_prot_prob_assign <- function(cplx, p)

Where cplx is the complex vector and p is a vector of complex-based probabilities derived from FCS. Since FCS provides p-values, then p is simply (1 - FCS p-values). For example:

  prot_prob_fcs(complex_vector, 1 - fcs_rc_n[1,])

Protein Recovery (PROTREC)

PROTREC is a bayesian scoring schema for assigning probabilities to observed and predicted missing proteins based on the false discovery rate of the proteomics screen, and the probability its constituent complex (if any) is present in the sample.

PROTREC_cplx_prob

Assigns a PROTREC-based probability based on a complex vector and a set of observed proteins

  PROTREC_cplx_prob(data, complex, fdr, threshold)

Where data is a data matrix, complex is a list of complexes, fdr is the false discovery rate of the screen, set as 0.01 or 0.05 normally. Threshold is the minimum size of the complex to consider, normally set as 5.

For example,

  PROTREC_cplx_prob(RCC_original, complex_vector, 0.01, 5)

PROTREC_protprob

Assigns a PROTREC-based probability to an individual protein based on a series of complex-based PROTREC probabilities

  PROTREC_protprob <- function(cplx, p, prot, fdr)

Where cplx is the list of complex components, p is the probability complex exists, and prot is the list of proteins in the screen

For example,

  protrec_rc_n <- as.matrix(protrec_rc_n)
  PROTREC_protprob(complex_vector, 1 - protrec_rc_n['N1_1',], rownames(RC_original), 0.05)

Performance Metrics

countgraph

Calculates the verification rates at each probability level

  countgraph(sample,names_sample,replicates,conplex_prot, observed_prot)

Where sample is the vector of PROTREC or FCS probabilities based on proteins for a given sample (includes observed and predicted missing proteins), names_sample is a supplied vector of protein names for sample. replicates is a vector of proteins for verification, complex_prot is the set of proteins potentially predictable due to the complex vector, and observed_prot is the set of proteins observed in sample.

For example:

  N1_T12 <- countgraph(normal_mat[,"N1_1"],rownames(normal_mat),union(rownames(peptide_support)[!is.na(peptide_support[,"N1_2"])], rownames(peptide_support)[!is.na(peptide_support[,"N1_1"])]),unique(as.vector(unlist(complex_vector))), rownames(RC_original)[which(!is.na(RC_original$N1_1))])
  
  rownames(N1_T12) <- N1_T12[,1]
  N1_T12 <- N1_T12[,-1]
  head(N1_T12)
  

The results may be plotted as a series of barplots

  barplot(t(N1_T12), col=c("pink", "red", "magenta"), main="FCS N1_T12", ylab="Count", xlab="Prob(Protein p is present but unreported)")

Here, magenta (original) proteins are the observed proteins, red (validated) are the verified missing proteins and pink are the predicted missing proteins not verified.

N1_T12 can be normalized by doing this:

  N1_T12_normalize <- N1_T12/rowSums(N1_T12)
  barplot(t(N1_T12_normalize), col=c("pink", "red", "magenta"), main="N1_T12", ylab="Proportion", xlab="Prob(Protein p is present but unreported)")

precision_recall

Calculates the cumulative verification rates at each probability level

  precision_recall(x,y,z, cplx, g)
  

Where x is the vector of protrec or fcs protein probabilities in the sample (including observed + predicted missing), y is the vector of names for x, z is the names of proteins to verify against, cplx are the names of proteins found in the complex vector, and g is the set of proteins observed in the sample

For example:

  protprob_N1_T12_unfiltered <- precision_recall(normal_mat[,"N1_1"],     rownames(normal_mat),union(rownames(peptide_support)[!is.na(peptide_support[,"N1_2"])], rownames(peptide_support)[!is.na(peptide_support[,"N1_1"])]),unique(as.vector(unlist(complex_vector))), rownames(RC_original)[which(!is.na(RC_original$N1_1))])
  head(protprob_N1_T12_unfiltered)
  plot(protprob_N1_T12_unfiltered[,1], protprob_N1_T12_unfiltered[,3]/(protprob_N1_T12_unfiltered[,2] + protprob_N1_T12_unfiltered[,3]), type="o", col="purple", ylim=c(0, 1), xlab="PROTREC score", ylab="% of proteins considered and verified")

  1. Goh and Wong, Recovering missing proteins based on biological complexes, Under Review

  2. Goh et al, Comparative network-based recovery analysis and proteomic profiling of neurological changes in valporic acid-treated mice. Journal of Proteome Research, 12(5):2116–2127, April 2013