PROTREC is an R package containing several functions for predicting, validating and evaluating missing proteins in proteomics data 1 2.
PROTREC methods are currently divisible into three functional classes
First we need to find some network information to act as the feature vector. PROTREC works well with real complexes and this data can be obtained from CORUM (http://mips.helmholtz-muenchen.de/genre/proj/corum/).
An example dataset is available here (https://drive.google.com/file/d/0BxBcRNJvgbEPN3pzcUFMbHhTV2c/view?usp=sharing). To use the example complex dataset with ‘PROTREC’, convert it into an R list object e.g.
library(PROTREC)
complexes <- read.table("human_complexes.txt", sep="\t", header=FALSE)
complex_vector <- strsplit(as.vector(complexes[,4]), ',')
complex_vector <- setNames(complex_vector, complexes[,1]) #assignes the complex ID as the name
Alternatively, PROTREC can also use as its feature vector a list of predicted network clusters or pathways.
Two proteomics expression datasets are provided. The renal control dataset (RCC_original) is a test expression comprising 12 control samples. The renal cancer dataset (RC_original) comprises 12 normal and 12 cancer samples. Both datasets may be called by their names (in brackets).
A peptide based validation file for RC is available at (https://drive.google.com/file/d/0BxBcRNJvgbEPX2ZmMERoVVFvVkk/view?usp=sharing). This is akin to a list of unique and ambiguous PSMs that can be used for checking if there is at least one peptide that points to the presence of a predicted missing protein. Following download, the peptide file can be read into memory and processed as follows:
peptide_support <- read.table("peptides_processed.txt", sep="\t", header=T)
rownames(peptide_support) <- make.names(peptide_support[,1], unique=TRUE)
peptide_support <- peptide_support[,5:ncol(peptide_support)]
peptide_support <- peptide_support[as.numeric(lapply(rownames(peptide_support), nchar)) <= 6,] #subset the peptide_support matrix
FCS generates a matrix of p-values based on significant enrichment of observed proteins against a vector of complexes. It takes a data matrix (RCC_original) and a vector of complex features as its primary inputs. Sim_size is the number of simulations and should be set to 1000 typically. Threshold is the minimal complex size to consider (default is usually size 5).
fcs(RCC_original, complex_vector,sim_size=1000, threshold=5)
Note that FCS can take a while to run, especially if there are many samples, and a large feature vector to consider. A example output is provided in fcs_rc_n.
head(fcs_rn_n, 3)
Evaluates the significance of the proportion of verified proteins given a set of predicted proteins.
fcs_prot_pairwise_recovery(fcs_list, original_prot_list, check_prot_list, complex_vector)
Where fcs_list is a vector of FCS significant values for a given sample, original_prot_list is the original set of proteins observed for that particular sample, check_prot_list is the set of proteins in the cross-replicate for verification, and complex_vector is the feature vector set.
For example:
pairwise_recovery(t(fcs_rc_n)[,1], rownames(rc_prot)[!is.na(rc_prot$N1_1)],rownames(rc_prot)[!is.na(rc_prot$N1_2)], complex_vector)
fcs_prot_prob_assign assigns probabilities to individual proteins based on the FCS probability.
fcs_prot_prob_assign <- function(cplx, p)
Where cplx is the complex vector and p is a vector of complex-based probabilities derived from FCS. Since FCS provides p-values, then p is simply (1 - FCS p-values). For example:
prot_prob_fcs(complex_vector, 1 - fcs_rc_n[1,])
PROTREC is a bayesian scoring schema for assigning probabilities to observed and predicted missing proteins based on the false discovery rate of the proteomics screen, and the probability its constituent complex (if any) is present in the sample.
Assigns a PROTREC-based probability based on a complex vector and a set of observed proteins
PROTREC_cplx_prob(data, complex, fdr, threshold)
Where data is a data matrix, complex is a list of complexes, fdr is the false discovery rate of the screen, set as 0.01 or 0.05 normally. Threshold is the minimum size of the complex to consider, normally set as 5.
For example,
PROTREC_cplx_prob(RCC_original, complex_vector, 0.01, 5)
Assigns a PROTREC-based probability to an individual protein based on a series of complex-based PROTREC probabilities
PROTREC_protprob <- function(cplx, p, prot, fdr)
Where cplx is the list of complex components, p is the probability complex exists, and prot is the list of proteins in the screen
For example,
protrec_rc_n <- as.matrix(protrec_rc_n)
PROTREC_protprob(complex_vector, 1 - protrec_rc_n['N1_1',], rownames(RC_original), 0.05)
Calculates the verification rates at each probability level
countgraph(sample,names_sample,replicates,conplex_prot, observed_prot)
Where sample is the vector of PROTREC or FCS probabilities based on proteins for a given sample (includes observed and predicted missing proteins), names_sample is a supplied vector of protein names for sample. replicates is a vector of proteins for verification, complex_prot is the set of proteins potentially predictable due to the complex vector, and observed_prot is the set of proteins observed in sample.
For example:
N1_T12 <- countgraph(normal_mat[,"N1_1"],rownames(normal_mat),union(rownames(peptide_support)[!is.na(peptide_support[,"N1_2"])], rownames(peptide_support)[!is.na(peptide_support[,"N1_1"])]),unique(as.vector(unlist(complex_vector))), rownames(RC_original)[which(!is.na(RC_original$N1_1))])
rownames(N1_T12) <- N1_T12[,1]
N1_T12 <- N1_T12[,-1]
head(N1_T12)
The results may be plotted as a series of barplots
barplot(t(N1_T12), col=c("pink", "red", "magenta"), main="FCS N1_T12", ylab="Count", xlab="Prob(Protein p is present but unreported)")
Here, magenta (original) proteins are the observed proteins, red (validated) are the verified missing proteins and pink are the predicted missing proteins not verified.
N1_T12 can be normalized by doing this:
N1_T12_normalize <- N1_T12/rowSums(N1_T12)
barplot(t(N1_T12_normalize), col=c("pink", "red", "magenta"), main="N1_T12", ylab="Proportion", xlab="Prob(Protein p is present but unreported)")
Calculates the cumulative verification rates at each probability level
precision_recall(x,y,z, cplx, g)
Where x is the vector of protrec or fcs protein probabilities in the sample (including observed + predicted missing), y is the vector of names for x, z is the names of proteins to verify against, cplx are the names of proteins found in the complex vector, and g is the set of proteins observed in the sample
For example:
protprob_N1_T12_unfiltered <- precision_recall(normal_mat[,"N1_1"], rownames(normal_mat),union(rownames(peptide_support)[!is.na(peptide_support[,"N1_2"])], rownames(peptide_support)[!is.na(peptide_support[,"N1_1"])]),unique(as.vector(unlist(complex_vector))), rownames(RC_original)[which(!is.na(RC_original$N1_1))])
head(protprob_N1_T12_unfiltered)
plot(protprob_N1_T12_unfiltered[,1], protprob_N1_T12_unfiltered[,3]/(protprob_N1_T12_unfiltered[,2] + protprob_N1_T12_unfiltered[,3]), type="o", col="purple", ylim=c(0, 1), xlab="PROTREC score", ylab="% of proteins considered and verified")