The PRIDE Archive Statistics

a statistics portal for MS proteomics

Jose A. Dianes
Software Engineer @EMBL-EBI

About PRIDE

The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence. PRIDE is hosted by the EMBL European Bioinformatics Institute in Hinxton, Cambridge (UK).

Scientifics around the world submit their experimental evidences to PRIDE, in order to support their publications and to make them avilable to the scientific community.

This is PRIDE

PRIDE data

PRIDE offers a web service interface that can be used to retrieve data from other programs written in different languages, such as R. Data provided by this service looks like:

project_list <- function(count) {
  prideJson <- fromJSON(file=paste0(pride_archive_url, "/project/list?show=",count), method="C")
  prideDataFrame <- fromJsonListToDataFrame(prideJson$list)
  prideDataFrame$numAssays <- as.numeric(prideDataFrame$numAssays)
  prideDataFrame
}
projects <- project_list(30)
na.omit(projects[c(1:3),c(1,7,8,10)])

##   accession             species        tissues    instrumentNames
## 2 PXD000822  Ovis aries (Sheep) seminal plasma LTQ Orbitrap Velos
## 3 PXD001163 Plasmodium sp. pBT9   blood plasma       LTQ Orbitrap

There are enough metadata to gather some statistics.

A sample case - most utilised MS instruments

For example, we can plot the most common instrument names for 30 different projects.

The Portal

In order to allow users to create their own plots, we have developed the PRIDE Archive Statistics portal.

http://jadianes.shinyapps.io/coursera-prider/

There, the user can generate live charts with the latest data.