Building Citation Networks With Open Data

george chacko

Why Citation Networks?



  1. Citations are semantic links between documents
  2. They uncover historical dependences (Garfield 1964)
  3. Citations trace collaborations
  4. Citations reflect the social behavior of researchers

Why Open Data?



Pros

  • Transparency
  • Reproducibility
  • Benchmarking
  • Cost effectiveness

Cons

  • Data quality
  • Coverage
  • Time and effort

PI 3-Kinase: a Test Case

Whitman, M., Kaplan, D., Schaffhausen, B. et al. Association of phosphatidylinositol kinase activity with polyoma middle-T competent for transformation. Nature 315, 239–242 (1985). 10.1038/315239a0

Fruman, David A., Honyin Chiu, Benjamin D. Hopkins, Shubha Bagrodia, Lewis C. Cantley, and Robert T. Abraham. "The PI3K pathway in human disease." Cell 170, no. 4 (2017) 10.1038/315239a0

Approach



  • Use references from Fruman et al. as seed set (Scopus)
  • Merge with Open Citations
  • Collect citing and cited wrt seed set -> S1
  • Collect citing and cited with respect to union of S1
  • Generate pubmed-restricted version

Code



  1. SQL scripts
  2. R parser for PubMed data | Hossein’s Python script (discuss)
  3. Export as edgelist

Final Product

The pubmed_restricted_pi3k_network

  1. 11,491,369 nodes
  2. 320,577,231 edges

Cleanup issues



  1. pmid -> doi many to one /* take latest record */
  2. doi -> pmid many to one /* keep or delete */
  3. Verification and validation /* complicated */
  4. Coverage /* restricted */
  5. Updating Pubmed baseline and Open Citations annually