Overview

For the next two classes, we will use a new R data package called nihexporter. The package provides NIH funding data from 2000-2014 in formats that make analysis with dplyr and ggplot2 very approachable.

I plan to write and submit a short paper on the nihexporter package, and I’d like all of you to help me. I propose that we all use the package to perform analyses for the next several weeks, culminating in a useful series of analyses. At that point we can submit the paper to bioRxiv or Bioinformatics (I can get this paid for, I think).

Analysis of nihexporter will solidify your understaning of dplyr and ggplot2, and we will hopefully discover some new things. However, the techniques are portable to other scenarios where you have tables of data. For example, linking sample information (a sample.info table) in a project to specific data sets acquired (linking by sample.id). It also forces you to think about how to best organize the data you have, so that analysis is as easy as possible.

`nihexporter` overview

The nihexporter package provides a minimal set of data from NIH EXPORTER, which contains information on NIH biomedical research funding from 1985-2014 (and continues monthly in a given fiscal year).

The package contains the following tables:

projects: provides data on funded projects by NIH.
project.pis: links project numbers (project.num) to PI ID (pi.id), which can used in NIH REPORTER searches
project.orgs: links DUNS numbers (org.duns) from projects table to information on specific organizations
publinks: links Pubmed IDs (pmid) to project numbers (project.num)
patents: links project IDs (project.num) to patent.id

Information about specific columns in the tables is here.

There are also a few helper variables that make exploratory analysis a bit easier:

nih.institutes: 27 NIH institutes in two-letter format

Questions to ponder

Easy-ish

What are the institutes with the best record of funding impact (i.e., publications / dollar, or patents / dollar)?
Which are more productive: single-pi or multiple-pi grants? Does this vary across institute, grant type (activity) or study-section?
Related: are there single-pi grants that are as productive as multiple-pi grants? Which grants are they?

More complicated

Model the relationship between total.cost and publication / patent number. How does the model change over time?
Find out how to link Pubmed IDs to the number of times they have been cited, then measure impact as publication / dollar scaled by times.cited. You’ll need to go the extra mile and develop a way to query pubmed (or identify this data via e.g., Google Scholar)

Analysis

library(dplyr)
library(ggplot2)
library(knitr)
library(nihexporter)

Identifying grant types - single and multiple PI grants

Here is how to find out how many PIs are on each grant …

project.pi.counts <- project.pis %>%
  group_by(project.num) %>%
  summarize(pi.count = n())
project.pi.counts

## Source: local data frame [259,153 x 2]
## 
##    project.num pi.count
## 1  C06CA091516        1
## 2  C06RR014469        1
## 3  C06RR014488        1
## 4  C06RR014520        1
## 5  C06RR014524        1
## 6  C06RR014527        1
## 7  C06RR014528        1
## 8  C06RR014533        1
## 9  C06RR014561        1
## 10 C06RR014577        1
## ..         ...      ...

single.pi.projects <- project.pi.counts %>%
  filter(pi.count == 1)    
single.pi.projects

## Source: local data frame [61,293 x 2]
## 
##    project.num pi.count
## 1  C06CA091516        1
## 2  C06RR014469        1
## 3  C06RR014488        1
## 4  C06RR014520        1
## 5  C06RR014524        1
## 6  C06RR014527        1
## 7  C06RR014528        1
## 8  C06RR014533        1
## 9  C06RR014561        1
## 10 C06RR014577        1
## ..         ...      ...

multiple.pi.projects <- project.pi.counts %>%
  filter(pi.count > 1)
multiple.pi.projects

## Source: local data frame [197,860 x 2]
## 
##    project.num pi.count
## 1  D43TW000003       31
## 2  D43TW000004       28
## 3  D43TW000007       31
## 4  D43TW000010       40
## 5  D43TW000011       20
## 6  D43TW000013       28
## 7  D43TW000018       32
## 8  D43TW000231       34
## 9  D43TW000233       24
## 10 D43TW000237       28
## ..         ...      ...

Now you can use the project.num field of the project.pi.counts tables to cross-reference with other tables via left_join() in dplyr.

Exercise

Determine which are more productive: single-pi or multiple-pi grants.

We already determined project.pi.counts above, so we just need to calculate productivity for each project.num and then categorize grants by their number of pis.

More exercises

Determine the numbers of grants of each type across fiscal years. Make two plots (e.g. with geom_boxplot()); on one of them color by activity and facet by institution, on the other do vice versa.

Contributing

You are all welcome to contribute to a manuscript. But to make this as painless as possible (for me), I require that you generate and edit content using RStudio linked to your github account. The package is small enough that you can also install it on a local machine, do your analyses there, and sync them in github.

As long as you submit content this way, you will get authorship on the paper. If you send me content by email, it doesn’t count.

Workflow

Create an account at github.com and enter those credentials in RStudio (Tools -> Global Options -> Git/SVN).
Login into your github.com account, and fork the nihexporter repository on the github.com website. This creates a copy of the repository in your account.
Create a new project in RStudio by importing the github repository (File -> New Project -> Version Control -> Git).

Enter the value of the HTTPS clone URL button, which should look like: “https://github.com//nihexporter.git”. This may take a minute or two.
You will see a new project button at the top right of RStudio called nihexporter.
Now you need to log in through the terminal to tesla and cd to the directory where you checkout out the repository and run this command:

$ git checkout manuscript
Now go back to RStudio, and in the Git panel in the lower right, you will see a button that says manuscript. All of your changes will now be (and should be) saved to this branch.

Create new content in the manuscript/contrib/ directory. Name your analysis lastname-analysis.Rmd. When you are happy with tht content, click Commit, click on the button next to the file to stage it, enter a commit message and press Commit, then push the content. You can keep editing and commiting new content this way until you are done.
When you are done with the content, commit your final copy and submit a pull request via the github website.

nihexporter analysis

Jay Hesselberth

March 9, 2015

Overview

`nihexporter` overview

Questions to ponder

Easy-ish

More complicated

Analysis

Identifying grant types - single and multiple PI grants

Exercise

More exercises

Contributing

Workflow

nihexporter analysis

Jay Hesselberth

March 9, 2015

Overview

nihexporter overview

Questions to ponder

Easy-ish

More complicated

Analysis

Identifying grant types - single and multiple PI grants

Exercise

More exercises

Contributing

Workflow

`nihexporter` overview