Introduction

This project seeks to import the publicly available NIH data into a database for storage and analysis. This data will be matched to a dummy database of employee information, to demonstrate the types of analyses that could be performed.

The script data_scrape.R (https://github.com/hdupre/DATA607/blob/master/nih_exporter/data_scrape.R), contains the code for scraping the zip files from the NIH Exporter website and unpacking the CSVs in the local environment.

Unfortunately, persistent issues meant that some manual intervention was needed between the extraction of the CSVs and loading them into the database, so the entirety of the process was not able to be knitted together in one RMD file.

Issues:

  • Some datasets (2010-2013) had missing or misspelled fields, which caused the dataframe that all the CSVs are loaded into to add extra fields to accomodate this.
  • There are UTF8 encoding issues. The CSVs had to be opened in a text editor and saved with a UTF8 encoding in order to avoid errors during the database load. Some invisible characters in three of the fifty-three datasets from 2019 (as 2019 and 2020 have one CSV for each week of the year) caused such persistent issues that it was better to remove them from the upload entirely – three weeks of missing data should not make a large difference in the aggregate analyses we will perform.

Creating a list of files, converting to a dataframe, loading to the database.

library(RCurl)
library(XML)
library(stringr)
library(plyr)
library(readr)
library(RMySQL)

# Set the directory where the unzipped CSVs are stored (these CSVs were retrieved in data_scrape.R linked above).
mydir <- "~/nih_exporter/CSVs/data"

# Make a list of csv file names at the set location.
myfiles <- list.files(path=mydir, pattern="*.csv", full.names=TRUE)

# Ignore all datasets from before 2010 for simplicity -- a ten year lookback is sufficient for these purposes.
myfiles <- myfiles[26:95]

# Remove three files from fiscal year 2019 that were having persistent issues due to invisible UTF8 characters.
myfiles <- myfiles[-c(51,59,60)]

# Read each CSV into dat_csv. Making the column types explicit avoids the persistent issue of some columns being erroneously labeled col_logical().
dat_csv = ldply(myfiles, read_csv, col_names= TRUE, col_types=cols(
  APPLICATION_ID = col_double(),
  ACTIVITY = col_character(),
  ADMINISTERING_IC = col_character(),
  APPLICATION_TYPE = col_double(),
  ARRA_FUNDED = col_character(),
  AWARD_NOTICE_DATE = col_character(),
  BUDGET_START = col_character(),
  BUDGET_END = col_character(),
  CFDA_CODE = col_character(),
  CORE_PROJECT_NUM = col_character(),
  ED_INST_TYPE = col_character(),
  FOA_NUMBER = col_character(),
  FULL_PROJECT_NUM = col_character(),
  FUNDING_ICs = col_character(),
  FUNDING_MECHANISM = col_character(),
  FY = col_double(),
  IC_NAME = col_character(),
  NIH_SPENDING_CATS = col_character(),
  ORG_CITY = col_character(),
  ORG_COUNTRY = col_character(),
  ORG_DEPT = col_character(),
  ORG_DISTRICT = col_double(),
  ORG_DUNS = col_character(),
  ORG_FIPS = col_character(),
  ORG_IPF_CODE = col_double(),
  ORG_NAME = col_character(),
  ORG_STATE = col_character(),
  ORG_ZIPCODE = col_character(),
  PHR = col_character(),
  PI_IDS = col_character(),
  PI_NAMEs = col_character(),
  PROGRAM_OFFICER_NAME = col_character(),
  PROJECT_START = col_character(),
  PROJECT_END = col_character(),
  PROJECT_TERMS = col_character(),
  PROJECT_TITLE = col_character(),
  SERIAL_NUMBER = col_character(),
  STUDY_SECTION = col_character(),
  STUDY_SECTION_NAME = col_character(),
  SUBPROJECT_ID = col_double(),
  SUFFIX = col_character(),
  SUPPORT_YEAR = col_double(),
  DIRECT_COST_AMT = col_double(),
  INDIRECT_COST_AMT = col_double(),
  TOTAL_COST = col_double(),
  TOTAL_COST_SUB_PROJECT = col_double()
))

head(dat_csv)
##   APPLICATION_ID ACTIVITY ADMINISTERING_IC APPLICATION_TYPE ARRA_FUNDED
## 1        7000731      C06               RR                1           Y
## 2        6826709      C06               RR                1           Y
## 3        7000889      C06               RR                1           Y
## 4        7000560      C06               RR                1           Y
## 5        7000669      C06               RR                1           Y
## 6        6949832      C06               RR                1           Y
##   AWARD_NOTICE_DATE BUDGET_START BUDGET_END CFDA_CODE CORE_PROJECT_NUM
## 1        10/20/2009   10/20/2009  9/30/2013       702      C06RR020081
## 2        10/20/2009   10/20/2009 10/19/2014       702      C06RR020088
## 3         12/9/2009   12/10/2009  6/30/2014       702      C06RR020096
## 4        10/20/2009   10/20/2009 10/19/2014       702      C06RR020132
## 5         12/9/2009   12/10/2009  6/30/2015       702      C06RR020533
## 6        10/20/2009   10/20/2009  9/30/2013       702      C06RR020544
##                     ED_INST_TYPE    FOA_NUMBER  FULL_PROJECT_NUM
## 1                           <NA>    PAR-04-122 1C06RR020081-01A1
## 2 SCHOOLS OF VETERINARY MEDICINE RFA-RR-03-011   1C06RR020088-01
## 3       ORGANIZED RESEARCH UNITS    PAR-04-122 1C06RR020096-01A1
## 4                           <NA>    PAR-04-122 1C06RR020132-01A1
## 5       ORGANIZED RESEARCH UNITS    PAR-04-122 1C06RR020533-01A1
## 6       ORGANIZED RESEARCH UNITS    PAR-04-122 1C06RR020544-01A1
##      FUNDING_ICs FUNDING_MECHANISM   FY
## 1 NCRR:8000000\\      Construction 2010
## 2 NCRR:3920956\\      Construction 2010
## 3 NCRR:8000000\\      Construction 2010
## 4 NCRR:4675896\\      Construction 2010
## 5 NCRR:3978104\\      Construction 2010
## 6 NCRR:5350000\\      Construction 2010
##                                  IC_NAME NIH_SPENDING_CATS ORG_CITY
## 1 NATIONAL CENTER FOR RESEARCH RESOURCES              <NA> COLUMBUS
## 2 NATIONAL CENTER FOR RESEARCH RESOURCES          HIV/AIDS COLUMBUS
## 3 NATIONAL CENTER FOR RESEARCH RESOURCES          HIV/AIDS  LINCOLN
## 4 NATIONAL CENTER FOR RESEARCH RESOURCES              <NA>  BUFFALO
## 5 NATIONAL CENTER FOR RESEARCH RESOURCES              <NA>    BOISE
## 6 NATIONAL CENTER FOR RESEARCH RESOURCES              <NA> COLUMBIA
##     ORG_COUNTRY ORG_DEPT ORG_DISTRICT  ORG_DUNS ORG_FIPS ORG_IPF_CODE
## 1 UNITED STATES     <NA>            3 832127323       US           NA
## 2 UNITED STATES     NONE            3 832127323       US           NA
## 3 UNITED STATES     NONE            1 555456995       US           NA
## 4 UNITED STATES     <NA>           26 824771034       US           NA
## 5 UNITED STATES     NONE            2  72995848       US           NA
## 6 UNITED STATES     NONE            4 153890272       US           NA
##                             ORG_NAME ORG_STATE ORG_ZIPCODE  PHR   PI_IDS
## 1              OHIO STATE UNIVERSITY        OH   432101016 <NA> 2074044;
## 2              OHIO STATE UNIVERSITY        OH   432101016 <NA> 1893210;
## 3     UNIVERSITY OF NEBRASKA LINCOLN        NE   685830861 <NA> 9902047;
## 4 ROSWELL PARK CANCER INSTITUTE CORP        NY   142630001 <NA> 6895846;
## 5             BOISE STATE UNIVERSITY        ID   837250001 <NA> 9837979;
## 6    UNIVERSITY OF MISSOURI-COLUMBIA        MO   652111230 <NA> 6219305;
##                  PI_NAMEs PROGRAM_OFFICER_NAME PROJECT_START PROJECT_END
## 1      GREVER, MICHAEL R;   MCCULLOUGH, WILLIE    10/20/2009   9/30/2013
## 2    OGLESBEE, MICHAEL J;              LIN, TI    10/20/2009  10/19/2014
## 3       PAUL, PREM SAGAR;   MCCULLOUGH, WILLIE    12/10/2009   6/30/2014
## 4          LEE, KELVIN P;              LIN, TI    10/20/2009  10/19/2014
## 5     RUDIN, MARK JOSEPH;   MCCULLOUGH, WILLIE    12/10/2009   6/30/2015
## 6 HALL, ROBERT DICKINSON;   MCCULLOUGH, WILLIE    10/20/2009   9/30/2013
##                                                               PROJECT_TERMS
## 1                                   Extramural Activities;research facility
## 2                                              Infectious Diseases Research
## 3 Extramural Activities;Extramural Research Facilities Construction Project
## 4                       Extramural Research Facilities Construction Project
## 5 Extramural Activities;Extramural Research Facilities Construction Project
## 6                                                    Animal Experimentation
##                                              PROJECT_TITLE SERIAL_NUMBER
## 1 PAR04-122, Extramural Research Facilities Improvement P*         20081
## 2   FACILITIES IMPROVEMENT FOR INFECTIOUS DISEASE RESEARCH         20088
## 3              Extramural Research Facilities Construction         20096
## 4 Extramural Research Facility Construction Projects. Ros*         20132
## 5 PAR04-122, Extramural Research Facilities Construction:*         20533
## 6            Construction of the MU Animal Research Center         20544
##   STUDY_SECTION
## 1          ZRR1
## 2          ZRR1
## 3          STRB
## 4          STRB
## 5          STRB
## 6          STRB
##                                                                       STUDY_SECTION_NAME
## 1                                                                 Special Emphasis Panel
## 2                                                                 Special Emphasis Panel
## 3 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 4 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 5 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 6 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
##   SUBPROJECT_ID SUFFIX SUPPORT_YEAR DIRECT_COST_AMT INDIRECT_COST_AMT
## 1            NA     A1            1              NA                NA
## 2            NA   <NA>            1              NA                NA
## 3            NA     A1            1              NA                NA
## 4            NA     A1            1              NA                NA
## 5            NA     A1            1              NA                NA
## 6            NA     A1            1              NA                NA
##   TOTAL_COST TOTAL_COST_SUB_PROJECT
## 1    8000000                     NA
## 2    3920956                     NA
## 3    8000000                     NA
## 4    4675896                     NA
## 5    3978104                     NA
## 6    5350000                     NA
# Set the MySQL connection variable, with the local database nih_export.
conn = dbConnect(MySQL(), user = 'root', password = '', dbname = 'nih_export', host = 'localhost')

# Write the data to a table in nih_export called nih_data.
dbWriteTable(conn, "nih_data", dat_csv, overwrite = TRUE, row.names=FALSE)
## [1] TRUE

Creating a dummy employee database

Here we will take two prepared CSVs: employee_db and application_employee_id.

The former is a dummy employee table containing employee ID, name, title, department, and salary.

The latter is table containing application IDs and employee IDs, this will help join the employee_db table to the nih_data table so that analyses can be run on internal data and the NIH data.

# Set the working directory to where the CSVs are.
setwd("~/nih_exporter")

# Read in the employee CSV and load it to the database.
employee_df <- read.csv("employee_db.csv", sep=",")

head(employee_df)
##   employee_id  first       last               title        department
## 1    33981637  David    Caffrey Assistant Professor Internal Medicine
## 2    21129270 Edward      Jones Assistant Professor Internal Medicine
## 3    13832369   Erin     Gordon Assistant Professor Internal Medicine
## 4    86042845  Ethel    Webster Assistant Professor Internal Medicine
## 5    16425582  Frank      Smith Assistant Professor Internal Medicine
## 6    36328276   Fred Blackstock Assistant Professor Internal Medicine
##   salary
## 1  72516
## 2  54820
## 3  50333
## 4  52502
## 5  66733
## 6  65997
dbWriteTable(conn, "employees", employee_df, overwrite = TRUE, row.names = FALSE)
## [1] TRUE
# Read in the application/employee ID CSV and load it to the database.
app_emp_df <- read.csv("application_employee_id.csv", sep=",")

head(app_emp_df)
##   application_id employee_id
## 1        9734439    33981637
## 2        9710474    21129270
## 3        9859520    13832369
## 4        9518924    86042845
## 5        9488276    16425582
## 6        9704737    36328276
dbWriteTable(conn, "app_emp_id", app_emp_df, overwrite = TRUE, row.names = FALSE)
## [1] TRUE

Creating dataframes, joining tables

To create the dataframes we’ll use for analysis we’ll need to query the db and join tables

nyu_dept_df <- dbGetQuery(conn, statement = paste("SELECT
ORG_DEPT, TOTAL_COST 
FROM
nih_data 
WHERE
ORG_NAME = 'NEW YORK UNIVERSITY SCHOOL OF MEDICINE' AND
FY = 2019;"))

pediatrics_df <- dbGetQuery(conn, statement = paste("SELECT 
FY, TOTAL_COST
FROM
nih_data
WHERE
ORG_NAME = 'NEW YORK UNIVERSITY SCHOOL OF MEDICINE' AND
ORG_DEPT = 'PEDIATRICS';"))


ratio_df <- dbGetQuery(conn, statement = paste("SELECT nih_data.TOTAL_COST, employees.first, employees.last, employees.salary
FROM app_emp_id 
    INNER JOIN nih_data
        on app_emp_id.application_id = nih_data.APPLICATION_ID
    INNER JOIN employees
        on app_emp_id.employee_id = employees.employee_id;"))

title_df <- dbGetQuery(conn, statement = paste("SELECT nih_data.TOTAL_COST, employees.title, employees.department
FROM app_emp_id 
    INNER JOIN nih_data
        on app_emp_id.application_id = nih_data.APPLICATION_ID
    INNER JOIN employees
        on app_emp_id.employee_id = employees.employee_id;"))

head(nyu_dept_df)
##                        ORG_DEPT TOTAL_COST
## 1                    PEDIATRICS     914716
## 2                     PATHOLOGY     423750
## 3   MICROBIOLOGY/IMMUN/VIROLOGY     537588
## 4                  BIOCHEMISTRY     803630
## 5                     PATHOLOGY     211875
## 6 RADIATION-DIAGNOSTIC/ONCOLOGY     497396
head(pediatrics_df)
##     FY TOTAL_COST
## 1 2010      47606
## 2 2010     605310
## 3 2010     220545
## 4 2010     244779
## 5 2010     762666
## 6 2010     570867
head(ratio_df)
##   TOTAL_COST   first     last salary
## 1     199643    Greg      Lee 105082
## 2     286918  Andrea Martinez 253686
## 3    1251118 Jeffrey    Rosen  68927
## 4     223312 Jessica    Riolo 128572
## 5     120000    Jane    Rogan  52123
## 6     286676     Adi    Allon 267107
head(title_df)
##   TOTAL_COST               title        department
## 1     199643 Associate Professor   Plastic Surgery
## 2     286918           Professor Internal Medicine
## 3    1251118 Assistant Professor          Oncology
## 4     223312 Associate Professor          Oncology
## 5     120000 Assistant Professor          Oncology
## 6     286676           Professor Internal Medicine

Transformation

With the data loaded into dataframes, we can transform and analyze.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Group by department, then sum total dollars awarded per department.
nyu_dept_df <- group_by(nyu_dept_df, ORG_DEPT)
nyu_dept_df <- summarize(nyu_dept_df, sum_total_cost = sum(TOTAL_COST))
nyu_dept_df <- data.frame(nyu_dept_df)
nyu_dept_df <- nyu_dept_df[1:21,]

# Plot results in a bar chart.
nyu_dept_plot <- ggplot(nyu_dept_df, aes(x=ORG_DEPT,y = sum_total_cost)) +
    geom_bar(width = .75,stat = "identity", position="dodge") +
    ggtitle("Total $ Awarded by Department, FY2019") +
    labs(x="Department",y="Total $ Awarded") +
    theme(plot.title = element_text(hjust=0.5), axis.text.x = element_text(angle = 90, hjust = 1)) +
    scale_y_continuous(labels = scales::dollar_format(prefix="$"))

# Group by fiscal year, then sum total dollars awarded per fiscal year.
pediatrics_df <- group_by(pediatrics_df, FY)
pediatrics_df <- summarize(pediatrics_df, sum_total_cost = sum(TOTAL_COST))
pediatrics_df <- data.frame(pediatrics_df)

# Plot a time-series
pediatrics_plot <- ggplot(pediatrics_df, aes(x=FY, y=sum_total_cost)) +
  geom_line(color = 'blue', size = 1) +
  ggtitle("Pediatrics Total Award $, FY 2010-20") +
  labs(x="Fiscal Year",y="Total $ Awarded") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
  scale_x_continuous(breaks = seq(2010,2020,by=1))

# Calculate the ratio of salary to award dollars for each employee.
ratio_df <- mutate(ratio_df, award_salary_ratio = round((salary/TOTAL_COST)*100,1))

# Group by title and department and sum total award $.
title_df <- group_by(title_df, title, department)
title_df <- summarize(title_df, sum_total_cost = sum(TOTAL_COST))
title_df <- data.frame(title_df)

# Plot in a bar chart where the x-axis is title and the fill is department.
title_plot <- ggplot(title_df, aes(x=title,y = sum_total_cost, fill = department)) +
    geom_bar(width = .75,stat = "identity", position="dodge") +
    ggtitle("Total $ Awarded per Title by Department") +
    labs(x="Title",y="Total $ Awarded", fill="Department") +
    theme(plot.title = element_text(hjust=0.5)) +
    scale_y_continuous(labels = scales::dollar_format(prefix="$"))

Analysis and Conclusions

nyu_dept_plot

Here we can see that in FY2019, Internal Medicine was awarded the most federal dollars by a large margin.

pediatrics_plot

Award dollars for the pediatrics department have generally increased in the past ten years, with FY2017 and FY2018 being particularly good.

ratio_df
##    TOTAL_COST     first       last salary award_salary_ratio
## 1      199643      Greg        Lee 105082               52.6
## 2      286918    Andrea   Martinez 253686               88.4
## 3     1251118   Jeffrey      Rosen  68927                5.5
## 4      223312   Jessica      Riolo 128572               57.6
## 5      120000      Jane      Rogan  52123               43.4
## 6      286676       Adi      Allon 267107               93.2
## 7      265664     Paolo Churchhill  91246               34.3
## 8      190400      Gary       Ngai 119694               62.9
## 9      297266      Kirk     Hubman 212883               71.6
## 10     208520      Jack      Smith 119495               57.3
## 11     184890 Elizabeth     Hudson  93058               50.3
## 12    2502236      Mady     Downey 115686                4.6
## 13     299317       Max      Huang 297977               99.6
## 14     155500     Ricky    Freeman  58381               37.5
## 15     298335    Martin     McCabe 251323               84.2
## 16     131265      Lisa     Thomas  54819               41.8
## 17     104260     David    Caffrey  72516               69.6
## 18     104260    Edward      Jones  54820               52.6
## 19     111796      Erin     Gordon  50333               45.0
## 20     114825      Fred Blackstock  65997               57.5
## 21     295139       Ido       Pitt 271964               92.1
## 22     285950   Tristan      Lopez 135050               47.2
## 23    1188535     Henry      Brown  71388                6.0
## 24     284557   Richard    Bartolo 132505               46.6
## 25     281408     Rajiv      Smith 116112               41.3
## 26    1563898   Russell        You  63413                4.1
## 27     156390     Ahmed       Rong  95616               61.1
## 28     285189       Tom       Wang 102343               35.9
## 29    1146858     Ethel    Webster  52502                4.6
## 30     161470      Alan    Halpert  93947               58.2
## 31     290400 Christina    Douglas 280652               96.6
## 32    2473583   Madison     Gibson 132499                5.4
## 33      11860      Gary      Evans  60741              512.2
## 34     295122    Eugene       Shah 274211               92.9
## 35     304150     Steve      Scher 296692               97.5
## 36    1303248      John      Brand  68347                5.2
## 37     295318       Joe     Cheney 292180               98.9
## 38     194452   Grayson     Potter 126112               64.9
## 39     178438      Ally   DeSantis 133734               74.9
## 40    1876677    Esther     Ciulla 121510                6.5
## 41    1407508      Paul    Thurman  71831                5.1
## 42    1146858     Frank      Smith  66733                5.8
## 43     134195      Lucy    Herbert  51527               38.4
## 44     208520     James Washington 114053               54.7
## 45     292728    Daniel        Kim 284412               97.2
## 46    1813451     Clare    Sanchez 139876                7.7
## 47     302225     Patty  Jefferson 271954               90.0
## 48     289086   Anthony      Singh 291096              100.7
## 49    2345846  Jonathan      Brown 110886                4.7
## 50     304150      Theo  Sotosanti 205339               67.5
## 51     301513      Mike       Bush 227368               75.4
## 52     262254      Nick    Lincoln 127913               48.8
## 53     301840   Patrick       Dean 225578               74.7
## 54     297484   Malcolm       Wolf 285779               96.1
## 55     217271  Jennifer   Thatcher  99688               45.9
## 56     294126     David      Brown 208223               70.8
## 57     135000      Mona     Parker  51584               38.2
## 58     254250 Nathaniel      Raviv 100287               39.4
## 59     285000   Suzanne    Beasley 121063               42.5
## 60     111796      Erin     Gordon  50333               45.0

Employees with a low score are bringing in far more dollars than their salary amount. Those with a score over 100 are costing more than they’re bringing in with grants.

title_plot

This is dummy data, so these conclusions aren’t applicable to the real world, but this type of analysis could allow for some investigation into why, for example, Associate Professors in Oncology are able to bring in so many award dollars.