This project seeks to import the publicly available NIH data into a database for storage and analysis. This data will be matched to a dummy database of employee information, to demonstrate the types of analyses that could be performed.
The script data_scrape.R (https://github.com/hdupre/DATA607/blob/master/nih_exporter/data_scrape.R), contains the code for scraping the zip files from the NIH Exporter website and unpacking the CSVs in the local environment.
Unfortunately, persistent issues meant that some manual intervention was needed between the extraction of the CSVs and loading them into the database, so the entirety of the process was not able to be knitted together in one RMD file.
library(RCurl)
library(XML)
library(stringr)
library(plyr)
library(readr)
library(RMySQL)
# Set the directory where the unzipped CSVs are stored (these CSVs were retrieved in data_scrape.R linked above).
mydir <- "~/nih_exporter/CSVs/data"
# Make a list of csv file names at the set location.
myfiles <- list.files(path=mydir, pattern="*.csv", full.names=TRUE)
# Ignore all datasets from before 2010 for simplicity -- a ten year lookback is sufficient for these purposes.
myfiles <- myfiles[26:95]
# Remove three files from fiscal year 2019 that were having persistent issues due to invisible UTF8 characters.
myfiles <- myfiles[-c(51,59,60)]
# Read each CSV into dat_csv. Making the column types explicit avoids the persistent issue of some columns being erroneously labeled col_logical().
dat_csv = ldply(myfiles, read_csv, col_names= TRUE, col_types=cols(
APPLICATION_ID = col_double(),
ACTIVITY = col_character(),
ADMINISTERING_IC = col_character(),
APPLICATION_TYPE = col_double(),
ARRA_FUNDED = col_character(),
AWARD_NOTICE_DATE = col_character(),
BUDGET_START = col_character(),
BUDGET_END = col_character(),
CFDA_CODE = col_character(),
CORE_PROJECT_NUM = col_character(),
ED_INST_TYPE = col_character(),
FOA_NUMBER = col_character(),
FULL_PROJECT_NUM = col_character(),
FUNDING_ICs = col_character(),
FUNDING_MECHANISM = col_character(),
FY = col_double(),
IC_NAME = col_character(),
NIH_SPENDING_CATS = col_character(),
ORG_CITY = col_character(),
ORG_COUNTRY = col_character(),
ORG_DEPT = col_character(),
ORG_DISTRICT = col_double(),
ORG_DUNS = col_character(),
ORG_FIPS = col_character(),
ORG_IPF_CODE = col_double(),
ORG_NAME = col_character(),
ORG_STATE = col_character(),
ORG_ZIPCODE = col_character(),
PHR = col_character(),
PI_IDS = col_character(),
PI_NAMEs = col_character(),
PROGRAM_OFFICER_NAME = col_character(),
PROJECT_START = col_character(),
PROJECT_END = col_character(),
PROJECT_TERMS = col_character(),
PROJECT_TITLE = col_character(),
SERIAL_NUMBER = col_character(),
STUDY_SECTION = col_character(),
STUDY_SECTION_NAME = col_character(),
SUBPROJECT_ID = col_double(),
SUFFIX = col_character(),
SUPPORT_YEAR = col_double(),
DIRECT_COST_AMT = col_double(),
INDIRECT_COST_AMT = col_double(),
TOTAL_COST = col_double(),
TOTAL_COST_SUB_PROJECT = col_double()
))
head(dat_csv)
## APPLICATION_ID ACTIVITY ADMINISTERING_IC APPLICATION_TYPE ARRA_FUNDED
## 1 7000731 C06 RR 1 Y
## 2 6826709 C06 RR 1 Y
## 3 7000889 C06 RR 1 Y
## 4 7000560 C06 RR 1 Y
## 5 7000669 C06 RR 1 Y
## 6 6949832 C06 RR 1 Y
## AWARD_NOTICE_DATE BUDGET_START BUDGET_END CFDA_CODE CORE_PROJECT_NUM
## 1 10/20/2009 10/20/2009 9/30/2013 702 C06RR020081
## 2 10/20/2009 10/20/2009 10/19/2014 702 C06RR020088
## 3 12/9/2009 12/10/2009 6/30/2014 702 C06RR020096
## 4 10/20/2009 10/20/2009 10/19/2014 702 C06RR020132
## 5 12/9/2009 12/10/2009 6/30/2015 702 C06RR020533
## 6 10/20/2009 10/20/2009 9/30/2013 702 C06RR020544
## ED_INST_TYPE FOA_NUMBER FULL_PROJECT_NUM
## 1 <NA> PAR-04-122 1C06RR020081-01A1
## 2 SCHOOLS OF VETERINARY MEDICINE RFA-RR-03-011 1C06RR020088-01
## 3 ORGANIZED RESEARCH UNITS PAR-04-122 1C06RR020096-01A1
## 4 <NA> PAR-04-122 1C06RR020132-01A1
## 5 ORGANIZED RESEARCH UNITS PAR-04-122 1C06RR020533-01A1
## 6 ORGANIZED RESEARCH UNITS PAR-04-122 1C06RR020544-01A1
## FUNDING_ICs FUNDING_MECHANISM FY
## 1 NCRR:8000000\\ Construction 2010
## 2 NCRR:3920956\\ Construction 2010
## 3 NCRR:8000000\\ Construction 2010
## 4 NCRR:4675896\\ Construction 2010
## 5 NCRR:3978104\\ Construction 2010
## 6 NCRR:5350000\\ Construction 2010
## IC_NAME NIH_SPENDING_CATS ORG_CITY
## 1 NATIONAL CENTER FOR RESEARCH RESOURCES <NA> COLUMBUS
## 2 NATIONAL CENTER FOR RESEARCH RESOURCES HIV/AIDS COLUMBUS
## 3 NATIONAL CENTER FOR RESEARCH RESOURCES HIV/AIDS LINCOLN
## 4 NATIONAL CENTER FOR RESEARCH RESOURCES <NA> BUFFALO
## 5 NATIONAL CENTER FOR RESEARCH RESOURCES <NA> BOISE
## 6 NATIONAL CENTER FOR RESEARCH RESOURCES <NA> COLUMBIA
## ORG_COUNTRY ORG_DEPT ORG_DISTRICT ORG_DUNS ORG_FIPS ORG_IPF_CODE
## 1 UNITED STATES <NA> 3 832127323 US NA
## 2 UNITED STATES NONE 3 832127323 US NA
## 3 UNITED STATES NONE 1 555456995 US NA
## 4 UNITED STATES <NA> 26 824771034 US NA
## 5 UNITED STATES NONE 2 72995848 US NA
## 6 UNITED STATES NONE 4 153890272 US NA
## ORG_NAME ORG_STATE ORG_ZIPCODE PHR PI_IDS
## 1 OHIO STATE UNIVERSITY OH 432101016 <NA> 2074044;
## 2 OHIO STATE UNIVERSITY OH 432101016 <NA> 1893210;
## 3 UNIVERSITY OF NEBRASKA LINCOLN NE 685830861 <NA> 9902047;
## 4 ROSWELL PARK CANCER INSTITUTE CORP NY 142630001 <NA> 6895846;
## 5 BOISE STATE UNIVERSITY ID 837250001 <NA> 9837979;
## 6 UNIVERSITY OF MISSOURI-COLUMBIA MO 652111230 <NA> 6219305;
## PI_NAMEs PROGRAM_OFFICER_NAME PROJECT_START PROJECT_END
## 1 GREVER, MICHAEL R; MCCULLOUGH, WILLIE 10/20/2009 9/30/2013
## 2 OGLESBEE, MICHAEL J; LIN, TI 10/20/2009 10/19/2014
## 3 PAUL, PREM SAGAR; MCCULLOUGH, WILLIE 12/10/2009 6/30/2014
## 4 LEE, KELVIN P; LIN, TI 10/20/2009 10/19/2014
## 5 RUDIN, MARK JOSEPH; MCCULLOUGH, WILLIE 12/10/2009 6/30/2015
## 6 HALL, ROBERT DICKINSON; MCCULLOUGH, WILLIE 10/20/2009 9/30/2013
## PROJECT_TERMS
## 1 Extramural Activities;research facility
## 2 Infectious Diseases Research
## 3 Extramural Activities;Extramural Research Facilities Construction Project
## 4 Extramural Research Facilities Construction Project
## 5 Extramural Activities;Extramural Research Facilities Construction Project
## 6 Animal Experimentation
## PROJECT_TITLE SERIAL_NUMBER
## 1 PAR04-122, Extramural Research Facilities Improvement P* 20081
## 2 FACILITIES IMPROVEMENT FOR INFECTIOUS DISEASE RESEARCH 20088
## 3 Extramural Research Facilities Construction 20096
## 4 Extramural Research Facility Construction Projects. Ros* 20132
## 5 PAR04-122, Extramural Research Facilities Construction:* 20533
## 6 Construction of the MU Animal Research Center 20544
## STUDY_SECTION
## 1 ZRR1
## 2 ZRR1
## 3 STRB
## 4 STRB
## 5 STRB
## 6 STRB
## STUDY_SECTION_NAME
## 1 Special Emphasis Panel
## 2 Special Emphasis Panel
## 3 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 4 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 5 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## 6 Scientific and Technical Review Board on Biomedical and Behavioral Research Facilities
## SUBPROJECT_ID SUFFIX SUPPORT_YEAR DIRECT_COST_AMT INDIRECT_COST_AMT
## 1 NA A1 1 NA NA
## 2 NA <NA> 1 NA NA
## 3 NA A1 1 NA NA
## 4 NA A1 1 NA NA
## 5 NA A1 1 NA NA
## 6 NA A1 1 NA NA
## TOTAL_COST TOTAL_COST_SUB_PROJECT
## 1 8000000 NA
## 2 3920956 NA
## 3 8000000 NA
## 4 4675896 NA
## 5 3978104 NA
## 6 5350000 NA
# Set the MySQL connection variable, with the local database nih_export.
conn = dbConnect(MySQL(), user = 'root', password = '', dbname = 'nih_export', host = 'localhost')
# Write the data to a table in nih_export called nih_data.
dbWriteTable(conn, "nih_data", dat_csv, overwrite = TRUE, row.names=FALSE)
## [1] TRUE
Here we will take two prepared CSVs: employee_db and application_employee_id.
The former is a dummy employee table containing employee ID, name, title, department, and salary.
The latter is table containing application IDs and employee IDs, this will help join the employee_db table to the nih_data table so that analyses can be run on internal data and the NIH data.
# Set the working directory to where the CSVs are.
setwd("~/nih_exporter")
# Read in the employee CSV and load it to the database.
employee_df <- read.csv("employee_db.csv", sep=",")
head(employee_df)
## employee_id first last title department
## 1 33981637 David Caffrey Assistant Professor Internal Medicine
## 2 21129270 Edward Jones Assistant Professor Internal Medicine
## 3 13832369 Erin Gordon Assistant Professor Internal Medicine
## 4 86042845 Ethel Webster Assistant Professor Internal Medicine
## 5 16425582 Frank Smith Assistant Professor Internal Medicine
## 6 36328276 Fred Blackstock Assistant Professor Internal Medicine
## salary
## 1 72516
## 2 54820
## 3 50333
## 4 52502
## 5 66733
## 6 65997
dbWriteTable(conn, "employees", employee_df, overwrite = TRUE, row.names = FALSE)
## [1] TRUE
# Read in the application/employee ID CSV and load it to the database.
app_emp_df <- read.csv("application_employee_id.csv", sep=",")
head(app_emp_df)
## application_id employee_id
## 1 9734439 33981637
## 2 9710474 21129270
## 3 9859520 13832369
## 4 9518924 86042845
## 5 9488276 16425582
## 6 9704737 36328276
dbWriteTable(conn, "app_emp_id", app_emp_df, overwrite = TRUE, row.names = FALSE)
## [1] TRUE
To create the dataframes we’ll use for analysis we’ll need to query the db and join tables
nyu_dept_df <- dbGetQuery(conn, statement = paste("SELECT
ORG_DEPT, TOTAL_COST
FROM
nih_data
WHERE
ORG_NAME = 'NEW YORK UNIVERSITY SCHOOL OF MEDICINE' AND
FY = 2019;"))
pediatrics_df <- dbGetQuery(conn, statement = paste("SELECT
FY, TOTAL_COST
FROM
nih_data
WHERE
ORG_NAME = 'NEW YORK UNIVERSITY SCHOOL OF MEDICINE' AND
ORG_DEPT = 'PEDIATRICS';"))
ratio_df <- dbGetQuery(conn, statement = paste("SELECT nih_data.TOTAL_COST, employees.first, employees.last, employees.salary
FROM app_emp_id
INNER JOIN nih_data
on app_emp_id.application_id = nih_data.APPLICATION_ID
INNER JOIN employees
on app_emp_id.employee_id = employees.employee_id;"))
title_df <- dbGetQuery(conn, statement = paste("SELECT nih_data.TOTAL_COST, employees.title, employees.department
FROM app_emp_id
INNER JOIN nih_data
on app_emp_id.application_id = nih_data.APPLICATION_ID
INNER JOIN employees
on app_emp_id.employee_id = employees.employee_id;"))
head(nyu_dept_df)
## ORG_DEPT TOTAL_COST
## 1 PEDIATRICS 914716
## 2 PATHOLOGY 423750
## 3 MICROBIOLOGY/IMMUN/VIROLOGY 537588
## 4 BIOCHEMISTRY 803630
## 5 PATHOLOGY 211875
## 6 RADIATION-DIAGNOSTIC/ONCOLOGY 497396
head(pediatrics_df)
## FY TOTAL_COST
## 1 2010 47606
## 2 2010 605310
## 3 2010 220545
## 4 2010 244779
## 5 2010 762666
## 6 2010 570867
head(ratio_df)
## TOTAL_COST first last salary
## 1 199643 Greg Lee 105082
## 2 286918 Andrea Martinez 253686
## 3 1251118 Jeffrey Rosen 68927
## 4 223312 Jessica Riolo 128572
## 5 120000 Jane Rogan 52123
## 6 286676 Adi Allon 267107
head(title_df)
## TOTAL_COST title department
## 1 199643 Associate Professor Plastic Surgery
## 2 286918 Professor Internal Medicine
## 3 1251118 Assistant Professor Oncology
## 4 223312 Associate Professor Oncology
## 5 120000 Assistant Professor Oncology
## 6 286676 Professor Internal Medicine
With the data loaded into dataframes, we can transform and analyze.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Group by department, then sum total dollars awarded per department.
nyu_dept_df <- group_by(nyu_dept_df, ORG_DEPT)
nyu_dept_df <- summarize(nyu_dept_df, sum_total_cost = sum(TOTAL_COST))
nyu_dept_df <- data.frame(nyu_dept_df)
nyu_dept_df <- nyu_dept_df[1:21,]
# Plot results in a bar chart.
nyu_dept_plot <- ggplot(nyu_dept_df, aes(x=ORG_DEPT,y = sum_total_cost)) +
geom_bar(width = .75,stat = "identity", position="dodge") +
ggtitle("Total $ Awarded by Department, FY2019") +
labs(x="Department",y="Total $ Awarded") +
theme(plot.title = element_text(hjust=0.5), axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = scales::dollar_format(prefix="$"))
# Group by fiscal year, then sum total dollars awarded per fiscal year.
pediatrics_df <- group_by(pediatrics_df, FY)
pediatrics_df <- summarize(pediatrics_df, sum_total_cost = sum(TOTAL_COST))
pediatrics_df <- data.frame(pediatrics_df)
# Plot a time-series
pediatrics_plot <- ggplot(pediatrics_df, aes(x=FY, y=sum_total_cost)) +
geom_line(color = 'blue', size = 1) +
ggtitle("Pediatrics Total Award $, FY 2010-20") +
labs(x="Fiscal Year",y="Total $ Awarded") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
scale_x_continuous(breaks = seq(2010,2020,by=1))
# Calculate the ratio of salary to award dollars for each employee.
ratio_df <- mutate(ratio_df, award_salary_ratio = round((salary/TOTAL_COST)*100,1))
# Group by title and department and sum total award $.
title_df <- group_by(title_df, title, department)
title_df <- summarize(title_df, sum_total_cost = sum(TOTAL_COST))
title_df <- data.frame(title_df)
# Plot in a bar chart where the x-axis is title and the fill is department.
title_plot <- ggplot(title_df, aes(x=title,y = sum_total_cost, fill = department)) +
geom_bar(width = .75,stat = "identity", position="dodge") +
ggtitle("Total $ Awarded per Title by Department") +
labs(x="Title",y="Total $ Awarded", fill="Department") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels = scales::dollar_format(prefix="$"))
nyu_dept_plot
Here we can see that in FY2019, Internal Medicine was awarded the most federal dollars by a large margin.
pediatrics_plot
Award dollars for the pediatrics department have generally increased in the past ten years, with FY2017 and FY2018 being particularly good.
ratio_df
## TOTAL_COST first last salary award_salary_ratio
## 1 199643 Greg Lee 105082 52.6
## 2 286918 Andrea Martinez 253686 88.4
## 3 1251118 Jeffrey Rosen 68927 5.5
## 4 223312 Jessica Riolo 128572 57.6
## 5 120000 Jane Rogan 52123 43.4
## 6 286676 Adi Allon 267107 93.2
## 7 265664 Paolo Churchhill 91246 34.3
## 8 190400 Gary Ngai 119694 62.9
## 9 297266 Kirk Hubman 212883 71.6
## 10 208520 Jack Smith 119495 57.3
## 11 184890 Elizabeth Hudson 93058 50.3
## 12 2502236 Mady Downey 115686 4.6
## 13 299317 Max Huang 297977 99.6
## 14 155500 Ricky Freeman 58381 37.5
## 15 298335 Martin McCabe 251323 84.2
## 16 131265 Lisa Thomas 54819 41.8
## 17 104260 David Caffrey 72516 69.6
## 18 104260 Edward Jones 54820 52.6
## 19 111796 Erin Gordon 50333 45.0
## 20 114825 Fred Blackstock 65997 57.5
## 21 295139 Ido Pitt 271964 92.1
## 22 285950 Tristan Lopez 135050 47.2
## 23 1188535 Henry Brown 71388 6.0
## 24 284557 Richard Bartolo 132505 46.6
## 25 281408 Rajiv Smith 116112 41.3
## 26 1563898 Russell You 63413 4.1
## 27 156390 Ahmed Rong 95616 61.1
## 28 285189 Tom Wang 102343 35.9
## 29 1146858 Ethel Webster 52502 4.6
## 30 161470 Alan Halpert 93947 58.2
## 31 290400 Christina Douglas 280652 96.6
## 32 2473583 Madison Gibson 132499 5.4
## 33 11860 Gary Evans 60741 512.2
## 34 295122 Eugene Shah 274211 92.9
## 35 304150 Steve Scher 296692 97.5
## 36 1303248 John Brand 68347 5.2
## 37 295318 Joe Cheney 292180 98.9
## 38 194452 Grayson Potter 126112 64.9
## 39 178438 Ally DeSantis 133734 74.9
## 40 1876677 Esther Ciulla 121510 6.5
## 41 1407508 Paul Thurman 71831 5.1
## 42 1146858 Frank Smith 66733 5.8
## 43 134195 Lucy Herbert 51527 38.4
## 44 208520 James Washington 114053 54.7
## 45 292728 Daniel Kim 284412 97.2
## 46 1813451 Clare Sanchez 139876 7.7
## 47 302225 Patty Jefferson 271954 90.0
## 48 289086 Anthony Singh 291096 100.7
## 49 2345846 Jonathan Brown 110886 4.7
## 50 304150 Theo Sotosanti 205339 67.5
## 51 301513 Mike Bush 227368 75.4
## 52 262254 Nick Lincoln 127913 48.8
## 53 301840 Patrick Dean 225578 74.7
## 54 297484 Malcolm Wolf 285779 96.1
## 55 217271 Jennifer Thatcher 99688 45.9
## 56 294126 David Brown 208223 70.8
## 57 135000 Mona Parker 51584 38.2
## 58 254250 Nathaniel Raviv 100287 39.4
## 59 285000 Suzanne Beasley 121063 42.5
## 60 111796 Erin Gordon 50333 45.0
Employees with a low score are bringing in far more dollars than their salary amount. Those with a score over 100 are costing more than they’re bringing in with grants.
title_plot
This is dummy data, so these conclusions aren’t applicable to the real world, but this type of analysis could allow for some investigation into why, for example, Associate Professors in Oncology are able to bring in so many award dollars.