This is an analysis of the Grant Status dataset for the Kaggle competition Predict Grant Applications. Kaggle’s Grant Applications competition challenges participants to predict the success (1 = successful, 0 = unsuccessful) of 2,176 grant applications submitted by the University of Melbourne between 2009 and 2010. The training and test datasets include 249 features. The training dataset contains 8,708 observations from applications made between 2005 and 2008. Competitors build a predictive model with the training dataset, then apply the model to the test dataset to produce a submission file consisting of the observation id and the predicted probability of application success. Kaggle evaluates submissions based on the area under the ROC curve (AUC).
This document addresses initial data management. It follows the logic used in Kuhn’s Applied Predictive Modeling. (See AppliedPredictiveModeling::scriptLocation()
.)
Each row contains the outcome (Grant.Status
) of a grant application (Grant.Application.ID
), the application date (Start.date
), several descriptive codes (Sponsor.Code
, Grant.Category.Code
, Contract.Value.Band...see.note.A
, 5 RFCD.Code.#
and proportions, and 5 SEO.Code.#
and proportions), and 15 sets of attributes for persons on the application. Those attributes include an identifier (Person.ID.#
), role (Role.#
) and 13 more fields.
This script pivots the 15 sets of persons on the application into additional rows in the dataset, then pivots each coded attribute into count predictors. Most of the predictors are binary. There are 730 RFCD______
dummies, 454 SEO_____
dummies, 292 Sponsor___
dummies, 17 ContractValueBand_
dummies, 11 mmm
application month dummies, and 6 ddd
application day of week dummies. There are frequency variables counting the number of persons per role num__
, and [role].[attribute] counts such as CI.1925
for chief investigators born between 1925 and 1930. The publication information is represented in two ways, first, as totals for each role, such as B.CI
, and second as total counts across all individuals, such as AstarTotal
or all journal types (allPub
). The calendar day of the year is stored as a numeric variable (Day
). The class outcome is contained in a factor variable Class
with levels successful
and unsuccessful
.
The script splits the data into two data frames: training
contains the 6,633 pre-2008 observations plus a holdout set of 1,557 observations from 2008 used to tune the model for a total of 8,190 observations; testing
contains the remaining 518 observations from 2008. Total rows = 8,190 + 518 = 8,708. A vector pre2008
contains the row indices of the 6,633 pre-2008 observations.
The full set of predictors, identified by name in vector fullSet
, include variables with zero variance or high (\(\rho\) >0.99) correlation to other variables. A reduced set of variables, identified by name in vector reducedSet
removes the zero varaiance and highly correlated variables.
## Rows: 8,708
## Columns: 252
## $ Grant.Application.ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ Grant.Status <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, ...
## $ Sponsor.Code <chr> NA, "2B", "29A", "40D", "59...
## $ Grant.Category.Code <chr> NA, "10A", "10B", "10B", "1...
## $ Contract.Value.Band...see.note.A <chr> "A", "B", "A", "C", "A", NA...
## $ Start.date <chr> "8/11/05", "11/11/05", "14/...
## $ RFCD.Code.1 <dbl> 280199, 280103, 321004, 270...
## $ RFCD.Percentage.1 <dbl> 100, 30, 60, 50, 34, 100, 5...
## $ RFCD.Code.2 <dbl> 0, 280106, 321216, 320602, ...
## $ RFCD.Percentage.2 <dbl> 0, 30, 40, 50, 33, 0, 30, 0...
## $ RFCD.Code.3 <dbl> 0, 280203, 0, 0, 290000, 0,...
## $ RFCD.Percentage.3 <dbl> 0, 40, 0, 0, 33, 0, 20, 0, ...
## $ RFCD.Code.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RFCD.Percentage.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RFCD.Code.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RFCD.Percentage.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.1 <dbl> 700299, 700103, 730105, 730...
## $ SEO.Percentage.1 <dbl> 100, 50, 60, 70, 100, 100, ...
## $ SEO.Code.2 <dbl> 0, 700102, 730207, 730201, ...
## $ SEO.Percentage.2 <dbl> 0, 50, 40, 30, 0, 0, 40, 0,...
## $ SEO.Code.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Percentage.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Percentage.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Percentage.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Person.ID.1 <dbl> 40572, 9067, 5967, 78782, 1...
## $ Role.1 <chr> "CHIEF_INVESTIGATOR", "CHIE...
## $ Year.of.Birth.1 <dbl> 1965, 1960, 1955, 1955, 196...
## $ Country.of.Birth.1 <chr> "Asia Pacific", "Australia"...
## $ Home.Language.1 <chr> "Other", NA, NA, NA, NA, NA...
## $ Dept.No..1 <dbl> 3073, 2538, 2923, 2678, 215...
## $ Faculty.No..1 <dbl> 31, 25, 25, 25, 19, 25, 31,...
## $ With.PHD.1 <chr> NA, "Yes", "Yes", "Yes", "Y...
## $ No..of.Years.in.Uni.at.Time.of.Grant.1 <chr> "Less than 0", "more than 1...
## $ Number.of.Successful.Grant.1 <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
## $ Number.of.Unsuccessful.Grant.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, ...
## $ A..1 <dbl> 4, 6, 0, 0, 3, 7, 8, 0, 8, ...
## $ A.1 <dbl> 2, 12, 3, 3, 0, 27, 5, 0, 1...
## $ B.1 <dbl> 0, 2, 5, 13, 1, 27, 2, 0, 1...
## $ C.1 <dbl> 0, 2, 2, 3, 0, 6, 0, 0, 0, ...
## $ Person.ID.2 <dbl> NA, NA, 27307, 55337, NA, N...
## $ Role.2 <chr> NA, NA, "CHIEF_INVESTIGATOR...
## $ Year.of.Birth.2 <dbl> NA, NA, 1950, 1975, NA, NA,...
## $ Country.of.Birth.2 <chr> NA, NA, "Australia", "Austr...
## $ Home.Language.2 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..2 <dbl> NA, NA, 2923, 2678, NA, NA,...
## $ Faculty.No..2 <dbl> NA, NA, 25, 25, NA, NA, 31,...
## $ With.PHD.2 <chr> NA, NA, NA, "Yes", NA, NA, ...
## $ No..of.Years.in.Uni.at.Time.of.Grant.2 <chr> NA, NA, "Less than 0", ">=0...
## $ Number.of.Successful.Grant.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ Number.of.Unsuccessful.Grant.2 <dbl> NA, NA, 0, 0, NA, NA, 1, 0,...
## $ A..2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ A.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ B.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ C.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ Person.ID.3 <dbl> NA, NA, 79652, NA, NA, NA, ...
## $ Role.3 <chr> NA, NA, "CHIEF_INVESTIGATOR...
## $ Year.of.Birth.3 <dbl> NA, NA, 1950, NA, NA, NA, 1...
## $ Country.of.Birth.3 <chr> NA, NA, "Asia Pacific", NA,...
## $ Home.Language.3 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..3 <dbl> NA, NA, 2498, NA, NA, NA, 3...
## $ Faculty.No..3 <dbl> NA, NA, 25, NA, NA, NA, 31,...
## $ With.PHD.3 <chr> NA, NA, "Yes", NA, NA, NA, ...
## $ No..of.Years.in.Uni.at.Time.of.Grant.3 <chr> NA, NA, "more than 15", NA,...
## $ Number.of.Successful.Grant.3 <dbl> NA, NA, 0, NA, NA, NA, 0, N...
## $ Number.of.Unsuccessful.Grant.3 <dbl> NA, NA, 0, NA, NA, NA, 0, N...
## $ A..3 <dbl> NA, NA, 1, NA, NA, NA, 0, N...
## $ A.3 <dbl> NA, NA, 3, NA, NA, NA, 7, N...
## $ B.3 <dbl> NA, NA, 3, NA, NA, NA, 12, ...
## $ C.3 <dbl> NA, NA, 3, NA, NA, NA, 4, N...
## $ Person.ID.4 <dbl> NA, NA, 11667, NA, NA, NA, ...
## $ Role.4 <chr> NA, NA, "DELEGATED_RESEARCH...
## $ Year.of.Birth.4 <dbl> NA, NA, 1950, NA, NA, NA, N...
## $ Country.of.Birth.4 <chr> NA, NA, "Australia", NA, NA...
## $ Home.Language.4 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..4 <dbl> NA, NA, 2548, NA, NA, NA, N...
## $ Faculty.No..4 <dbl> NA, NA, 25, NA, NA, NA, NA,...
## $ With.PHD.4 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.4 <chr> NA, NA, "more than 15", NA,...
## $ Number.of.Successful.Grant.4 <dbl> NA, NA, 0, NA, NA, NA, NA, ...
## $ Number.of.Unsuccessful.Grant.4 <dbl> NA, NA, 0, NA, NA, NA, NA, ...
## $ A..4 <dbl> NA, NA, 6, NA, NA, NA, NA, ...
## $ A.4 <dbl> NA, NA, 14, NA, NA, NA, NA,...
## $ B.4 <dbl> NA, NA, 12, NA, NA, NA, NA,...
## $ C.4 <dbl> NA, NA, 2, NA, NA, NA, NA, ...
## $ Person.ID.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.5 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..5 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.6 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.7 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.11 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..11 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.12 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.13 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.14 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.15 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.15 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.15 <lgl> NA, NA, NA, NA, NA, NA, NA,...
## $ X252 <lgl> NA, NA, NA, NA, NA, NA, NA,...
d_raw_2 <- d_raw_1 %>%
mutate(
X252 = NULL, # extra col loaded by read_csv
Role.1 = replace_na(Role.1, "Unk"), # Force a value for Role.1
Sponsor.Code = factor(paste0("Sponsor", replace_na(Sponsor.Code, "Unk"))),
Grant.Category.Code = factor(paste0("GrantCat", replace_na(Grant.Category.Code, "Unk"))),
Contract.Value.Band...see.note.A = factor(paste0("ContractValueBand", replace_na(Contract.Value.Band...see.note.A, "Unk"))),
Start.date = lubridate::dmy(Start.date),
) %>%
mutate_at(vars(
starts_with("Person.ID."),
starts_with("Role."),
starts_with("RFCD.Code."),
starts_with("SEO.Code."),
starts_with("Country.of.Birth"),
starts_with("Home.Language."),
starts_with("Dept.No.."),
starts_with("Faculty.No.."),
starts_with("No..of.Years.in.Uni.at.Time.of.Grant."),
starts_with("With.PHD.")
),
as.character
) %>%
mutate_at(vars(
starts_with("Year.of.Birth."),
starts_with("Number.of.Successful.Grant."),
starts_with("Number.of.Unsuccessful.Grant."),
starts_with("A.."),
starts_with("A."),
starts_with("B."),
starts_with("C.")
),
as.numeric
)
glimpse(d_raw_2)
## Rows: 8,708
## Columns: 251
## $ Grant.Application.ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ Grant.Status <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, ...
## $ Sponsor.Code <fct> SponsorUnk, Sponsor2B, Spon...
## $ Grant.Category.Code <fct> GrantCatUnk, GrantCat10A, G...
## $ Contract.Value.Band...see.note.A <fct> ContractValueBandA, Contrac...
## $ Start.date <date> 2005-11-08, 2005-11-11, 20...
## $ RFCD.Code.1 <chr> "280199", "280103", "321004...
## $ RFCD.Percentage.1 <dbl> 100, 30, 60, 50, 34, 100, 5...
## $ RFCD.Code.2 <chr> "0", "280106", "321216", "3...
## $ RFCD.Percentage.2 <dbl> 0, 30, 40, 50, 33, 0, 30, 0...
## $ RFCD.Code.3 <chr> "0", "280203", "0", "0", "2...
## $ RFCD.Percentage.3 <dbl> 0, 40, 0, 0, 33, 0, 20, 0, ...
## $ RFCD.Code.4 <chr> "0", "0", "0", "0", "0", "0...
## $ RFCD.Percentage.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RFCD.Code.5 <chr> "0", "0", "0", "0", "0", "0...
## $ RFCD.Percentage.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.1 <chr> "700299", "700103", "730105...
## $ SEO.Percentage.1 <dbl> 100, 50, 60, 70, 100, 100, ...
## $ SEO.Code.2 <chr> "0", "700102", "730207", "7...
## $ SEO.Percentage.2 <dbl> 0, 50, 40, 30, 0, 0, 40, 0,...
## $ SEO.Code.3 <chr> "0", "0", "0", "0", "0", "0...
## $ SEO.Percentage.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.4 <chr> "0", "0", "0", "0", "0", "0...
## $ SEO.Percentage.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SEO.Code.5 <chr> "0", "0", "0", "0", "0", "0...
## $ SEO.Percentage.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Person.ID.1 <chr> "40572", "9067", "5967", "7...
## $ Role.1 <chr> "CHIEF_INVESTIGATOR", "CHIE...
## $ Year.of.Birth.1 <dbl> 1965, 1960, 1955, 1955, 196...
## $ Country.of.Birth.1 <chr> "Asia Pacific", "Australia"...
## $ Home.Language.1 <chr> "Other", NA, NA, NA, NA, NA...
## $ Dept.No..1 <chr> "3073", "2538", "2923", "26...
## $ Faculty.No..1 <chr> "31", "25", "25", "25", "19...
## $ With.PHD.1 <chr> NA, "Yes", "Yes", "Yes", "Y...
## $ No..of.Years.in.Uni.at.Time.of.Grant.1 <chr> "Less than 0", "more than 1...
## $ Number.of.Successful.Grant.1 <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
## $ Number.of.Unsuccessful.Grant.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, ...
## $ A..1 <dbl> 4, 6, 0, 0, 3, 7, 8, 0, 8, ...
## $ A.1 <dbl> 2, 12, 3, 3, 0, 27, 5, 0, 1...
## $ B.1 <dbl> 0, 2, 5, 13, 1, 27, 2, 0, 1...
## $ C.1 <dbl> 0, 2, 2, 3, 0, 6, 0, 0, 0, ...
## $ Person.ID.2 <chr> NA, NA, "27307", "55337", N...
## $ Role.2 <chr> NA, NA, "CHIEF_INVESTIGATOR...
## $ Year.of.Birth.2 <dbl> NA, NA, 1950, 1975, NA, NA,...
## $ Country.of.Birth.2 <chr> NA, NA, "Australia", "Austr...
## $ Home.Language.2 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..2 <chr> NA, NA, "2923", "2678", NA,...
## $ Faculty.No..2 <chr> NA, NA, "25", "25", NA, NA,...
## $ With.PHD.2 <chr> NA, NA, NA, "Yes", NA, NA, ...
## $ No..of.Years.in.Uni.at.Time.of.Grant.2 <chr> NA, NA, "Less than 0", ">=0...
## $ Number.of.Successful.Grant.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ Number.of.Unsuccessful.Grant.2 <dbl> NA, NA, 0, 0, NA, NA, 1, 0,...
## $ A..2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ A.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ B.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ C.2 <dbl> NA, NA, 0, 0, NA, NA, 0, 0,...
## $ Person.ID.3 <chr> NA, NA, "79652", NA, NA, NA...
## $ Role.3 <chr> NA, NA, "CHIEF_INVESTIGATOR...
## $ Year.of.Birth.3 <dbl> NA, NA, 1950, NA, NA, NA, 1...
## $ Country.of.Birth.3 <chr> NA, NA, "Asia Pacific", NA,...
## $ Home.Language.3 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..3 <chr> NA, NA, "2498", NA, NA, NA,...
## $ Faculty.No..3 <chr> NA, NA, "25", NA, NA, NA, "...
## $ With.PHD.3 <chr> NA, NA, "Yes", NA, NA, NA, ...
## $ No..of.Years.in.Uni.at.Time.of.Grant.3 <chr> NA, NA, "more than 15", NA,...
## $ Number.of.Successful.Grant.3 <dbl> NA, NA, 0, NA, NA, NA, 0, N...
## $ Number.of.Unsuccessful.Grant.3 <dbl> NA, NA, 0, NA, NA, NA, 0, N...
## $ A..3 <dbl> NA, NA, 1, NA, NA, NA, 0, N...
## $ A.3 <dbl> NA, NA, 3, NA, NA, NA, 7, N...
## $ B.3 <dbl> NA, NA, 3, NA, NA, NA, 12, ...
## $ C.3 <dbl> NA, NA, 3, NA, NA, NA, 4, N...
## $ Person.ID.4 <chr> NA, NA, "11667", NA, NA, NA...
## $ Role.4 <chr> NA, NA, "DELEGATED_RESEARCH...
## $ Year.of.Birth.4 <dbl> NA, NA, 1950, NA, NA, NA, N...
## $ Country.of.Birth.4 <chr> NA, NA, "Australia", NA, NA...
## $ Home.Language.4 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..4 <chr> NA, NA, "2548", NA, NA, NA,...
## $ Faculty.No..4 <chr> NA, NA, "25", NA, NA, NA, N...
## $ With.PHD.4 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.4 <chr> NA, NA, "more than 15", NA,...
## $ Number.of.Successful.Grant.4 <dbl> NA, NA, 0, NA, NA, NA, NA, ...
## $ Number.of.Unsuccessful.Grant.4 <dbl> NA, NA, 0, NA, NA, NA, NA, ...
## $ A..4 <dbl> NA, NA, 6, NA, NA, NA, NA, ...
## $ A.4 <dbl> NA, NA, 14, NA, NA, NA, NA,...
## $ B.4 <dbl> NA, NA, 12, NA, NA, NA, NA,...
## $ C.4 <dbl> NA, NA, 2, NA, NA, NA, NA, ...
## $ Person.ID.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.5 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.5 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.5 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.6 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.6 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.6 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.7 <chr> NA, NA, "EXT_CHIEF_INVESTIG...
## $ Year.of.Birth.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.7 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.7 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.8 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.8 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.9 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.9 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.10 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.10 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.11 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.11 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.12 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.12 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.13 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.13 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.14 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.14 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Person.ID.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Role.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Year.of.Birth.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Country.of.Birth.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Home.Language.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Dept.No..15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Faculty.No..15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ With.PHD.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ No..of.Years.in.Uni.at.Time.of.Grant.15 <chr> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Successful.Grant.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ Number.of.Unsuccessful.Grant.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A..15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ A.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ B.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
## $ C.15 <dbl> NA, NA, NA, NA, NA, NA, NA,...
There are 15 roles on the grant application with associated attributes. Transform these 15 sets of columns into 15 sets of rows.
tmp <- vector(mode = "list", length = 15)
for (i in 1:15) {
x <- select_at(d_raw_2, vars(c(1:6), ends_with(paste0(".", i))))
names(x) <- str_remove(names(x), paste0(".", i))
if(all(names(x) != "RFCD.Code")) x$RFCD.Code <- NA
if(all(names(x) != "RFCD.Percentage")) x$RFCD.Percentage <- NA
if(all(names(x) != "SEO.Code")) x$SEO.Code <- NA
if(all(names(x) != "SEO.Percentage")) x$SEO.Percentage <- NA
tmp[[i]] = x
}
d_raw_3 <- bind_rows(tmp)
d_raw_3 <- filter(d_raw_3, !is.na(Role))
rm(tmp, i, x)
glimpse(d_raw_3)
## Rows: 17,594
## Columns: 25
## $ Grant.Application.ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
## $ Grant.Status <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, ...
## $ Sponsor.Code <fct> SponsorUnk, Sponsor2B, Sponsor...
## $ Grant.Category.Code <fct> GrantCatUnk, GrantCat10A, Gran...
## $ Contract.Value.Band...see.note.A <fct> ContractValueBandA, ContractVa...
## $ Start.date <date> 2005-11-08, 2005-11-11, 2005-...
## $ RFCD.Code <chr> "280199", "280103", "321004", ...
## $ RFCD.Percentage <dbl> 100, 30, 60, 50, 34, 100, 50, ...
## $ SEO.Code <chr> "700299", "700103", "730105", ...
## $ SEO.Percentage <dbl> 100, 50, 60, 70, 100, 100, 60,...
## $ Person.ID <chr> "40572", "9067", "5967", "7878...
## $ Role <chr> "CHIEF_INVESTIGATOR", "CHIEF_I...
## $ Year.of.Birth <dbl> 1965, 1960, 1955, 1955, 1965, ...
## $ Country.of.Birth <chr> "Asia Pacific", "Australia", "...
## $ Home.Language <chr> "Other", NA, NA, NA, NA, NA, N...
## $ Dept.No. <chr> "3073", "2538", "2923", "2678"...
## $ Faculty.No. <chr> "31", "25", "25", "25", "19", ...
## $ With.PHD <chr> NA, "Yes", "Yes", "Yes", "Yes"...
## $ No..of.Years.in.Uni.at.Time.of.Grant <chr> "Less than 0", "more than 15",...
## $ Number.of.Successful.Grant <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, ...
## $ Number.of.Unsuccessful.Grant <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, ...
## $ A. <dbl> 4, 6, 0, 0, 3, 7, 8, 0, 8, 1, ...
## $ A <dbl> 2, 12, 3, 3, 0, 27, 5, 0, 1, 0...
## $ B <dbl> 0, 2, 5, 13, 1, 27, 2, 0, 11, ...
## $ C <dbl> 0, 2, 2, 3, 0, 6, 0, 0, 0, 0, ...
d_raw_4 <- d_raw_3 %>%
mutate(
Year.of.Birth = factor(Year.of.Birth),
Country.of.Birth = factor(str_remove_all(Country.of.Birth, " ")),
Home.Language = factor(str_replace(Home.Language, "Other", "OtherLang")),
With.PHD = str_replace(With.PHD, "Yes", "PhD"),
Dept.No. = factor(paste0("Dept", Dept.No.)),
Faculty.No. = factor(paste0("Faculty", Faculty.No.)),
RFCD.Percentage = if_else(
RFCD.Code %in% c("0", "999999"), as.numeric(NA), RFCD.Percentage),
RFCD.Code = factor(paste0("RFCD", if_else(
RFCD.Code %in% c("0", "999999"), as.character(NA), RFCD.Code))),
SEO.Percentage = if_else(
SEO.Code %in% c("0", "999999"), as.numeric(NA), SEO.Percentage),
SEO.Code = factor(paste0("SEO", if_else(
SEO.Code %in% c("0", "999999"), as.character(NA), SEO.Code))),
No..of.Years.in.Uni.at.Time.of.Grant = factor(case_when(
is.na(No..of.Years.in.Uni.at.Time.of.Grant) ~ "DurationUnk",
No..of.Years.in.Uni.at.Time.of.Grant == ">=0 to 5" ~ "Duration0to5",
No..of.Years.in.Uni.at.Time.of.Grant == ">5 to 10" ~ "Duration5to10",
No..of.Years.in.Uni.at.Time.of.Grant == ">10 to 15" ~ "Duration10to15",
No..of.Years.in.Uni.at.Time.of.Grant == "more than 15" ~ "DurationGT15",
No..of.Years.in.Uni.at.Time.of.Grant == "Less than 0" ~ "DurationLT0",
TRUE ~ as.character(NA)
)),
Role = case_when(
Role == "EXT_CHIEF_INVESTIGATOR" ~ "ECI",
Role == "STUD_CHIEF_INVESTIGATOR" ~ "SCI",
Role == "CHIEF_INVESTIGATOR" ~ "CI",
Role == "DELEGATED_RESEARCHER" ~ "DR",
Role == "EXTERNAL_ADVISOR" ~ "EA",
Role == "HONVISIT" ~ "HV",
Role == "PRINCIPAL_SUPERVISOR" ~ "PS",
Role == "STUDRES" ~ "SR",
TRUE ~ "UNK"
)
)
glimpse(d_raw_4)
## Rows: 17,594
## Columns: 25
## $ Grant.Application.ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
## $ Grant.Status <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, ...
## $ Sponsor.Code <fct> SponsorUnk, Sponsor2B, Sponsor...
## $ Grant.Category.Code <fct> GrantCatUnk, GrantCat10A, Gran...
## $ Contract.Value.Band...see.note.A <fct> ContractValueBandA, ContractVa...
## $ Start.date <date> 2005-11-08, 2005-11-11, 2005-...
## $ RFCD.Code <fct> RFCD280199, RFCD280103, RFCD32...
## $ RFCD.Percentage <dbl> 100, 30, 60, 50, 34, 100, 50, ...
## $ SEO.Code <fct> SEO700299, SEO700103, SEO73010...
## $ SEO.Percentage <dbl> 100, 50, 60, 70, 100, 100, 60,...
## $ Person.ID <chr> "40572", "9067", "5967", "7878...
## $ Role <chr> "CI", "CI", "CI", "PS", "CI", ...
## $ Year.of.Birth <fct> 1965, 1960, 1955, 1955, 1965, ...
## $ Country.of.Birth <fct> AsiaPacific, Australia, Austra...
## $ Home.Language <fct> OtherLang, NA, NA, NA, NA, NA,...
## $ Dept.No. <fct> Dept3073, Dept2538, Dept2923, ...
## $ Faculty.No. <fct> Faculty31, Faculty25, Faculty2...
## $ With.PHD <chr> NA, "PhD", "PhD", "PhD", "PhD"...
## $ No..of.Years.in.Uni.at.Time.of.Grant <fct> DurationLT0, DurationGT15, Dur...
## $ Number.of.Successful.Grant <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, ...
## $ Number.of.Unsuccessful.Grant <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, ...
## $ A. <dbl> 4, 6, 0, 0, 3, 7, 8, 0, 8, 1, ...
## $ A <dbl> 2, 12, 3, 3, 0, 27, 5, 0, 1, 0...
## $ B <dbl> 0, 2, 5, 13, 1, 27, 2, 0, 11, ...
## $ C <dbl> 0, 2, 2, 3, 0, 6, 0, 0, 0, 0, ...
Create role-specific count variables for home language, nationality, degree, etc.
Define a function to perform pivoting.
trim_cols <- function(dat) {
keepers <- unlist(lapply(dat, function(x) length(unique(x)) > 1))
trimmed <- dat[, keepers, drop = FALSE] %>% select_at(vars(-ends_with("NA")))
trimmed
}
pivot_role <- function(dat, col) {
dat$cnt_col <- unlist(dat[, col])
dat_2 <- dat %>% mutate(metric = paste0(Role, ".", cnt_col))
tab_df <- as.data.frame(table(dat_2$Grant.Application.ID, dat_2$metric))
names(tab_df) <- c("Grant.Application.ID", "Val", "Freq")
tab_wide <- tab_df %>% pivot_wider(names_from = Val, values_from = Freq)
tab_trimmed <- trim_cols(tab_wide)
tab_trimmed
}
people <- d_raw_4 %>% count(Grant.Application.ID, name = "numPeople")
investCount <- d_raw_4 %>% count(Grant.Application.ID, Role) %>%
pivot_wider(names_from = Role, values_from = n, values_fill = list(n = 0),
names_prefix = "Num")
investDOB <- pivot_role(d_raw_4, "Year.of.Birth")
investCountry <- pivot_role(d_raw_4, "Country.of.Birth")
investLang <- pivot_role(d_raw_4, "Home.Language")
investPhD <- pivot_role(d_raw_4, "With.PHD")
investGrants_s <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(n = sum(Number.of.Successful.Grant, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(
id_cols = Grant.Application.ID, names_from = Role, values_from = n,
names_prefix = "Success.", values_fill = list(n = 0)) %>%
trim_cols()
investGrants_u <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(n = sum(Number.of.Unsuccessful.Grant, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(
id_cols = Grant.Application.ID, names_from = Role, values_from = n,
names_prefix = "Unsuccess.", values_fill = list(n = 0)) %>%
trim_cols
investGrants <- inner_join(investGrants_s, investGrants_u,
by = "Grant.Application.ID")
rm(investGrants_s, investGrants_u)
investDept <- pivot_role(d_raw_4, "Dept.No.")
investFaculty <- pivot_role(d_raw_4, "Faculty.No.")
investDuration <- d_raw_4 %>%
count(Grant.Application.ID, No..of.Years.in.Uni.at.Time.of.Grant) %>%
ungroup() %>%
pivot_wider(
id_cols = Grant.Application.ID,
names_from = No..of.Years.in.Uni.at.Time.of.Grant, values_from = n,
values_fill = list(n = 0))
totalPub = d_raw_4 %>%
group_by(Grant.Application.ID) %>%
summarize(
AstarTotal = sum(A., na.rm = TRUE),
ATotal = sum(A, na.rm = TRUE),
BTotal = sum(B, na.rm = TRUE),
CTotal = sum(C, na.rm = TRUE),
allPub = sum(c(A., A, B, C), na.rm = TRUE)
)
n_astar <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(x = sum(A., na.rm = TRUE)) %>%
pivot_wider(names_from = Role, values_from = x, values_fill = list(x = 0),
names_prefix = "Astar.") %>%
trim_cols()
n_a <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(x = sum(A, na.rm = TRUE)) %>%
pivot_wider(names_from = Role, values_from = x, values_fill = list(x = 0),
names_prefix = "A.") %>%
trim_cols()
n_b <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(x = sum(B, na.rm = TRUE)) %>%
pivot_wider(names_from = Role, values_from = x, values_fill = list(x = 0),
names_prefix = "B.") %>%
trim_cols()
n_c <- d_raw_4 %>%
group_by(Grant.Application.ID, Role) %>%
summarize(x = sum(C, na.rm = TRUE)) %>%
pivot_wider(names_from = Role, values_from = x, values_fill = list(x = 0),
names_prefix = "C.") %>%
trim_cols()
investPub <- inner_join(n_astar, n_a, by = "Grant.Application.ID") %>%
inner_join(n_b, by = "Grant.Application.ID") %>%
inner_join(n_c, by = "Grant.Application.ID")
rm(n_astar, n_a, n_b, n_c)
RFCDcount <- d_raw_4 %>%
count(Grant.Application.ID, RFCD.Code) %>%
pivot_wider(names_from = RFCD.Code, values_from = n, values_fill = list(n = 0)) %>%
trim_cols()
SEOcount <- d_raw_4 %>%
count(Grant.Application.ID, SEO.Code) %>%
pivot_wider(names_from = SEO.Code, values_from = n, values_fill = list(n = 0)) %>%
trim_cols()
grantData <- d_raw_2 %>%
mutate(
Month = factor(as.character(month(Start.date, label = TRUE))),
Weekday = factor(as.character(wday(Start.date, label = TRUE))),
Day = yday(Start.date)
) %>%
select(Sponsor.Code, Contract.Value.Band...see.note.A,
Grant.Category.Code, Month:Day)
dummies <- dummyVars(~., data = grantData, levelsOnly = TRUE)
grantData <- as.data.frame(predict(dummies, grantData))
grantData$Grant.Application.ID <- d_raw_2$Grant.Application.ID
grantData$Class <- factor(ifelse(d_raw_2$Grant.Status, "successful", "unsuccessful"))
grantData$is2008 <- year(d_raw_2$Start.date) == 2008
grantData <- trim_cols(grantData)
d_raw_5 <- merge(investCount, investDOB)
d_raw_5 <- merge(d_raw_5, investCountry)
d_raw_5 <- merge(d_raw_5, investLang)
d_raw_5 <- merge(d_raw_5, investPhD)
d_raw_5 <- merge(d_raw_5, investGrants)
d_raw_5 <- merge(d_raw_5, investDept)
d_raw_5 <- merge(d_raw_5, investFaculty)
d_raw_5 <- merge(d_raw_5, investDuration)
d_raw_5 <- merge(d_raw_5, investPub)
d_raw_5 <- merge(d_raw_5, totalPub)
d_raw_5 <- merge(d_raw_5, people)
d_raw_5 <- merge(d_raw_5, RFCDcount)
d_raw_5 <- merge(d_raw_5, SEOcount)
d_raw_5 <- merge(d_raw_5, grantData)
d_raw_5$Grant.Application.ID <- NULL
d_raw_5 <- d_raw_5 %>% select(Class, sort(colnames(d_raw_5)))
rm(investCount, investDOB, investCountry, investLang, investPhD, investGrants,
investDept, investFaculty, investDuration, investPub, totalPub, people,
RFCDcount, SEOcount, grantData, dummies)
If the grant success rate were relatively constant over the years, a reasonable data splitting strategy would be to use all data from 2005 to 2008, reserving some data for a test set, and resampling the remainder for tuning the various models. However, the grant success rate was not constant (see EDA), so random test sample is less relevant. One alternative is to fit models to pre-2008 data, then tune them on 2008 data. Essentially, 2008 would serve as a single test. However, this strategy may lead to over-fitting. The compromise taken here is to build models on the pre-2008 data and tune them on a random sample of 2,075 grants from 2008. Once the optimal parameters are determined, fit the final model using these parameters to the entire training set (pre-2008 plus 2008 grants). A small holdout set of 518 grants from 2008 will be used as the test set to ensure no gross methodology errors occur from repeatedly evaluating the 2008 data during model tuning.
dat_pre2008 <- subset(d_raw_5, !is2008)
pre2008 <- 1:nrow(dat_pre2008)
year2008 <- subset(d_raw_5, is2008)
set.seed(568)
inTrain <- createDataPartition(year2008$Class, p = 3/4)[[1]]
training <- rbind(dat_pre2008, year2008[inTrain, ])
testing <- year2008[-inTrain, ]
training$is2008 <- NULL
testing$is2008 <- NULL
training <- trim_cols(training)
testing <- testing[, names(training)]
rm(inTrain)
Create two character predictor set vectors, one with all the predictors less the highly correlated predictors (fullSet
), and one that additionally removes the sparse predictors (reducedSet
).
training
has 1,892 variables. First, pull out the 63 highly correlated (\(\rho\) >= 0.99) predictors. low_cor
is a list of 1,828 predictors with (\(\rho\) < 0.99).
all_preds <- names(training)[names(training) != "Class"]
all_cor <- cor(training[, all_preds])
high_cor <- findCorrelation(all_cor, .99)
low_cor <- all_preds[-high_cor]
rm(all_preds, all_cor, high_cor)
length(low_cor)
## [1] 1828
From these 1,828 predictors, pull out the 753 near-zero variance predictors. That leaves 1,828 - 753 = 1,075 predictors for fullSet
. From these 1,075 predictors, pull out the 819 predictors with a very high frequency ratio. That leaves 256 predictors for reducedSet
, but also remove the 4 variables that are redundant by construction (allPub
, numPeople
, Mar
, and Sun
), leaveing 252 predictors for reducedSet
.
prob_preds <- nearZeroVar(training[, low_cor], saveMetrics = TRUE,
freqCut = floor(nrow(training)/5))
fullSet <- rownames(subset(prob_preds, !nzv))
reducedSet <- rownames(subset(prob_preds, !nzv & freqRatio < floor(nrow(training)/50)))
### Perfectly collinear predictors (due to construction). March
### and Sunday have the lowest frequencies for mo/day.
reducedSet <- reducedSet[(reducedSet != "allPub") &
(reducedSet != "numPeople") &
(reducedSet != "Mar") &
(reducedSet != "Sun")
]
rm(low_cor, prob_preds)
Are we good? The function findLinearCombos
uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist). Here it is for the full set of predictors.
## [1] "CI.Dept3268" "CI.Faculty10" "DR.Dept2498" "DR.Dept2548" "DR.Dept2653"
## [6] "DR.Faculty19" "ECI.1965" "ECI.1975" "RFCD420215"
And here it is for the reduced set of predictors.
## character(0)
The classification tree uses a different set of predictors with factor encodings of some of the predictors.
factorPredictors <- names(training)[names(training) != "Class"]
factorPredictors <- factorPredictors[!grepl("Sponsor[0-9]", factorPredictors)]
factorPredictors <- factorPredictors[!grepl("SponsorUnk", factorPredictors)]
factorPredictors <- factorPredictors[!grepl("ContractValueBand[A-Z]", factorPredictors)]
factorPredictors <- factorPredictors[!grepl("GrantCat", factorPredictors)]
factorPredictors <- factorPredictors[!(factorPredictors %in% levels(training$Month))]
factorPredictors <- factorPredictors[!(factorPredictors %in% levels(training$Weekday))]
factorForm <- paste("Class ~ ", paste(factorPredictors, collapse = "+"))
factorForm <- as.formula(factorForm)