A cursory, univariate review of the newly encoded data uncovers a few interesting relationships with the response.

Univariate Analysis

What continuous variables are likely to be important predictors of the response variable?

cor_out <- training %>% 
  mutate(Class = as.numeric(training$Class)) %>%
  cor()
cor_out_2 <- cor_out[, "Class"] %>% abs() %>% sort(decreasing = TRUE)
cor_out_2[c(1:5)]

##                Class ContractValueBandUnk          GrantCatUnk 
##            1.0000000            0.4519267            0.2547636 
##           SponsorUnk                  Jan 
##            0.2547636            0.2341108

Two continuous predictors, the number of prior successful and unsuccessful grant applications by the chief investigator, were highly associated with grant application success.

Three categorical predictors (Contract Value Band A, Sponsor Unknown, and January) had the highest univariate associations with grant application success. The associations for these three predictors were not strong, but suggests grant applications with a large monetary value, an unknown sponsor, or a submission in January are associated with greater grant success.

## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

Three categorical predictors with highest association with application status: Contract Value Band
Band	Succ	Uns	N	Pct	Odds	or
A	1,506	824	2,330	0.65	1.83	2.83
Other	2,297	3,563	5,860	0.39	0.64

Sponsor
Sponsor	Succ	Uns	N	Pct	Odds	or
Known	3,064	4,233	7,297	0.42	0.72	0.15
Unk	739	154	893	0.83	4.80

Month
Month	Succ	Uns	N	Pct	Odds	or
Jan	478	47	525	0.91	10.17	13.27
Oth	3,325	4,340	7,665	0.43	0.77

The percentage of successful grants varied over the years: 45% (2005), 51.7% (2006), 47.2% (2007), and 36.6% (2008). Although 2008 had the lowest percentage, there is not necesarily a downward trend.

d_raw_2 %>%
  mutate(Start.date.year = lubridate::year(Start.date)) %>%
  count(Start.date.year, Grant.Status) %>%
  ungroup() %>%
  group_by(Start.date.year) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = Start.date.year, 
             y = pct, 
             fill = factor(Grant.Status, labels = c("Unsuccessful", "Successful")),
             label = scales::percent(pct)
             )) +
  geom_col() +
  geom_text(position = position_stack(vjust = 0.9)) +
  theme_mf() +
  theme(axis.text.y = element_blank()) +
  scale_fill_manual(values = c("#814E4A", "#334F67")) +
  labs(title = "Grant Success", fill = "", x = "", y = ""
  )

The data splitting scheme should take into account that the purpose of the model is to quantify the likelihood of success for new grants, which is why the competition used the most recent data for testing purposes.

Kaggle - Grant Prediction

Step 2: Exploratory Data Analysis

Michael Foley

4/12/2020

Setup

Load Data

Univariate Analysis