A cursory, univariate review of the newly encoded data uncovers a few interesting relationships with the response.
What continuous variables are likely to be important predictors of the response variable?
cor_out <- training %>%
mutate(Class = as.numeric(training$Class)) %>%
cor()
cor_out_2 <- cor_out[, "Class"] %>% abs() %>% sort(decreasing = TRUE)
cor_out_2[c(1:5)]
## Class ContractValueBandUnk GrantCatUnk
## 1.0000000 0.4519267 0.2547636
## SponsorUnk Jan
## 0.2547636 0.2341108
Two continuous predictors, the number of prior successful and unsuccessful grant applications by the chief investigator, were highly associated with grant application success.
Three categorical predictors (Contract Value Band A, Sponsor Unknown, and January) had the highest univariate associations with grant application success. The associations for these three predictors were not strong, but suggests grant applications with a large monetary value, an unknown sponsor, or a submission in January are associated with greater grant success.
## Warning: group_by_() is deprecated.
## Please use group_by() instead
##
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
Band | Succ | Uns | N | Pct | Odds | or |
A | 1,506 | 824 | 2,330 | 0.65 | 1.83 | 2.83 |
Other | 2,297 | 3,563 | 5,860 | 0.39 | 0.64 |
Sponsor | Succ | Uns | N | Pct | Odds | or |
Known | 3,064 | 4,233 | 7,297 | 0.42 | 0.72 | 0.15 |
Unk | 739 | 154 | 893 | 0.83 | 4.80 |
Month | Succ | Uns | N | Pct | Odds | or |
Jan | 478 | 47 | 525 | 0.91 | 10.17 | 13.27 |
Oth | 3,325 | 4,340 | 7,665 | 0.43 | 0.77 |
The percentage of successful grants varied over the years: 45% (2005), 51.7% (2006), 47.2% (2007), and 36.6% (2008). Although 2008 had the lowest percentage, there is not necesarily a downward trend.
d_raw_2 %>%
mutate(Start.date.year = lubridate::year(Start.date)) %>%
count(Start.date.year, Grant.Status) %>%
ungroup() %>%
group_by(Start.date.year) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = Start.date.year,
y = pct,
fill = factor(Grant.Status, labels = c("Unsuccessful", "Successful")),
label = scales::percent(pct)
)) +
geom_col() +
geom_text(position = position_stack(vjust = 0.9)) +
theme_mf() +
theme(axis.text.y = element_blank()) +
scale_fill_manual(values = c("#814E4A", "#334F67")) +
labs(title = "Grant Success", fill = "", x = "", y = ""
)
The data splitting scheme should take into account that the purpose of the model is to quantify the likelihood of success for new grants, which is why the competition used the most recent data for testing purposes.