This is a minimal demonstration of fitting some machine learning models with the new tidymodels package.
options(digits = 3)
library(pacman)
p_load(tidyverse, magrittr, tidymodels, GGally)
theme_set(theme_bw())
A really simple case of classification. We predict whether it is a particular species from the measurements.
#plot data to see what's up
#https://stackoverflow.com/a/12047554/3980197
ggpairs(iris, aes(colour = Species, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#lets make a binary version for species
iris = cbind(
iris,
model.matrix( ~ Species - 1, data=iris ) %>% set_colnames(colnames(.) %>% str_replace("Species", "")) %>% as_tibble() %>% map_df(as.factor)
) %>% as_tibble()
#print data
iris
#outcome split
iris$Species %>% table()
## .
## setosa versicolor virginica
## 50 50 50
#make a recipe
iris_recipe <-
recipe(virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
#make a model
iris_model <-
logistic_reg() %>%
set_engine("glm")
#resampling method
set.seed(1)
iris_folds <- vfold_cv(iris, v = 10)
#make workflow
iris_wf <-
workflow() %>%
add_model(iris_model) %>%
add_recipe(iris_recipe)
#fit
iris_fit <-
iris_wf %>%
fit_resamples(iris_folds)
## ! Fold01: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold02: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold03: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold04: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold05: model: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie...
## ! Fold06: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold07: model: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie...
## ! Fold08: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold09: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold10: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
#metrics
collect_metrics(iris_fit)
As above, but we want to predict each species level, not a single contrast. It was a bit difficult to figure out how to do this, but one needs to search here for the model to find the function one needs, turns out it is called multinom_reg()
.
#make a recipe
#same as above except that we swap the outcome to Species
iris2_recipe <-
recipe(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
#make a model
#swap to multinom_reg() + nnet
iris2_model <-
multinom_reg() %>%
set_engine("nnet")
#resampling method
#actually the same as above
set.seed(1)
iris2_folds <- vfold_cv(iris, v = 10)
#make workflow
iris2_wf <-
workflow() %>%
add_model(iris2_model) %>%
add_recipe(iris2_recipe)
#fit
iris2_fit <-
iris_wf %>%
fit_resamples(iris2_folds)
## ! Fold01: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold02: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold03: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold04: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold05: model: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie...
## ! Fold06: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold07: model: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie...
## ! Fold08: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold09: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
## ! Fold10: model: glm.fit: fitted probabilities numerically 0 or 1 occurred
#metrics
collect_metrics(iris2_fit)
So, we find that this performs slightly worse when we need to predict each species, not just whether species was virginica. This is expected. If we look at the plot, we see that the red species (setosa) is very easy to predict even by eyesight, so any proper model would have ~100% accuracy. The other two species are a bit harder to tell apart, but we do well here with about 97% accuracy versus expected by chance of ~33% (the dataset is evenly split).
Let’s try something a little bit more complicated. Let’s redo this analysis I previusly did in the previous meta-modeling package, caret. In this situation, we have data from about 4.5k US veterans on the MMPI items, some 550 items asking about varius aspects of psychopathology. We also have general intelligence as measured by the common fator of ~18 diverse cognitive tests.
#read data from file
#you can get it here: https://osf.io/dbn4k/
ves_mmpi = read_rds("data/d_nomiss2.rds") %>% as_tibble()
#print data
ves_mmpi