Recipes interface

Overview

In this document we will demonstrate the basic usage of the recipes interface in tfdatasets.

The recipes interface is a user friendly interface to feature_columns. It allows us to specify columns transformations and representations when working with structured data.

We will use the hearts dataset and it can be laoded with data(hearts).

library(tfdatasets)
data(hearts)

head(hearts)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0
##         thal target
## 1      fixed      0
## 2     normal      1
## 3 reversible      0
## 4     normal      0
## 5     normal      0
## 6     normal      0

We want to train a model to predict the target variable suing Keras but, before that we need to prepare the data. We need to transform the categorical variables into some form of dense variable, we usually want to normalize all numeric columns too.

The recipes interface works with a TensorFlow dataset which can be created from an R data.frame using:

dataset <- hearts %>% 
  tensor_slices_dataset() %>% 
  dataset_shuffle(nrow(hearts)) %>% 
  dataset_batch(16)

Now let’s start creating our Recipe.

rec <- recipe(target ~ ., dataset)

The first thing we need to do after creating the recipe is to define how you want to treat your variables as categorical or numeric.

You can do this by adding steps to the recipe.

rec <- rec %>% 
  step_numeric_column(age, trestbps, chol, thalach, oldpeak, slope, ca) %>% 
  step_categorical_column_with_vocabulary_list(thal)

The following steps can be used to define the variable type:

step_numeric_column to define numeric variables
step_categorical_with_vocabulary_list for categorical variables with a fixed vocabulary
step_categorical_column_with_hash_bucket (TODO)
step_categorical_column_with_identity (TODO)
step_categorical_column_with_vocabulary_file (TODO)

When using step_categorical_column_with_vocabulary_list you can also provide a vocabulary argument with the fixed vocabulary. The recipe will find all the unique values in the dataset and use it as the vocabulary.

You can also specify a normalizer_fn to the step_numeric_column in this case the variable will be transformed by the feature column. Note that the transformation will occur in the TensorFlow Graph, so it must use only TensorFlow ops. We plan to implement pre-maid normalizer functions that obtains the normalizing constants from the dataset itself.

Also in the above example we typed the name of all columns, but you can also use selectors like:

starts_with(), ends_with(), matches() etc. (from tidyselect)
all_numeric() to select all numeric variables (TODO)
all_nominal() to select all strings. (TODO)
all_predictors() to select all predictor variables (TODO)
all_outcomes to select all outcomes (TODO)

After specifying the types of the columns you can add transformation steps. For example you may want to bucketize a numeric column:

rec <- rec %>% 
  step_bucketized_column(age, boundaries = c(18, 25, 30, 35, 40, 45, 50, 55, 60, 65))

You can also specify the kind of numeric representation that you want to use for your categorical variables.

rec <- rec %>% 
  step_indicator_column(thal) %>% 
  step_embedding_column(thal, dimension = 2)

Another common transformation is to add interactions between variables using crossed columns.

rec <- rec %>% 
  step_crossed_column(thal_and_age = c(thal, bucketized_age), hash_bucket_size = 1000) %>% 
  step_indicator_column(thal_and_age)

Note that the crossed_cluumn is a categorical column, so we need to also specify what kind of numeric tranformation we want to use. Also note that we can name the transformed variables - each step uses a default naming for columns, eg. bucketized_age is the default name when you use step_bucketized_colun with column called age.

With the above code we have created our recipe. Note we can also define the recipe by chaining a sequence of methods:

rec <- recipe(target ~ ., dataset) %>% 
  step_numeric_column(age, trestbps, chol, thalach, oldpeak, slope, ca) %>% 
  step_categorical_column_with_vocabulary_list(thal) %>% 
  step_bucketized_column(age, boundaries = c(18, 25, 30, 35, 40, 45, 50, 55, 60, 65)) %>% 
  step_indicator_column(thal) %>% 
  step_embedding_column(thal, dimension = 2)# %>% 
  # step_crossed_column(c(thal, bucketized_age), hash_bucket_size = 1000) %>%
  # step_indicator_column(crossed_thal_bucketized_age)

After defining the recipe we need to prep it. It’s when preparing that we compute the vocabulary list for categorical variables or find the mean and standard deviation for the normalizing functions. Preparation involves evaluatig the full dataset, so If you have provided the vocabulary list and your columns are already normalized you can skip the preparation step (TODO).

In our case, we will prepare the recipe, since we didn’t specify the vocabulary list for the categorical variables.

rec_prep <- prep(rec)

After preparing we can see the list of dense features that were defined:

str(rec_prep$dense_features())

## List of 10
##  $ indicator_thal:IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='thal', vocabulary_list=('1', '2', 'fixed', 'normal', 'reversible'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
##  $ embedding_thal:EmbeddingColumn(categorical_column=VocabularyListCategoricalColumn(key='thal', vocabulary_list=('1', '2', 'fixed', 'normal', 'reversible'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=2.0, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True)
##  $ age           :NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ trestbps      :NumericColumn(key='trestbps', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ chol          :NumericColumn(key='chol', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ thalach       :NumericColumn(key='thalach', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ oldpeak       :NumericColumn(key='oldpeak', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ slope         :NumericColumn(key='slope', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ ca            :NumericColumn(key='ca', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
##  $ bucketized_age:BucketizedColumn(source_column=NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(18.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0))

Now we are ready to define our model in Keras. We will use a specialized layer_dense_features that knows what to do with the feature columns specification.

library(keras)

dense_features <- layer_dense_features(feature_columns = rec_prep$dense_features())

model <- keras_model_sequential(layers = list(
  dense_features,
  layer_dense(units = 32, activation = "relu"),
  layer_dense(units = 1, activation = "sigmoid")
))

model %>% compile(loss = loss_binary_crossentropy, 
                  optimizer = "adam",
                  metrics = metric_binary_accuracy)

We can finaly train the model in the dataset:

# TODO we should provide a transformed dataset with all columns necessary
# for training and in the same format as keras.

dataset_2 <- dataset %>% 
  dataset_map(function(x) reticulate::tuple(x, x$target))

model %>% 
  fit(dataset_2, epochs = 5, verbose = 2)