SNAP: a brief introduction

Giancarlo Vercellino

04-July-2021

"Like the crack of the whip I snap attack

Front to back in this thing called rap

Dig it like a shovel rhyme devil

On a heavenly level

Bang the bass turn up the treble

Radical mind day and night all the time

Seven to fourteen wise divine

Maniac brainiac winning the game

I’m the lyrical Jesse James"

(Snap, The Power)

What is SNAP

SNAP is a simple tool to design neural networks based on ‘Tensorflow’/‘Keras’1 backend. It is a wrapper to easily configure deep neural networks for regression, classification and multi-label tasks with some tweaks and tricks (skip shortcuts2, embedding3, feature selection4 and anomaly detection5).

The art of crunching numbers

The first example of application is a case of regression. We use the Peak function from Mlbench package, as you can see below.

Benchmark dataset for regression task from mlbench.peak (n = 300, d = 20)
x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10 x.11 x.12 x.13 x.14 x.15 x.16 x.17 x.18 x.19 x.20 y
0.25 0.09 -1.71 -0.06 0.43 0.30 1.08 0.05 -0.05 -0.03 -0.58 0.74 -1.16 0.33 0.31 0.26 -0.17 0.42 0.82 -0.99 0.31
-0.35 0.38 0.12 -0.52 0.37 0.66 0.30 -0.48 -0.30 0.85 -0.32 -0.89 0.27 -0.07 -0.09 0.25 -1.29 0.77 0.02 0.67 1.24
0.00 -0.07 0.00 0.04 0.08 -0.02 -0.08 0.16 0.06 0.01 0.02 0.02 0.08 0.04 -0.05 -0.04 0.01 -0.07 0.20 -0.13 23.54
0.13 0.21 0.03 -0.05 0.30 0.23 -0.30 -0.18 -0.40 -0.12 0.44 0.08 -0.67 -0.88 0.29 0.16 -0.54 0.12 -0.16 -0.32 7.23
0.09 0.02 -0.55 -0.28 0.16 -0.29 0.93 -0.43 0.24 -0.10 0.69 0.40 -0.89 -0.04 0.46 0.41 0.16 0.30 -0.76 -0.73 2.44
0.16 -0.20 -0.14 0.06 0.08 0.12 -0.01 -0.13 -0.01 -0.07 -0.10 -0.01 -0.04 0.05 -0.06 0.33 -0.03 -0.10 -0.04 -0.24 21.21
0.63 -0.33 0.48 -0.40 -0.02 0.03 -1.45 -0.30 -0.26 -0.37 -0.19 -0.05 0.36 0.18 -0.05 -0.87 -0.48 -0.47 -0.21 0.80 1.69
-0.46 -0.26 0.04 0.09 0.07 -0.23 0.08 0.11 0.48 -0.11 -0.27 -0.03 -0.14 0.31 0.68 0.05 -0.02 -0.22 -0.08 0.00 13.00
-0.01 -0.17 0.10 -0.25 -0.04 -0.06 0.26 0.08 -0.18 -0.12 -0.19 0.20 0.31 0.11 -0.02 0.04 0.20 0.09 -0.01 0.01 19.81
0.01 0.06 0.01 0.05 0.05 -0.06 0.02 -0.09 -0.03 -0.08 -0.03 -0.07 -0.01 -0.01 0.01 -0.01 -0.03 -0.03 -0.06 -0.04 24.46

With SNAP you can easily design various network configurations, as the one below.

Example n. 1

Example n. 1

The first step is the filtering of features and anomalies. The filtering variables allow for selecting relevant features and instances, excluding noisy fields and misleading outliers. The imp_thresh variable decides the minimum feature importance as inclusion threshold (setting the value to 0.4 means that only the feature with value above 0.4, as measured by the empirical distribution of importance scores, will be included in the working set; the default value is 0, including all features). The anom_thresh variable decides the maximum anomaly value for inclusion, as measured by the empirical distribution of dbscan lof (setting the value to 0.8 means that only the instances with value below 0.8 will be included in the working set; the default value is 1, including all instances).

The second step is a neural network configured with the parameters below (this is not necessarily the best configuration, it’s only an example). With SNAP you can easily design different kind of neural networks with a single line of code.

library(snap)
example1 <- snap(data=benchmark1, target="y", task = "regr", layers = 2, nodes = c(64, 32), activations = c("tanh", "sigmoid"), regularization_L1 = c(0, 0), regularization_L2 = c(0, 0), dropout = c(0.7, 0.4), optimizer = "rmsprop", imp_thresh = 0.4, anom_thresh = 0.8, reps = 5, folds = 2, normalization = F)
  snap: 9.25 sec elapsed

The cross-validation is a two-fold scheme repeated five times, as depicted in reps and folds variable.

example1$trials
  $trial_train_metrics
      folds  reps    rmse    mae   mdae   mape   rrse    rae    prsn
  1  fold_1 rep_1 11.1223 8.6081 5.1666 1.7075 1.2613 1.0785 -0.0549
  2  fold_2 rep_1  7.7973 6.2620 5.2499 3.2403 1.0469 0.9863 -0.0492
  3  fold_1 rep_2  8.9439 7.6183 7.5325 3.8540 1.0547 1.0239 -0.0770
  4  fold_2 rep_2  9.1078 7.5260 7.4695 4.4157 1.1240 1.0513 -0.0274
  5  fold_1 rep_3  9.6969 8.3560 8.5024 4.9401 1.1981 1.1496 -0.1269
  6  fold_2 rep_3  9.5654 8.0902 8.8777 5.2366 1.1357 1.1104  0.0097
  7  fold_1 rep_4  9.8447 8.2428 7.9454 3.8420 1.1497 1.0723 -0.0051
  8  fold_2 rep_4  9.4805 8.2230 8.9688 6.3085 1.2071 1.2210 -0.0756
  9  fold_1 rep_5  9.5626 7.9044 8.0634 4.8664 1.1744 1.1278 -0.0233
  10 fold_2 rep_5  9.8658 8.5781 8.7204 5.3642 1.1745 1.1378 -0.0818
  
  $trial_val_metrics
      folds  reps   rmse    mae   mdae   mape   rrse    rae    prsn
  1  fold_1 rep_1 8.6489 6.4152 3.4774 1.9627 1.1613 1.0104 -0.0483
  2  fold_2 rep_1 9.7713 8.1149 6.2668 2.8969 1.1081 1.0167 -0.0150
  3  fold_1 rep_2 8.7323 7.3657 7.6805 4.3962 1.0777 1.0289 -0.0348
  4  fold_2 rep_2 9.1766 7.8225 8.1604 4.0927 1.0821 1.0513 -0.0555
  5  fold_1 rep_3 9.5274 8.0984 8.7649 5.3236 1.1312 1.1115  0.0102
  6  fold_2 rep_3 9.7352 8.3737 8.3359 4.8916 1.2028 1.1521 -0.1338
  7  fold_1 rep_4 9.4571 8.2019 8.9644 6.2933 1.2041 1.2179 -0.0736
  8  fold_2 rep_4 9.8482 8.2473 7.9405 3.8550 1.1501 1.0729 -0.0050
  9  fold_1 rep_5 9.8222 8.5498 8.7522 5.3491 1.1693 1.1340 -0.0765
  10 fold_2 rep_5 9.5909 7.9484 8.2310 4.9010 1.1779 1.1341 -0.0232

The selected features and instances are reported within the list of results:

example1$selected_feat
   [1] "x.1"  "x.2"  "x.4"  "x.5"  "x.6"  "x.7"  "x.9"  "x.11" "x.13" "x.14"
  [11] "x.18" "x.19" "x.20"
example1$selected_inst
    [1]   1   2   3   4   5   6   7  10  11  12  13  14  15  16  17  19  20  21
   [19]  22  23  25  26  27  28  29  31  32  34  35  36  37  38  39  40  41  43
   [37]  44  45  46  47  48  49  50  51  52  53  54  55  56  57  59  60  62  63
   [55]  64  65  66  70  73  75  76  77  78  79  80  83  84  85  86  87  89  91
   [73]  93  94  95  96  97  98  99 100 101 102 105 106 107 109 110 111 112 113
   [91] 115 116 117 118 119 120 121 122 123 124 125 126 128 129 130 131 132 133
  [109] 134 135 136 137 139 140 142 143 144 145 146 148 149 152 153 154 156 158
  [127] 160 161 162 163 164 165 166 168 169 170 171 172 173 174 175 176 177 178
  [145] 181 182 183 184 185 186 189 190 192 195 196 197 198 199 200 202 203 204
  [163] 205 206 207 209 210 211 212 213 215 216 217 218 223 224 225 226 227 228
  [181] 229 230 231 232 233 234 236 237 239 240 241 242 243 244 246 247 249 250
  [199] 251 252 253 254 256 258 260 261 263 264 265 266 267 268 269 270 271 272
  [217] 273 277 278 279 280 281 283 284 285 286 287 288 289 290 291 292 293 294
  [235] 295 296 297 298 299 300

SNAP creates a prediction function that can be directly applied on the same data scheme to predict new values.

example1$pred_fun(benchmark1[1:10,-21])
     predicted_y
  1    14.108350
  2    14.184744
  3    12.046779
  4     7.926267
  5    11.787643
  6    10.038385
  7     1.609311
  8    11.890808
  9    13.951631
  10   12.714508

Everything in the right box

SNAP can be used also to easily manage classification tasks. To demonstrate a case, we use the Waveform function from Mlbench package, as you can see below.

Example of benchmark dataset for classification task from mlbench.waveform(n = 300)
x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10 x.11 x.12 x.13 x.14 x.15 x.16 x.17 x.18 x.19 x.20 x.21 classes
-2.04 0.07 -0.91 0.16 0.03 1.27 0.27 1.74 -0.83 2.46 2.03 2.41 3.94 6.14 5.47 4.33 3.00 2.18 2.48 1.53 0.20 3
1.49 2.35 2.15 1.92 3.26 4.09 5.36 5.85 3.04 3.71 2.44 3.64 0.77 0.48 -1.54 0.37 0.32 -0.58 -0.64 1.99 1.33 2
-1.06 1.76 3.28 2.35 4.77 4.56 5.05 3.52 3.58 3.45 3.57 1.97 0.49 1.25 -0.32 0.28 0.10 -0.02 -0.86 1.55 -0.37 2
0.94 -1.35 -0.11 1.26 1.02 1.75 3.57 3.74 3.87 4.50 6.60 2.95 4.85 4.46 2.06 4.02 -0.56 1.06 1.96 -0.84 -1.71 3
0.53 0.14 0.21 -0.68 0.95 -0.60 1.18 1.63 3.15 3.46 6.44 2.65 3.80 3.91 3.90 3.30 2.73 0.70 1.44 1.15 0.06 3
-0.90 -0.69 -0.11 3.41 2.47 6.51 7.71 4.00 1.98 4.83 3.08 2.48 0.55 0.53 0.58 0.12 2.06 0.87 -0.35 -0.15 0.79 2
-0.25 -1.64 0.72 -0.21 2.13 0.39 5.00 2.90 3.14 3.77 4.79 3.64 2.27 2.96 -0.10 0.68 -0.10 0.23 -0.20 0.27 -0.44 2
1.18 0.32 -0.71 0.15 1.01 2.34 -0.29 -0.56 0.48 2.84 4.35 2.72 4.89 6.25 5.93 4.38 3.00 1.64 2.00 -0.20 -0.01 3
1.06 -0.57 0.84 0.24 -2.06 -0.73 1.99 2.28 1.20 2.64 3.14 3.02 3.92 3.46 4.03 5.70 3.00 0.54 0.38 1.19 -1.90 3
0.00 2.00 3.79 4.25 3.22 4.42 3.94 5.68 5.25 3.56 0.73 1.44 0.75 0.89 1.40 0.46 -0.16 1.41 0.03 0.54 -0.23 2

Besides setting task to “classif”, this time we are also trying to use a residual network. To mitigate the vanishing gradient problem when you have deep neural network with many layers you have to set skip_shortcut: setting the variable to TRUE automatically add a summation link between the input layer and the second-last layer before the output.

Here we propose a brief depiction:

Example n. 2

Example n. 2

As for the previous one, this is an example of hyper-parameter configuration (and absolutely not the best possible configuration). Translated in code, you can see below.

example2 <- snap(data=benchmark2, target="classes", task = "classif", layers = 4, activations = c("mish", "parametric_relu", "bent", "softplus"), nodes = c(1024, 96, 1024, 512), regularization_L1 = c(30, 50, 10, 40), regularization_L2 = c(80, 30, 60, 80), dropout = c(0.5, 0.2, 0.2, 0.2), imp_thresh = 0.66, anom_thresh = 0.95, optimizer = "adam", skip_shortcut = T)
  snap: 11.25 sec elapsed

You can visualize the configuration in Keras’s style. As you can see, after each layer, this time the result is normalized (we haven’t change the default value of normalization).

example2$configuration
    layers                           activations regularization_L1
  1      4 mish, parametric_relu, bent, softplus    30, 50, 10, 40
    regularization_L2               nodes            dropout
  1    80, 30, 60, 80 1024, 96, 1024, 512 0.5, 0.2, 0.2, 0.2
example2$model
  Model
  Model: "functional_3"
  ________________________________________________________________________________
  Layer (type)              Output Shape      Param #  Connected to               
  ================================================================================
  input_2 (InputLayer)      [(None, 8)]       0                                   
  ________________________________________________________________________________
  dense_3 (Dense)           (None, 1024)      9216     input_2[0][0]              
  ________________________________________________________________________________
  activation_2 (Activation) (None, 1024)      0        dense_3[0][0]              
  ________________________________________________________________________________
  dropout_2 (Dropout)       (None, 1024)      0        activation_2[0][0]         
  ________________________________________________________________________________
  batch_normalization (Batc (None, 1024)      4096     dropout_2[0][0]            
  ________________________________________________________________________________
  dense_4 (Dense)           (None, 96)        98400    batch_normalization[0][0]  
  ________________________________________________________________________________
  p_re_lu (PReLU)           (None, 96)        96       dense_4[0][0]              
  ________________________________________________________________________________
  dropout_3 (Dropout)       (None, 96)        0        p_re_lu[0][0]              
  ________________________________________________________________________________
  batch_normalization_1 (Ba (None, 96)        384      dropout_3[0][0]            
  ________________________________________________________________________________
  dense_5 (Dense)           (None, 1024)      99328    batch_normalization_1[0][0]
  ________________________________________________________________________________
  activation_3 (Activation) (None, 1024)      0        dense_5[0][0]              
  ________________________________________________________________________________
  dropout_4 (Dropout)       (None, 1024)      0        activation_3[0][0]         
  ________________________________________________________________________________
  batch_normalization_2 (Ba (None, 1024)      4096     dropout_4[0][0]            
  ________________________________________________________________________________
  dense_6 (Dense)           (None, 512)       524800   batch_normalization_2[0][0]
  ________________________________________________________________________________
  activation_4 (Activation) (None, 512)       0        dense_6[0][0]              
  ________________________________________________________________________________
  dropout_5 (Dropout)       (None, 512)       0        activation_4[0][0]         
  ________________________________________________________________________________
  batch_normalization_3 (Ba (None, 512)       2048     dropout_5[0][0]            
  ________________________________________________________________________________
  dense_7 (Dense)           (None, 8)         4104     batch_normalization_3[0][0]
  ________________________________________________________________________________
  add (Add)                 (None, 8)         0        dense_7[0][0]              
                                                       input_2[0][0]              
  ________________________________________________________________________________
  dense_8 (Dense)           (None, 3)         27       add[0][0]                  
  ================================================================================
  Total params: 746,595
  Trainable params: 741,283
  Non-trainable params: 5,312
  ________________________________________________________________________________

For classification tasks, a different set of metrics are used for training and validation (and you can sharp your focus on a specific class indicating the class string in positive):

example2$metrics
           bac    prc    sen    csi    fsc     kpp     kdl
  train 0.5275 0.3699 0.3697 0.2266 0.3663  0.0557  0.0207
  valid 0.5264 0.3586 0.3676 0.2235 0.3571  0.0539 -0.0411
  test  0.4999 0.3463 0.3336 0.2044 0.3374 -0.0004  0.0681

The madness of labels

When you have multiple class features, each one with multiple label set, you have a multilabel task. The example3 comes from the yeast function from mldr.datsets: 2417 rows x 117 columns (too much to effectively have a snapshot table here).

This time we design a network with a single layer of 1024 nodes and mish activation (neither regularization or droput or normalization). The picture here:

Example n. 3

Example n. 3

In the multilabel setting you cannot filter for feature importance (any imp_thresh value above 0 is set to zero with a returning message), but you can still sift the instances with anom_thresh.

example3 <- snap(data=benchmark3, target=paste0("Class", 1:14), task = "multilabel", layer = 1, activations = "mish", nodes = 1024, anom_thresh = 0.9, normalization = F)
  snap: 51.13 sec elapsed

As in the previous examples, the result list includes also the standard history plot from Keras. Here you can see the plot for example3 (with early stop using span and min_delta to manage the patience parameter).

example3$plot

This time, when you use the prediction function, you get a whole set of label features:

example3$pred_fun(benchmark3[1:10,-c(104:117)])
     predicted_Class1 predicted_Class2 predicted_Class3 predicted_Class4
  1                 0                0                0                0
  2                 0                0                1                1
  3                 0                1                0                0
  4                 0                0                1                1
  5                 0                1                1                0
  6                 0                0                0                0
  7                 1                0                0                0
  8                 0                0                1                1
  9                 0                0                0                0
  10                0                0                1                1
     predicted_Class5 predicted_Class6 predicted_Class7 predicted_Class8
  1                 0                0                0                0
  2                 0                0                0                0
  3                 0                0                0                0
  4                 0                0                0                0
  5                 0                0                0                0
  6                 1                0                0                0
  7                 0                0                0                0
  8                 0                0                0                0
  9                 0                0                0                0
  10                0                0                0                0
     predicted_Class9 predicted_Class10 predicted_Class11 predicted_Class12
  1                 0                 0                 0                 1
  2                 0                 0                 0                 0
  3                 0                 0                 0                 1
  4                 0                 0                 0                 1
  5                 0                 0                 0                 1
  6                 0                 0                 0                 1
  7                 0                 0                 0                 1
  8                 0                 0                 0                 1
  9                 0                 0                 0                 1
  10                0                 0                 0                 1
     predicted_Class13 predicted_Class14
  1                  1                 0
  2                  0                 0
  3                  1                 0
  4                  1                 0
  5                  1                 0
  6                  1                 0
  7                  1                 0
  8                  1                 0
  9                  1                 0
  10                 1                 0

  1. All you need to know about the current implementation of Tensorflow/Keras in R: https://keras.rstudio.com↩︎

  2. If you need a concise overview on residual networks, you can take a look here: https://en.wikipedia.org/wiki/Residual_neural_network↩︎

  3. In a future version of this article, I will add some examples on how to use global and local embeddings.↩︎

  4. For feature selection we use CORElearn’s RReliefFbestK (as of the two different version presented in the package for numeric and factor variables). If you want a better understanding on the specific Relief metric, you can take a look here: https://www.rdocumentation.org/packages/CORElearn/versions/1.56.0/topics/attrEval↩︎

  5. For the anomaly detection, we use the standard lof function from the dbscan package. Here’s a link to the documentation: https://rdrr.io/cran/dbscan/man/lof.html↩︎