Deep learning with Keras - using R

Sigrid Keydana, Trivadis
2017/09/11

About me & my employer

 

Trivadis

  • DACH-based IT consulting and service company, from traditional technologies to big data/machine learning/data science

My background

  • from psychology/statistics via software development and database engineering to data science and ML/DL

My passion

  • machine learning and deep learning
  • data science and (Bayesian) statistics
  • explanation/understanding over prediction accuracy

Where to find me

Welcome to the world of deep neural networks

 

 

Source: https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

How can we do deep learning in practice?

 

Powerful frameworks:

  • TensorFlow (Python)
  • PyTorch (Python)
  • Keras (Python)
  • DL4J (Java)

But we want to use R!

 

Let's just use Keras

 

… from R!

How?

With the R Interface to Keras, developed by RStudio

Getting started with Keras

 

Four steps:

  1. prepare data
  2. define model
  3. train model
  4. test model

Let's see how this works on our most beloved algorithm...

 

… linear regression!

… using our most beloved dataset …

 

 

Linear regression with Keras

The task

 

We want to predict Petal.Width from Petal.Length …


Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56515 -0.12358 -0.01898  0.13288  0.64272 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared:  0.9271,    Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

plot of chunk unnamed-chunk-2

Step 1: Prepare data

 

Not much to do in this case, just split into x/y and test & training sets:

x_train <- iris[1:120, "Petal.Length"]
y_train <- iris[1:120, "Petal.Width"]
x_test <- iris[121:150, "Petal.Length"]
y_test <- iris[121:150, "Petal.Width"]

Step 2: Define model

 

model <- keras_model_sequential()
model %>%
  layer_dense(units = 8, input_shape = 1) %>% # hidden layer with 8 neurons, 1-dimensional input
  layer_activation_leaky_relu() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1) # output layer with linear activation

# model %>% summary() # params: num_hidden weights and biases each to hidden layer, 8 weights and 1 bias to output neuron

model %>% compile(optimizer = "adam", loss = "mean_squared_error")

 

Layer (type)Output ShapeParam
dense_1 (Dense)(None, 8)16
leaky_re_lu_1 (LeakyReLU)(None, 8)0
dropout_1 (Dropout)(None, 8)0
dense_2 (Dense)(None, 1)9

Step 3: Train model

 

hist <-
  model %>% fit(
    x_train,
    y_train,
    batch_size = 10,
    epochs = 100
  )
model %>% save_model_hdf5("iris.h5")

plot(hist)

plot of chunk unnamed-chunk-6

Step 4: Test model

 

model <- load_model_hdf5("iris.h5")
model %>% evaluate(x_test, y_test)
[1] 0.428311
preds_train <- model %>% predict(x_train)
preds_test <- model %>% predict(x_test)

plot of chunk unnamed-chunk-8

 

 

Convolutional Neural Networks (CNNs)

Going spatial: convnets

 

LeNet: First successful application of convolutional neural networks by Yann LeCun, Yoshua Bengio et al.

Source: http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

The convolution operation

 

Source: Goodfellow et al., Deep Learning.

CIFAR-10

 

32*32 px RGB images, split into training set (50000) and test set (10000)

cifar10 <- dataset_cifar10()

x_train <- cifar10$train$x/255
x_test <- cifar10$test$x/255
y_train <- to_categorical(cifar10$train$y, num_classes = 10)
y_test <- to_categorical(cifar10$test$y, num_classes = 10)

plot of chunk unnamed-chunk-10

A (rather) simple convnet

 

model <- keras_model_sequential()

model %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), padding = "same", input_shape = c(32, 32, 3)) %>%
  layer_activation("relu") %>%

  layer_conv_2d(filters = 32, kernel_size = c(3,3)) %>% layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_dropout(0.25) %>%

  layer_conv_2d(filters = 32, kernel_size = c(3,3), padding = "same") %>%
  layer_activation("relu") %>%

  layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%

  layer_flatten() %>%
  layer_dense(512) %>%
  layer_activation("relu") %>%
  layer_dropout(0.5) %>%

  layer_dense(10) %>%
  layer_activation("softmax")

Compile and train (not now)

 

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6),
  metrics = "accuracy"
)

model %>% fit(
  x_train,
  y_train,
  batch_size = 32,
  epochs = 200,
  validation_data = list(x_test, y_test),
  shuffle = TRUE
)

Evaluate model

 

model <- load_model_hdf5("cifar_10_200epochs.h5")
(model %>% evaluate(x_test, y_test))[[2]] # accuracy
[1] 0.3802
y_test[1] # example image: true class
[1] 0
model %>% predict_proba(x_test[1, , , , drop = FALSE])
           [,1]       [,2]      [,3]      [,4]      [,5]      [,6]
[1,] 0.02724996 0.02721717 0.1336272 0.3005729 0.0846272 0.1922515
          [,7]       [,8]       [,9]      [,10]
[1,] 0.1633566 0.03343451 0.02119882 0.01646418
model %>% predict_classes(x_test[1, , , , drop = FALSE])
[1] 3

Why start from scratch? Pretrained models

 

Available in Keras, with weights trained on ImageNet

  • Inception V3
  • VGG19, VGG19
  • ResNet
  • Xception

Usage:

  • as is
  • remove top fully connected layer and train for own classes
  • remove top fully connected layer and train for own classes, fine-tune upper layers
  • extract features (top or intermediate)

What's this?

 

Source: Imagenet

Ask VGG16

 

model <- application_vgg16(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(224,224)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- imagenet_preprocess_input(x)

preds <- model %>% predict(x)
imagenet_decode_predictions(preds)
[[1]]
  class_name class_description       score
1  n03888257         parachute 0.876223385
2  n02268443         dragonfly 0.066812977
3  n02879718               bow 0.008029214
4  n01770393          scorpion 0.007272749
5  n03814906          necklace 0.004784236

Ask Resnet

 

model <- application_resnet50(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(224,224)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- imagenet_preprocess_input(x)

preds <- model %>% predict(x)
imagenet_decode_predictions(preds)
[[1]]
  class_name           class_description score
1  n02098286 West_Highland_white_terrier     1
2  n15075141               toilet_tissue     0
3  n02319095                  sea_urchin     0
4  n02391049                       zebra     0
5  n02389026                      sorrel     0

Ask Inception V3

 

model <- application_inception_v3(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(299,299)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- inception_v3_preprocess_input(x)

preds <- model %>% predict(x)
which.max(preds)
[1] 796
imagenet_decode_predictions(preds)
[[1]]
  class_name class_description       score
1  n04228054               ski 0.762439072
2  n03888257         parachute 0.030627847
3  n09193705               alp 0.024098456
4  n04252077        snowmobile 0.007552539
5  n03792782     mountain_bike 0.003474350

 

 

Long Short Term Memory (LSTM)

Going sequential: Recurrent Neural Networks (RNNs)

 

Source: Goodfellow et al. 2016, Deep Learning

I need to forget: Long Short Term Memory

Let's write ourselves some R (source!)

 

… character by character (fun read: The Unreasonable Effectiveness of Recurrent Neural Networks)

Input: subset of .c files under $R_HOME/src/main

Like this, for example…

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include <Defn.h>
#include <Internal.h>

/* JMC convinced MM that this was not a good idea: */
#undef _S4_subsettable

static R_INLINE SEXP VECTOR_ELT_FIX_NAMED(SEXP y, R_xlen_t i) {
    /* if RHS (container or element) has NAMED > 0 set NAMED = 2.
       Duplicating might be safer/more consistent (fix bug reported by
       Radford Neal; similar to PR15098) */
    SEXP val = VECTOR_ELT(y, i);
    if ((NAMED(y) || NAMED(val)))
        if (NAMED(val) < 2)
            SET_NAMED(val, 2);
    return val;
}

Preprocessing for char-rnn

 

Preprocessing steps:

  • split into characters
  • build a dictionary of characters
  • build a list of (overlapping) character sequences of a specific length, and the corresponding following char
dataset <- map(
  seq(1, length(text) - maxlen - 1, by = 3), 
  function(x) list(sentence = text[x:(x + maxlen - 1)], next_char = text[x + maxlen])
)
dataset[[14]]
$sentence
 [1] "s"  "t"  "i"  "c"  "a"  "l"  " "  "d"  "a"  "t"  "a"  " "  "a" 
[14] "n"  "a"  "l"  "y"  "s"  "i"  "s"  "\n" " "  "*"  " "  " "  "c" 
[27] "o"  "p"  "y"  "r"  "i"  "g"  "h"  "t"  " "  "("  "c"  ")"  " " 
[40] "1" 

$next_char
[1] "9"

Create X and y matrices in formats required by LSTM

 

LSTM needs as input a 3-dimensional array, where the sizes are, in order

  • batch input size
  • number of timesteps (characters to use for prediction, in our case)
  • number of features (the dictionary of characters, in our case)

 

dataset <- transpose(dataset)

X <- array(0, dim = c(length(dataset$sentence), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentence), length(chars)))

for(i in 1:length(dataset$sentence)){  
  X[i,,] <- sapply(chars, function(x){
    as.integer(x == dataset$sentence[[i]])
  })  
  y[i,] <- as.integer(chars == dataset$next_char[[i]])
}

Define model

 

model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(chars))) %>%
  layer_dense(length(chars)) %>%
  layer_activation("softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

60 epochs later... let's see some output!

 

n memory */
        if (start) return (intsxp:";
        case nilsxp:
    case nilsxp:
    case = char(chk(x), start);
    return ans;
}

static rboolean r_namessxp:
        case cptr = r_print(protect(s"));
            if (ns == 0) {
                if (isnan(x)[i]) return (integer(x)));
    return chk(cadr(chk(s))); }
sexp (set_ay_string)) {
        if (size > 0) {
        if (s <= 0) return this spacelar(stack) be gens in the cur

 

Could be C to me ;-)

Another one... more adventurous this time...

 

n the -is a should neare stack to protect(shiftemplerbleve */
        {
        count = node_ised(cgised-> set_a_mayshelex)[i].for(real_const char *) valuel);
    int dit++;
    *vele = null;
}

static void r_nilsxp -= reprintf(r_isopy_primns; i++) {
    size = fr_is);
    for(i = 0; i complex * vmax);
    set_loon (_(_(dsize) == na_string) {
          error(_("argumenexity an
, experend */
    return chk(car(farge, "

... and a still more creative one...

 

>0.ribncely_troum-pubscliet(len);
  define (plise inc) (arguments,s.=irned jin = kp; (bached1,(yoenjioff_lenamus(x, size_t lastso, |glepplesxp,t)(sexp fiwelue_stefff() fmt2 to we bule 2a *lotkfientsrp++);
#if nhex) if */*xels(bece fonts)
{
    s =     pointeg > breakfrec
        size = ptroblerc=rrol);
    error(l_r_isest ans_ltaindxtes(fmtpaun narisex same * %d.--20tsmdourgite_code_nexi__relen _so_v_num__("/

 

Now we know where the “less helpful error messages” come from ;-)

Another cool thing we can do with LSTMs: Time series forecasting

 

Take this one:

plot of chunk unnamed-chunk-23

This series has multiple seasonality.

One-step-ahead forecasts

 

Preprocessing:

  • Scale/normalize the data (especially with extremely high values like this!)
  • Again, reshape data to the form <num samples, num timesteps, num features>

Model definition:

model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features)) %>% 
  layer_dense(units = 1) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
  )

 

Where the number of timesteps is chosen to be 24 hours * 7 days = 168.

And after about 20 epochs of training...

 

… here are the forecasts:

plot of chunk unnamed-chunk-25

How about multi-step-ahead forecasts?

 

We can use Keras TimeDistributed layer for this.

lstm_units <- 24 * 7 # hourly * weekly
model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
)

 

Where number of predictions equals number of timesteps: 168.

168-step-ahead forecasts!

 

And here are the forecasts, 168 steps ahead (displaying 3 only)

 

plot of chunk unnamed-chunk-27

That's just a small glimpse of what you could do...

 

… with Keras, in R.

Go ahead and have fun :-)

 

Thanks for your attention!!