Deep learning with Keras - using R

Sigrid Keydana, Trivadis
2017/09/11

About me & my employer

Trivadis

DACH-based IT consulting and service company, from traditional technologies to big data/machine learning/data science

My background

from psychology/statistics via software development and database engineering to data science and ML/DL

My passion

machine learning and deep learning
data science and (Bayesian) statistics
explanation/understanding over prediction accuracy

Where to find me

blog: http://recurrentnull.wordpress.com
twitter: @zkajdan

Welcome to the world of deep neural networks

Source: https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

How can we do deep learning in practice?

Powerful frameworks:

TensorFlow (Python)
PyTorch (Python)
Keras (Python)
DL4J (Java)
…

But we want to use R!

Let's just use Keras

… from R!

How?

With the R Interface to Keras, developed by RStudio

Getting started with Keras

Four steps:

prepare data
define model
train model
test model

Let's see how this works on our most beloved algorithm...

… linear regression!

… using our most beloved dataset …

Linear regression with Keras

The task

We want to predict Petal.Width from Petal.Length …


Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56515 -0.12358 -0.01898  0.13288  0.64272 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared:  0.9271,    Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

plot of chunk unnamed-chunk-2

Step 1: Prepare data

Not much to do in this case, just split into x/y and test & training sets:

x_train <- iris[1:120, "Petal.Length"]
y_train <- iris[1:120, "Petal.Width"]
x_test <- iris[121:150, "Petal.Length"]
y_test <- iris[121:150, "Petal.Width"]

Step 2: Define model

model <- keras_model_sequential()
model %>%
  layer_dense(units = 8, input_shape = 1) %>% # hidden layer with 8 neurons, 1-dimensional input
  layer_activation_leaky_relu() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1) # output layer with linear activation

# model %>% summary() # params: num_hidden weights and biases each to hidden layer, 8 weights and 1 bias to output neuron

model %>% compile(optimizer = "adam", loss = "mean_squared_error")

Layer (type)	Output Shape	Param
dense_1 (Dense)	(None, 8)	16
leaky_re_lu_1 (LeakyReLU)	(None, 8)	0
dropout_1 (Dropout)	(None, 8)	0
dense_2 (Dense)	(None, 1)	9

Step 3: Train model

hist <-
  model %>% fit(
    x_train,
    y_train,
    batch_size = 10,
    epochs = 100
  )
model %>% save_model_hdf5("iris.h5")

plot(hist)

plot of chunk unnamed-chunk-6

Step 4: Test model

model <- load_model_hdf5("iris.h5")
model %>% evaluate(x_test, y_test)

[1] 0.428311

preds_train <- model %>% predict(x_train)
preds_test <- model %>% predict(x_test)

plot of chunk unnamed-chunk-8

Convolutional Neural Networks (CNNs)

Going spatial: convnets

LeNet: First successful application of convolutional neural networks by Yann LeCun, Yoshua Bengio et al.

Source: http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

The convolution operation

Source: Goodfellow et al., Deep Learning.

CIFAR-10

32*32 px RGB images, split into training set (50000) and test set (10000)

cifar10 <- dataset_cifar10()

x_train <- cifar10$train$x/255
x_test <- cifar10$test$x/255
y_train <- to_categorical(cifar10$train$y, num_classes = 10)
y_test <- to_categorical(cifar10$test$y, num_classes = 10)

plot of chunk unnamed-chunk-10

A (rather) simple convnet

model <- keras_model_sequential()

model %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), padding = "same", input_shape = c(32, 32, 3)) %>%
  layer_activation("relu") %>%

  layer_conv_2d(filters = 32, kernel_size = c(3,3)) %>% layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_dropout(0.25) %>%

  layer_conv_2d(filters = 32, kernel_size = c(3,3), padding = "same") %>%
  layer_activation("relu") %>%

  layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%

  layer_flatten() %>%
  layer_dense(512) %>%
  layer_activation("relu") %>%
  layer_dropout(0.5) %>%

  layer_dense(10) %>%
  layer_activation("softmax")

Compile and train (not now)

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6),
  metrics = "accuracy"
)

model %>% fit(
  x_train,
  y_train,
  batch_size = 32,
  epochs = 200,
  validation_data = list(x_test, y_test),
  shuffle = TRUE
)

Evaluate model

model <- load_model_hdf5("cifar_10_200epochs.h5")
(model %>% evaluate(x_test, y_test))[[2]] # accuracy

[1] 0.3802

y_test[1] # example image: true class

[1] 0

model %>% predict_proba(x_test[1, , , , drop = FALSE])

           [,1]       [,2]      [,3]      [,4]      [,5]      [,6]
[1,] 0.02724996 0.02721717 0.1336272 0.3005729 0.0846272 0.1922515
          [,7]       [,8]       [,9]      [,10]
[1,] 0.1633566 0.03343451 0.02119882 0.01646418

model %>% predict_classes(x_test[1, , , , drop = FALSE])

[1] 3

Why start from scratch? Pretrained models

Available in Keras, with weights trained on ImageNet

Inception V3
VGG19, VGG19
ResNet
Xception

Usage:

as is
remove top fully connected layer and train for own classes
remove top fully connected layer and train for own classes, fine-tune upper layers
extract features (top or intermediate)

What's this?

Ask VGG16

model <- application_vgg16(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(224,224)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- imagenet_preprocess_input(x)

preds <- model %>% predict(x)
imagenet_decode_predictions(preds)

[[1]]
  class_name class_description       score
1  n03888257         parachute 0.876223385
2  n02268443         dragonfly 0.066812977
3  n02879718               bow 0.008029214
4  n01770393          scorpion 0.007272749
5  n03814906          necklace 0.004784236

Ask Resnet

model <- application_resnet50(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(224,224)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- imagenet_preprocess_input(x)

preds <- model %>% predict(x)
imagenet_decode_predictions(preds)

[[1]]
  class_name           class_description score
1  n02098286 West_Highland_white_terrier     1
2  n15075141               toilet_tissue     0
3  n02319095                  sea_urchin     0
4  n02391049                       zebra     0
5  n02389026                      sorrel     0

Ask Inception V3

model <- application_inception_v3(weights = 'imagenet')

img_path <- "xc.jpg"
img <- image_load(img_path, target_size = c(299,299)) 
x <- image_to_array(img)
dim(x) <- c(1, dim(x))

x <- inception_v3_preprocess_input(x)

preds <- model %>% predict(x)
which.max(preds)

[1] 796

imagenet_decode_predictions(preds)

[[1]]
  class_name class_description       score
1  n04228054               ski 0.762439072
2  n03888257         parachute 0.030627847
3  n09193705               alp 0.024098456
4  n04252077        snowmobile 0.007552539
5  n03792782     mountain_bike 0.003474350

Long Short Term Memory (LSTM)

Going sequential: Recurrent Neural Networks (RNNs)

Source: Goodfellow et al. 2016, Deep Learning

I need to forget: Long Short Term Memory

Source: Stanford CS 224D Deep Learning for NLP Lecture Notes

Let's write ourselves some R (source!)

… character by character (fun read: The Unreasonable Effectiveness of Recurrent Neural Networks)

Input: subset of .c files under $R_HOME/src/main

Like this, for example…

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include <Defn.h>
#include <Internal.h>

/* JMC convinced MM that this was not a good idea: */
#undef _S4_subsettable

static R_INLINE SEXP VECTOR_ELT_FIX_NAMED(SEXP y, R_xlen_t i) {
    /* if RHS (container or element) has NAMED > 0 set NAMED = 2.
       Duplicating might be safer/more consistent (fix bug reported by
       Radford Neal; similar to PR15098) */
    SEXP val = VECTOR_ELT(y, i);
    if ((NAMED(y) || NAMED(val)))
        if (NAMED(val) < 2)
            SET_NAMED(val, 2);
    return val;
}

Preprocessing for char-rnn

Preprocessing steps:

split into characters
build a dictionary of characters
build a list of (overlapping) character sequences of a specific length, and the corresponding following char

dataset <- map(
  seq(1, length(text) - maxlen - 1, by = 3), 
  function(x) list(sentence = text[x:(x + maxlen - 1)], next_char = text[x + maxlen])
)
dataset[[14]]

$sentence
 [1] "s"  "t"  "i"  "c"  "a"  "l"  " "  "d"  "a"  "t"  "a"  " "  "a" 
[14] "n"  "a"  "l"  "y"  "s"  "i"  "s"  "\n" " "  "*"  " "  " "  "c" 
[27] "o"  "p"  "y"  "r"  "i"  "g"  "h"  "t"  " "  "("  "c"  ")"  " " 
[40] "1" 

$next_char
[1] "9"

Create X and y matrices in formats required by LSTM

LSTM needs as input a 3-dimensional array, where the sizes are, in order

batch input size
number of timesteps (characters to use for prediction, in our case)
number of features (the dictionary of characters, in our case)

dataset <- transpose(dataset)

X <- array(0, dim = c(length(dataset$sentence), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentence), length(chars)))

for(i in 1:length(dataset$sentence)){  
  X[i,,] <- sapply(chars, function(x){
    as.integer(x == dataset$sentence[[i]])
  })  
  y[i,] <- as.integer(chars == dataset$next_char[[i]])
}

Define model

model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(chars))) %>%
  layer_dense(length(chars)) %>%
  layer_activation("softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

60 epochs later... let's see some output!

n memory */
        if (start) return (intsxp:";
        case nilsxp:
    case nilsxp:
    case = char(chk(x), start);
    return ans;
}

static rboolean r_namessxp:
        case cptr = r_print(protect(s"));
            if (ns == 0) {
                if (isnan(x)[i]) return (integer(x)));
    return chk(cadr(chk(s))); }
sexp (set_ay_string)) {
        if (size > 0) {
        if (s <= 0) return this spacelar(stack) be gens in the cur

Could be C to me ;-)

Another one... more adventurous this time...

n the -is a should neare stack to protect(shiftemplerbleve */
        {
        count = node_ised(cgised-> set_a_mayshelex)[i].for(real_const char *) valuel);
    int dit++;
    *vele = null;
}

static void r_nilsxp -= reprintf(r_isopy_primns; i++) {
    size = fr_is);
    for(i = 0; i complex * vmax);
    set_loon (_(_(dsize) == na_string) {
          error(_("argumenexity an
, experend */
    return chk(car(farge, "

... and a still more creative one...

>0.ribncely_troum-pubscliet(len);
  define (plise inc) (arguments,s.=irned jin = kp; (bached1,(yoenjioff_lenamus(x, size_t lastso, |glepplesxp,t)(sexp fiwelue_stefff() fmt2 to we bule 2a *lotkfientsrp++);
#if nhex) if */*xels(bece fonts)
{
    s =     pointeg > breakfrec
        size = ptroblerc=rrol);
    error(l_r_isest ans_ltaindxtes(fmtpaun narisex same * %d.--20tsmdourgite_code_nexi__relen _so_v_num__("/

Now we know where the “less helpful error messages” come from ;-)

Another cool thing we can do with LSTMs: Time series forecasting

Take this one:

plot of chunk unnamed-chunk-23

This series has multiple seasonality.

One-step-ahead forecasts

Preprocessing:

Scale/normalize the data (especially with extremely high values like this!)
Again, reshape data to the form <num samples, num timesteps, num features>

Model definition:

model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features)) %>% 
  layer_dense(units = 1) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
  )

Where the number of timesteps is chosen to be 24 hours * 7 days = 168.

And after about 20 epochs of training...

… here are the forecasts:

plot of chunk unnamed-chunk-25

How about multi-step-ahead forecasts?

We can use Keras TimeDistributed layer for this.

lstm_units <- 24 * 7 # hourly * weekly
model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
)

Where number of predictions equals number of timesteps: 168.

168-step-ahead forecasts!

And here are the forecasts, 168 steps ahead (displaying 3 only)

plot of chunk unnamed-chunk-27

That's just a small glimpse of what you could do...

… with Keras, in R.

Go ahead and have fun :-)

Thanks for your attention!!