5.序列数据问题

深度学习与自然语言

应用

文档分类
作者识别
受上下文约束的对话

事实上，模型无法理解文本，而是发现文本中的统计结构。计算机视觉谁将模式识别应用于图片像素，在自然语言中，深度学习是将模式识别应用于单词，句子和段落。

深度学习只能处理张量，因此无法之际处理文本，首先需要将文本转化成为张量。转化的方式有很多中。

文本分割成为单词，将单词转化成为向量
将文本分割陈各位字符
提取N元单词和字符，并将N元单词/字符转化成为向量。单词N元是可以从句子中提取N个以内连续单词/字符的组。例如I Love You ，拆解2元集合为[I,I Love,Love,Love you,you]。这样的集合叫做2元词袋，这类标记的方法叫做词袋。

需要注意的是，因为词袋并不会保留文本顺序，通常会在千层语言处理模型中年使用，而不是在深度学习模型

总而言之，把文本分解成为不同的单元（单词，字符。。），这些单元成为标记token。将文本分解成为不同单元的过程叫做标记话tokenization 。

one hot 编码

one hot 编码是将标记变成向量最常见的方法。以下代码分别是关于用于单词和字符的one hot 编码。

单词one hot 编码的代码。

# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
samples <- c("The cat sat on the mat.", "The dog ate my homework.")
  
# First, build an index of all tokens in the data.
token_index <- list()
for (sample in samples)
  # Tokenizes the samples via the strsplit function. In real life, you'd also
  # strip punctuation and special characters from the samples.
  for (word in strsplit(sample, " ")[[1]])
    if (!word %in% names(token_index))
      # Assigns a unique index to each unique word. Note that you don't
      # attribute index 1 to anything.
      token_index[[word]] <- length(token_index) + 2 

# Vectorizes the samples. You'll only consider the first max_length 
# words in each sample.
max_length <- 10

# This is where you store the results.
results <- array(0, dim = c(length(samples), 
                            max_length, 
                            max(as.integer(token_index))))

for (i in 1:length(samples)) {
  sample <- samples[[i]]
  words <- head(strsplit(sample, " ")[[1]], n = max_length)
  for (j in 1:length(words)) {
    index <- token_index[[words[[j]]]]
    results[[i, j, index]] <- 1
  }
}

字符one hot 编码代码。

samples <- c("The cat sat on the mat.", "The dog ate my homework.")

ascii_tokens <- c("", sapply(as.raw(c(32:126)), rawToChar))
token_index <- c(1:(length(ascii_tokens)))
names(token_index) <- ascii_tokens

max_length <- 50

results <- array(0, dim = c(length(samples), max_length, length(token_index)))

for (i in 1:length(samples)) {
  sample <- samples[[i]]
  characters <- strsplit(sample, "")[[1]]
  for (j in 1:length(characters)) {
    character <- characters[[j]]
    results[i, j, token_index[[character]]] <- 1
  }
}

keras 有内置的用于单词级别和字符级别one hot 编码的函数。

library(keras)

samples <- c("The cat sat on the mat.", "The dog ate my homework.")

# Creates a tokenizer, configured to only take into account the 1,000 
# most common words, then builds the word index.
tokenizer <- text_tokenizer(num_words = 1000) %>%
  fit_text_tokenizer(samples)

## Loaded Tensorflow version 2.9.2

# Turns strings into lists of integer indices
sequences <- texts_to_sequences(tokenizer, samples)

# You could also directly get the one-hot binary representations. Vectorization 
# modes other than one-hot encoding are supported by this tokenizer.
one_hot_results <- texts_to_matrix(tokenizer, samples, mode = "binary")

# How you can recover the word index that was computed
word_index <- tokenizer$word_index

cat("Found", length(word_index), "unique tokens.\n")

## Found 9 unique tokens.

单词嵌入

另外一种方法是使用稠密的词向量（单词嵌入）。one hot 编码获得张量是二进制的，稀疏的（大部分是0），并且是非常高纬的（维度与词汇表维度相等，可能是几万维）。

单词嵌入本质上就是直接将不同单词当作不同的值，具体就是将人类语言映射到有一个几何空间。例如，我们会期望两个词向量的距离与关联词之间的语意相关。

例如，我们将四个词嵌入到二维平面。

词之间的关系可以用几何变换表示，例如cat - tiger ，可能的语意可以是宠物到野兽。

但是，目前并没有一个单词嵌入空间能够挖煤映射自然语言。我们需要在每个任务中学习一个新的嵌入空间。通过layer_embedding来添加一个嵌入层。

library(keras)

# The embedding layer takes at least two arguments:
# the number of possible tokens, here 1000 (1 + maximum word index),
# and the dimensionality of the embeddings, here 64.
embedding_layer <- layer_embedding(input_dim = 1000, output_dim = 64)

嵌入层至少需要两个参数：

可能的标记token 数量
嵌入的维度

layer_embedding层可以理解成为一个字典，将整数索引映射到单词向量。其以整数作为输入，然后根据整数查找单词，返回关联的向量。

layer_embedding 接受一个二维状态的整数张量作为输入(samples，sequence-length)，返回一个三维格式的浮点张量（sample，sequence-length，embedding-dimensionality）。这样的三维张量可以通过一个RNN层或者一维卷积层处理。

我们使用IMDB数据集为例，应用嵌入层。在这个例子中，只考虑最常见的1000个恶单词。并且只选取20个单词。

首先准备数据

# Number of words to consider as features
max_features <- 10000
# Cut texts after this number of words 
# (among top max_features most common words)
maxlen <- 20

# Load the data as lists of integers.
imdb <- dataset_imdb(num_words = max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train <- pad_sequences(x_train, maxlen = maxlen)
x_test <- pad_sequences(x_test, maxlen = maxlen)

构建网络，编译以及训练模型

model <- keras_model_sequential() %>% 
  # We specify the maximum input length to our Embedding layer
  # so we can later flatten the embedded inputs
  layer_embedding(input_dim = 10000, output_dim = 8, 
                  input_length = maxlen) %>% 
  # We flatten the 3D tensor of embeddings 
  # into a 2D tensor of shape `(samples, maxlen * 8)`
  layer_flatten() %>% 
  # We add the classifier on top
  layer_dense(units = 1, activation = "sigmoid") 

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("acc")
)

history <- model %>% fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

可以看到预测准确度可以达到76%左右。由于我们我们每条数据只有20个单词（特征只有20个），并且网络结构简单，这样的效果已经不错。

我们可以进一步优化。

使用预训练的单词嵌入

keras 有很多预计算的词嵌入数据库，例如word2vec，GloVe。我们来看如何在Keras模型中添加单词嵌入

我们首先下载IMDB的原始数据，链接是：http://mng.bz/0tIo

首先将训练数据集放到一个字符串列表当中，

imdb_dir <- "/Users/milin/Keras2021/aclImdb"
train_dir <- file.path(imdb_dir, "train")

labels <- c()
texts <- c()

for (label_type in c("neg", "pos")) {
  label <- switch(label_type, neg = 0, pos = 1)
  dir_name <- file.path(train_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*.txt"), 
                           full.names = TRUE)) {
    texts <- c(texts, readChar(fname, file.info(fname)$size))
    labels <- c(labels, label)
  }
}

接下来训练数据集全部被放到了一个列表当中，列表的每一个元素表示一条文本。

标签化Tokenize数据

我们将文本进行进行处理。

library(keras)

maxlen <- 100                 # We will cut reviews after 100 words
training_samples <- 200       # We will be training on 200 samples
validation_samples <- 10000   # We will be validating on 10000 samples
max_words <- 10000            # We will only consider the top 10,000 words in the dataset

tokenizer <- text_tokenizer(num_words = max_words) %>% 
  fit_text_tokenizer(texts)

sequences <- texts_to_sequences(tokenizer, texts)

word_index = tokenizer$word_index
cat("Found", length(word_index), "unique tokens.\n")

## Found 88582 unique tokens.

data <- pad_sequences(sequences, maxlen = maxlen)

labels <- as.array(labels)
cat("Shape of data tensor:", dim(data), "\n")

## Shape of data tensor: 25000 100

cat('Shape of label tensor:', dim(labels), "\n")

## Shape of label tensor: 25000

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices <- sample(1:nrow(data))
training_indices <- indices[1:training_samples]
validation_indices <- indices[(training_samples + 1): 
                              (training_samples + validation_samples)]

x_train <- data[training_indices,]
y_train <- labels[training_indices]

x_val <- data[validation_indices,]
y_val <- labels[validation_indices]

只考虑100个词，选择200个样本进行训练。

下一步，需要喜爱在GloVe单词嵌入词，https://nlp.stanford.edu/projects/glove/ ，并且下载glove.6B.zip文件，其中包含40万单词的100维单词嵌入向量。

预处理嵌入

生成索引，将单词映射到向量表示。

glove_dir = '/Users/milin/Keras2021/glove.6B'
lines <- readLines(file.path(glove_dir, "glove.6B.100d.txt"))

embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(lines)) {
  line <- lines[[i]]
  values <- strsplit(line, " ")[[1]]
  word <- values[[1]]
  embeddings_index[[word]] <- as.double(values[-1])
}

cat("Found", length(embeddings_index), "word vectors.\n")

## Found 400000 word vectors.

接下来，，构建单词嵌入矩阵

embedding_dim <- 100

embedding_matrix <- array(0, c(max_words, embedding_dim))

for (word in names(word_index)) {
  index <- word_index[[word]]
  if (index < max_words) {
    embedding_vector <- embeddings_index[[word]]
    if (!is.null(embedding_vector))
      # Words not found in the embedding index will be all zeros.
      embedding_matrix[index+1,] <- embedding_vector
  }
}

定义模型

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = max_words, output_dim = embedding_dim, 
                  input_length = maxlen) %>% 
  layer_flatten() %>% 
  layer_dense(units = 32, activation = "relu") %>% 
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  embedding_2 (Embedding)            (None, 100, 100)                1000000     
##  flatten_1 (Flatten)                (None, 10000)                   0           
##  dense_2 (Dense)                    (None, 32)                      320032      
##  dense_1 (Dense)                    (None, 1)                       33          
## ================================================================================
## Total params: 1,320,065
## Trainable params: 1,320,065
## Non-trainable params: 0
## ________________________________________________________________________________

接下来，我们需要将创建好的单词嵌入矩阵加入到嵌入层

get_layer(model, index = 1) %>% 
  set_weights(list(embedding_matrix)) %>% 
  freeze_weights()

为了确保一层的参数不发生修改，需要冻结参数。

设置好嵌入矩阵之后，训练模型。

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("acc")
)

history <- model %>% fit(
  x_train, y_train,
  epochs = 20,
  batch_size = 32,
  validation_data = list(x_val, y_val)
)

save_model_weights_hdf5(model, "pre_trained_glove_model.h5")

可视化训过程

plot(history)

模型很快过拟合，效果一般。这很正常，这是因为这里只用了非常少的数据集。这里主要是为了演示，同学们可以自己尝试训练更加复杂的模型。

下面的代码是是不是用预定义的单词嵌入。


model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = max_words, output_dim = embedding_dim, 
                  input_length = maxlen) %>% 
  layer_flatten() %>% 
  layer_dense(units = 32, activation = "relu") %>% 
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("acc")
)

history <- model %>% fit(
  x_train, y_train,
  epochs = 20,
  batch_size = 32,
  validation_data = list(x_val, y_val)
)

可视化训练过程

plot(history)

标记化测试数据集

test_dir <- file.path(imdb_dir, "test")

labels <- c()
texts <- c()

for (label_type in c("neg", "pos")) {
  label <- switch(label_type, neg = 0, pos = 1)
  dir_name <- file.path(test_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*.txt"), 
                           full.names = TRUE)) {
    texts <- c(texts, readChar(fname, file.info(fname)$size))
    labels <- c(labels, label)
  }
}

sequences <- texts_to_sequences(tokenizer, texts)
x_test <- pad_sequences(sequences, maxlen = maxlen)
y_test <- as.array(labels)

验证测试数据集

model %>% 
  load_model_weights_hdf5("pre_trained_glove_model.h5") %>% 
  evaluate(x_test, y_test, verbose = 0)

循环神经网络

我们之前接触的网路都是前馈网络,这些网络不同层的输入都是独立处理的，对于这样的网络，处理序列数据（时间序列，文本序列，流程序列），必须首先将序列进行转换，转换成为一个大的向量，再输入到层中进行处理。

循环神经网络RNN是一种具有内循环的神经网络类型。

我们来看一个RNN的简单伪代码：

state_t <- 0

for(input_t in input_sequence){
  output_t <- f(input_t,state_t)
  state_t <- output_t
}

f是一个函数，例如是一个激活函数。RNN是一个for循环，其会使用他上一个迭代期间获得的输出结果。也就是说，一个序列中，第一个元素产生的结果，会作为第二个元素计算过程中的参数。

循环层

创建循环层使用的是 layer_simple_rnn().函数

layer_simple_rnn(units = 32)

该层接受的数据形式是(batch_size,timesteps,inputfeature)。其可以返回类似于(batch_size,timesteps,inputfeature)这样的完整序列，也可以返回序列中的最后一个输出，类似于(batch_size,inputfeature) ,二维

下面这个网络返回最后一个序列

library(keras)
model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = 10000, output_dim = 32) %>% # 单词嵌入层
  layer_simple_rnn(units = 32)

summary(model)

输出完整序列

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = 10000, output_dim = 32) %>% # 单词嵌入层
  layer_simple_rnn(units = 32, return_sequences = TRUE)

summary(model)

多个循环层的时候，中间层需要设置返回完整序列。


model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = 10000, output_dim = 32) %>% 
  layer_simple_rnn(units = 32, return_sequences = TRUE) %>% 
  layer_simple_rnn(units = 32, return_sequences = TRUE) %>%
  layer_simple_rnn(units = 32, return_sequences = TRUE) %>%
  layer_simple_rnn(units = 32)  # This last layer only returns the last outputs.

summary(model)

接下来，对于IMDB数据，使用RNN进行分类。

library(keras)

max_features <- 10000  # Number of words to consider as features
maxlen <- 500  # Cuts off texts after this many words (among the max_features most common words)
batch_size <- 32

cat("Loading data...\n")

## Loading data...

imdb <- dataset_imdb(num_words = max_features)
c(c(input_train, y_train), c(input_test, y_test)) %<-% imdb 
cat(length(input_train), "train sequences\n")

## 25000 train sequences

input_train <- pad_sequences(input_train, maxlen = maxlen)
input_test <- pad_sequences(input_test, maxlen = maxlen)
cat("input_train shape:", dim(input_train), "\n")

## input_train shape: 25000 500

构建网络

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_features, output_dim = 32) %>% #单词嵌入层,其接受一个二维张量，
  layer_simple_rnn(units = 32) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("acc")
)

history <- model %>% fit(
  input_train, y_train,
  epochs = 10,
  batch_size = 128,
  validation_split = 0.2
)

查看训练过程

plot(history)

这里可以看到，效果并不是非常好。这是应为我们每一条文本只考虑了500个单词，可以尝试考虑完整序列。另外，layer_simple_rnn层不擅长处理长序列，例如文本。

LSTM 层

LSTM 是长短期记忆算法，其实是layer_simple_rnn的一个变体，其是为了解决layer_simple_rnn的梯度消失问题而创造的。LSTM增加了一个方式将信息能够横跨多个时间步（横跨多个序列）。

LSTM擅长处理全局的，长期的结构，因此其在自动问答，机器翻译中有广泛的应用。

简单的来说，LSTM层能够允许以后在输入过去的信息。我们将LSTM应用于IMDB分类

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = max_features, output_dim = 32) %>% # 单词嵌入层 
  layer_lstm(units = 32) %>% 
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop", 
  loss = "binary_crossentropy", 
  metrics = c("acc")
)

history <- model %>% fit(
  input_train, y_train,
  epochs = 10,
  batch_size = 128,
  validation_split = 0.2
)

在这里的修改很简单，我们将之间的layer_simple_rnn(units = 32)改为，layer_lstm(units = 32)

可视化结果

plot(history)

温度预测问题

我们来看一个新例子，在这个例子中我们预测最后一个数据点后24个小时的温度。

在这个例子中，我们使用德国耶拿市马克思普朗克生物地球化学研究所的气象站记录数据。数据集有很多哦特征，气温，大气压，湿度，风向等等。数据每十分钟记录一次。该数据集包含2009到2016年的数据。

首先下载数据集

dir.create("/Users/milin/Keras2021/jena_climate", recursive = TRUE)
download.file(
  "https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip",
  "/Users/milin/Keras2021/jena_climate/jena_climate_2009_2016.csv.zip"
)
unzip(
  "/Users/milin/Keras2021/jena_climate/jena_climate_2009_2016.csv.zip",
  exdir = "/Users/milin/Keras2021/jena_climate"
)

查看数据

library(tibble)
library(readr)

data_dir <- "jena_climate"
fname <- file.path(data_dir, "jena_climate_2009_2016.csv")
data <- read_csv(fname)
## Rows: 420451 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Date Time
## dbl (14): p (mbar), T (degC), Tpot (K), Tdew (degC), rh (%), VPmax (mbar), V...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(data)
## Rows: 420,451
## Columns: 15
## $ `Date Time`       <chr> "01.01.2009 00:10:00", "01.01.2009 00:20:00", "01.01…
## $ `p (mbar)`        <dbl> 996.52, 996.57, 996.53, 996.51, 996.51, 996.50, 996.…
## $ `T (degC)`        <dbl> -8.02, -8.41, -8.51, -8.31, -8.27, -8.05, -7.62, -7.…
## $ `Tpot (K)`        <dbl> 265.40, 265.01, 264.91, 265.12, 265.15, 265.38, 265.…
## $ `Tdew (degC)`     <dbl> -8.90, -9.28, -9.31, -9.07, -9.04, -8.78, -8.30, -8.…
## $ `rh (%)`          <dbl> 93.3, 93.4, 93.9, 94.2, 94.1, 94.4, 94.8, 94.4, 93.8…
## $ `VPmax (mbar)`    <dbl> 3.33, 3.23, 3.21, 3.26, 3.27, 3.33, 3.44, 3.44, 3.36…
## $ `VPact (mbar)`    <dbl> 3.11, 3.02, 3.01, 3.07, 3.08, 3.14, 3.26, 3.25, 3.15…
## $ `VPdef (mbar)`    <dbl> 0.22, 0.21, 0.20, 0.19, 0.19, 0.19, 0.18, 0.19, 0.21…
## $ `sh (g/kg)`       <dbl> 1.94, 1.89, 1.88, 1.92, 1.92, 1.96, 2.04, 2.03, 1.97…
## $ `H2OC (mmol/mol)` <dbl> 3.12, 3.03, 3.02, 3.08, 3.09, 3.15, 3.27, 3.26, 3.16…
## $ `rho (g/m**3)`    <dbl> 1307.75, 1309.80, 1310.24, 1309.19, 1309.00, 1307.86…
## $ `wv (m/s)`        <dbl> 1.03, 0.72, 0.19, 0.34, 0.32, 0.21, 0.18, 0.19, 0.28…
## $ `max. wv (m/s)`   <dbl> 1.75, 1.50, 0.63, 0.50, 0.63, 0.63, 0.63, 0.50, 0.75…
## $ `wd (deg)`        <dbl> 152.3, 136.1, 171.6, 198.0, 214.3, 192.7, 166.5, 118…
dim(data)
## [1] 420451     15

数据集有420451行，有15列。绘制关于温度的时间序列图。


library(ggplot2)
ggplot(data, aes(x = 1:nrow(data), y = `T (degC)`)) + geom_line() +theme_classic()

数据是10min记录一次，每天有144个点。我们查看前20天的数据。

ggplot(data[1:1440*2,], aes(x = 1:1440*2, y = `T (degC)`)) + geom_line()+theme_classic()

可以看到，温度似乎是存在周期性。

接下来，进行数据准备。我们的目标是预测24h之后的气温。

首先将data frame 转化成为matrix

data <- data.matrix(data[,-1])

使用前20w行数据作为训练集，并对数据进行标准化。

train_data <- data[1:200000,]
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
data <- scale(data, center = mean, scale = std)

以下是一个数据生器：

data 就是数据
lookback 表示回溯的步数，1步就是10min
delay 表示需要预测的步数，预测24h，则是144min/步（10min为间隔，1天有144个间隔）
min_index,maax_index，表示从哪个时间范围划分数据
shuffle ：是否打乱数据
batch_size 每批样本数量
step 采样周期，6则表示周期为1h

generator <- function(data, lookback, delay, min_index, max_index,
                      shuffle = FALSE, batch_size = 128, step = 6) {
  if (is.null(max_index))
    max_index <- nrow(data) - delay - 1
  i <- min_index + lookback
  function() {
    if (shuffle) {
      rows <- sample(c((min_index+lookback):max_index), size = batch_size)
    } else {
      if (i + batch_size >= max_index)
        i <<- min_index + lookback
      rows <- c(i:min(i+batch_size, max_index))
      i <<- i + length(rows)
    }
    
    samples <- array(0, dim = c(length(rows), 
                                lookback / step,
                                dim(data)[[-1]]))
    targets <- array(0, dim = c(length(rows)))
                     
    for (j in 1:length(rows)) {
      indices <- seq(rows[[j]] - lookback, rows[[j]], 
                     length.out = dim(samples)[[2]])
      samples[j,,] <- data[indices,]
      targets[[j]] <- data[rows[[j]] + delay,2]
    }            
    
    list(samples, targets)
  }
}

构造训练集，测试集和验证集

lookback <- 1440 # 回溯10天
step <- 6 # 1小时为时间间隔
delay <- 144 # 目标是24小时
batch_size <- 128

train_gen <- generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 1,
  max_index = 200000,
  shuffle = TRUE,
  step = step, 
  batch_size = batch_size
)

val_gen = generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 200001,
  max_index = 300000,
  step = step,
  batch_size = batch_size
)

test_gen <- generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 300001,
  max_index = NULL,
  step = step,
  batch_size = batch_size
)

# This is how many steps to draw from `val_gen`
# in order to see the whole validation set:
val_steps <- (300000 - 200001 - lookback) / batch_size

  # This is how many steps to draw from `test_gen`
# in order to see the whole test set:
test_steps <- (nrow(data) - 300001 - lookback) / batch_size

我们从简单的开始，假设24小时之后的温度和现在以下，我们计算平均绝对误差MAE。MAE计算公式如下。

mean(abs(preds - targets)) 计算误差。

evaluate_naive_method <- function() {
  batch_maes <- c()
  for (step in 1:val_steps) {
    c(samples, targets) %<-% val_gen()
    preds <- samples[,dim(samples)[[2]],2]
    mae <- mean(abs(preds - targets))
    batch_maes <- c(batch_maes, mae)
  }
  print(mean(batch_maes))
}

MAE 是0.277左右，由于数据是经过标准化之后。摄氏误差是2.6左右。我们的模型误差至少应该要比这个效果要好吧。

我们是用GRU层来进行构建网络，GRU和LSTM类似，但是其运行成本更低，虽然。

model <- keras_model_sequential() %>% 
  layer_gru(units = 32, input_shape = list(NULL, dim(data)[[-1]])) %>% 
  layer_dense(units = 1)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

history <- model %>% fit_generator(
  train_gen,
  steps_per_epoch = 500,
  epochs = 20,
  validation_data = val_gen,
  validation_steps = val_steps
)
## Warning in fit_generator(., train_gen, steps_per_epoch = 500, epochs = 20, :
## `fit_generator` is deprecated. Use `fit` instead, it now accept generators.

查看训练过程

plot(history)

为了解决过拟合，我们添加dropout dropout表示层输入单元的丢弃率，recurrent_dropout表示循环单元的丢弃率。

model <- keras_model_sequential() %>% 
  layer_gru(units = 32, dropout = 0.2, recurrent_dropout = 0.2,
            input_shape = list(NULL, dim(data)[[-1]])) %>% 
  layer_dense(units = 1)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

history <- model %>% fit_generator(
  train_gen,
  steps_per_epoch = 500,
  epochs = 40,
  validation_data = val_gen,
  validation_steps = val_steps
)
## Warning in fit_generator(., train_gen, steps_per_epoch = 500, epochs = 40, :
## `fit_generator` is deprecated. Use `fit` instead, it now accept generators.
plot(history)

为了进一步提高模型效果，我们可以添加多几个layer_gru层。增加层或者增加单元数可以提高网络的复杂度。需要注意的是，中间层需要返回完整序列，需要设置return_sequences = TRUE

model <- keras_model_sequential() %>% 
  layer_gru(units = 32, 
            dropout = 0.1, 
            recurrent_dropout = 0.5,
            return_sequences = TRUE,
            input_shape = list(NULL, dim(data)[[-1]])) %>% 
  layer_gru(units = 64, activation = "relu",
            dropout = 0.1,
            recurrent_dropout = 0.5) %>% 
  layer_dense(units = 1)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

history <- model %>% fit_generator(
  train_gen,
  steps_per_epoch = 500,
  epochs = 40,
  validation_data = val_gen,
  validation_steps = val_steps
)
## Warning in fit_generator(., train_gen, steps_per_epoch = 500, epochs = 40, :
## `fit_generator` is deprecated. Use `fit` instead, it now accept generators.

查看结果

plot(history)

进一步，我们还可以使用双向RNN，其是处理自然语言问题的瑞士军刀，就是非常好用，常用的意思。

什么是双向RNN？

就是不仅考虑序列的正向顺序，也考虑序列的逆向顺序。通过Keras 创建双向RNN，需要使用bidirectional函数。

k_clear_session()
model <- keras_model_sequential() %>% 
  bidirectional(
    layer_gru(units = 32), input_shape = list(NULL, dim(data)[[-1]])
  ) %>% 
  layer_dense(units = 1)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

history <- model %>% fit_generator(
  train_gen,
  steps_per_epoch = 500,
  epochs = 40,
  validation_data = val_gen,
  validation_steps = val_steps
)
## Warning in fit_generator(., train_gen, steps_per_epoch = 500, epochs = 40, :
## `fit_generator` is deprecated. Use `fit` instead, it now accept generators.

双向RNN常用于自然语言处理，对于我们这个问题，其实宾不合适，我们需要预测未来的温度，只能考虑正向顺序。反向顺序对于我们的预测其实并没有什么帮助。

进一步优化

尝试跟多层数以及单元数修改RMSprop优化器的参数 3.。。。。深度学习是一门艺术，（玄学/魔法），每个问题都是第独一无二的，没有什么理论可以预先告诉我们如何做最好，必须不断尝试调整。

使用卷积网络解决序列问题

我们之前使用二维卷积网络解决计算机识别问题。一维卷积网络在某些序列问题上可以与RNN媲美，并且计算成本小的多。其可以作为文本分类，时间序列预测的RNN替代方案。

一维卷积网络和二维卷积网络类似，我们来看一个例子。在这个例子中使用一维卷积网络解决IMDB的分类。

首先准备数据

library(keras)

max_features <- 10000
max_len <- 500

cat("Loading data...\n")
## Loading data...
imdb <- dataset_imdb(num_words = max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb 
cat(length(x_train), "train sequences\n")
## 25000 train sequences
cat(length(x_test), "test sequences")
## 25000 test sequences
cat("Pad sequences (samples x time)\n")
## Pad sequences (samples x time)
x_train <- pad_sequences(x_train, maxlen = max_len)
x_test <- pad_sequences(x_test, maxlen = max_len)
cat("x_train shape:", dim(x_train), "\n")
## x_train shape: 25000 500
cat("x_test shape:", dim(x_test), "\n")
## x_test shape: 25000 500

一维卷积网络和二维卷积网络类似。使用layer_conv_1d和layer_max_pooling_1d网络，以全连接层或者layer_flatten层结束。

不同之处在于，一维卷积可以使用更大的卷积窗口。这是因为，对于二维卷积，卷积3-3的窗口有9个像素点，对于一维的窗口，大小为3则之后三个特征。

构建网络。

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = max_features, output_dim = 128,
                  input_length = max_len) %>% 
  layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>% 
  layer_global_max_pooling_1d() %>% 
  layer_dense(units = 1)

summary(model)
## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  embedding (Embedding)              (None, 500, 128)                1280000     
##  conv1d_1 (Conv1D)                  (None, 494, 32)                 28704       
##  max_pooling1d (MaxPooling1D)       (None, 98, 32)                  0           
##  conv1d (Conv1D)                    (None, 92, 32)                  7200        
##  global_max_pooling1d (GlobalMaxPoo  (None, 32)                     0           
##  ling1D)                                                                        
##  dense_1 (Dense)                    (None, 1)                       33          
## ================================================================================
## Total params: 1,315,937
## Trainable params: 1,315,937
## Non-trainable params: 0
## ________________________________________________________________________________

编译和训练模型

model %>% compile(
  optimizer = optimizer_rmsprop(lr = 1e-4),
  loss = "binary_crossentropy",
  metrics = c("acc")
)

history <- model %>% fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 128,
  validation_split = 0.2
)

查看结果

plot(history)

1维卷积速度更快，在很多时候是一个非常好的替代方案。

结合RNN 和一维卷积

# This was previously set to 6 (one point per hour).
# Now 3 (one point per 30 min).
step <- 3 
lookback <- 720  # Unchanged
delay <- 144  # Unchanged
  
train_gen <- generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 1,
  max_index = 200000,
  shuffle = TRUE,
  step = step
)

val_gen <- generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 200001,
  max_index = 300000,
  step = step
)

test_gen <- generator(
  data,
  lookback = lookback,
  delay = delay,
  min_index = 300001,
  max_index = NULL,
  step = step
)

val_steps <- (300000 - 200001 - lookback) / 128
test_steps <- (nrow(data) - 300001 - lookback) / 128
从两个 layer_conv_1d() 然后添加layer_gru():

model <- keras_model_sequential() %>% 
  layer_conv_1d(filters = 32, kernel_size = 5, activation = "relu",
                input_shape = list(NULL, dim(data)[[-1]])) %>% 
  layer_max_pooling_1d(pool_size = 3) %>% 
  layer_conv_1d(filters = 32, kernel_size = 5, activation = "relu") %>% 
  layer_gru(units = 32, dropout = 0.1, recurrent_dropout = 0.5) %>% 
  layer_dense(units = 1)

summary(model)
## Model: "sequential_2"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  conv1d_3 (Conv1D)                  (None, None, 32)                2272        
##  max_pooling1d_1 (MaxPooling1D)     (None, None, 32)                0           
##  conv1d_2 (Conv1D)                  (None, None, 32)                5152        
##  gru_1 (GRU)                        (None, 32)                      6336        
##  dense_2 (Dense)                    (None, 1)                       33          
## ================================================================================
## Total params: 13,793
## Trainable params: 13,793
## Non-trainable params: 0
## _______________________________________________________________________________

训练模型

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

history <- model %>% fit_generator(
  train_gen,
  steps_per_epoch = 500,
  epochs = 20,
  validation_data = val_gen,
  validation_steps = val_steps
)
## Warning in fit_generator(., train_gen, steps_per_epoch = 500, epochs = 20, :
## `fit_generator` is deprecated. Use `fit` instead, it now accept generators.

可视化结果


plot(history)

这是一种技巧，这种技术允许我么你处理更长的序列，并且效率更高。

总结

可以使用RNN/一维卷积解决序列回归，分类等等问题如果全局顺序对于序列数据很重要，则可以使用双向RNN 如果全局顺序不重要，那么一维卷积也有不错表现，并且效率更高。

6. 结束也是开始

使用LSTM 生成文本基本思想基于LSTM网络训练语言模型，事实上，使用其他层例如卷积也可以是想一样的功能。

首先准备数据

library(keras)
library(stringr)
k_clear_session()
path <- get_file(
  "nietzsche.txt",
  origin = "https://s3.amazonaws.com/text-datasets/nietzsche.txt"
)
text <- tolower(readChar(path, file.info(path)$size))
cat("Corpus length:", nchar(text), "\n")
## Corpus length: 335872
然后，对数据进行one hot 编码。

maxlen <- 60  # Length of extracted character sequences

step <- 3  # We sample a new sequence every `step` characters
  
text_indexes <- seq(1, nchar(text) - maxlen, by = step)

# This holds our extracted sequences
sentences <- str_sub(text, text_indexes, text_indexes + maxlen - 1)

# This holds the targets (the follow-up characters)
next_chars <- str_sub(text, text_indexes + maxlen, text_indexes + maxlen)

cat("Number of sequences: ", length(sentences), "\n")
## Number of sequences:  111938
# List of unique characters in the corpus
chars <- unique(sort(strsplit(text, "")[[1]]))
cat("Unique characters:", length(chars), "\n")
## Unique characters: 52
# Dictionary mapping unique characters to their index in `chars`
char_indices <- 1:length(chars) 
names(char_indices) <- chars

# Next, one-hot encode the characters into binary arrays.
cat("Vectorization...\n") 
## Vectorization...
x <- array(0L, dim = c(length(sentences), maxlen, length(chars)))
y <- array(0L, dim = c(length(sentences), length(chars)))
for (i in 1:length(sentences)) {
  sentence <- strsplit(sentences[[i]], "")[[1]]
  for (t in 1:length(sentence)) {
    char <- sentence[[t]]
    x[i, t, char_indices[[char]]] <- 1
  }
  next_char <- next_chars[[i]]
  y[i, char_indices[[next_char]]] <- 1
}

建立LSTM 网络

model <- keras_model_sequential() %>% 
  layer_lstm(units = 128, input_shape = c(maxlen, length(chars))) %>% 
  layer_dense(units = length(chars), activation = "softmax")
编译模型

optimizer <- optimizer_rmsprop(learning_rate = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

下一步，给定训练完成的模型和种子文本，可以根据如下策略产生新文本（其实生成文本的策略不止这一种）。 1. 根据模型得出下一个字符的概率分布 2. 对分布进行加权 3. 根据重新加权随机采样下一个字符 4. 在可以文本末尾添加新字符

sample_next_char <- function(preds, temperature = 1.0) {
  preds <- as.numeric(preds)
  preds <- log(preds) / temperature
  exp_preds <- exp(preds)
  preds <- exp_preds / sum(exp_preds)
  which.max(t(rmultinom(1, 1, preds)))
}

最后，通过循环训练并且生成文本。

 # Fit the model for 1 epoch on the available training data
  model %>% fit(x, y, batch_size = 128, epochs = 1) 
  
  # Select a text seed at random
  start_index <- sample(1:(nchar(text) - maxlen - 1), 1)  
  seed_text <- str_sub(text, start_index, start_index + maxlen - 1)
  
  cat("--- Generating with seed:", seed_text, "\n\n")
## --- Generating with seed: st degree
## suitable to give rise to exceptional men of the mo
  for (temperature in c(0.2, 0.5, 1.0, 1.2)) {
    
    cat("------ temperature:", temperature, "\n")
    cat(seed_text, "\n")
    
    generated_text <- seed_text
    
     # We generate 400 characters
    for (i in 1:400) {
      
      sampled <- array(0, dim = c(1, maxlen, length(chars)))
      generated_chars <- strsplit(generated_text, "")[[1]]
      for (t in 1:length(generated_chars)) {
        char <- generated_chars[[t]]
        sampled[1, t, char_indices[[char]]] <- 1
      }
        
      preds <- model %>% predict(sampled, verbose = 0)
      next_index <- sample_next_char(preds[1,], temperature)
      next_char <- chars[[next_index]]
      
      generated_text <- paste0(generated_text, next_char)
      generated_text <- substring(generated_text, 2)
      
      cat(next_char)
    }
    cat("\n\n")
  }
## ------ temperature: 0.2 
## st degree
## suitable to give rise to exceptional men of the mo 
## des and a profuen of the spise the spirit of the man and the sout and spise indeelly the self the spirit the spechation of the spilition of the spirit the spirit inder and all the spilition of the spilition of the opposence of the spilition of the spinition of the spilition of the religion of the spinition of the spilition of the spise and desine of the spiention of the spilition of the spile of t
## 
## ------ temperature: 0.5 
## st degree
## suitable to give rise to exceptional men of the mo 
## ralise a deether of with the recion and desine as for the groming is a the secientice of the groust perhaps and who with seciouss the mealicise the of the them every of the mast cand the seally and diffadiciant indingers of the indincers and for a forment of the groment stinction of supperether the propered is all the same of the comether are a doust and dose is decines and spise and prising to th
## 
## ------ temperature: 1 
## st degree
## suitable to give rise to exceptional men of the mo 
## nge pristity wirhs a pe for graud
## facued gof nowlicial
## .
## an to-that and the reacinesal a
## mearm
## progabed aang to be waketror alsot--perpare.
## trae, or like lopting, strucht the really h! wor the sole "worterd
## of i man then of lecaty re. one ofienes goom s;med6t? which to i spood at skorn the hald bebority, aci1), and feew and tall-ficsed amays indeacage of no in that conciness, thatygy if the indolu
## 
## ------ temperature: 1.2 
## st degree
## suitable to give rise to exceptional men of the mo 
## le? of pelintjeshovesise.
## 
## 10qo quhen the figest
## zory meid a regolience towce eluover.  phivosiyoun quint reseluisentansce lowing pechaply theytt refurer a moro aicausgly,
## pen to which wigh, the regurening thatse sfpleliegly of dibsely, of
## pibed amewerechy. in wore--everychecflyua spereacy preccomaxication, grorned
## in oredy
## ablet ounds.
## 
## me7 im
## art nown of and stallw!
## 
## 1z7 hy ramoks,ly. thes inssi

通过训练更大的模型，更长的时间，更多的数据，我们可以获得更连贯和更真实的样本。当然，这些文本并不是真的有意义，他们只是从统计模型抽取了一些数据。

还有很多我们没有介绍道。例如，我们之间构建的网络的层大多都是串行的，通过Keraas_model函数可以构建有更加复杂拓扑结构的网络，例如多输入模型，多输出模型，有向无环涂层，等等。

深度学习还可以做Deeo dream，其是一种图像修改技术，还可以修改图片样式。还可以通过变分自编码器VAE和生成对抗网络来生成图片（Ai 绘画）等等。这些内容大家可以在未来进一步学习。

总结

网络架构总结

构造网络就像是搭积木，将层链接到一起，从而将任何东西映射到任何东西。以下是输入模式和适当网络架构的一个总结。

向量数据：全连接层网络
图形数据：二维卷积网络
声音数据：一维卷积网络-首选，RNN
文本数据：一维卷积网络-首选，RNN
其他序列数据：顺序重要，例如时间序列，RNN。维卷积网络
视频数据：三维卷积网络

深度学习可以做的事情

构造网络就像是搭积木，将层链接到一起，从而将任何东西映射到任何东西。

将向量数据映射到向量数据：例如根据医疗记录预测患者结果
将图片数据映射到向量数据：根据图片预测人的年龄，自动驾驶，
将时间序列数据映射到向量数据：天气预报，脑机接口-将脑磁图的时间序列映射到计算机命令。
将本文映射到文本：智能回复，问题回答，
图像映射到文本：生成图片的文字描述
将文本映射到图片：AI画图
。。。。。。几乎任何事情都有肯能。

需要注意的是，深度学习极度依赖于大量人工注释的数据。本质上，除非一天Ai不需要人标注，可以自己定义问题了，否则AI不会有真正的智能。

Deep Learning With R-part 2

5.序列数据问题

深度学习与自然语言

one hot 编码

单词嵌入

使用预训练的单词嵌入

标签化Tokenize数据

预处理嵌入

定义模型

循环神经网络

循环层

LSTM 层

温度预测问题

使用卷积网络解决序列问题

结合RNN 和一维卷积

总结

6. 结束也是开始

总结

网络架构总结

深度学习可以做的事情