Introduction

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

In this project, we would like to study every tweet with NLP, then, classify which one announces a disaster.

Data

The training data is shown as:

train <- read_csv("train.csv")
test <- read_csv("test.csv")
train

## # A tibble: 7,613 × 5
##       id keyword location text                                            target
##    <dbl> <chr>   <chr>    <chr>                                            <dbl>
##  1     1 <NA>    <NA>     Our Deeds are the Reason of this #earthquake M…      1
##  2     4 <NA>    <NA>     Forest fire near La Ronge Sask. Canada               1
##  3     5 <NA>    <NA>     All residents asked to 'shelter in place' are …      1
##  4     6 <NA>    <NA>     13,000 people receive #wildfires evacuation or…      1
##  5     7 <NA>    <NA>     Just got sent this photo from Ruby #Alaska as …      1
##  6     8 <NA>    <NA>     #RockyFire Update => California Hwy. 20 closed…      1
##  7    10 <NA>    <NA>     #flood #disaster Heavy rain causes flash flood…      1
##  8    13 <NA>    <NA>     I'm on top of the hill and I can see a fire in…      1
##  9    14 <NA>    <NA>     There's an emergency evacuation happening now …      1
## 10    15 <NA>    <NA>     I'm afraid that the tornado is coming to our a…      1
## # … with 7,603 more rows

It contains 7613 observations and 5 variables, which are:

id: the index of each twitter (contains 7613 unique values)
keyword: the key word of each twitter;
location: the posting location;
text: twitter text; one of the most important features in our prediction
target: response variable; whether it announces a disaster or not

The following figure shows the missing values in each variable. Exploratory variable keyword has very few missing values; location has a large amount of missing values; both text and target do not have any missing values.

train %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) +
  geom_raster() +
  coord_flip() +
  scale_y_continuous(NULL, expand = c(0, 0)) +
  scale_fill_grey(name = "",
                  labels = c("Present",
                             "Missing")) +
  xlab("Observation") +
  theme(axis.text.y  = element_text(size = 6)) +
  labs(title = "Missing Values Visualization",
       caption = "Figure 1")

Due to the large amount of missing values, location will not be considered in our model.

train %<>%
  select(-location)

Manipulation of `keywords`

The following bar chart shows the top 20 key words (in frequency).

train %>% 
  group_by(keyword) %>%
  summarise(freq = n()) %>%
  arrange(freq) %>%
  top_n(20) %>%
  ggplot(mapping = aes(x = reorder(keyword, freq), y = freq)) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(caption = "Figure 2")

The most key word is NA and it is as missing. Besides, we realize that many “different” key words express very similar meanings, such as, collided and collision, weapon and weapons, wreckage and wrecked, sinking and sunk. To shrink unique values, words with similar meanings should be marked as one individual word.

Some pairs, like collided and collision or weapon and weapons are easy to be detected. We will say that if the first \(k\) letters are matched, then they will be marked as one word. A large \(k\) may result pairs too hard to combine, while, a small \(k\) may result some words with different meanings are combined as one word in accident. After several tries, we decide to set \(k = 4\).

simplifiedWord <- c()
k <- 4
uniqueKeywords <- unique(c(train$keyword, test$keyword))
for (word in rev(uniqueKeywords)) {
  
  if(is.na(word)) {
    newword <- "NA"
  } else {
    
    if(str_length(word) < k) {
      newword <- word
    } else {
      
      whichMatch <- substr(simplifiedWord, 1, k) %in%
        substr(word, 1, k)
      
      if (any(whichMatch)) {
        newword <- simplifiedWord[whichMatch][1L]
      } else {
        newword <- word
      }
    }
  }

  simplifiedWord <- c(simplifiedWord, newword)
}
simplifiedWord <- rev(simplifiedWord)

After this process, 222 unique key words are shrunk into 141. However, some pairs, like sunk and sinking, blew%20up and blown%20up are unable to be combined into one word in this way. Fortunately, 141 words do not take too much time to go through.

simplifiedWord[simplifiedWord == "blew%20up"] <- "blown%20up"
simplifiedWord[simplifiedWord == "bleeding"] <- "bloody"
simplifiedWord[simplifiedWord == "dead"] <- "death"

Right now, 141 unique key words are turned into 138. To build a model, 138 unique key words are still a large amount. Some words, like fire and burn, fatality and death are synonyms. Currently, we will leave it as it is. If necessary (the training performance is not satisfied), we will go back and combine synonyms via wordnet(Miller 1995).

newKeyWords <- tibble(
  keyword = uniqueKeywords,
  simplifiedKeywords = simplifiedWord
)
uniqueKeywords <- unique(newKeyWords$simplifiedKeywords)
train <- train %>%
  left_join(newKeyWords, by = "keyword") %>%
  mutate(keyword = simplifiedKeywords) %>%
  select(-simplifiedKeywords)

test <- test %>%
  left_join(newKeyWords, by = "keyword") %>%
  mutate(keyword = simplifiedKeywords) %>%
  select(-simplifiedKeywords)

train %>% 
  group_by(keyword) %>%
  summarise(freq = n()) %>%
  arrange(freq) %>%
  top_n(20) %>%
  ggplot(mapping = aes(x = reorder(keyword, freq), y = freq)) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(caption = "Figure 3")

In this way, we can tell that collision has the greatest frequencies, then, it is fatality, wrecked, etc. Then, we would be interested in which word announces the most disasters.

train %>% 
  group_by(keyword) %>% 
  summarise(disasterProp = sum(target)/n()) -> newData 

g1 <- newData %>%
  arrange(disasterProp) %>%
  top_n(20) %>%
  ungroup() %>%
  ggplot(mapping = aes(x = reorder(keyword, disasterProp), y = disasterProp)) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  coord_flip() + 
  ggtitle("Top 20")+ 
  labs(caption = " ")
g2 <- newData %>%
  arrange(disasterProp) %>%
  top_n(-20) %>%
  ungroup() %>%
  ggplot(mapping = aes(x = reorder(keyword, disasterProp), y = disasterProp)) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  coord_flip(ylim = c(0, 1)) + 
  ggtitle("Bottom 20")  + 
  labs(caption = "Figure 4")
grid.arrange(grobs = list(g1, g2), nrow = 1)

If the keywords are debris, outbreak, typhoon, etc, the probability of this twitter to be a disaster is approximate to 1; if the keywords are afterhock, ruin, blight, the probability of this twitter to be a disaster is approximate to 0.

`text` manipulation

While, the data is pretty raw… Also, there are several tweets in training set which are labeled differently in their duplicates. This is a bad news for neural network (back propagation may be “confused” in updating the gradients). We should relabel (remove) such tweets.

library(tidytext)
head(train$text)

## [1] "Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all"                                                                
## [2] "Forest fire near La Ronge Sask. Canada"                                                                                               
## [3] "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"
## [4] "13,000 people receive #wildfires evacuation orders in California"                                                                     
## [5] "Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school"                                              
## [6] "#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires"

# remove certain strings
## amp --> `&`
## lt --> `<`
## gt --> `>`
## ?http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+ --> url
remove_reg <- "&amp;|&lt;|&gt;|&?http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
tidy_train <- train %>% 
  mutate(text = str_remove_all(text, remove_reg))%>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  mutate(word = gsub("#|@", "", word))

tidy_test <- test %>% 
  mutate(text = str_remove_all(text, remove_reg))%>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  mutate(word = gsub("#|@", "", word))

tidy_train

## # A tibble: 57,693 × 4
##       id keyword target word      
##    <dbl> <chr>    <dbl> <chr>     
##  1     1 NA           1 deeds     
##  2     1 NA           1 reason    
##  3     1 NA           1 earthquake
##  4     1 NA           1 allah     
##  5     1 NA           1 forgive   
##  6     4 NA           1 forest    
##  7     4 NA           1 fire      
##  8     4 NA           1 la        
##  9     4 NA           1 ronge     
## 10     4 NA           1 sask      
## # … with 57,683 more rows

An interesting approach is to check how many times each target (0 or 1) uses each word. First, we keep the words used more than 10 times; then, turn the “long form data” to a “wide form data.” After that, compute the log risk ratio for word \(i\)

\[\text{log risk ratio}_i = \log(\frac{\frac{n_{1i} + 1}{\text{total}_1 + 1}}{\frac{n_{0i} + 1}{\text{total}_0 + 1}})\]

where \(n_{1i}\) represents the count of this word \(i\) appearing in group as a disaster (or group 1) and \(n_{0i}\) represents the count of the word \(i\) appearing in group not as a disaster (or group 0). \(\text{total}_1\) and \(\text{total}_0\) are the total count of words used in group 1 and 0.

If the \(\text{log risk ratio}_i\) is large (much greater than 0), representing the twitter including this word has more chance to be clustered to group 1, otherwise (very negative), it will be cluster into group 0.

If \(\text{log risk ratio}_i\) is approximate to 0, such word will have equal chance to appear in group 1 and group 0. Consequently, this word could be “useless” in distinguishing such tweet.

tidy_train %>%
  count(word, target) %>%
  group_by(word) %>%
  ungroup() %>%
  pivot_wider(names_from = target, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`1` / `0`)) -> logRisks
logRisks %>%
  arrange(desc(logratio))

## # A tibble: 17,030 × 4
##    word              `1`       `0` logratio
##    <chr>           <dbl>     <dbl>    <dbl>
##  1 mh370         0.00265 0.0000328     4.39
##  2 northern      0.00239 0.0000328     4.29
##  3 legionnaires  0.00232 0.0000328     4.26
##  4 debris        0.00188 0.0000328     4.05
##  5 severe        0.00177 0.0000328     3.99
##  6 hiroshima     0.00320 0.0000655     3.89
##  7 derailment    0.00155 0.0000328     3.85
##  8 migrants      0.00147 0.0000328     3.81
##  9 investigators 0.00140 0.0000328     3.75
## 10 mosque        0.00132 0.0000328     3.70
## # … with 17,020 more rows

Words mh370, northern, legionnaires, bomber may result such twitters announcing a disaster.

logRisks %>%
  arrange(logratio)

## # A tibble: 17,030 × 4
##    word              `1`      `0` logratio
##    <chr>           <dbl>    <dbl>    <dbl>
##  1 bags        0.0000736 0.00141     -2.95
##  2 aftershock  0.0000368 0.000655    -2.88
##  3 bag         0.0000736 0.00131     -2.88
##  4 ruin        0.0000736 0.00124     -2.83
##  5 lmao        0.0000368 0.000557    -2.72
##  6 ebay        0.0000736 0.00108     -2.69
##  7 king        0.0000368 0.000491    -2.59
##  8 career      0.0000368 0.000459    -2.52
##  9 electrocute 0.0000736 0.000917    -2.52
## 10 louis       0.0000368 0.000426    -2.45
## # … with 17,020 more rows

Words bags, aftershock, bag, ruin may result such twitters not announcing a disaster.

logRisks %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col() +
  coord_flip() +
  ylab("log risk ratio (`1`/`0`)") + 
  theme(axis.text.y = element_text(color = "grey20", 
                                   size = 7, angle = 0, 
                                   hjust = 1, vjust = 0, 
                                   face = "plain"),
        legend.position = "none") + 
  labs(caption = "Figure 5")

This figure clearly shows a tweet with words bomber, debris, israeli may have a higher chance to announce a disaster; while, words aftershock, bag, ebay may indicate such one just a common tweet.

Model

A text cannot be trained into a neural network until converted to tensors. Two popular ways are often used

One Hot Encode

One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network a dense layer that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words \(\times\) num_reviews size matrix.(Allaire 2015)

The dimension of this vector largely depends on the unique words in text. In our case, 7559 tweets are composed of 32017 unique words (after dropping stops words, symbols, etc). By the one-hot-encode rule, the dimension of the training matrix is \(7559 \times 32017\). What if there are 1 million tweets, composed of 10 million unique words? It could be a nightmare for storage and computing speed!

In Figure 5, we could tell some words play a significant role in announcing a disaster, such as bomber, debris, israeli, etc and some words show strong evidence that this is a “peace and love” tweet, such as aftershock, bag, ebay. An important metric to tell whether this word is useful or not is the log ratio. To reduce the number of unique words, we only select words with

\[|\log(\text{risk ratio})| > 1\]

which is approximate

\[\text{risk ratio} > 2.7 \ \ \ \text{or} \ \ \ \text{risk ratio} < 0.4\] The margin value 1 is very arbitrary. The larger this value is, the less the unique words we have. The following table shows how the number of unique words vary with the change of the boundary.

margin	number of unique words
0.1	16972
0.3	15511
0.5	15241
0.7	7803
0.9	3288
1.1	2182
1.3	1152
1.5	898
1.7	574
1.9	386

A comprised choice is 1 which is a balance of the model complexity and the information sufficiency.

uniqueWords <- logRisks %>% 
  filter(abs(logratio) > 1) %>% 
  select(word) %>% 
  unlist() %>% 
  unname()
length(uniqueWords)

## [1] 2311

# combine words into "sentence"
newtrain <- tidy_train %>% 
  group_by(id, keyword, target) %>% 
  summarise(text = paste0(word, collapse = " ")) %>% 
  ungroup()

newtest <- tidy_test %>% 
  group_by(id, keyword) %>% 
  summarise(text = paste0(word, collapse = " ")) %>% 
  ungroup()
# words to vectors
vectorize_words <- function(x, uniqueWords) {
  dim <- length(uniqueWords)
  len <- length(x)
  m <- matrix(0, nrow = length(x), ncol = dim)
  for(i in seq(len)) {
    w <- x[i]
    m[i, (uniqueWords %in% str_split(w, " ")[[1]])] <- 1
  }
  
  return(m)
}

train_text1 <- vectorize_words(newtrain$text, uniqueWords)
test_text1 <- vectorize_words(newtest$text, uniqueWords)

Do not forget we have approximate 140 key words!

train_text2 <- vectorize_words(newtrain$keyword, uniqueKeywords)
test_text2 <- vectorize_words(newtest$keyword, uniqueKeywords)

train_text <- cbind(train_text1, train_text2)
test_text <- cbind(test_text1, test_text2)
dim <- ncol(train_text)

Best `optimizer`

Optimizers are algorithms used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Different optimizers use very different methods (gradients) to optimize the target function. For different problems, the performance (coverage speed, accuracy) of each can vary a lot (check this).

In keras, seven optimizers adam(), adadelta(), adagrad(), nadam(), adamax() (Kingma and Ba 2014), rmsprop(), sgd() are provided. To select the “best” one, a following process is used:

Create a 10-fold cross validation set;
Build a basic neural network with two layers: the first one has 32 nodes and the second one has 16 nodes.
Apply seven optimizers one after the other on the 10 fold validation sets. Then, record 10 prediction accuracy.

library(keras)
KFoldValidation <- function(x, y, k = 10, seed = 123) {
  
  set.seed(seed)
  n <- length(y)
  resample <- sample(seq(n), n)
  validationSet <- floor(n/k)
  
  lapply(seq(k),
         function(i) {
           end <- if (i == k) n else i*validationSet
           res <- resample[((i - 1)*validationSet + 1):end]
           
           list(
             x_train = x[-res, ],
             y_train = y[-res],
             x_validation = x[res, ],
             y_validation = y[res]
           )
         })
}
validationSet <- KFoldValidation(train_text, newtrain$target)
lr <- 0.001
optimizer <- list(optimizer_adagrad(lr = lr), 
                  optimizer_adamax(lr = lr), 
                  optimizer_adam(lr = lr), 
                  optimizer_nadam(lr = lr), 
                  optimizer_adadelta(lr = lr),
                  optimizer_rmsprop(lr = lr), 
                  optimizer_sgd(lr = lr))
epochs <- 50

acc <- lapply(validationSet,
              function(data) {
                sapply(optimizer,
                       function(opt) {
                         model <- keras_model_sequential() %>% 
                           layer_dense(units = 32, activation = "relu", 
                                       input_shape = c(dim)) %>% 
                           layer_dense(units = 16, 
                                       activation = "relu") %>% 
                           layer_dense(units = 1, 
                                       activation = "sigmoid") %>%
                           compile(
                             optimizer = opt,
                             loss = "binary_crossentropy",
                             metrics = c("accuracy")
                           )
                         
                         history <- model %>% fit(
                           data$x_train,
                           data$y_train,
                           epochs = epochs,
                           verbose = 0,
                           batch_size = 512,
                           validation_data = list(data$x_validation, 
                                                  data$y_validation)
                         )
                         
                         history$metrics$val_accuracy[epochs]
                       })
              })
acc %>% 
  as.data.frame() %>% 
  as.matrix() %>% 
  t() -> m
colnames(m) <- c("adagrad", "adamax", 
                 "adam", "nadam", "adadelta",
                 "rmsprop", "sgd")

acc <- as_tibble(m)

acc %>% 
  pivot_longer(cols = everything(),
               names_to = "method",
               values_to = "accuracy") %>% 
  ggplot(mapping = aes(x = method, y = accuracy, fill = method)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Optimizer Performance",
       caption = "Figure 6",
       subtitle = "over 10 fold cross validation")

Through this plot, we are able to tell that optimizer adamax gives the best performance and a relatively low variation.

Model Capacity

Complex model (i.e., more layers and nodes) results in more memorization capacity. On one hand, it allows the model to learn more features and patterns in the data. On the other hand, a too complex model will overfit to the training data. Typically, a model maximizing validation error performance, meanwhile, minimizing capacity is what we are looking for (Boehmke and Greenwell 2019).

As an example, we assessed six different model capacity settings that include the three layers and multiple nodes while maintaining all other structure settings (i.e. optimizer ada_max()).

size	1	2
small	32	8
small to median	32	16
median	64	16
median to large	64	32
large	128	64
XLarge	256	128

Figure 7 shows that the model performed best has a small number of nodes. All the “median to large” models underfit and would require more epochs to identify their minimum validation error. The large 3-layer model overfits extremely fast. Preferably, we want a model that overfits more slowly such as the 1- and 2-layer medium and large models (code in appendix).

Normalization and Drop Out

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. (Ioffe and Szegedy 2015)

Dropout is a technique for addressing overfitting. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much

In our case, using dropout can slightly improve the performance and the prediction accuracy is slightly over 80% (around top 30% in kaggle competition).

The “Embedding” Layer

Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape num_examples \(\times\) max_length. We can use an embedding layer capable of handling this shape as the first layer in our network.

num_words <- 2000
max_length <- 100
text_vectorization <- layer_text_vectorization(
  max_tokens = num_words, 
  output_sequence_length = max_length, 
)
text_vectorization %>% 
  adapt(newtrain$text)

input <- layer_input(shape = c(1), dtype = "string")

output <- input %>% 
  text_vectorization() %>% 
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = 16) %>%
  layer_global_average_pooling_1d() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

model %>% compile(
  optimizer = optimizer_adamax(0.001),
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

history <- model %>% fit(
  newtrain$text,
  newtrain$target,
  epochs = 100,
  batch_size = 512,
  validation_split = 0.2,
  verbose = 1
)

The prediction accuracy of this specific model on test set is approximate 79%, almost identical to the previous one.

BERT

BERT (\(\textbf{B}\)idirectional \(\textbf{E}\)ncoder \(\textbf{R}\)epresentations from \(\textbf{T}\)ransformers) (Turc et al. 2019) is a pre-trained deep learning model introduced by Google AI Research which has been trained on Wikipedia and BooksCorpus. Unlike word2vec or Glove, it is contextual. Word bank has different representations in bank deposite and river bank in BERT. Additionally, this model is bidirectional, rather from left to right or from right to left, it uses both context to understand a text.

Set `BERT` in `R`

BERT mixes PyPI packages with Conda, one way is to create a dedicated environment using a YAML file (thanks for the solution provided by merv).

bert_env

name: bert_env
channels:
  - defaults
dependencies:
  - numpy
  - keras
  - pip
  - pip:
    - keras-bert

In your conda, running

conda env create -f bert_env.yaml

Then, open your R

# make sure tensorflow is installed in the bert environment 
reticulate::py_install("tensorflow") 
reticulate::use_condaenv("bert_env", required=TRUE)

reticulate::py_config()
# python:         .../r-miniconda/envs/r-reticulate/python.exe
# libpython:      .../r-miniconda/envs/r-reticulate/python37.dll
# pythonhome:     .../r-miniconda/envs/r-reticulate
# version:        3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
# Architecture:   64bit
# numpy:          .../r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
# numpy_version:  1.19.5

reticulate::py_module_available("keras_bert")
# [1] TRUE

Build Models

For more details, please visit BERT from R.

First: Set `BERT`

Download BERT from here and unzip it which contains three files:

bert_model.ckpt, which is for loading the weights from the TensorFlow checkpoint
bert_config.json, which is a configuration file
vocab.txt, which is for text tokenization

library(reticulate)
reticulate::use_condaenv("bert_env", required=TRUE)
k_bert <- import('keras_bert')
pretrained_path <- 'uncased_L-12_H-768_A-12'
config_path <- file.path(pretrained_path, 'bert_config.json')
checkpoint_path <- file.path(pretrained_path, 'bert_model.ckpt')
vocab_path <- file.path(pretrained_path, 'vocab.txt')
token_dict <- k_bert$load_vocabulary(vocab_path)
tokenizer <- k_bert$Tokenizer(token_dict)

Second: Set Data

BERT is a contextual nlp model so that “stop words” cannot be removed. We just remove useless symbols and urls.

bertTrain <- train %>% 
  mutate(text = str_remove_all(text, remove_reg))

bertTest <- test %>% 
  mutate(text = str_remove_all(text, remove_reg))

bertInput <- function(text, seq_length = 50L) {
  
  c(indices, segments) %<-% list(list(), list())
  
  for (i in 1:length(text)) {
    c(indices_tok, segments_tok) %<-% tokenizer$encode(text[i], 
                                                       max_len = seq_length)
    indices <- c(indices, list(as.matrix(indices_tok)))
    segments <- c(segments, list(as.matrix(segments_tok)))
  }
  
  c(list(do.call(cbind, indices) %>% t()), 
    list(do.call(cbind, segments) %>% t()))
}
seq_length <- 50L
bertTrainInput <- bertInput(bertTrain$text, seq_length = seq_length)
bertTestInput <- bertInput(bertTest$text, seq_length = seq_length)

Third: Load Model

model <- k_bert$load_trained_model_from_checkpoint(
  config_path,
  checkpoint_path,
  training = TRUE,
  trainable = TRUE,
  seq_len = seq_length)

inputs <- list(get_layer(model,name = 'Input-Token')$input,
               get_layer(model,name = 'Input-Segment')$input)

outputs <- layer_dense(get_layer(model, name = 'NSP-Dense')$output, 
                       units=1L, 
                       activation='sigmoid',
                       kernel_initializer = initializer_truncated_normal(stddev = 0.02),
                       name = 'output')

model <- keras_model(inputs = inputs,
                     outputs = outputs) 
# this is slow... may take 0.5 ~ 1 hour
model %>% 
  compile(
    optimizer_adamax(lr = 1e-4),
    loss = 'binary_crossentropy',
    metrics = 'accuracy'
  ) %>% 
  fit(
    bertTrainInput,
    bertTrain$target,
    epochs = 1,
    batch_size = 70)

pred <- model %>% predict(bertTestInput)
write_csv(
  tibble(id = bertTest$id, 
         target = ifelse(pred[, 1] < 0.5, 0, 1)), 
  "out.csv")

BERT gives the best prediction accuracy 83% which is the top 10 (115/1364) of the competition!

Conclusion

This project is to study whether a post tweet announcing a disaster. The training data includes 7613 tweets and 4 variables, id, keywords, location and text.

There are 222 unique keywords. However, some of them have the same meaning but in different forms (e.g. dead and death, weapon and weapons, etc). To reduce the unique number of keywords, some of them can be combined into one. An easy solution is that if the first \(k\) letters match, then such words could be marked as one word. A large \(k\) may result very few to combine and a small \(k\) may result words with different meanings to combine in accident. Based on multiple tries, we choose to set \(k = 4\) so that the 222 unique keywords are shrunk to 141.

Variable text is the most important features in classifying a tweet. To pass it into deep learning, the word has to be turned in vectors. However, the text is raw and containing over 30,000 unique words. If no operations are applied, the dimension of the input sparse matrix will be \(7613 \times 32017\). It is a heavy loading requiring a large amount of computation energy and expensive storage. To reduce the dimension, many useless symbols, such as amp (“&”), lt (“<”), gt (“>”) and URLs are removed. Then a single text is split into tokens. Often in text mining, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the,” “of,” “to,” and so forth in English. After this, the number of unique words is reduced to 17030. Next, for each word, a log risk ratio is computed which is defined as \(\text{log risk ratio}_i = \log(\frac{\frac{n_{1i} + 1}{\text{total}_1 + 1}}{\frac{n_{0i} + 1}{\text{total}_0 + 1}})\). A positive value suggests that this word shows more often in group 1, while a negative value suggests that this word shows more often in group 0. A strategy is applied: word \(i\) is kept only if \(|\text{log risk ratio}_i| > \gamma\). To balance the computing/storage and information sufficiency. We choose \(\gamma = 1\). With such operations, the number of unique words is reduced to 2309.

before	after
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all	deeds reason earthquake allah forgive
Forest fire near La Ronge Sask. Canada	forest fire la ronge sask canada
All residents asked to ‘shelter in place’ are being notified by officers. No other evacuation or shelter in place orders are expected	residents shelter notified officers evacuation shelter expected

In neural network, finding the right hyper-parameters (e.g. drop out rate, learning rate, etc) and constructing the most efficient structure (e.g. number of layers, nodes) are not easy. For most implementations we need to predetermine the number of layers we want and then establish the search grid.

Optimizer is the backbone of deep learning. A proper one can lead the model converge fast, while, an “improper” one may result the model hard or even unsuccessfully to converge. Based on 10 fold cross validation, optimizer adamax has the best performance. Next, we construct six models, from “small” capacity to “large.” From boxplot, the “small” model fits the data well while “large” models tends to overfit the data. After settling down the optimizer and model capacity, dropout and normalization are also considered.

Typically, there are two ways to convert texts to tensors, one-hot-encode (output is binary) and embedding layer (output is indices). Both give relatively close results with accuracy 80%.

Additionally, a new algorithm BERT is applied. BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks. It is a contextual and bidirectional method which understands the meaning of the text by its neighborhood (both left and right). In the end, it gives the best performance (115/1364, top 10%).

Further Work

Many genius can have the 100% prediction accuracy, such as Gunes Evitan. I notice that he spends large effort in text cleaning. For example, informal abbreviations are written in their long forms (“MH370” to “Malaysia Airlines Flight 370”); some words are replaced with their acronyms and some words are grouped into one (what we did in keywords manipulation). This process is a headache, but it is definitely worth it. A clear text can help us build a more informative tensor matrix, resulting a better performance.

Appendix

Model Capacity

units <- list(
  c(32, 8, 1),
  c(32, 16, 1),
  c(64, 16, 1),
  c(64, 32, 1),
  c(128, 64, 1),
  c(256, 128, 1)
)
modelCapacity <- lapply(validationSet,
                        function(data) {
                          sapply(units,
                                 function(unit) {
                                   model <- keras_model_sequential() %>% 
                                     layer_dense(units = unit[1], 
                                                 activation = "relu", 
                                                 input_shape = c(dim)) %>%
                                     layer_dense(units = unit[2], 
                                                 activation = "relu") %>% 
                                     layer_dense(units = unit[3], 
                                                 activation = "sigmoid") %>%
                                     compile(
                                       optimizer = optimizer_adamax(lr = 0.001),
                                       loss = "binary_crossentropy",
                                       metrics = c("accuracy")
                                     )
                                   
                                   history <- model %>% fit(
                                     data$x_train,
                                     data$y_train,
                                     epochs = epochs,
                                     verbose = 0,
                                     batch_size = 512,
                                     validation_data = list(data$x_validation, 
                                                            data$y_validation)
                                   )
                                   
                                   history$metrics$val_accuracy[epochs]
                                 })
                        })

modelCapacity %>% 
  as.data.frame() %>% 
  as.matrix() %>% 
  t() -> m
colnames(m) <- SIZE

modelCapacity <- as_tibble(m)

Normalization and Dropout

normalization <- lapply(validationSet,
                        function(data) {
                          model <- keras_model_sequential() %>% 
                            layer_dense(units = 32, 
                                        activation = "relu", 
                                        input_shape = c(dim)) %>%
                            layer_batch_normalization() %>%
                            layer_dense(units = 8, 
                                        activation = "relu") %>% 
                            layer_batch_normalization() %>%
                            layer_dense(units = 1, 
                                        activation = "sigmoid") %>%
                            compile(
                              optimizer = optimizer_adamax(lr = 0.001),
                              loss = "binary_crossentropy",
                              metrics = c("accuracy")
                            )
                          
                          history <- model %>% fit(
                            data$x_train,
                            data$y_train,
                            epochs = epochs,
                            verbose = 1,
                            batch_size = 512,
                            validation_data = list(data$x_validation, 
                                                   data$y_validation)
                          )
                          
                          history$metrics$val_accuracy[epochs]
                        })

dropout <- lapply(validationSet,
                  function(data) {
                    model <- keras_model_sequential() %>% 
                      layer_dense(units = 32, 
                                  activation = "relu", 
                                  input_shape = c(dim)) %>%
                      layer_dropout(rate = 0.4) %>%
                      layer_dense(units = 8, 
                                  activation = "relu") %>% 
                      layer_dropout(rate = 0.3) %>%
                      layer_dense(units = 1, 
                                  activation = "sigmoid") %>%
                      compile(
                        optimizer = optimizer_adamax(lr = 0.001),
                        loss = "binary_crossentropy",
                        metrics = c("accuracy")
                      )
                    
                    history <- model %>% fit(
                      data$x_train,
                      data$y_train,
                      epochs = epochs,
                      verbose = 0,
                      batch_size = 512,
                      validation_data = list(data$x_validation, 
                                             data$y_validation)
                    )
                    
                    history$metrics$val_accuracy[epochs]
                  })

normalizationDropout <- lapply(validationSet,
                               function(data) {
                                 model <- keras_model_sequential() %>% 
                                   layer_dense(units = 32, 
                                               activation = "relu", 
                                               input_shape = c(dim)) %>%
                                   layer_batch_normalization() %>%
                                   layer_dropout(rate = 0.4) %>%
                                   layer_dense(units = 8, 
                                               activation = "relu") %>% 
                                   layer_batch_normalization() %>%
                                   layer_dropout(rate = 0.3) %>%
                                   layer_dense(units = 1, 
                                               activation = "sigmoid") %>%
                                   compile(
                                     optimizer = optimizer_adamax(lr = 0.001),
                                     loss = "binary_crossentropy",
                                     metrics = c("accuracy")
                                   )
                                 
                                 history <- model %>% fit(
                                   train_text,
                                   newtrain$target,
                                   epochs = epochs,
                                   verbose = 1,
                                   batch_size = 512,
                                   validation_split = 0.2
                                 )
                                 
                                 history$metrics$val_accuracy[epochs]
                               })

m <- tibble(
  normalization = unlist(normalization),
  dropout = unlist(dropout),
  normalizationDropout = unlist(normalizationDropout)
)

Reference

Allaire, Joseph J. 2015. “Tensorflow for r.” In. https://tensorflow.rstudio.com/.

Boehmke, Brad, and Brandon Greenwell. 2019. Hands-on Machine Learning with r. Chapman; Hall/CRC.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In International Conference on Machine Learning, 448–56. PMLR.

Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.

Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39–41.

Turc, Iulia, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Well-Read Students Learn Better: On the Importance of Pre-Training Compact Models.” arXiv Preprint arXiv:1908.08962v2.

Natural Language Processing with Disaster Tweets

Zehao Xu

10/6/2021

Introduction

Data

Manipulation of `keywords`

`text` manipulation

Model

One Hot Encode

Best `optimizer`

Model Capacity

Normalization and Drop Out

The “Embedding” Layer

BERT

Set `BERT` in `R`

Build Models

First: Set `BERT`

Second: Set Data

Third: Load Model

Conclusion

Further Work

Appendix

Model Capacity

Normalization and Dropout

Reference

Natural Language Processing with Disaster Tweets

Zehao Xu

10/6/2021

Introduction

Data

Manipulation of keywords

text manipulation

Model

One Hot Encode

Best optimizer

Model Capacity

Normalization and Drop Out

The “Embedding” Layer

BERT

Set BERT in R

Build Models

First: Set BERT

Second: Set Data

Third: Load Model

Conclusion

Further Work

Appendix

Model Capacity

Normalization and Dropout

Reference

Manipulation of `keywords`

`text` manipulation

Best `optimizer`

Set `BERT` in `R`

First: Set `BERT`