1. Introduction

Today, American politicians use Twitter to communicate their opinions. In this idea, we decided to test if an algorithm can recognise the different authors of the tweets. This would enable us to see if they have actually been written by the mentioned author (to avoid any hacking).

For that, we decided to train a model to classify the tweets of different important US politicians. Moreover, we focused on three politicians: Barack Obama, Hillary Clinton and Donald Trump. We found several databases freely available on Kaggle and each dataset contains between 3000 and 8000 tweets. We have to be careful not to include potential retweets present in the datasets, that could mislead the algorithm. Each tweet is composed of a maximum of 280 characters.

On an additional note, Multinomial Naive Bayes has been used in the past for such task, but Neural Network is the most accurate method to classify texts. The approach is to use Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) with word embedding to reduce the numbers of dimensions (instead of using one hot encoding for each single word).

1.1 The datasets

Both datasets are available for free on Kaggle.com. The Barack Obama dataset has only one column and each instance contains various information, such as the URL, the date, the number of retweet, the message, etc… The 6930 tweets start from November 2012 until April 2019.

The dataset of Hillary Clinton and Donald Trump contains 6444 tweets from a smaller time window (January 2016 until November 2016). However, both american politicians were, at the time, very prolific in terms of tweeting, as they manage to send more than 3000 each. Reminder: 2016 was election year. Otherwise, this dataset is better structured as it contains 28 variables, such as the source, the date, the number of retweets, etc.

2. Data Cleaning

The first part of the project was to clean and prepare the data. The dataset that contains the tweets from Clinton and Trump has a binary variable for the retweets. We removed them from our analysis as we want to recognize writing styles of the author and not be biased by third parties tweets.

By looking deeper into the data, we identified some elements that could potentially harm the accuracy of our futur models. For example, we decided to remove URLs using regulary expressions as all of them are unique and do not bring additional informations regarding the authors.

clinton_trump$text <- rm_twitter_url(
  clinton_trump$text,
  clean = TRUE,
  pattern = "@rm_twitter_url",
  replacement = "",
  extract = FALSE,
  dictionary = getOption("regex.library")
)

The second important feature of tweets are the hashtags. They are used to reference tweets and give more visibility to the message. The decision to keep them or not has not been straight. First, we thought that they would not bring additional accuracy and would bring some noise in our predictions.

clinton_trump$text <-
  str_replace_all(clinton_trump$text, "([@#][\\w_-]+)", "")

However, after training the neural networks, we decided to keep the hashtags as they surprisingly increased the accuracy of our models. We have a few hypotheses why this actually happens:

Finally, tweets were extracted from the Obama dataset using differents regular expression (package StringR).

# Take out the first 50 characters which represents date, twitter account, etc...
obama$text <- substring(obama$text, 50)

#remove URL that lead to the tweet
obama$text <- sub("\\;.*", "", obama$text)
obama$text <- sub("\\:.*", "", obama$text)

3. Data Preprocessing (Split and Tokenization)

After having cleaned properly the data and removed information that could bring noise in the predictions, we assigned a unique number to identify the author of the tweet.

 1 => Barack Obama
 2 => Hillary Clinton
 3 => Donald Trump

Another issue that we solved is the unbalanced data. To address it, we randomly selected 2629 instances from the Trump and Obama datasets to match the number of Clinton. Finally, we split the data into training and test sets using a 80-20 ratio and we set the seed for further reproducibility.

Regarding the maximum length used for the tokenization of our model, we ran some analyses to have more information about the number of words per tweet and find the best one.

#for the maxlen choice
summary(sapply(train_x_test_maxlen , length))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   16.00   15.31   20.00   54.00

We decided to choose the number 20 as it includes around 75% of the tweets and will be significantly faster to process. Additionally, we tried with 50 words, as well, and did not observe major differences in the accuracy score.

Regarding the tokenization, we chose to keep the punctuations as we observed that Donald Trump used it frequently to emphasize his messages. We used the function text_tokenizer to segment the various texts into smaller units, here being words.

#Generate a tokenizer #tune the filters
tokenizer <-
  text_tokenizer(num_words = num_words) %>% fit_text_tokenizer(train_x)

# Tokenize `train_x` and transform it to sequences of integers. Then, pad those sequences to `50`.

train_x <-
  texts_to_sequences(tokenizer = tokenizer, texts = train_x) %>% pad_sequences(maxlen = maxlen)
test_x <-
  texts_to_sequences(tokenizer = tokenizer, texts = test_x) %>% pad_sequences(maxlen = maxlen)

# Transform `train_y` into one-hot encoded matrix.
train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y)

# Creation of the word index
word_index <- tokenizer$word_index

4. Modelling

After having preprocessed the datasets and tokenized both training and test sets, we started working on the modelling part. We wrote six differents models (one per script) to find which one would give us the best accuracy.

RNN_pretrain.R: Recurrent Neural Network using pretrained word embedding from Standford.

RNN_train.R: Recurrent Neural Network using self-trained word embedding based on our data.

RNN_pretrain bidirectional.R: Recurrent Neural Network using pretrained word embedding from Standford + bidirectional layer.

RNN_train bidirectional.R: Recurrent Neural Network using self-trained word embedding based on our data + bidirectional layer.

CNN_pretrain.R : Combined Neural Network: Convulational NN using pretrained word embedding from Standford with RNN gru layer.

CNN_train.R : Combined Neural Network: Convulational NN using self-trained word embedding based on our data with RNN gru layer.

4.1 Recurrent Neural Network

Recurrent Neural Networks are particularly appropriate for text mining. To further explain, they manage to process sequences by iteration and maintain the state containing information of previously seen inputs (source, Lecture 09). As the word order is important, it should theoretically give us better results than Convolutional Neural Networks. We tested our models with bidirectional and unidirectional layers. Another important difference with CNN is the recurrent dropout which is combined with normal dropout.

RNN_train.R with default values

num_words <- 7000  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- 25 # the dimension of embeddings

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_gru(
    units = 128,
    dropout = 0.1,
    recurrent_dropout = 0.4,
    return_sequences = TRUE
  ) %>%
  layer_gru(units = 128,
            dropout = 0.1) %>%
  layer_dense(units = 4, activation = "softmax")

model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2
)

4.2 Combined Neural Network

Convolutional Neural Networks (CNN) are efficient to detect patterns at different locations. They can be used, as well, with sequential data by using 1D convulational layer. We included CNN layers to RNN ones in order to create Combined Neural Networks, mixed with word embedding. The goal was to see if it could potentially outperform “simpler” RNN models.

CNN_train.R with default values

num_words <- 7000  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- 25 # the dimension of embeddings

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_conv_1d(
    filters = 32,
    kernel_size = 5,
    strides = 1,
    padding = "same",
    activation = "relu"
  ) %>%
  layer_max_pooling_1d(pool_size = 3) %>%
  layer_conv_1d(
    filters = 64,
    kernel_size = 3,
    strides = 1,
    padding = "same",
    activation = "relu"
  ) %>%
  layer_gru(units = 128) %>%
  layer_dense(units = 4, activation = "softmax")


model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2,
  shuffle = TRUE
)

4.3 Word Embedding (pretrained and self-trained)

An important feature of both RNN and CNN is the word embedding, which organizes words as vectors into multidimensional space and is more efficient to recognize writing patterns. For the project, we first used self-trained word embedding based on our data (see scripts including “train”). However, pretrained word embeddings can be used as well.

In this project, we used the glove.twitter.27B trained by a team from Standford University on 2 billion tweets, creating 27 billion tokens. There exist four versions containing 25,50,100 and 200 dimensions. During our preliminary trials, we observed that a word embedding greater than 25-30 dimensions has a negative impact on the accuracy. This result can be explained by the relatively small size of our dataset, but mostly because we are analyzing only three different authors. Increasing the number of dimensions would just overcomplexify our Neural Network and penalize accuracy. In other words, it doesn’t get the whole picture.

RNN_pretrain.R with default values

lines <-
  readLines(here("embeddings/glove.twitter.27B/glove.twitter.27B.25d.txt"))

embeddings_index <- new.env(hash = TRUE, parent = emptyenv())

for (i in 1:length(lines)) {
  line <- lines[[i]]
  values <- strsplit(line, " ")[[1]]
  word <- values[[1]]
  embeddings_index[[word]] <- as.double(values[-1])
}

embedding_matrix <- array(0, c(num_words, embedding_dim))

for (word in names(word_index)) {
  index <- word_index[[word]]
  if (index < num_words) {
    embedding_vector <- embeddings_index[[word]]
    if (!is.null(embedding_vector))
      embedding_matrix[index + 1, ] <- embedding_vector
  }
}

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_gru(
    units = 128,
    dropout = 0.1,
    recurrent_dropout = 0.4,
    return_sequences = TRUE
  ) %>%
  layer_gru(units = 128,
            dropout = 0.1) %>%
  layer_dense(units = 4, activation = "softmax")

# assign pretrained weights to the embedding layer
get_layer(model, index = 1) %>%
  set_weights(list(embedding_matrix)) %>%
  freeze_weights()

model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2
)

5. Hyperparameters tuning on the Cloud

5.1 Training on the Cloud

As the training of RNN with pretrained word embedding requires a lot of computational resources, we trained our Neural Network on the Cloud (AI Google Cloud). Each script, which contains one model, was prepared for the hyperparameters tuning using FLAGS. To set the range of test of each flag, an yml file has been created containing all the information for the tuning and has been sent to the server through the script control-gcp.R. Training and test sets, as well as, the pretrained word embedding have also been sent in RDS format.

5.2 Tuning Strategy

Our strategy for this project was to first proceed to preliminary tests locally. This allowed us to start with a large scope of research and then refine it. Regarding the pretrained word embedding, we observed that the 25-dimension one performs better than the others. Therefore, we used only this one for the hyperparameters tuning. We used the same strategy for the choice of the optimizer and the layers of the RNN. We used GRU layers instead of LSTM as they perform better.

After training and tuning all the six models on the cloud with 20 runs, we then picked the two most accurate ones and re-tuned additional parameters, in order to optimize the accuracy. This was done with 40 runs each.

5.3 First Round of Tuning Hyperparameters

For the RNN models, we decided to use gru layer as mentionned earlier and tune the number of nodes for both layers. We did the same for the dropouts, and we naturally included a recurrent dropout in our models. The number of words used in the embedding matrix and its dimensions have been tuned as well in all models (RNN and CNN) without pretrained embeddings.

Example of the tuning parameters for the RNN_train.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_numeric("rec_dropout1", 0.4),
  flag_integer("gru_units2", 128),
  flag_numeric("dropout2", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25)
)

For the CNN Models, we used a similar strategy.

Example of the tuning parameters for the CNN_train.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25)
)

After running each model with 20 trials each, the results clearly showed the underperformance of the pretrained word embeddings, as none of them had outperformed non-pretrained models.

FIRST ROUND: Ranking of top-accuracy models
script metric_val_categorical_accuracy flag_gru_units1 flag_dropout1 flag_rec_dropout1 flag_gru_units2 flag_dropout2 flag_num_words flag_dim
RNN_train.R 0.902 60 0.111 0.431 274 0.234 8535 NA
RNN_train.R 0.902 204 0.288 0.483 94 0.245 9676 NA
RNN_train.R 0.900 172 0.292 0.500 171 0.233 8508 NA
RNN_train.R 0.900 237 0.175 0.401 247 0.032 9062 NA
RNN_train.R 0.899 120 0.084 0.491 112 0.108 7524 NA
RNN_train bidirectional.R 0.898 163 0.143 0.484 161 0.111 10981 10
CNN_train.R 0.897 66 0.028 NA NA NA 7199 50
CNN_train.R 0.897 63 0.298 NA NA NA 7108 50
CNN_train.R 0.897 52 0.079 NA NA NA 9016 36
RNN_train.R 0.896 147 0.023 0.308 204 0.010 9989 NA
CNN_train.R 0.895 110 0.245 NA NA NA 11111 33
RNN_train bidirectional.R 0.895 52 0.010 0.600 79 0.010 8189 50
RNN_train bidirectional.R 0.895 131 0.137 0.429 63 0.030 11209 34
CNN_train.R 0.894 246 0.032 NA NA NA 10036 25
RNN_train bidirectional.R 0.894 292 0.300 0.300 90 0.300 8946 10
CNN_train.R 0.893 283 0.023 NA NA NA 11992 50
CNN_train.R 0.893 293 0.026 NA NA NA 7178 48
CNN_train.R 0.893 78 0.156 NA NA NA 10907 45
RNN_train bidirectional.R 0.893 92 0.014 0.301 109 0.298 8557 36
RNN_train bidirectional.R 0.893 272 0.065 0.395 285 0.031 7753 35

Regarding the best ranked models, the results showed that none of them performed significantly better than the others. However, the Recurrent Neural Network with bidirectional layers is sligthly less accurate than the normal RNN. Therefore, we decided to exclude it from futher researches and focused our resources on the following models:

  1. Recurrent Neural Network with self-trained word embedding (best validation accuracy: 0.9019)
  2. Convolutional Neural Network with self-trained word embedding (best validation accuracy: 0.8967)

5.4 Second Round of Tuning Hyperparameters

Since we uncovered the two best models, we decided to refine them by training additional hyperparamters, in order to increase their accuracy. To further explain, in addition to the previous hyperparameters, we also added the learning rate, patience and batch size to be tuned. Each of these two models were run 40 times, as mentioned previously.

Tuning parameters for the RNN_train_2.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_numeric("rec_dropout1", 0.4),
  flag_integer("gru_units2", 128),
  flag_numeric("dropout2", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25),
  flag_numeric("learningrate",0.001),
  flag_integer("patience",5),
  flag_integer("batch",128)
)

Tuning parameters for the CNN_train_2.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim",25),
  flag_numeric("learningrate",0.001),
  flag_integer("patience",5),
  flag_integer("batch",128)
)

The results of this second round of tuning were the following:

SECOND ROUND: Ranking of top-accuracy models
script metric_val_categorical_accuracy flag_gru_units1 flag_dropout1 flag_rec_dropout1 flag_gru_units2 flag_dropout2 flag_num_words flag_dim
CNN_train_2.R 0.904 100 0.011 NA NA NA 8721 48
RNN_train_2.R 0.896 65 0.116 0.566 255 0.298 7017 33
RNN_train_2.R 0.892 297 0.281 0.573 231 0.293 7148 48
CNN_train_2.R 0.892 60 0.021 NA NA NA 8538 11
RNN_train_2.R 0.890 50 0.300 0.600 98 0.300 8110 28
RNN_train_2.R 0.889 142 0.015 0.306 106 0.054 9583 48
CNN_train_2.R 0.888 105 0.160 NA NA NA 7035 25
RNN_train_2.R 0.888 114 0.300 0.600 145 0.010 9936 50
CNN_train_2.R 0.888 53 0.011 NA NA NA 11639 11
CNN_train_2.R 0.887 50 0.010 NA NA NA 8440 50
CNN_train_2.R 0.886 138 0.297 NA NA NA 10960 46
RNN_train_2.R 0.885 127 0.019 0.306 50 0.278 7607 37
CNN_train_2.R 0.882 279 0.018 NA NA NA 7459 50
RNN_train_2.R 0.881 289 0.300 0.300 171 0.300 9558 10
RNN_train_2.R 0.880 140 0.012 0.493 153 0.020 8618 43
CNN_train_2.R 0.880 292 0.270 NA NA NA 11732 12
CNN_train_2.R 0.874 160 0.285 NA NA NA 8674 14
CNN_train_2.R 0.872 294 0.246 NA NA NA 7212 47
RNN_train_2.R 0.870 297 0.030 0.309 288 0.030 7095 50
RNN_train_2.R 0.869 67 0.275 0.312 259 0.197 7210 27

Thus, one can clearly see that the best model is the Convolutional Neural Network with self-trained word embedding (validation accuracy: 0.9040), which optimised hyperparameters are:

Optimised hyperparameters of the best model

FLAGS <- flags(
  flag_integer("gru_units1", 100),
  flag_numeric("dropout1", 0.011),
  flag_integer("num_words", 8721),
  flag_integer("dim",48),
  flag_numeric("learningrate",0.009),
  flag_integer("batch",73)
)

6. Best Model

Since a best model was found, we then retrained that model, using its optimised hyperparameters.

retrain.R with best values

# 1. Define the model
#-------------------------------------------------------------------------------
num_words <- FLAGS$num_words  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- FLAGS$dim # the dimension of embeddings

best_model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_conv_1d(filters = 32, kernel_size = 5, strides = 1, padding = "same",
                activation = "relu") %>%
  layer_max_pooling_1d(pool_size = 3) %>%
  layer_conv_1d(filters = 64, kernel_size = 3, strides = 1, padding = "same",
                activation = "relu") %>%
  layer_gru(units = FLAGS$gru_units1,dropout = FLAGS$dropout1) %>%
  layer_dense(units = 4, activation = "softmax")

# 2. Compile the model
#-------------------------------------------------------------------------------

best_model %>% compile(
  optimizer = optimizer_rmsprop(lr = FLAGS$learningrate),
  loss = "categorical_crossentropy"
  metrics = metric_categorical_accuracy
)

# 3. Fit the model
#-------------------------------------------------------------------------------

best_model %>% fit(
  train_x,
  train_y,
  batch_size = FLAGS$batch,
  epochs = 12
)
# 4. Save model
#-------------------------------------------------------------------------------

best_model %>% save_model_hdf5("results/best-model.hdf5")

Model evaluation

# 5. Evaluate the model
#-------------------------------------------------------------------------------
best_model <- load_model_hdf5(
    filepath = here("results/best-model.hdf5")) 
test_x <- readRDS(here("/rds/test_x.rds"))
test_y <- readRDS(here("/rds/test_y.rds"))
evaluation <- best_model %>%
    evaluate(test_x,test_y)

Thanks to this model, we were able to reach 0.9066427 of categorical accuracy on the test set, which is higher than expected.

7. Conclusion & Recommendations

First of all, we were positively suprised by the predictive ability of our best model, which is higher than what we expected. This result is the consequence of different explanations.

First, Donald Trump, Barack Obama and Hillary Clinton were good candidates for predictions, as each of them have a significant personal writing style. Although Barack Obama and Hillary Clinton are democrats, they have differents commmunication skills, which makes them easier to classify (this effect is even higher with Donald Trump).

Another interesting result is that we have a better accuracy using non-pretrained word embedding. Although we were surprised initially, we then realized that pretrained word embeddings are trained on billions of tweets and millions of people (each of them has their own writing style and vocabulary). They would probably perform better on larger datasets with way more people. For limited classification tasks, we would recommend not to use it.

Finally and to go further, classifying tweets according to the membership to a political party could justify further researches. However, we struggled to find other datasets of American politicians. Another interesting project could be to classify political membership on a larger scale, by using for example the tweets of thousands of non-political individuals but with distinct affiliation to a political party.

8. Sources