1. Introduction

Today, American politicians use Twitter to communicate their opinions. In this idea, we decided to test if an algorithm can recognise the different authors of the tweets. This would enable us to see if they have actually been written by the mentioned author (to avoid any hacking).

For that, we decided to train a model to classify the tweets of different important US politicians. Moreover, we focused on three politicians: Barack Obama, Hillary Clinton and Donald Trump. We found several databases freely available on Kaggle and each dataset contains between 3000 and 8000 tweets. We have to be careful not to include potential retweets present in the datasets, that could mislead the algorithm. Each tweet is composed of a maximum of 280 characters.

On an additional note, Multinomial Naive Bayes has been used in the past for such task, but Neural Network is the most accurate method to classify texts. The approach is to use Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) with word embedding to reduce the numbers of dimensions (instead of using one hot encoding for each single word).

1.1 The datasets

Both datasets are available for free on Kaggle.com. The Barack Obama dataset has only one column and each instance contains various information, such as the URL, the date, the number of retweet, the message, etc… The 6930 tweets start from November 2012 until April 2019.

The dataset of Hillary Clinton and Donald Trump contains 6444 tweets from a smaller time window (January 2016 until November 2016). However, both american politicians were, at the time, very prolific in terms of tweeting, as they manage to send more than 3000 each. Reminder: 2016 was election year. Otherwise, this dataset is better structured as it contains 28 variables, such as the source, the date, the number of retweets, etc.

2. Data Cleaning

The first part of the project was to clean and prepare the data. The dataset that contains the tweets from Clinton and Trump has a binary variable for the retweets. We removed them from our analysis as we want to recognize writing styles of the author and not be biased by third parties tweets.

By looking deeper into the data, we identified some elements that could potentially harm the accuracy of our futur models. For example, we decided to remove URLs using regulary expressions as all of them are unique and do not bring additional informations regarding the authors.

clinton_trump$text <- rm_twitter_url(
  clinton_trump$text,
  clean = TRUE,
  pattern = "@rm_twitter_url",
  replacement = "",
  extract = FALSE,
  dictionary = getOption("regex.library")
)

The second important feature of tweets are the hashtags. They are used to reference tweets and give more visibility to the message. The decision to keep them or not has not been straight. First, we thought that they would not bring additional accuracy and would bring some noise in our predictions.

clinton_trump$text <-
  str_replace_all(clinton_trump$text, "([@#][\\w_-]+)", "")

However, after training the neural networks, we decided to keep the hashtags as they surprisingly increased the accuracy of our models. We have a few hypotheses why this actually happens:

As we limit the number of words in our vocabulary, we assume that only the most frequent hashtags are kept in the training set and it would therefore not bias our data.
The pretrained embedding word from Standford has been trained on tweets and it contains hashtags.
The frequent use of hastags can be seen as a writing style as well, and help identify the authors.

Finally, tweets were extracted from the Obama dataset using differents regular expression (package StringR).

# Take out the first 50 characters which represents date, twitter account, etc...
obama$text <- substring(obama$text, 50)

#remove URL that lead to the tweet
obama$text <- sub("\\;.*", "", obama$text)
obama$text <- sub("\\:.*", "", obama$text)

3. Data Preprocessing (Split and Tokenization)

After having cleaned properly the data and removed information that could bring noise in the predictions, we assigned a unique number to identify the author of the tweet.

 1 => Barack Obama
 2 => Hillary Clinton
 3 => Donald Trump

Another issue that we solved is the unbalanced data. To address it, we randomly selected 2629 instances from the Trump and Obama datasets to match the number of Clinton. Finally, we split the data into training and test sets using a 80-20 ratio and we set the seed for further reproducibility.

Regarding the maximum length used for the tokenization of our model, we ran some analyses to have more information about the number of words per tweet and find the best one.

#for the maxlen choice
summary(sapply(train_x_test_maxlen , length))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   16.00   15.31   20.00   54.00

We decided to choose the number 20 as it includes around 75% of the tweets and will be significantly faster to process. Additionally, we tried with 50 words, as well, and did not observe major differences in the accuracy score.

Regarding the tokenization, we chose to keep the punctuations as we observed that Donald Trump used it frequently to emphasize his messages. We used the function text_tokenizer to segment the various texts into smaller units, here being words.

#Generate a tokenizer #tune the filters
tokenizer <-
  text_tokenizer(num_words = num_words) %>% fit_text_tokenizer(train_x)

# Tokenize `train_x` and transform it to sequences of integers. Then, pad those sequences to `50`.

train_x <-
  texts_to_sequences(tokenizer = tokenizer, texts = train_x) %>% pad_sequences(maxlen = maxlen)
test_x <-
  texts_to_sequences(tokenizer = tokenizer, texts = test_x) %>% pad_sequences(maxlen = maxlen)

# Transform `train_y` into one-hot encoded matrix.
train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y)

# Creation of the word index
word_index <- tokenizer$word_index

4. Modelling

After having preprocessed the datasets and tokenized both training and test sets, we started working on the modelling part. We wrote six differents models (one per script) to find which one would give us the best accuracy.

RNN_pretrain.R: Recurrent Neural Network using pretrained word embedding from Standford.

RNN_train.R: Recurrent Neural Network using self-trained word embedding based on our data.

RNN_pretrain bidirectional.R: Recurrent Neural Network using pretrained word embedding from Standford + bidirectional layer.

RNN_train bidirectional.R: Recurrent Neural Network using self-trained word embedding based on our data + bidirectional layer.

CNN_pretrain.R : Combined Neural Network: Convulational NN using pretrained word embedding from Standford with RNN gru layer.

CNN_train.R : Combined Neural Network: Convulational NN using self-trained word embedding based on our data with RNN gru layer.

4.1 Recurrent Neural Network

Recurrent Neural Networks are particularly appropriate for text mining. To further explain, they manage to process sequences by iteration and maintain the state containing information of previously seen inputs (source, Lecture 09). As the word order is important, it should theoretically give us better results than Convolutional Neural Networks. We tested our models with bidirectional and unidirectional layers. Another important difference with CNN is the recurrent dropout which is combined with normal dropout.

RNN_train.R with default values

num_words <- 7000  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- 25 # the dimension of embeddings

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_gru(
    units = 128,
    dropout = 0.1,
    recurrent_dropout = 0.4,
    return_sequences = TRUE
  ) %>%
  layer_gru(units = 128,
            dropout = 0.1) %>%
  layer_dense(units = 4, activation = "softmax")

model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2
)

4.2 Combined Neural Network

Convolutional Neural Networks (CNN) are efficient to detect patterns at different locations. They can be used, as well, with sequential data by using 1D convulational layer. We included CNN layers to RNN ones in order to create Combined Neural Networks, mixed with word embedding. The goal was to see if it could potentially outperform “simpler” RNN models.

CNN_train.R with default values

num_words <- 7000  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- 25 # the dimension of embeddings

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_conv_1d(
    filters = 32,
    kernel_size = 5,
    strides = 1,
    padding = "same",
    activation = "relu"
  ) %>%
  layer_max_pooling_1d(pool_size = 3) %>%
  layer_conv_1d(
    filters = 64,
    kernel_size = 3,
    strides = 1,
    padding = "same",
    activation = "relu"
  ) %>%
  layer_gru(units = 128) %>%
  layer_dense(units = 4, activation = "softmax")


model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2,
  shuffle = TRUE
)

4.3 Word Embedding (pretrained and self-trained)

An important feature of both RNN and CNN is the word embedding, which organizes words as vectors into multidimensional space and is more efficient to recognize writing patterns. For the project, we first used self-trained word embedding based on our data (see scripts including “train”). However, pretrained word embeddings can be used as well.

In this project, we used the glove.twitter.27B trained by a team from Standford University on 2 billion tweets, creating 27 billion tokens. There exist four versions containing 25,50,100 and 200 dimensions. During our preliminary trials, we observed that a word embedding greater than 25-30 dimensions has a negative impact on the accuracy. This result can be explained by the relatively small size of our dataset, but mostly because we are analyzing only three different authors. Increasing the number of dimensions would just overcomplexify our Neural Network and penalize accuracy. In other words, it doesn’t get the whole picture.

RNN_pretrain.R with default values

lines <-
  readLines(here("embeddings/glove.twitter.27B/glove.twitter.27B.25d.txt"))

embeddings_index <- new.env(hash = TRUE, parent = emptyenv())

for (i in 1:length(lines)) {
  line <- lines[[i]]
  values <- strsplit(line, " ")[[1]]
  word <- values[[1]]
  embeddings_index[[word]] <- as.double(values[-1])
}

embedding_matrix <- array(0, c(num_words, embedding_dim))

for (word in names(word_index)) {
  index <- word_index[[word]]
  if (index < num_words) {
    embedding_vector <- embeddings_index[[word]]
    if (!is.null(embedding_vector))
      embedding_matrix[index + 1, ] <- embedding_vector
  }
}

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_gru(
    units = 128,
    dropout = 0.1,
    recurrent_dropout = 0.4,
    return_sequences = TRUE
  ) %>%
  layer_gru(units = 128,
            dropout = 0.1) %>%
  layer_dense(units = 4, activation = "softmax")

# assign pretrained weights to the embedding layer
get_layer(model, index = 1) %>%
  set_weights(list(embedding_matrix)) %>%
  freeze_weights()

model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = loss_categorical_crossentropy,
                  metrics = metric_categorical_accuracy)

model %>% fit(
  train_x,
  train_y,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2
)

5. Hyperparameters tuning on the Cloud

5.1 Training on the Cloud

As the training of RNN with pretrained word embedding requires a lot of computational resources, we trained our Neural Network on the Cloud (AI Google Cloud). Each script, which contains one model, was prepared for the hyperparameters tuning using FLAGS. To set the range of test of each flag, an yml file has been created containing all the information for the tuning and has been sent to the server through the script control-gcp.R. Training and test sets, as well as, the pretrained word embedding have also been sent in RDS format.

5.2 Tuning Strategy

Our strategy for this project was to first proceed to preliminary tests locally. This allowed us to start with a large scope of research and then refine it. Regarding the pretrained word embedding, we observed that the 25-dimension one performs better than the others. Therefore, we used only this one for the hyperparameters tuning. We used the same strategy for the choice of the optimizer and the layers of the RNN. We used GRU layers instead of LSTM as they perform better.

After training and tuning all the six models on the cloud with 20 runs, we then picked the two most accurate ones and re-tuned additional parameters, in order to optimize the accuracy. This was done with 40 runs each.

5.3 First Round of Tuning Hyperparameters

For the RNN models, we decided to use gru layer as mentionned earlier and tune the number of nodes for both layers. We did the same for the dropouts, and we naturally included a recurrent dropout in our models. The number of words used in the embedding matrix and its dimensions have been tuned as well in all models (RNN and CNN) without pretrained embeddings.

Example of the tuning parameters for the RNN_train.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_numeric("rec_dropout1", 0.4),
  flag_integer("gru_units2", 128),
  flag_numeric("dropout2", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25)
)

For the CNN Models, we used a similar strategy.

Example of the tuning parameters for the CNN_train.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25)
)

After running each model with 20 trials each, the results clearly showed the underperformance of the pretrained word embeddings, as none of them had outperformed non-pretrained models.

FIRST ROUND: Ranking of top-accuracy models
script	metric_val_categorical_accuracy	flag_gru_units1	flag_dropout1	flag_rec_dropout1	flag_gru_units2	flag_dropout2	flag_num_words	flag_dim
RNN_train.R	0.902	60	0.111	0.431	274	0.234	8535	NA
RNN_train.R	0.902	204	0.288	0.483	94	0.245	9676	NA
RNN_train.R	0.900	172	0.292	0.500	171	0.233	8508	NA
RNN_train.R	0.900	237	0.175	0.401	247	0.032	9062	NA
RNN_train.R	0.899	120	0.084	0.491	112	0.108	7524	NA
RNN_train bidirectional.R	0.898	163	0.143	0.484	161	0.111	10981	10
CNN_train.R	0.897	66	0.028	NA	NA	NA	7199	50
CNN_train.R	0.897	63	0.298	NA	NA	NA	7108	50
CNN_train.R	0.897	52	0.079	NA	NA	NA	9016	36
RNN_train.R	0.896	147	0.023	0.308	204	0.010	9989	NA
CNN_train.R	0.895	110	0.245	NA	NA	NA	11111	33
RNN_train bidirectional.R	0.895	52	0.010	0.600	79	0.010	8189	50
RNN_train bidirectional.R	0.895	131	0.137	0.429	63	0.030	11209	34
CNN_train.R	0.894	246	0.032	NA	NA	NA	10036	25
RNN_train bidirectional.R	0.894	292	0.300	0.300	90	0.300	8946	10
CNN_train.R	0.893	283	0.023	NA	NA	NA	11992	50
CNN_train.R	0.893	293	0.026	NA	NA	NA	7178	48
CNN_train.R	0.893	78	0.156	NA	NA	NA	10907	45
RNN_train bidirectional.R	0.893	92	0.014	0.301	109	0.298	8557	36
RNN_train bidirectional.R	0.893	272	0.065	0.395	285	0.031	7753	35

Regarding the best ranked models, the results showed that none of them performed significantly better than the others. However, the Recurrent Neural Network with bidirectional layers is sligthly less accurate than the normal RNN. Therefore, we decided to exclude it from futher researches and focused our resources on the following models:

Recurrent Neural Network with self-trained word embedding (best validation accuracy: 0.9019)
Convolutional Neural Network with self-trained word embedding (best validation accuracy: 0.8967)

5.4 Second Round of Tuning Hyperparameters

Since we uncovered the two best models, we decided to refine them by training additional hyperparamters, in order to increase their accuracy. To further explain, in addition to the previous hyperparameters, we also added the learning rate, patience and batch size to be tuned. Each of these two models were run 40 times, as mentioned previously.

Tuning parameters for the RNN_train_2.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_numeric("rec_dropout1", 0.4),
  flag_integer("gru_units2", 128),
  flag_numeric("dropout2", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim", 25),
  flag_numeric("learningrate",0.001),
  flag_integer("patience",5),
  flag_integer("batch",128)
)

Tuning parameters for the CNN_train_2.R

FLAGS <- flags(
  flag_integer("gru_units1", 128),
  flag_numeric("dropout1", 0.1),
  flag_integer("num_words", 7000),
  flag_integer("dim",25),
  flag_numeric("learningrate",0.001),
  flag_integer("patience",5),
  flag_integer("batch",128)
)

The results of this second round of tuning were the following:

SECOND ROUND: Ranking of top-accuracy models
script	metric_val_categorical_accuracy	flag_gru_units1	flag_dropout1	flag_rec_dropout1	flag_gru_units2	flag_dropout2	flag_num_words	flag_dim
CNN_train_2.R	0.904	100	0.011	NA	NA	NA	8721	48
RNN_train_2.R	0.896	65	0.116	0.566	255	0.298	7017	33
RNN_train_2.R	0.892	297	0.281	0.573	231	0.293	7148	48
CNN_train_2.R	0.892	60	0.021	NA	NA	NA	8538	11
RNN_train_2.R	0.890	50	0.300	0.600	98	0.300	8110	28
RNN_train_2.R	0.889	142	0.015	0.306	106	0.054	9583	48
CNN_train_2.R	0.888	105	0.160	NA	NA	NA	7035	25
RNN_train_2.R	0.888	114	0.300	0.600	145	0.010	9936	50
CNN_train_2.R	0.888	53	0.011	NA	NA	NA	11639	11
CNN_train_2.R	0.887	50	0.010	NA	NA	NA	8440	50
CNN_train_2.R	0.886	138	0.297	NA	NA	NA	10960	46
RNN_train_2.R	0.885	127	0.019	0.306	50	0.278	7607	37
CNN_train_2.R	0.882	279	0.018	NA	NA	NA	7459	50
RNN_train_2.R	0.881	289	0.300	0.300	171	0.300	9558	10
RNN_train_2.R	0.880	140	0.012	0.493	153	0.020	8618	43
CNN_train_2.R	0.880	292	0.270	NA	NA	NA	11732	12
CNN_train_2.R	0.874	160	0.285	NA	NA	NA	8674	14
CNN_train_2.R	0.872	294	0.246	NA	NA	NA	7212	47
RNN_train_2.R	0.870	297	0.030	0.309	288	0.030	7095	50
RNN_train_2.R	0.869	67	0.275	0.312	259	0.197	7210	27

Thus, one can clearly see that the best model is the Convolutional Neural Network with self-trained word embedding (validation accuracy: 0.9040), which optimised hyperparameters are:

Optimised hyperparameters of the best model

FLAGS <- flags(
  flag_integer("gru_units1", 100),
  flag_numeric("dropout1", 0.011),
  flag_integer("num_words", 8721),
  flag_integer("dim",48),
  flag_numeric("learningrate",0.009),
  flag_integer("batch",73)
)

6. Best Model

Since a best model was found, we then retrained that model, using its optimised hyperparameters.

retrain.R with best values

# 1. Define the model
#-------------------------------------------------------------------------------
num_words <- FLAGS$num_words  # the size of the vocabluary
maxlen <- 20      # the size of the text
embedding_dim <- FLAGS$dim # the dimension of embeddings

best_model <- keras_model_sequential() %>%
  layer_embedding(input_dim = num_words,
                  output_dim = embedding_dim,
                  input_length = maxlen) %>%
  layer_conv_1d(filters = 32, kernel_size = 5, strides = 1, padding = "same",
                activation = "relu") %>%
  layer_max_pooling_1d(pool_size = 3) %>%
  layer_conv_1d(filters = 64, kernel_size = 3, strides = 1, padding = "same",
                activation = "relu") %>%
  layer_gru(units = FLAGS$gru_units1,dropout = FLAGS$dropout1) %>%
  layer_dense(units = 4, activation = "softmax")

# 2. Compile the model
#-------------------------------------------------------------------------------

best_model %>% compile(
  optimizer = optimizer_rmsprop(lr = FLAGS$learningrate),
  loss = "categorical_crossentropy"
  metrics = metric_categorical_accuracy
)

# 3. Fit the model
#-------------------------------------------------------------------------------

best_model %>% fit(
  train_x,
  train_y,
  batch_size = FLAGS$batch,
  epochs = 12
)
# 4. Save model
#-------------------------------------------------------------------------------

best_model %>% save_model_hdf5("results/best-model.hdf5")

Model evaluation

# 5. Evaluate the model
#-------------------------------------------------------------------------------
best_model <- load_model_hdf5(
    filepath = here("results/best-model.hdf5")) 
test_x <- readRDS(here("/rds/test_x.rds"))
test_y <- readRDS(here("/rds/test_y.rds"))
evaluation <- best_model %>%
    evaluate(test_x,test_y)

Thanks to this model, we were able to reach 0.9066427 of categorical accuracy on the test set, which is higher than expected.

7. Conclusion & Recommendations

First of all, we were positively suprised by the predictive ability of our best model, which is higher than what we expected. This result is the consequence of different explanations.

First, Donald Trump, Barack Obama and Hillary Clinton were good candidates for predictions, as each of them have a significant personal writing style. Although Barack Obama and Hillary Clinton are democrats, they have differents commmunication skills, which makes them easier to classify (this effect is even higher with Donald Trump).

Another interesting result is that we have a better accuracy using non-pretrained word embedding. Although we were surprised initially, we then realized that pretrained word embeddings are trained on billions of tweets and millions of people (each of them has their own writing style and vocabulary). They would probably perform better on larger datasets with way more people. For limited classification tasks, we would recommend not to use it.

Finally and to go further, classifying tweets according to the membership to a political party could justify further researches. However, we struggled to find other datasets of American politicians. Another interesting project could be to classify political membership on a larger scale, by using for example the tweets of thousands of non-political individuals but with distinct affiliation to a political party.

8. Sources

Lectures of Deep Learning, Iegor Rudnitsky
https://nlp.stanford.edu/projects/glove/
https://www.kaggle.com/adhok93/president-obama
https://www.kaggle.com/benhamner/clinton-trump-tweets

Predict Authorship of Tweets using Neural Networks

Romain Donati, Hasini Gunawardena

2020-05-27