Today, American politicians use Twitter to communicate their opinions. In this idea, we decided to test if an algorithm can recognise the different authors of the tweets. This would enable us to see if they have actually been written by the mentioned author (to avoid any hacking).
For that, we decided to train a model to classify the tweets of different important US politicians. Moreover, we focused on three politicians: Barack Obama, Hillary Clinton and Donald Trump. We found several databases freely available on Kaggle and each dataset contains between 3000 and 8000 tweets. We have to be careful not to include potential retweets present in the datasets, that could mislead the algorithm. Each tweet is composed of a maximum of 280 characters.
On an additional note, Multinomial Naive Bayes has been used in the past for such task, but Neural Network is the most accurate method to classify texts. The approach is to use Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) with word embedding to reduce the numbers of dimensions (instead of using one hot encoding for each single word).
Both datasets are available for free on Kaggle.com. The Barack Obama dataset has only one column and each instance contains various information, such as the URL, the date, the number of retweet, the message, etc… The 6930 tweets start from November 2012 until April 2019.
The dataset of Hillary Clinton and Donald Trump contains 6444 tweets from a smaller time window (January 2016 until November 2016). However, both american politicians were, at the time, very prolific in terms of tweeting, as they manage to send more than 3000 each. Reminder: 2016 was election year. Otherwise, this dataset is better structured as it contains 28 variables, such as the source, the date, the number of retweets, etc.
The first part of the project was to clean and prepare the data. The dataset that contains the tweets from Clinton and Trump has a binary variable for the retweets. We removed them from our analysis as we want to recognize writing styles of the author and not be biased by third parties tweets.
By looking deeper into the data, we identified some elements that could potentially harm the accuracy of our futur models. For example, we decided to remove URLs using regulary expressions as all of them are unique and do not bring additional informations regarding the authors.
clinton_trump$text <- rm_twitter_url(
clinton_trump$text,
clean = TRUE,
pattern = "@rm_twitter_url",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library")
)
The second important feature of tweets are the hashtags. They are used to reference tweets and give more visibility to the message. The decision to keep them or not has not been straight. First, we thought that they would not bring additional accuracy and would bring some noise in our predictions.
clinton_trump$text <-
str_replace_all(clinton_trump$text, "([@#][\\w_-]+)", "")
However, after training the neural networks, we decided to keep the hashtags as they surprisingly increased the accuracy of our models. We have a few hypotheses why this actually happens:
Finally, tweets were extracted from the Obama dataset using differents regular expression (package StringR).
# Take out the first 50 characters which represents date, twitter account, etc...
obama$text <- substring(obama$text, 50)
#remove URL that lead to the tweet
obama$text <- sub("\\;.*", "", obama$text)
obama$text <- sub("\\:.*", "", obama$text)
After having cleaned properly the data and removed information that could bring noise in the predictions, we assigned a unique number to identify the author of the tweet.
1 => Barack Obama
2 => Hillary Clinton
3 => Donald Trump
Another issue that we solved is the unbalanced data. To address it, we randomly selected 2629 instances from the Trump and Obama datasets to match the number of Clinton. Finally, we split the data into training and test sets using a 80-20 ratio and we set the seed for further reproducibility.
Regarding the maximum length used for the tokenization of our model, we ran some analyses to have more information about the number of words per tweet and find the best one.
#for the maxlen choice
summary(sapply(train_x_test_maxlen , length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.00 16.00 15.31 20.00 54.00
We decided to choose the number 20 as it includes around 75% of the tweets and will be significantly faster to process. Additionally, we tried with 50 words, as well, and did not observe major differences in the accuracy score.
Regarding the tokenization, we chose to keep the punctuations as we observed that Donald Trump used it frequently to emphasize his messages. We used the function text_tokenizer to segment the various texts into smaller units, here being words.
#Generate a tokenizer #tune the filters
tokenizer <-
text_tokenizer(num_words = num_words) %>% fit_text_tokenizer(train_x)
# Tokenize `train_x` and transform it to sequences of integers. Then, pad those sequences to `50`.
train_x <-
texts_to_sequences(tokenizer = tokenizer, texts = train_x) %>% pad_sequences(maxlen = maxlen)
test_x <-
texts_to_sequences(tokenizer = tokenizer, texts = test_x) %>% pad_sequences(maxlen = maxlen)
# Transform `train_y` into one-hot encoded matrix.
train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y)
# Creation of the word index
word_index <- tokenizer$word_index
After having preprocessed the datasets and tokenized both training and test sets, we started working on the modelling part. We wrote six differents models (one per script) to find which one would give us the best accuracy.
RNN_pretrain.R: Recurrent Neural Network using pretrained word embedding from Standford.
RNN_train.R: Recurrent Neural Network using self-trained word embedding based on our data.
RNN_pretrain bidirectional.R: Recurrent Neural Network using pretrained word embedding from Standford + bidirectional layer.
RNN_train bidirectional.R: Recurrent Neural Network using self-trained word embedding based on our data + bidirectional layer.
CNN_pretrain.R : Combined Neural Network: Convulational NN using pretrained word embedding from Standford with RNN gru layer.
CNN_train.R : Combined Neural Network: Convulational NN using self-trained word embedding based on our data with RNN gru layer.
Recurrent Neural Networks are particularly appropriate for text mining. To further explain, they manage to process sequences by iteration and maintain the state containing information of previously seen inputs (source, Lecture 09). As the word order is important, it should theoretically give us better results than Convolutional Neural Networks. We tested our models with bidirectional and unidirectional layers. Another important difference with CNN is the recurrent dropout which is combined with normal dropout.
RNN_train.R with default values
num_words <- 7000 # the size of the vocabluary
maxlen <- 20 # the size of the text
embedding_dim <- 25 # the dimension of embeddings
model <- keras_model_sequential() %>%
layer_embedding(input_dim = num_words,
output_dim = embedding_dim,
input_length = maxlen) %>%
layer_gru(
units = 128,
dropout = 0.1,
recurrent_dropout = 0.4,
return_sequences = TRUE
) %>%
layer_gru(units = 128,
dropout = 0.1) %>%
layer_dense(units = 4, activation = "softmax")
model %>% compile(optimizer = optimizer_rmsprop(),
loss = loss_categorical_crossentropy,
metrics = metric_categorical_accuracy)
model %>% fit(
train_x,
train_y,
batch_size = 128,
epochs = 10,
validation_split = 0.2
)
Convolutional Neural Networks (CNN) are efficient to detect patterns at different locations. They can be used, as well, with sequential data by using 1D convulational layer. We included CNN layers to RNN ones in order to create Combined Neural Networks, mixed with word embedding. The goal was to see if it could potentially outperform “simpler” RNN models.
CNN_train.R with default values
num_words <- 7000 # the size of the vocabluary
maxlen <- 20 # the size of the text
embedding_dim <- 25 # the dimension of embeddings
model <- keras_model_sequential() %>%
layer_embedding(input_dim = num_words,
output_dim = embedding_dim,
input_length = maxlen) %>%
layer_conv_1d(
filters = 32,
kernel_size = 5,
strides = 1,
padding = "same",
activation = "relu"
) %>%
layer_max_pooling_1d(pool_size = 3) %>%
layer_conv_1d(
filters = 64,
kernel_size = 3,
strides = 1,
padding = "same",
activation = "relu"
) %>%
layer_gru(units = 128) %>%
layer_dense(units = 4, activation = "softmax")
model %>% compile(optimizer = optimizer_rmsprop(),
loss = loss_categorical_crossentropy,
metrics = metric_categorical_accuracy)
model %>% fit(
train_x,
train_y,
batch_size = 128,
epochs = 10,
validation_split = 0.2,
shuffle = TRUE
)
An important feature of both RNN and CNN is the word embedding, which organizes words as vectors into multidimensional space and is more efficient to recognize writing patterns. For the project, we first used self-trained word embedding based on our data (see scripts including “train”). However, pretrained word embeddings can be used as well.
In this project, we used the glove.twitter.27B trained by a team from Standford University on 2 billion tweets, creating 27 billion tokens. There exist four versions containing 25,50,100 and 200 dimensions. During our preliminary trials, we observed that a word embedding greater than 25-30 dimensions has a negative impact on the accuracy. This result can be explained by the relatively small size of our dataset, but mostly because we are analyzing only three different authors. Increasing the number of dimensions would just overcomplexify our Neural Network and penalize accuracy. In other words, it doesn’t get the whole picture.
RNN_pretrain.R with default values
lines <-
readLines(here("embeddings/glove.twitter.27B/glove.twitter.27B.25d.txt"))
embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(lines)) {
line <- lines[[i]]
values <- strsplit(line, " ")[[1]]
word <- values[[1]]
embeddings_index[[word]] <- as.double(values[-1])
}
embedding_matrix <- array(0, c(num_words, embedding_dim))
for (word in names(word_index)) {
index <- word_index[[word]]
if (index < num_words) {
embedding_vector <- embeddings_index[[word]]
if (!is.null(embedding_vector))
embedding_matrix[index + 1, ] <- embedding_vector
}
}
model <- keras_model_sequential() %>%
layer_embedding(input_dim = num_words,
output_dim = embedding_dim,
input_length = maxlen) %>%
layer_gru(
units = 128,
dropout = 0.1,
recurrent_dropout = 0.4,
return_sequences = TRUE
) %>%
layer_gru(units = 128,
dropout = 0.1) %>%
layer_dense(units = 4, activation = "softmax")
# assign pretrained weights to the embedding layer
get_layer(model, index = 1) %>%
set_weights(list(embedding_matrix)) %>%
freeze_weights()
model %>% compile(optimizer = optimizer_rmsprop(),
loss = loss_categorical_crossentropy,
metrics = metric_categorical_accuracy)
model %>% fit(
train_x,
train_y,
batch_size = 128,
epochs = 10,
validation_split = 0.2
)
As the training of RNN with pretrained word embedding requires a lot of computational resources, we trained our Neural Network on the Cloud (AI Google Cloud). Each script, which contains one model, was prepared for the hyperparameters tuning using FLAGS. To set the range of test of each flag, an yml file has been created containing all the information for the tuning and has been sent to the server through the script control-gcp.R. Training and test sets, as well as, the pretrained word embedding have also been sent in RDS format.
Our strategy for this project was to first proceed to preliminary tests locally. This allowed us to start with a large scope of research and then refine it. Regarding the pretrained word embedding, we observed that the 25-dimension one performs better than the others. Therefore, we used only this one for the hyperparameters tuning. We used the same strategy for the choice of the optimizer and the layers of the RNN. We used GRU layers instead of LSTM as they perform better.
After training and tuning all the six models on the cloud with 20 runs, we then picked the two most accurate ones and re-tuned additional parameters, in order to optimize the accuracy. This was done with 40 runs each.
For the RNN models, we decided to use gru layer as mentionned earlier and tune the number of nodes for both layers. We did the same for the dropouts, and we naturally included a recurrent dropout in our models. The number of words used in the embedding matrix and its dimensions have been tuned as well in all models (RNN and CNN) without pretrained embeddings.
Example of the tuning parameters for the RNN_train.R
FLAGS <- flags(
flag_integer("gru_units1", 128),
flag_numeric("dropout1", 0.1),
flag_numeric("rec_dropout1", 0.4),
flag_integer("gru_units2", 128),
flag_numeric("dropout2", 0.1),
flag_integer("num_words", 7000),
flag_integer("dim", 25)
)
For the CNN Models, we used a similar strategy.
Example of the tuning parameters for the CNN_train.R
FLAGS <- flags(
flag_integer("gru_units1", 128),
flag_numeric("dropout1", 0.1),
flag_integer("num_words", 7000),
flag_integer("dim", 25)
)
After running each model with 20 trials each, the results clearly showed the underperformance of the pretrained word embeddings, as none of them had outperformed non-pretrained models.
| script | metric_val_categorical_accuracy | flag_gru_units1 | flag_dropout1 | flag_rec_dropout1 | flag_gru_units2 | flag_dropout2 | flag_num_words | flag_dim |
|---|---|---|---|---|---|---|---|---|
| RNN_train.R | 0.902 | 60 | 0.111 | 0.431 | 274 | 0.234 | 8535 | NA |
| RNN_train.R | 0.902 | 204 | 0.288 | 0.483 | 94 | 0.245 | 9676 | NA |
| RNN_train.R | 0.900 | 172 | 0.292 | 0.500 | 171 | 0.233 | 8508 | NA |
| RNN_train.R | 0.900 | 237 | 0.175 | 0.401 | 247 | 0.032 | 9062 | NA |
| RNN_train.R | 0.899 | 120 | 0.084 | 0.491 | 112 | 0.108 | 7524 | NA |
| RNN_train bidirectional.R | 0.898 | 163 | 0.143 | 0.484 | 161 | 0.111 | 10981 | 10 |
| CNN_train.R | 0.897 | 66 | 0.028 | NA | NA | NA | 7199 | 50 |
| CNN_train.R | 0.897 | 63 | 0.298 | NA | NA | NA | 7108 | 50 |
| CNN_train.R | 0.897 | 52 | 0.079 | NA | NA | NA | 9016 | 36 |
| RNN_train.R | 0.896 | 147 | 0.023 | 0.308 | 204 | 0.010 | 9989 | NA |
| CNN_train.R | 0.895 | 110 | 0.245 | NA | NA | NA | 11111 | 33 |
| RNN_train bidirectional.R | 0.895 | 52 | 0.010 | 0.600 | 79 | 0.010 | 8189 | 50 |
| RNN_train bidirectional.R | 0.895 | 131 | 0.137 | 0.429 | 63 | 0.030 | 11209 | 34 |
| CNN_train.R | 0.894 | 246 | 0.032 | NA | NA | NA | 10036 | 25 |
| RNN_train bidirectional.R | 0.894 | 292 | 0.300 | 0.300 | 90 | 0.300 | 8946 | 10 |
| CNN_train.R | 0.893 | 283 | 0.023 | NA | NA | NA | 11992 | 50 |
| CNN_train.R | 0.893 | 293 | 0.026 | NA | NA | NA | 7178 | 48 |
| CNN_train.R | 0.893 | 78 | 0.156 | NA | NA | NA | 10907 | 45 |
| RNN_train bidirectional.R | 0.893 | 92 | 0.014 | 0.301 | 109 | 0.298 | 8557 | 36 |
| RNN_train bidirectional.R | 0.893 | 272 | 0.065 | 0.395 | 285 | 0.031 | 7753 | 35 |
Regarding the best ranked models, the results showed that none of them performed significantly better than the others. However, the Recurrent Neural Network with bidirectional layers is sligthly less accurate than the normal RNN. Therefore, we decided to exclude it from futher researches and focused our resources on the following models:
Since we uncovered the two best models, we decided to refine them by training additional hyperparamters, in order to increase their accuracy. To further explain, in addition to the previous hyperparameters, we also added the learning rate, patience and batch size to be tuned. Each of these two models were run 40 times, as mentioned previously.
Tuning parameters for the RNN_train_2.R
FLAGS <- flags(
flag_integer("gru_units1", 128),
flag_numeric("dropout1", 0.1),
flag_numeric("rec_dropout1", 0.4),
flag_integer("gru_units2", 128),
flag_numeric("dropout2", 0.1),
flag_integer("num_words", 7000),
flag_integer("dim", 25),
flag_numeric("learningrate",0.001),
flag_integer("patience",5),
flag_integer("batch",128)
)
Tuning parameters for the CNN_train_2.R
FLAGS <- flags(
flag_integer("gru_units1", 128),
flag_numeric("dropout1", 0.1),
flag_integer("num_words", 7000),
flag_integer("dim",25),
flag_numeric("learningrate",0.001),
flag_integer("patience",5),
flag_integer("batch",128)
)
The results of this second round of tuning were the following:
| script | metric_val_categorical_accuracy | flag_gru_units1 | flag_dropout1 | flag_rec_dropout1 | flag_gru_units2 | flag_dropout2 | flag_num_words | flag_dim |
|---|---|---|---|---|---|---|---|---|
| CNN_train_2.R | 0.904 | 100 | 0.011 | NA | NA | NA | 8721 | 48 |
| RNN_train_2.R | 0.896 | 65 | 0.116 | 0.566 | 255 | 0.298 | 7017 | 33 |
| RNN_train_2.R | 0.892 | 297 | 0.281 | 0.573 | 231 | 0.293 | 7148 | 48 |
| CNN_train_2.R | 0.892 | 60 | 0.021 | NA | NA | NA | 8538 | 11 |
| RNN_train_2.R | 0.890 | 50 | 0.300 | 0.600 | 98 | 0.300 | 8110 | 28 |
| RNN_train_2.R | 0.889 | 142 | 0.015 | 0.306 | 106 | 0.054 | 9583 | 48 |
| CNN_train_2.R | 0.888 | 105 | 0.160 | NA | NA | NA | 7035 | 25 |
| RNN_train_2.R | 0.888 | 114 | 0.300 | 0.600 | 145 | 0.010 | 9936 | 50 |
| CNN_train_2.R | 0.888 | 53 | 0.011 | NA | NA | NA | 11639 | 11 |
| CNN_train_2.R | 0.887 | 50 | 0.010 | NA | NA | NA | 8440 | 50 |
| CNN_train_2.R | 0.886 | 138 | 0.297 | NA | NA | NA | 10960 | 46 |
| RNN_train_2.R | 0.885 | 127 | 0.019 | 0.306 | 50 | 0.278 | 7607 | 37 |
| CNN_train_2.R | 0.882 | 279 | 0.018 | NA | NA | NA | 7459 | 50 |
| RNN_train_2.R | 0.881 | 289 | 0.300 | 0.300 | 171 | 0.300 | 9558 | 10 |
| RNN_train_2.R | 0.880 | 140 | 0.012 | 0.493 | 153 | 0.020 | 8618 | 43 |
| CNN_train_2.R | 0.880 | 292 | 0.270 | NA | NA | NA | 11732 | 12 |
| CNN_train_2.R | 0.874 | 160 | 0.285 | NA | NA | NA | 8674 | 14 |
| CNN_train_2.R | 0.872 | 294 | 0.246 | NA | NA | NA | 7212 | 47 |
| RNN_train_2.R | 0.870 | 297 | 0.030 | 0.309 | 288 | 0.030 | 7095 | 50 |
| RNN_train_2.R | 0.869 | 67 | 0.275 | 0.312 | 259 | 0.197 | 7210 | 27 |
Thus, one can clearly see that the best model is the Convolutional Neural Network with self-trained word embedding (validation accuracy: 0.9040), which optimised hyperparameters are:
Optimised hyperparameters of the best model
FLAGS <- flags(
flag_integer("gru_units1", 100),
flag_numeric("dropout1", 0.011),
flag_integer("num_words", 8721),
flag_integer("dim",48),
flag_numeric("learningrate",0.009),
flag_integer("batch",73)
)
Since a best model was found, we then retrained that model, using its optimised hyperparameters.
retrain.R with best values
# 1. Define the model
#-------------------------------------------------------------------------------
num_words <- FLAGS$num_words # the size of the vocabluary
maxlen <- 20 # the size of the text
embedding_dim <- FLAGS$dim # the dimension of embeddings
best_model <- keras_model_sequential() %>%
layer_embedding(input_dim = num_words,
output_dim = embedding_dim,
input_length = maxlen) %>%
layer_conv_1d(filters = 32, kernel_size = 5, strides = 1, padding = "same",
activation = "relu") %>%
layer_max_pooling_1d(pool_size = 3) %>%
layer_conv_1d(filters = 64, kernel_size = 3, strides = 1, padding = "same",
activation = "relu") %>%
layer_gru(units = FLAGS$gru_units1,dropout = FLAGS$dropout1) %>%
layer_dense(units = 4, activation = "softmax")
# 2. Compile the model
#-------------------------------------------------------------------------------
best_model %>% compile(
optimizer = optimizer_rmsprop(lr = FLAGS$learningrate),
loss = "categorical_crossentropy"
metrics = metric_categorical_accuracy
)
# 3. Fit the model
#-------------------------------------------------------------------------------
best_model %>% fit(
train_x,
train_y,
batch_size = FLAGS$batch,
epochs = 12
)
# 4. Save model
#-------------------------------------------------------------------------------
best_model %>% save_model_hdf5("results/best-model.hdf5")
Model evaluation
# 5. Evaluate the model
#-------------------------------------------------------------------------------
best_model <- load_model_hdf5(
filepath = here("results/best-model.hdf5"))
test_x <- readRDS(here("/rds/test_x.rds"))
test_y <- readRDS(here("/rds/test_y.rds"))
evaluation <- best_model %>%
evaluate(test_x,test_y)
Thanks to this model, we were able to reach 0.9066427 of categorical accuracy on the test set, which is higher than expected.
First of all, we were positively suprised by the predictive ability of our best model, which is higher than what we expected. This result is the consequence of different explanations.
First, Donald Trump, Barack Obama and Hillary Clinton were good candidates for predictions, as each of them have a significant personal writing style. Although Barack Obama and Hillary Clinton are democrats, they have differents commmunication skills, which makes them easier to classify (this effect is even higher with Donald Trump).
Another interesting result is that we have a better accuracy using non-pretrained word embedding. Although we were surprised initially, we then realized that pretrained word embeddings are trained on billions of tweets and millions of people (each of them has their own writing style and vocabulary). They would probably perform better on larger datasets with way more people. For limited classification tasks, we would recommend not to use it.
Finally and to go further, classifying tweets according to the membership to a political party could justify further researches. However, we struggled to find other datasets of American politicians. Another interesting project could be to classify political membership on a larger scale, by using for example the tweets of thousands of non-political individuals but with distinct affiliation to a political party.