The last 2 years have been marked by one of the most memorables events in the history of the human race, a worldwide pandemic that took everyone by surprise, from big companies and governments to the most regular of the citizens around the world.
As of Today, October the 3rd of 2021, the virus has taken the life of more than 4 million people all over the globe, with more than 210 million cases registered as well.
Luckily, the efforts for stopping the pandemic have represented great results for the control of this virus. With almost 34.2% of the world population fully vaccinated it is safe to assume that not in to long the covid will be totally neutralized. As long as the necessary actions are followed by the common population and the governments.
Even though, the pandemic might be coming to an end, it has left us with many terrible moments as well as some positive ones. This is reflected on twitter, where people worldwide can share their feelings about the virus with everyone else.
The idea of this project is to study these emotions trough tweets and, subsequently, create a respective model that can predict whether a tweet represents a positive, negative or neutral sentiment.
For the previously mentioned purpose we have a set of data containing 44955 tweets regarding people’s feelings about different moments in the pandemic.
The data is divided in training and testing subsets, each containing 41157 and 3798 of the tweets respectively. However, we will bind them in a single dataset and then we will create our own partition.
We also have information about the date when the tweet was tweetes, as well as their respective sentiment, which is the variable that we want to model.
library(tidyverse)
library(stringr)
library(quanteda)
library(caret)
library(tensorflow)
library(keras)
library(irlba)
reticulate::py_config()
## python: C:/Users/jaram/anaconda3/envs/r-reticulate/python.exe
## libpython: C:/Users/jaram/anaconda3/envs/r-reticulate/python37.dll
## pythonhome: C:/Users/jaram/anaconda3/envs/r-reticulate
## version: 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Users/jaram/anaconda3/envs/r-reticulate/Lib/site-packages/numpy
## numpy_version: 1.19.5
## tensorflow: C:\Users\jaram\ANACON~1\envs\R-RETI~1\lib\site-packages\tensorflow\__init__.p
##
## python versions found:
## C:/Users/jaram/anaconda3/envs/r-reticulate/python.exe
## C:/Users/jaram/anaconda3/python.exe
## C:/Users/jaram/anaconda3/envs/python_tensor/python.exe
tf_config()
## TensorFlow v2.4.0 (C:\Users\jaram\ANACON~1\envs\R-RETI~1\lib\site-packages\tensorflow\__init__.p)
## Python v3.7 (C:/Users/jaram/anaconda3/envs/r-reticulate/python.exe)
train = read.csv("Corona_NLP_train.csv")
test = read.csv("Corona_NLP_test.csv")
tweet = rbind(train,test)
prop.table(table(tweet$Sentiment))*100
##
## Extremely Negative Extremely Positive Negative Neutral
## 13.50906 16.06718 24.37549 18.53409
## Positive
## 27.51418
Right now, there are 5 different possible labels for our data: positive, negative, neutral, extremely positive and extremely negative. First of all, we want to include all the positive and negative tweets as a single label for each so that, for example, extremely positive and positive count as a single label and not two.
tweetBack = tweet
tweet$Sentiment = ifelse(str_detect(tweet$Sentiment,"Positive"),"Positive",ifelse(tweet$Sentiment=="Neutral","Neutral","Negative"))
However, we now face a different problem, where the data seems to be quite imbalanced. For our luck, we have a relatively big dataset which means we could do undersampling in order to balance the data and try to get better results on our model and to reduce computational costs.
set.seed(1122)
positiveTweetUnder = positiveTweet[sample.int(nrow(positiveTweet),nrow(neutralTweet)),]
negativeTweetUnder = negativeTweet[sample.int(nrow(negativeTweet),nrow(neutralTweet)),]
tweetUS = rbind(positiveTweetUnder,negativeTweetUnder,neutralTweet)
ggplot(tweetUS,aes(x = Sentiment,fill = Sentiment))+
geom_bar()+
theme_bw()+
theme(legend.position = "none")
We can then see a huge difference in comparison to the last barplot that was plotted. Now it is clear how every one of the levels for the sentiment variable have the same number of records, meaning our data is finally balanced.
Before creating our bag of words model, we want to look at some possible features in our data that might be useful to predict the final sentiment of a certain tweet regarding covid-19.
In the first place, it would be interesting to look at the distribution of the length of the tweets and aggregate it by sentiment. This since it may be possible that negative tweets are longer than more positive or neutral tweets for example.
tweetUS$TweetLength = unlist(lapply(str_split(tweetUS$OriginalTweet," "),length))
ggplot(tweetUS,aes(x = TweetLength,fill = Sentiment))+
geom_histogram(col = "black",bins = 20)+
theme_bw()
This histogram for the distribution of the length of tweets by the sentiment shows a very clear behavior. Basically, we can conclude that shorter tweets are usually more neutral type of tweets while longer tweets are usually related to either positive or negative messages. For those tweets whose length is closer to the mean there seems to be a similar chance for each sentiment
Next up, we want to see if maybe certain months or years tend to be more negative, positive or neutral. Maybe a certain day presented a higher count of negative tweets for the possible effects of the virus in that specific day.
As a general rule we do not see a clear trend for the tweet’s sentiments. However, it is important to mention how on the 28th of March, the great majority of tweets regarding covid-19 were positive by a big margin.
Even though this might seem interesting, the date at which the tweet was written does not seem to be influential when classifying them according to their sentiment.
The next step is to tokenize the original tweets in order to create a document-feature matrix which will be used to feed the final predicting model.
However, before creating the final matrix, it is needed to perform certain processes to get the tweets to a more usable format.
Specifically, it is required to remove certain type of characters, like punctuation, numbers, symbols or url’s since none of these will have an actual predicting value. Then we have to turn the words into lowercase; remove stopwords, which are a group of very common words that will not add anything to the model and, finally, we want to perform stemming, a technique where words with the same root will be turned into the same word (for example run, ran and running into run).
tw_tokens = tokens(tweetUS$OriginalTweet,what = "word",remove_numbers = T,remove_punct = T,remove_symbols = T, split_hyphens = T,remove_url = T)
tw_tokens = tokens_tolower(tw_tokens)
tw_tokens = tokens_select(tw_tokens,stopwords(),selection = "remove")
tw_tokens = tokens_wordstem(tw_tokens,language = "english")
tw_tokens_dfm = dfm(tw_tokens)
tw_tokens_mat = as.matrix(tw_tokens_dfm)
dim(tw_tokens_mat)
## [1] 24996 37250
set.seed(1134)
indexes = createDataPartition(tweetUS$Sentiment,p = 0.7,list = F)
train_tokens = tw_tokens_mat[indexes,]
test_tokens = tw_tokens_mat[-indexes,]
rm(tw_tokens_mat)
Given the fact that we have very big training matrix with 17499 documents and 37250 possible features, building a model in a traditional what might be complicated since the processing power required is to high. For that reason, we will be creating a neural network using keras and tensorflow in order to use the GPU as the processing unit and not the CPU.
sentiment = ifelse(tweetUS$Sentiment=="Positive",2,ifelse(tweetUS$Sentiment=="Negative",0,1))
train_y = to_categorical(sentiment[indexes])
test_y = to_categorical(sentiment[-indexes])
memory.limit(size = 76000)
## [1] 76000
modelNN = load_model_tf("modelNN")
modelNN %>% evaluate(test_tokens,test_y)
## loss accuracy
## 2.2215056 0.7028145
For this case a neural network with 4 hidden layers each with 50 neurons was trained. After using the test dataset to evaluate the real accuracy of the model with data that it had never seen before, a value of 70.28% was obtained, meaning that the model will, approximately, correctly classify 70% of the tweets regarding covid-19.
This first model is decent when it comes to predicting the sentiment of covid tweets. However, a good way to improve it might be to use tf-idf, which gives importance to the number of times a certain word is used in a certain document as well as the importance of every term in the collection of documents.
train_tokensN = apply(train_tokens,1,term.frecuency)
train_tokens_idf = apply(train_tokens,2,inverse.doc.freq)
rm(train_tokens)
train_tokens_tfidf = apply(train_tokensN,2,tf.idf,idf = train_tokens_idf)
rm(train_tokens_idf,train_tokensN)
train_tokens_tfidf = as.matrix(t(train_tokens_tfidf))
dim(train_tokens_tfidf)
## [1] 17499 37250
test_tokensN = apply(test_tokens,1,term.frecuency)
test_tokens_idf = apply(test_tokens,2,inverse.doc.freq)
rm(test_tokens)
test_tokens_tfidf = apply(test_tokensN,2,tf.idf,idf = test_tokens_idf)
rm(test_tokens_idf,test_tokensN)
test_tokens_tfidf = as.matrix(t(test_tokens_tfidf))
dim(test_tokens_tfidf)
## [1] 7497 37250
train_tokens_tfidf[is.na(train_tokens_tfidf)] = 0
test_tokens_tfidf[is.na(test_tokens_tfidf)] = 0
Now that we have created our tf-idf matrix that includes term frequency in every document and in general. We want to create a new model similar to the one created previously but with the new data. It is expected to obtain better results because of the added value of the tf-idf matrix.
rm(train_tokens_tfidf)
model_tfidf = load_model_tf("model_tfidf")
model_tfidf %>% evaluate(test_tokens_tfidf,test_y)
## loss accuracy
## 3.1008654 0.6590636
rm(test_tokens_tfidf)
After building these two models, it is desired to make a comparison on their performances in order to decide which of them is the more proper one to predict the sentiment of covid tweets.
This graphics comes to show how the use of tf-idf not always represents an improvement when creating a model that uses text. For this case specifically, it is noticeable how both models performed really well when using training data. However, it is also clear that the bag of words model performed better than the tf-idf model when using the test dataset, with almost 5% more accuracy.
The great performance obtained with the training dataset might be a sign of the model being overfitted, meaning that maybe using a less flexible model could return better values for the test dataset.
It would be recommended to use different models and techniques to try and get better results.