Packages used for this project
Here are the different packages used for our work.
library(tidyverse)
library(ggplot2)
library(textstem)
library(stringr)
library(stopwords)
library(tidytext)
library(vtable)
library(kableExtra)
library(wordcloud)
library(stringi)
library(quanteda)
library(caret)
library(tibble)
library(tree)
library(quanteda.textmodels)
library(tm)
library(SnowballC)
library(randomForest)
In this project we have tried to analyse tweets about the coronavirus. Each tweet is associated with a sentiment measure. In this paper, we will perform an exploratory analysis and build two classification models for sentiment analysis (Naive baise & Decision Tree).
First, we import the dataset for analysing the tweets. You can find the dataset in free access: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification
file <- read.csv("E:/HSE Social Info/Third Year/Text Analysis/HM/archive/Corona_NLP_train.csv", header = T)
head(file)
## UserName ScreenName Location TweetAt
## 1 3799 48751 London 16-03-2020
## 2 3800 48752 UK 16-03-2020
## 3 3801 48753 Vagabonds 16-03-2020
## 4 3802 48754 16-03-2020
## 5 3803 48755 16-03-2020
## 6 3804 48756 ÜT: 36.319708,-82.363649 16-03-2020
## OriginalTweet
## 1 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2 advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3 Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4 My food stock is not the only one which is empty...\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \nStay calm, stay safe.\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6 As news of the regionÂ’s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
## Sentiment
## 1 Neutral
## 2 Positive
## 3 Positive
## 4 Positive
## 5 Extremely Negative
## 6 Positive
count(file)
## n
## 1 41157
The data frame has six columns:UserName, ScreenName, Location, TweetAt, OriginalTweet, and Sentiment
The sentiment column, our response variable, has four categorical variables: Positive, Extremely positive, Neutral, Negative and Extremely negative. For this work, and to make it easier, we have removed the neutral sentiment, and reduced the positive and extremely positive into one variable (positive), as well as the negative and extremely negative. In addition, as some tweets have a different character encoding, which makes the cleaning procedure difficult, we had to transform all character encodings to UTF-8.
n_file <- file %>% filter(!Sentiment == "Neutral") %>% mutate(Sentiment1 = case_when(
Sentiment == "Positive" | Sentiment == "Extremely Positive" ~ "Positive",
Sentiment == "Negative" | Sentiment == "Extremely Negative" ~ "Negative"))
n_file <- n_file %>% select(-Sentiment)
n_file <- n_file %>% mutate(OriginalTweet = stri_replace_all_regex(OriginalTweet, pattern=c("\u0091", "\u0092", "\u0093", "\u0094", "\u0095", "\u0096", "\u0097", "\u0099", "\u0080", "\u0084", "\u009e", "\u009a"),replacement=c("'", "'", "'", "'", "", "", "", "", "", "", "", ""),vectorize=FALSE))
n_file$OriginalTweet <- iconv(n_file$OriginalTweet, "UTF-8", "UTF-8", sub='')
Before proceeding with the exploration analysis, we need to tokenise our tweets. We have chosen ‘tweets’ as the tokenisation mode. This allows us not to split the urls into different parts and to keep them as a whole. Next, we removed stop words, urls, usernames, all words starting with a number, all tweets containing the dollar sign, and all special characters except # from the tokenisation words, to keep the hastags. Finally, we lemmatized the set of words obtained from the tokenization and filtering.
tok_ln <- n_file %>% unnest_tokens(words, OriginalTweet, token = "tweets", to_lower = T)
tok_ln1 <- tok_ln %>% filter(!words %in% stopwords("en"), !str_detect(words, "https:."), !str_detect(words, "@."), !str_detect(words, "^[0-9]+"),!str_detect(words, "\\$"), !str_detect(words, "(?!#)[^0-9a-zA-Z]+"))
tok_ln1 <- tok_ln1 %>% select(Sentiment1, words)
#head(tok_ln1)
words_lema <- tok_ln1 %>% mutate(lem = lemmatize_words(words))
head(words_lema)
## Sentiment1 words lem
## 1 Positive advice advice
## 2 Positive talk talk
## 3 Positive neighbours neighbour
## 4 Positive family family
## 5 Positive exchange exchange
## 6 Positive phone phone
count(words_lema)
## n
## 1 624598
We ended up with a corpus size of 620,709 lemmatized words.Indeed it is a large data set.
In this section, we will describe some basic statistics of the corpus, such as the distribution of the most frequent words, the size of the vocabulary and the distribution of classes. In addition, we will examine the use of words in the two sentiment groups, i.e. which words occur most often in positive and negative sentiment tweets. We will use the log-likelihood ratio to do this.
Let us start with the size of the vocabulary and the distribution of classes. We saw above that the corpus size is 620,709 lemmatised words. Now we want to know how many unique lemmatised words there are in the corpus.
words_lema %>% select(lem) %>% unique() %>% count()
## n
## 1 39225
There are 39225 unique words in our corpus.
The class distribution allows us to know how many words in ours corpus comes from negative tweets or positive tweets.
#words_lema %>% filter(Sentiment1 == "Positive") %>% count()
#words_lema %>% filter(Sentiment1 == "Negative") %>% count()
sumtable(words_lema, out = 'kable') %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
| Variable | N | Percent |
|---|---|---|
| Sentiment1 | 624598 | |
| … Negative | 285286 | 45.7% |
| … Positive | 339312 | 54.3% |
From our corpus of 620709 words, 45.7% comes from Negative tweets and 54.3% comes from Positive tweets.
Next is to define what are the most frequent words in our corpus. We will do it by performing a frequency table, a graph distribution of the words, and a words cloud.
c_words <- words_lema %>% count(lem, sort = T)
head(c_words)
## lem n
## 1 #coronavirus 12462
## 2 price 7587
## 3 #covid19 7168
## 4 store 6540
## 5 food 6271
## 6 covid19 6237
words100 <- c_words %>% top_n(50)
## Selecting by n
ggplot(words100, aes(reorder(lem, -n), n)) +
geom_point(stat = "identity") +
geom_line(group = 1) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
xlab("Words") +
ylab("Frequency") +
ggtitle("Distribution of the most frequent words")
words_lema %>%
dplyr::count(lem) %>%
with(wordcloud(lem, n, max.words = 80))
The hastag #Coronavirus is the most frequent word in our corpus size, followed by the word “price”. To better understand the meaning of different words in the corpus, it would be better to perform a bigram analysis of the words, but as our analysis items are tweets, the bigram cannot be well implemented in our case. In addition, we have a logarithmic distribution of words. We have some mainly used words and many other rarely used words.
We need a constrained analysis to define which words occur most often in positive and negative sentiment tweets. We will use the log-likelihood ratio to do this. First, we add 1 to the positive and negative columns to avoid infinity when calculating the log-likelihood, as some words do not appear in any of the instances. Second, we filter out the words that are at least present in ten positive or negative tweets. Finally, we perform the Dunning G^2 log-likelihood and the log-ratio. And finally, we show the result with a log-ratio distribution.
sentiment <- words_lema %>% select(- words) %>% count(lem, Sentiment1) %>% spread(Sentiment1, n, fill = 0) %>% mutate(Positive = Positive + 1, Negative = Negative + 1) %>% filter(Positive > 10 | Negative > 10)
head(sentiment)
## lem Negative Positive
## 1 #advertising 4 16
## 2 #africa 12 6
## 3 #agriculture 7 13
## 4 #ai 8 13
## 5 #aldi 5 11
## 6 #alert 4 30
g2 = function(a, b) {
c = sum(a)
d = sum(b)
E1 = c * ((a + b) / (c + d))
E2 = d * ((a + b) / (c + d))
return(2*((a*log(a/E1+1e-7)) + (b*log(b/E2+1e-7))))
}
sentiment.g2 <- sentiment %>%
mutate(g2=g2(Positive, Negative)) %>%
arrange(desc(g2)) %>%
mutate(g2 = round(g2, 2))
head (sentiment.g2)
## lem Negative Positive g2
## 1 panic 2005 427 1400.47
## 2 hand 403 2136 1019.02
## 3 crisis 1314 278 921.98
## 4 help 433 2032 869.56
## 5 thank 160 1208 739.34
## 6 sanitizer 425 1773 673.09
logratio <- function(a, b) {
return(log2((a/sum(a)/(b/sum(b)))))
}
sentiment.lr <- sentiment.g2 %>%
mutate(logodds = logratio(Positive, Negative))
head(sentiment.lr %>% arrange(desc(abs(logodds))))
## lem Negative Positive g2 logodds
## 1 #contest 1 37 37.53 4.961693
## 2 beautiful 1 33 32.87 4.796634
## 3 #giveawayalert 1 30 29.39 4.659131
## 4 lawsuit 21 1 25.96 -4.640077
## 5 wtf 62 3 76.41 -4.616994
## 6 ruin 20 1 24.49 -4.569688
sentiment.lr %>%
filter(Positive > 0 & Negative > 0) %>%
group_by(logodds < 0) %>%
top_n(15, abs(logodds)) %>%
ungroup() %>%
mutate(lem = reorder(lem, logodds)) %>%
ggplot(aes(lem, logodds, fill = logodds > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (Negative/Positive)") +
scale_fill_discrete(name = "", labels = c("Negative", "Positive"))
As a result, we can see that “#contest” and “beautiful” are the most frequent words among the positive tweets and that the difference is statistically significant. “Lawsuit” and “wtf” are the most frequent words among the negative tweets and the difference is statistically significant.
In this section, we will prepare the data for the classification models. The main objective of our classification is to determine whether a tweet is positive or negative by the words it contains. Therefore, our model will classify new tweets as positive or negative by the composition of their words. To do this, we must first clean our data for classification. Here, we will use the same data cleaning as we used previously, except that we will add an ID number to each tweet so that we do not lose the origin of the words after tokenisation. This will help us to know which tweets a word comes from.
data_clean <- n_file %>% mutate(doc_id = row_number()) %>% unnest_tokens(words, OriginalTweet, token = "tweets", to_lower = T)%>% filter(!words %in% stopwords("en"), !str_detect(words, "https:."), !str_detect(words, "@."), !str_detect(words, "^[0-9]+"),!str_detect(words, "\\$"), !str_detect(words, "(?!#)[^0-9a-zA-Z]+"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tok_ln1 <- tok_ln1 %>% select(Sentiment1, words)
#head(tok_ln1)
words_lema <- tok_ln1 %>% mutate(lem = lemmatize_words(words))
data_lama <- data_clean %>% mutate(lem = lemmatize_words(words)) %>% select(Sentiment1, doc_id, lem)
head(data_lama)
## Sentiment1 doc_id lem
## 1 Positive 1 advice
## 2 Positive 1 talk
## 3 Positive 1 neighbour
## 4 Positive 1 family
## 5 Positive 1 exchange
## 6 Positive 1 phone
#data_lama %>% count(lem) %>% arrange(n)
In addition, we analyse the most frequent words in the tweets. In the following, we will often refer to tweets as documents.
lem_count <- data_lama %>% count(lem, doc_id, sort = T)
head(lem_count)
## lem doc_id n
## 1 gt 6264 10
## 2 online 27635 9
## 3 thank 30645 9
## 4 expert 26060 7
## 5 gt 3905 7
## 6 industry 3251 7
Since we want to run a classification model on our corpus, we need to transform it into a matrix where the rows represent the documents (tweets), and the columns the different words.
dfm <- lem_count %>%
cast_dfm(doc_id, lem, n)
head(dfm)
## Document-feature matrix of: 6 documents, 39,225 features (99.95% sparse) and 0 docvars.
## features
## docs gt online thank expert industry senator wena amp bill can
## 6264 10 0 0 0 0 0 0 0 0 0
## 27635 0 9 0 0 0 0 0 0 0 0
## 30645 0 0 9 0 0 0 0 0 0 0
## 26060 0 0 0 7 0 0 0 0 0 0
## 3905 7 0 0 0 0 0 0 0 0 0
## 3251 0 0 0 0 7 0 0 0 0 0
## [ reached max_nfeat ... 39,215 more features ]
Our matrix consists of 33,441 rows (documents) and 39,225 features (words). In ML, it is not recommended to run a model with more features than rows. Moreover, running a model with 39225 features requires computers with higher memory capacities. Therefore, we will filter the features (terms) of our matrix. Let us filter out the terms that appear at least 5 times in the whole corpus (term frequency) and these words must appear in at least 0.3% of the documents in the corpus (document frequency).
dfm_final <- dfm %>% # building matrix with frequencies as values! not tf-idf
dfm_trim(min_termfreq=5, termfreq_type="count") %>% #each word should appear at least 1 timw
dfm_trim(min_docfreq=0.003, docfreq_type="prop")#each word should appear at least in 0.5 % of docs # now converting to tf-idf doc-term matrix
dfm_final
## Document-feature matrix of: 33,441 documents, 942 features (98.67% sparse) and 0 docvars.
## features
## docs gt online thank expert industry amp bill can case every
## 6264 10 0 0 0 0 0 0 0 0 0
## 27635 0 9 0 0 0 0 0 0 0 0
## 30645 0 0 9 0 0 0 0 0 0 0
## 26060 0 0 0 7 0 0 0 0 0 0
## 3905 7 0 0 0 0 0 0 0 0 0
## 3251 0 0 0 0 7 0 0 0 0 0
## [ reached max_ndoc ... 33,435 more documents, reached max_nfeat ... 932 more features ]
We end up with a matrix of 942 features (words). This is a reduction of 97% compared to our previous matrix.
Next, we need to divide our corpus into training data, to run the model, and test data, to test the quality of the model with unknown documents. 90% of our corpus will be sent to the training set, and the remaining 10% to the test set.
sentiment_labels <- n_file$Sentiment1[as.integer(rownames(dfm))]
set.seed(1991)
## we will take 10% of the sample for testinf
split <- createDataPartition(y=sentiment_labels, p = 0.9, list = FALSE)
train.data <- dfm %>% dfm_subset(rownames(dfm) %in% rownames(dfm)[split])
test.data <- dfm %>% dfm_subset(!rownames(dfm) %in% rownames(dfm)[split])
response <- as.factor(sentiment_labels)
trainY <- response[split]
testY <- response[-split]
test.matched <- test.data %>%
dfm_match(features = featnames(train.data))
In this section, we will run the Naive Bayes algorithm with our training data, predict a new classification (positive or negative) with our test data, and then analyse and discuss the validity of our model.
Let us run the Naive Bayes model. We have chosen a multinomial distribution since the presence of words in a document is counted.
model.nb <- textmodel_nb(train.data, trainY, distribution = "multinomial")
summary(model.nb)
##
## Call:
## textmodel_nb.dfm(x = train.data, y = trainY, distribution = "multinomial")
##
## Class Priors:
## (showing first 2 elements)
## Negative Positive
## 0.5 0.5
##
## Estimated Feature Scores:
## gt online thank expert industry senator wena
## Negative 0.0001485 0.002798 0.0004692 0.0003477 0.0006616 7.764e-05 2.7e-05
## Positive 0.0003857 0.004524 0.0031288 0.0003335 0.0007829 3.190e-05 2.9e-06
## amp bill can case dad every grateful
## Negative 0.005907 0.0003342 0.00401 0.0008877 0.0001553 0.0009552 4.051e-05
## Positive 0.006722 0.0002813 0.00535 0.0007046 0.0001102 0.0010932 2.523e-04
## paper pray price railway staff want will
## Negative 0.001681 4.388e-05 0.013022 4.051e-05 0.001009 0.001097 0.005647
## Positive 0.001577 1.740e-04 0.008664 8.699e-06 0.001786 0.001461 0.005640
## work check close combo crisis death dying
## Negative 0.003261 0.000611 0.001212 2.363e-05 0.0039763 0.0006211 1.654e-04
## Positive 0.004016 0.001070 0.001606 2.030e-05 0.0007481 0.0001479 8.699e-05
## essential fight
## Negative 0.001343 0.0011409
## Positive 0.001667 0.0007017
Then we predict a new classification using our model created above on the test data.
predictedY <- predict(model.nb, newdata = test.matched)
head(predictedY)
## 2714 23802 19031 2180 29133 3528
## Negative Positive Positive Positive Positive Positive
## Levels: Negative Positive
We can see for example that our model predicted that document 2714 was a negative tweet, and the same for document 4432. We can run the model again, and instead of having the final classification (positive/negative), let’s see what the probability is for a document to be positive or negative.
#probabilities
predicted.prob <- round(predict(model.nb, newdata = test.matched, type = "prob"), 2)
head(predicted.prob)
## Negative Positive
## 2714 1.00 0.00
## 23802 0.00 1.00
## 19031 0.49 0.51
## 2180 0.49 0.51
## 29133 0.46 0.54
## 3528 0.02 0.98
We can see for example that document 2714 has a 100% chance of being negative, while document 4432 has a 98% chance of being negative.
Let’s see how accurate our model is and how accurate it is in classifying a document as positive or negative.
cm.po <- confusionMatrix(data = predictedY, reference = testY, positive="Positive", mode = "prec_recall")
cm.po
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 1228 323
## Positive 311 1481
##
## Accuracy : 0.8103
## 95% CI : (0.7966, 0.8235)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6185
##
## Mcnemar's Test P-Value : 0.6622
##
## Precision : 0.8265
## Recall : 0.8210
## F1 : 0.8237
## Prevalence : 0.5396
## Detection Rate : 0.4430
## Detection Prevalence : 0.5360
## Balanced Accuracy : 0.8094
##
## 'Positive' Class : Positive
##
cm.neg <- confusionMatrix(data = predictedY, reference = testY, positive="Negative", mode = "prec_recall")
cm.neg
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 1228 323
## Positive 311 1481
##
## Accuracy : 0.8103
## 95% CI : (0.7966, 0.8235)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6185
##
## Mcnemar's Test P-Value : 0.6622
##
## Precision : 0.7917
## Recall : 0.7979
## F1 : 0.7948
## Prevalence : 0.4604
## Detection Rate : 0.3673
## Detection Prevalence : 0.4640
## Balanced Accuracy : 0.8094
##
## 'Positive' Class : Negative
##
The accuracy of our model is 80%. This is a very good result for a ML model. The precision for accurately predicting negative documents is almost 77%, while the precision for accurately predicting positive documents is almost 81%. The results are quite satisfactory, and we can ensure that our model can predict accurately for a new data set of 80%.
In this section, we will analyse the most important predictors for the different classes, i.e. the predictors that have the greatest influence on increasing the probability that a document is negative or positive.
vars.nb <- t(model.nb$param) %>%
as.data.frame %>%
rownames_to_column("word") %>%
mutate(lo = log(Positive/Negative))
head(vars.nb)
## word Negative Positive lo
## 1 gt 1.485192e-04 3.856671e-04 0.9542604
## 2 online 2.798237e-03 4.523614e-03 0.4803218
## 3 thank 4.691856e-04 3.128833e-03 1.8974169
## 4 expert 3.476699e-04 3.334716e-04 -0.0416960
## 5 industry 6.615855e-04 7.829332e-04 0.1684082
## 6 senator 7.763503e-05 3.189728e-05 -0.8894981
Most important predictors for class “Positive”
head(vars.nb %>% arrange(desc(lo)))
## word Negative Positive lo
## 1 #contest 3.375436e-06 9.859159e-05 3.374461
## 2 #giveawayalert 3.375436e-06 8.409283e-05 3.215397
## 3 beautiful 3.375436e-06 8.119307e-05 3.180305
## 4 #puzzle 3.375436e-06 7.829332e-05 3.143938
## 5 success 3.375436e-06 7.249382e-05 3.066977
## 6 peace 3.375436e-06 6.669431e-05 2.983595
Most important predictors for class “Negative”
head(vars.nb %>% arrange(lo))
## word Negative Positive lo
## 1 ruin 6.413329e-05 2.899753e-06 -3.096338
## 2 fraud 3.139156e-04 1.449876e-05 -3.075061
## 3 crude 4.961891e-04 2.319802e-05 -3.062890
## 4 lawsuit 6.075785e-05 2.899753e-06 -3.042271
## 5 wtf 1.822736e-04 8.699258e-06 -3.042271
## 6 tragedy 5.400698e-05 2.899753e-06 -2.924488
We can see that “gratitude”, and “#contest” are the words that most influence the probability of classifying a tweet as positive. On the other hand, the words “evil” and “bullshit” are the ones that most influence the probability of a tweet being classified as positive. These words match their description. #contest is mainly positive because it is written under marketing tweets that urge people to enter a contest. “Evil” and “bullshit” are words that are mostly used in a negative context. By performing a sentiment analysis, tweets containing these words are more likely to be classified as negative.
Are they the same with what we found in the EDA steps?
a <- vars.nb %>% arrange(desc(lo)) %>% top_n(15, lo)
b <- vars.nb %>% arrange(lo) %>% top_n(-15, lo)
c <- vars.nb %>% top_n(30, abs(lo)) %>% arrange(desc(abs(lo)))
sentiment.lr %>%
filter(Positive > 0 & Negative > 0) %>%
top_n(30, abs(logodds)) %>% ungroup() %>% filter(lem %in% c(c$word))
## lem Negative Positive g2 logodds
## 1 crude 161 10 188.06 -4.256749
## 2 fraud 105 6 125.01 -4.377043
## 3 arrest 110 8 123.46 -4.029120
## 4 blame 75 5 86.11 -4.154651
## 5 wtf 62 3 76.41 -4.616994
## 6 #contest 1 37 37.53 4.961693
## 7 fraudster 30 2 34.44 -4.154651
## 8 beautiful 1 33 32.87 4.796634
## 9 gratitude 2 37 32.56 3.961693
## 10 #giveawayalert 1 30 29.39 4.659131
## 11 #puzzle 1 28 27.08 4.559595
## 12 lawsuit 21 1 25.96 -4.640077
## 13 success 1 27 25.93 4.507128
## 14 ruin 20 1 24.49 -4.569688
## 15 brent 18 1 21.56 -4.417685
## 16 peace 1 23 21.36 4.275802
## 17 cheat 17 1 20.11 -4.335223
## 18 terroristic 17 1 20.11 -4.335223
## 19 assault 16 1 18.66 -4.247760
## 20 tragedy 16 1 18.66 -4.247760
## 21 bankrupt 13 1 14.37 -3.948200
We can see that 27 of the top 30 most influential words are in the top 30 words with the highest log probabilities found in the previous step. This shows that they are highly correlated. It is important to note that #contest was also the first of the most frequent words among the positive tweets we found in the EDA step.
Next, let’s analyse the wrongly predicted tweets and learn from their characteristics. The wrongly predicted tweets are the tweets in the test data that our model predicts as positive when they were originally negative and vice versa.
predictions <- data.frame(doc_id=as.integer(rownames(predicted.prob)))
predictions$pred_y <- predictedY
predictions$positive_prob_pred <- predicted.prob[,2]
predictions$negative_prob_pred <- predicted.prob[,1]
predictions$true_label <- n_file[rownames(predicted.prob),]$Sentiment1
predictions$text <- n_file[rownames(predicted.prob),]$OriginalTweet
head(predictions)
## doc_id pred_y positive_prob_pred negative_prob_pred true_label
## 1 2714 Negative 0.00 1.00 Negative
## 2 23802 Positive 1.00 0.00 Positive
## 3 19031 Positive 0.51 0.49 Positive
## 4 2180 Positive 0.51 0.49 Negative
## 5 29133 Positive 0.54 0.46 Negative
## 6 3528 Positive 0.98 0.02 Positive
## text
## 1 Indonesia - 227 cases, 19 deaths\nPhilippines - 187 cases, 14 deaths\nMalaysia - 673 cases, 2 deaths\nThailand - 177 cases, 1 death\nSingapore - 226 cases , 0 deaths\n\nIndonesia now has the most #COVID19 case fatalities in Southeast Asia #coronavirus #CoronaOutbreak #SARSCoV2
## 2 i want to go shopping i want to walk w/o the thought i might get the COVID-19 virus i want a real hug i want to go eat at a dine in restaurant ohhh to be free w/o the thought of that darn virus i want to go shopping for shoes i don't need just want to go shopping\nno more online!
## 3 @Consumer_Cell Is there an acceptable level of cases that push your company to change its stand on working from WORK! We are heading there FAST! SO what's your number? 75K cases? 100K cases? 150K cases? 200K cases? #coronavirus https://t.co/CR1Ssx1PGF
## 4 @Sideshow_JaneV3 Perhaps... I think the world may be altered in other ways also by COVID-19. More people may want or be forced to work at home. People may find lots of businesses have gone bust, so more online shopping? Travel may be harder. All this may
## 5 Sports behaving badly? A US Nielsen Sports survey found 45% of fans will stop following a sport if it responds poorly to the current crisis, 57% will try a new sport that impresses with their approach. Full interview with of Nielsen Sports CEO here https://t.co/gaQZX4Q0kZ
## 6 Question why is the government not concerned about closing or limiting the # of people in these type store at one time: \nApparel stores \nGrocery stores \nSporting stores \nHome improvement stores?\nLIKE their doing to RESTAURANTS? ???? #coronavirus #COVID19 #CoronavirusOutbreak
Wrongly predicted positive tweets
head(predictions %>% filter(pred_y == 'Positive' & true_label == 'Negative'))
## doc_id pred_y positive_prob_pred negative_prob_pred true_label
## 1 2180 Positive 0.51 0.49 Negative
## 2 29133 Positive 0.54 0.46 Negative
## 3 830 Positive 0.52 0.48 Negative
## 4 6275 Positive 0.83 0.17 Negative
## 5 3250 Positive 1.00 0.00 Negative
## 6 30419 Positive 0.51 0.49 Negative
## text
## 1 @Sideshow_JaneV3 Perhaps... I think the world may be altered in other ways also by COVID-19. More people may want or be forced to work at home. People may find lots of businesses have gone bust, so more online shopping? Travel may be harder. All this may
## 2 Sports behaving badly? A US Nielsen Sports survey found 45% of fans will stop following a sport if it responds poorly to the current crisis, 57% will try a new sport that impresses with their approach. Full interview with of Nielsen Sports CEO here https://t.co/gaQZX4Q0kZ
## 3 Connectivity is essential during times of crisis. \n\nThat's why @CWAunion & allies are asking broadband CEOs to lift data caps, waive fees, & do everything within their power to help people connect to the world from home & stop the spread of COVID-19.\n\nhttps://t.co/GWWf1LBByp
## 4 So while at the grocery store, my car is in the parking lot CLOSER than 6 feet from the cars on either side. This is not #SocialDistancing I hope my car doesnÂ’t get #coronavirus
## 5 Due to the #Coronavirus outbreak, we recommend regularly using anti-static cleansing wipes or anti-static foam cleanser on the touchscreen of your till…\n\nFoam Clene Spray (300ml): £8.50\nScreen Clene Wipes (20): £10\nScreen Clene Wipes (100): £35\n\nNote: All prices exclude VAT
## 6 Shocking it's been more busy than Christmas for Royal Mail and other couriers/deliveries. If it's not essential don't send it and don't buy for delivery. Putting workers in danger that don't have protection in place #StayHomeSaveLives and only do essential online shopping & post.
By analysing the tweets, we can see that our model, even though it incorrectly predicted the original sentiment of the tweets, seems to be closer to reality than the original sentiment of the tweets. For example, let’s analyse 15335. Only because there is a repetition of the word kill several times, the original sentiment analysis considered the tweets to be negative. Whereas in this context, the word kill was not used to share a negative sentiment. Therefore, we can conclude by saying that the original sentiment analysis only considers the presence or absence of a word, to which it assigns a value, and to consider whether a tweet is positive or not, they add up the values of the different words to make a conclusion. Whereas our machine learning model takes into account the different contexts in which a word may appear. Is the situation the same for wrongly predicted negative tweets?
Wrongly predicted negative tweets
head(predictions %>% filter(pred_y == 'Negative' & true_label == 'Positive'))
## doc_id pred_y positive_prob_pred negative_prob_pred true_label
## 1 8267 Negative 0.09 0.91 Positive
## 2 5488 Negative 0.07 0.93 Positive
## 3 6966 Negative 0.32 0.68 Positive
## 4 411 Negative 0.03 0.97 Positive
## 5 11560 Negative 0.04 0.96 Positive
## 6 31736 Negative 0.00 1.00 Positive
## text
## 1 You may be immune to #COVID19 but ask yourself, are my mom and dad immune? Are my grandparents immune? Is that old lady I pass at the grocery store immune? Try not being selfish for the first time in your life. Stop hording. Use your brain.\n\n#coronavirus #CoronaVirusUpdate
## 2 Don't miss your FREE 72-page land report with an analysis of land prices in 2019, land trends over the last 13 years and we reveal who is buying land. Plus a surge of cattle numbers in marts and #Coronavirus advice for you and your farm. Only inside this week's @Farmersjournal
## 3 due to COVID-19 iÂ’m going to be selling pictures to help pay my bills and buy groceries, please dm me to talk prices!! \nselling feet pics, selling nudes, selling videos, sugar daddy needed, serious inquires only
## 4 Clubs shut down\nBars shut down\nRestaurants shut down\nGym shut down \nGrocery stores are nightmare fuel\nSeriously if u work at a grocery store or retail in general u have my sympathy \n#coronavirus
## 5 anhydrous ammonia is off $100 from same week last year\nprices on inputs like anhydrous ammonia could continue the trend to the downside\n“China is a large importer of ammonia for industrial purposes\n\nhttps://t.co/j6rSiRleK5 https://t.co/1hlz58haLx
## 6 Consumer Prices Plummet as Economy Continues Shrinking: The effects of the COVID-19 economic shutdown continue to be felt, but for consumers, itÂ’s not all bad. Consumer prices fell by the most since 2015, dropping 0.4% in March. [CNBC] Prices had seen aÂ… https://t.co/7mfwu8J5VD https://t.co/CEjYoG7NPV
Let’s analyse the tweet 14686. We can see that the same conclusion made for wrongly predicted positive tweets can also be applied to this case. Just because the tweet contains “very good!”, the original sentiment analysis considered it to be a positive tweet. Whereas on closer inspection, the “very good!” was purely ironic. The hashtag used, and the subject, indicate that the tweet is a negative sentiment.
Overall, we can assume that our model performs better than the original sentiment analysis. One reason for this could be the use of hashtags as predictors of sentiment. The first sentiment analysis probably did not take into account the hastags present in the tweets. The overall sentiment of a tweet was probably the sum of the values assigned to each word. This does not take into account the context in which a word is used. Whereas our machine learning model takes into account the context in which a word may appear.
In this section, we will run another machine learning model called the random forest. To do this, we will use the same corpus of lematised words as used in the Naive bayse model. In this case, we will transform our frequency table into a document terms matrix.
xfreq <- lem_count %>%
cast_dtm(doc_id, lem, n)
xfreq
## <<DocumentTermMatrix (documents: 33441, terms: 39225)>>
## Non-/sparse entries: 593581/1311129644
## Sparsity : 100%
## Maximal term length: 191
## Weighting : term frequency (tf)
We obtain a matrix that contains 33441 documents and 39225 terms or words. We can see that our matrix has a high sparsity, i.e. many columns or words contain many zeros. We will therefore remove those words that have many zeros in our matrix.
sparse = removeSparseTerms(xfreq, 0.995)## remplace xfreq by frequencies
sparse
## <<DocumentTermMatrix (documents: 33441, terms: 590)>>
## Non-/sparse entries: 373902/19356288
## Sparsity : 98%
## Maximal term length: 20
## Weighting : term frequency (tf)
After cleaning, our final matrix has 589 terms.
Before running our model, we need to transform our document terms matrix into a data frame and add the original sentiment of each word.
tSparse = as.data.frame(as.matrix(sparse))
colnames(tSparse) = make.names(colnames(tSparse))
tSparse$sentiment = n_file[row.names(tSparse),]$Sentiment1
Spliting of our Data
Next, we randomly split our dataset between training and test data, as we did in the previous model. In this case, 80% of our data will go to the training set and 20% will be sent to the test set.
library(caTools)
set.seed(100)
split = sample.split(tSparse$sentiment, SplitRatio = 0.8)
trainSparse = subset(tSparse, split==TRUE)
testSparse = subset(tSparse, split==FALSE)
Here we run our model using the training dataset and test its accuracy using our test dataset.
set.seed(100)
trainSparse$sentiment = as.factor(trainSparse$sentiment)
testSparse$sentiment = as.factor(testSparse$sentiment)
RF_model = randomForest(sentiment ~ ., data=trainSparse, ntree = 50)
predictRF = predict(RF_model, newdata=testSparse)
confusionMatrix(data = predictRF, reference = testSparse$sentiment, positive="Positive", mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 2358 663
## Positive 721 2946
##
## Accuracy : 0.7931
## 95% CI : (0.7832, 0.8027)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5829
##
## Mcnemar's Test P-Value : 0.1255
##
## Precision : 0.8034
## Recall : 0.8163
## F1 : 0.8098
## Prevalence : 0.5396
## Detection Rate : 0.4405
## Detection Prevalence : 0.5483
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : Positive
##
confusionMatrix(data = predictRF, reference = testSparse$sentiment, positive="Negative", mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 2358 663
## Positive 721 2946
##
## Accuracy : 0.7931
## 95% CI : (0.7832, 0.8027)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5829
##
## Mcnemar's Test P-Value : 0.1255
##
## Precision : 0.7805
## Recall : 0.7658
## F1 : 0.7731
## Prevalence : 0.4604
## Detection Rate : 0.3526
## Detection Prevalence : 0.4517
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : Negative
##
Our model offers 78% accuracy with 76% precision in accurately predicting negative tweets, and 80% precision in accurately predicting positive tweets. This is pretty good for a machine learning model. Compared to the Naive bayes model (80% accuracy), the random forest model (with a number of threes of 50) is less accurate. I argue that by increasing the number of trees we could get better results. But for the purpose of this project, we use only 50 trees for this random forest.
Let us analyse the most important predictors for our model. For this model, there are two measures of predictor importance:
importance(RF_model) %>% data.frame() %>% top_n(30, MeanDecreaseGini) %>% arrange(desc(MeanDecreaseGini))
## MeanDecreaseGini
## crisis 299.53379
## panic 271.50977
## help 244.67703
## good 208.93874
## hand 186.83794
## thank 180.13983
## like 146.51677
## scam 121.01588
## safe 117.33522
## sanitizer 110.71955
## free 107.39151
## support 103.37800
## great 102.04255
## price 94.24502
## stop 91.10413
## bad 86.71358
## low 82.14261
## please 79.20764
## food 75.49845
## buy 73.09191
## die 71.19885
## fear 67.19809
## X.coronavirus 65.31852
## X.covid19 63.81065
## kill 60.44494
## people 58.85086
## go 56.50492
## covid19 56.33730
## share 55.45966
## oil 54.56170
head(varImpPlot(RF_model))
## MeanDecreaseGini
## online 35.576536
## thank 180.139834
## expert 7.289899
## industry 11.199352
## amp 47.975771
## bill 7.878483
As we can see, words such as: “panic”, “crisis”, “good”, “support” have the greatest influence on the model. Let us compare the results with what we found in the previous model.
tp_imp_words <- importance(RF_model) %>% data.frame() %>% top_n(50, MeanDecreaseGini) %>% arrange(desc(MeanDecreaseGini))
c %>% filter(word %in% row.names(tp_imp_words))
## [1] word Negative Positive lo
## <0 rows> (or 0-length row.names)
We can see that the top 50 most influential words in the Random Forest, are absent from the top 50 most influential words in the Naive bayes. This can be explained by the reduction in the number of words for both models. The Naive Bayes model contained 942 features whereas the Random Forest contains 589 words. In addition, the model replaces hashtags with strange connotations. For example, #covid19 is replaced by X.covid19. We could rearrange it to have a better combination of predictors for both models, but for the sake of this project (and time) we will ignore it.
Let’s analyse the misclassified tweets. Misclassified tweets are tweets in the test data that our model predicts as positive when they were initially negative and vice versa.
df <- data.frame(predictRF)
predictions <- data.frame(doc_id=as.integer(rownames(df)))
predictions$pred_y <- df$predictRF
predictions$true_label <- n_file[row.names(df),]$Sentiment1
predictions$text <- n_file[row.names(df),]$OriginalTweet
head(predictions)
## doc_id pred_y true_label
## 1 33004 Negative Positive
## 2 2714 Negative Negative
## 3 6746 Positive Positive
## 4 7642 Negative Negative
## 5 23488 Negative Negative
## 6 16311 Negative Negative
## text
## 1 So proud of members and staff during COVID 19 Culinary staff can t open cafe but can serve meals to MD workers Piecemakers quilting club can t meet but can make masks Food Pantry staff can t socially interact but can provide food to drive thru s demand 700
## 2 Indonesia - 227 cases, 19 deaths\nPhilippines - 187 cases, 14 deaths\nMalaysia - 673 cases, 2 deaths\nThailand - 177 cases, 1 death\nSingapore - 226 cases , 0 deaths\n\nIndonesia now has the most #COVID19 case fatalities in Southeast Asia #coronavirus #CoronaOutbreak #SARSCoV2
## 3 Pray for the leaders. Pray for the healthcare workers. Pray for the researchers. Pray for the grocery store workers. Pray for those affected. Pray for each other. #Pray #coronavirus
## 4 Correct terms for the coronavirus #CoronaVirusUpdate #COROVID19 #CoronaCrisis the great bogg crisis, loo roll crisis, canÂ’t buy anything I need crisis, supermarket crisis, ventilator crisis, please add anything extra......
## 5 #Northerners vs #southerners....... fight fight fight fight fight.....\n\nFFS GROW UP! IÂ’m seeing as much footage of the chavs in northern town supermarket car parks as I am in southern supermarket car parks... \n\nGuess what, youÂ’re all pricks! #StayHomeSaveLives https://t.co/RoZ3a1HZIc
## 6 An anonymous donor gave us a certificate for T&T Pet Food & Supply, allowing us to buy this FULL cart of cat food! Due to COVID-19, there is high demand for our Pet Food & Supply Pantry. If you can donate, we need dry dog food, dry cat food & cat litter: https://t.co/2EZh4KMma5. https://t.co/ib2xhK0slf
Wrongly predicted positive tweets
head(predictions %>% filter(pred_y == 'Positive' & true_label == 'Negative'))
## doc_id pred_y true_label
## 1 6118 Positive Negative
## 2 12770 Positive Negative
## 3 15661 Positive Negative
## 4 26471 Positive Negative
## 5 23629 Positive Negative
## 6 8518 Positive Negative
## text
## 1 Day 2: Went to Food4Less to stock up on food. The only frozen pizzas left were California Pizza Kitchen thin crust 4 cheese. I had to make a choice. Do I get the pizza or starve?! I got the pizza. If COVID-19 doesnÂ’t get me, this pizza might finish the job.
## 2 @ArvindKejriwal @LtGovDelhi Delhi Goods Transport Organisation reiterated its demand made to Delhi LG & Delhi CM request to declare lockdown in Delhi. ItÂ’s necessary to combat community transmission of COVID-19 & request to arrange food/help for
## 3 Hi friends, is there a list of DMV food donation facilities that are still open for donation drop offs? Or know which areas are in dire need of food donations due to larger demand? I bought $100 of food donations that I need to drop off, thank you in advance! #Covid_19
## 4 @callistoggawine You should be livid. Not only has he exposed you to Covid-19, he has exposed his buddies who will now expose more folk who will expose nurses, doctors, careworkers, refuse collectors, supermarket staff, delivery drivers, farmers etc. He i
## 5 Disruption in demand markets due to say closure of food markets as part of efforts to contain spread of virus will increase levels of food going to waste This implies an increased need in means of preserving food among small scale and informal food traders to buffer
## 6 ItÂ’s sad, so sad, itÂ’s a sad sad situation,\nAnd itÂ’s getting more and more absurd.. \n#corona #coronavirus #covid_19 #toiletpaper #toiletroll #panicbuying #darkphotography #dslrguru #supermarket #dontpanic #photography #photographyislife #lovenotlooroll https://t.co/fTJ7mAOzyU
Wrongly predicted negative tweets
Let’s analyse the tweet 14984. By reading the tweet, we can clearly see that the tweet contains many “negative” words that influenced the overall sentiment of the tweets. Furthermore, taking into account the context of the tweet, we can agree with the original sentiment analysis that this particular tweet is negative. It would be interesting to have the probability of predicting that a tweet is positive or negative as in the Naives Bayse model. This would give us a more advanced insight into the classification choice made by the model. Unfortunately, we could not find such an opportunity for the random forest.
head(predictions %>% filter(pred_y == 'Negative' & true_label == 'Positive'))
## doc_id pred_y true_label
## 1 33004 Negative Positive
## 2 9136 Negative Positive
## 3 5649 Negative Positive
## 4 4798 Negative Positive
## 5 8267 Negative Positive
## 6 3773 Negative Positive
## text
## 1 So proud of members and staff during COVID 19 Culinary staff can t open cafe but can serve meals to MD workers Piecemakers quilting club can t meet but can make masks Food Pantry staff can t socially interact but can provide food to drive thru s demand 700
## 2 Medical staff - Frontline workers\nEmergency responders - Frontline workers\nPharmacy techs - Frontline workers \nRetail staff inc Truckers - ARE Frontline workers \n\nAll deserve respect, retail workers are just trying to keep your food stocked!\n#StopHoarding #coronavirus #Respect
## 3 If you have food at home please don't buy other food everyday.\nPeople like me don't find nothing at the supermarket.\nDon't be selfish we all buy the food, so why i should buy something that i don't like because there is only that.\n#coronavirus #covid19UK #tesco
## 4 says system is alive and well calls workers heroes If you see empty shelves that s a demand issue not a supply issue gt gt gt gt
## 5 You may be immune to #COVID19 but ask yourself, are my mom and dad immune? Are my grandparents immune? Is that old lady I pass at the grocery store immune? Try not being selfish for the first time in your life. Stop hording. Use your brain.\n\n#coronavirus #CoronaVirusUpdate
## 6 Amid the panic and worries worldwide regarding Covid-19. We urge you to stand in faith and trust in God. Do not be moved by what you see. Do not be moved by the lack of food in stores. Do not be moved by the news reportings. But be moved by the Word of God. \n\nMatthew 6:31-34 https://t.co/lk7pDGoj9I
By analysing these tweets, in particular the 4465, we can say that by taking into account the context of the tweets, our prediction is better than the original sentiment analysis. As we said before for the first model, it seems that the original sentiment analysis only took into account the value attached to a word, ignoring the context (the mix of words in a sentence.) Taking this case, we see that the context of the tweet is negative even though it does not contain many words with negative connotation.
Overall, our Random Forest model performed well, with an accuracy of 78%. For a better result, we could increase the number of trees in our model. Moreover, as mentioned above, hastags are replaced by strange characters when running the model, which does not allow us to make a good comparative study between the two models. For this project, we will recommend using the Naive Bayse model to predict the sentiment of a tweet based on the word it contains.