This Project

In this project we have tried to analyse tweets about the coronavirus. Each tweet is associated with a sentiment measure. In this paper, we will perform an exploratory analysis and build two classification models for sentiment analysis (Naive baise & Decision Tree).

TEXT preprocessing & Explorative analyse

First, we import the dataset for analysing the tweets. You can find the dataset in free access: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification

file <- read.csv("E:/HSE Social Info/Third Year/Text Analysis/HM/archive/Corona_NLP_train.csv", header = T)

head(file)

##   UserName ScreenName                  Location    TweetAt
## 1     3799      48751                    London 16-03-2020
## 2     3800      48752                        UK 16-03-2020
## 3     3801      48753                 Vagabonds 16-03-2020
## 4     3802      48754                           16-03-2020
## 5     3803      48755                           16-03-2020
## 6     3804      48756 ÃœT: 36.319708,-82.363649 16-03-2020
##                                                                                                                                                                                                                                                                                                        OriginalTweet
## 1                                                                                                                                                                                                    @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2                                                                      advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3                                                                                                                                                                                Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4      My food stock is not the only one which is empty...\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \nStay calm, stay safe.\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6                                                         As news of the regionÂ’s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
##            Sentiment
## 1            Neutral
## 2           Positive
## 3           Positive
## 4           Positive
## 5 Extremely Negative
## 6           Positive

count(file)

##       n
## 1 41157

The data frame has six columns:UserName, ScreenName, Location, TweetAt, OriginalTweet, and Sentiment

Data Cleaning

‘Sentiment’ variable transformation & Encoding

The sentiment column, our response variable, has four categorical variables: Positive, Extremely positive, Neutral, Negative and Extremely negative. For this work, and to make it easier, we have removed the neutral sentiment, and reduced the positive and extremely positive into one variable (positive), as well as the negative and extremely negative. In addition, as some tweets have a different character encoding, which makes the cleaning procedure difficult, we had to transform all character encodings to UTF-8.

n_file <- file %>% filter(!Sentiment == "Neutral") %>% mutate(Sentiment1 = case_when(
  Sentiment == "Positive" | Sentiment ==  "Extremely Positive" ~ "Positive", 
  Sentiment == "Negative" | Sentiment == "Extremely Negative" ~ "Negative")) 
n_file <- n_file %>% select(-Sentiment)
n_file <-  n_file %>% mutate(OriginalTweet = stri_replace_all_regex(OriginalTweet, pattern=c("\u0091", "\u0092", "\u0093", "\u0094", "\u0095", "\u0096", "\u0097", "\u0099", "\u0080", "\u0084", "\u009e", "\u009a"),replacement=c("'", "'", "'", "'", "", "", "", "", "", "", "", ""),vectorize=FALSE))

n_file$OriginalTweet <- iconv(n_file$OriginalTweet, "UTF-8", "UTF-8", sub='')

Tokenization and Cleaning Stops words, Urls, users etc. & lematization

Before proceeding with the exploration analysis, we need to tokenise our tweets. We have chosen ‘tweets’ as the tokenisation mode. This allows us not to split the urls into different parts and to keep them as a whole. Next, we removed stop words, urls, usernames, all words starting with a number, all tweets containing the dollar sign, and all special characters except # from the tokenisation words, to keep the hastags. Finally, we lemmatized the set of words obtained from the tokenization and filtering.

tok_ln <- n_file %>% unnest_tokens(words, OriginalTweet, token = "tweets", to_lower = T)
tok_ln1 <- tok_ln %>% filter(!words %in% stopwords("en"), !str_detect(words, "https:."), !str_detect(words, "@."), !str_detect(words, "^[0-9]+"),!str_detect(words, "\\$"), !str_detect(words, "(?!#)[^0-9a-zA-Z]+")) 
tok_ln1 <- tok_ln1 %>% select(Sentiment1, words)
#head(tok_ln1)
words_lema <- tok_ln1 %>% mutate(lem = lemmatize_words(words))
head(words_lema)

##   Sentiment1      words       lem
## 1   Positive     advice    advice
## 2   Positive       talk      talk
## 3   Positive neighbours neighbour
## 4   Positive     family    family
## 5   Positive   exchange  exchange
## 6   Positive      phone     phone

count(words_lema)

##        n
## 1 624598

We ended up with a corpus size of 620,709 lemmatized words.Indeed it is a large data set.

Explorating Analysis

In this section, we will describe some basic statistics of the corpus, such as the distribution of the most frequent words, the size of the vocabulary and the distribution of classes. In addition, we will examine the use of words in the two sentiment groups, i.e. which words occur most often in positive and negative sentiment tweets. We will use the log-likelihood ratio to do this.

Vocabulary size

Let us start with the size of the vocabulary and the distribution of classes. We saw above that the corpus size is 620,709 lemmatised words. Now we want to know how many unique lemmatised words there are in the corpus.

words_lema %>% select(lem) %>% unique() %>% count()

##       n
## 1 39225

There are 39225 unique words in our corpus.

Class distribution

The class distribution allows us to know how many words in ours corpus comes from negative tweets or positive tweets.

#words_lema %>% filter(Sentiment1 == "Positive") %>% count()
#words_lema %>% filter(Sentiment1 == "Negative") %>% count()
sumtable(words_lema, out = 'kable') %>% 
 kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Summary Statistics
Variable	N	Percent
Sentiment1	624598
… Negative	285286	45.7%
… Positive	339312	54.3%

From our corpus of 620709 words, 45.7% comes from Negative tweets and 54.3% comes from Positive tweets.

Distribution of the words in our corpus

Next is to define what are the most frequent words in our corpus. We will do it by performing a frequency table, a graph distribution of the words, and a words cloud.

Frequency table

c_words <- words_lema %>% count(lem, sort = T)
head(c_words)

##            lem     n
## 1 #coronavirus 12462
## 2        price  7587
## 3     #covid19  7168
## 4        store  6540
## 5         food  6271
## 6      covid19  6237

Distribution of the most frequent words

words100 <- c_words %>% top_n(50)

## Selecting by n

ggplot(words100, aes(reorder(lem, -n), n)) +
  geom_point(stat = "identity") +
  geom_line(group = 1) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
  xlab("Words") +
  ylab("Frequency") +
  ggtitle("Distribution of the most frequent words")

Words Cloud

words_lema %>%
    dplyr::count(lem) %>%
    with(wordcloud(lem, n, max.words = 80))

The hastag #Coronavirus is the most frequent word in our corpus size, followed by the word “price”. To better understand the meaning of different words in the corpus, it would be better to perform a bigram analysis of the words, but as our analysis items are tweets, the bigram cannot be well implemented in our case. In addition, we have a logarithmic distribution of words. We have some mainly used words and many other rarely used words.

Constrative analysis

We need a constrained analysis to define which words occur most often in positive and negative sentiment tweets. We will use the log-likelihood ratio to do this. First, we add 1 to the positive and negative columns to avoid infinity when calculating the log-likelihood, as some words do not appear in any of the instances. Second, we filter out the words that are at least present in ten positive or negative tweets. Finally, we perform the Dunning G^2 log-likelihood and the log-ratio. And finally, we show the result with a log-ratio distribution.

    sentiment <- words_lema %>% select(- words) %>% count(lem, Sentiment1) %>% spread(Sentiment1, n, fill = 0) %>% mutate(Positive = Positive + 1, Negative = Negative + 1) %>%  filter(Positive > 10 | Negative > 10)
head(sentiment)

##            lem Negative Positive
## 1 #advertising        4       16
## 2      #africa       12        6
## 3 #agriculture        7       13
## 4          #ai        8       13
## 5        #aldi        5       11
## 6       #alert        4       30

Dunning log-likelihood (G^2)

g2 = function(a, b) {
  c = sum(a)
  d = sum(b)
  E1 = c * ((a + b) / (c + d))
  E2 = d * ((a + b) / (c + d))
  return(2*((a*log(a/E1+1e-7)) + (b*log(b/E2+1e-7))))
}

sentiment.g2 <- sentiment  %>% 
    mutate(g2=g2(Positive, Negative)) %>%
    arrange(desc(g2)) %>%
    mutate(g2 = round(g2, 2))

head (sentiment.g2)

##         lem Negative Positive      g2
## 1     panic     2005      427 1400.47
## 2      hand      403     2136 1019.02
## 3    crisis     1314      278  921.98
## 4      help      433     2032  869.56
## 5     thank      160     1208  739.34
## 6 sanitizer      425     1773  673.09

Log ratio

logratio <- function(a, b) {
    return(log2((a/sum(a)/(b/sum(b)))))
}
sentiment.lr <- sentiment.g2 %>%
    mutate(logodds = logratio(Positive, Negative))

head(sentiment.lr %>%  arrange(desc(abs(logodds))))

##              lem Negative Positive    g2   logodds
## 1       #contest        1       37 37.53  4.961693
## 2      beautiful        1       33 32.87  4.796634
## 3 #giveawayalert        1       30 29.39  4.659131
## 4        lawsuit       21        1 25.96 -4.640077
## 5            wtf       62        3 76.41 -4.616994
## 6           ruin       20        1 24.49 -4.569688

Log odds ratio distribution

sentiment.lr %>%
    filter(Positive > 0 & Negative > 0) %>%
    group_by(logodds < 0) %>%
    top_n(15, abs(logodds)) %>%
    ungroup() %>%
    mutate(lem = reorder(lem, logodds)) %>%
    ggplot(aes(lem, logodds, fill = logodds > 0)) +
    geom_col(show.legend = FALSE) +
    coord_flip() +
    ylab("log odds ratio (Negative/Positive)") +
    scale_fill_discrete(name = "", labels = c("Negative", "Positive"))

As a result, we can see that “#contest” and “beautiful” are the most frequent words among the positive tweets and that the difference is statistically significant. “Lawsuit” and “wtf” are the most frequent words among the negative tweets and the difference is statistically significant.

Classification

In this section, we will prepare the data for the classification models. The main objective of our classification is to determine whether a tweet is positive or negative by the words it contains. Therefore, our model will classify new tweets as positive or negative by the composition of their words. To do this, we must first clean our data for classification. Here, we will use the same data cleaning as we used previously, except that we will add an ID number to each tweet so that we do not lose the origin of the words after tokenisation. This will help us to know which tweets a word comes from.

data_clean <- n_file %>% mutate(doc_id = row_number()) %>% unnest_tokens(words, OriginalTweet, token = "tweets", to_lower = T)%>% filter(!words %in% stopwords("en"), !str_detect(words, "https:."), !str_detect(words, "@."), !str_detect(words, "^[0-9]+"),!str_detect(words, "\\$"), !str_detect(words, "(?!#)[^0-9a-zA-Z]+"))

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

tok_ln1 <- tok_ln1 %>% select(Sentiment1, words)
#head(tok_ln1)
words_lema <- tok_ln1 %>% mutate(lem = lemmatize_words(words))

data_lama <- data_clean %>%  mutate(lem = lemmatize_words(words)) %>% select(Sentiment1, doc_id, lem)
head(data_lama)

##   Sentiment1 doc_id       lem
## 1   Positive      1    advice
## 2   Positive      1      talk
## 3   Positive      1 neighbour
## 4   Positive      1    family
## 5   Positive      1  exchange
## 6   Positive      1     phone

#data_lama %>% count(lem) %>% arrange(n)

In addition, we analyse the most frequent words in the tweets. In the following, we will often refer to tweets as documents.

lem_count <- data_lama %>% count(lem, doc_id, sort = T)
head(lem_count)

##        lem doc_id  n
## 1       gt   6264 10
## 2   online  27635  9
## 3    thank  30645  9
## 4   expert  26060  7
## 5       gt   3905  7
## 6 industry   3251  7

Since we want to run a classification model on our corpus, we need to transform it into a matrix where the rows represent the documents (tweets), and the columns the different words.

dfm <- lem_count %>% 
    cast_dfm(doc_id, lem, n)

head(dfm)

## Document-feature matrix of: 6 documents, 39,225 features (99.95% sparse) and 0 docvars.
##        features
## docs    gt online thank expert industry senator wena amp bill can
##   6264  10      0     0      0        0       0    0   0    0   0
##   27635  0      9     0      0        0       0    0   0    0   0
##   30645  0      0     9      0        0       0    0   0    0   0
##   26060  0      0     0      7        0       0    0   0    0   0
##   3905   7      0     0      0        0       0    0   0    0   0
##   3251   0      0     0      0        7       0    0   0    0   0
## [ reached max_nfeat ... 39,215 more features ]

Our matrix consists of 33,441 rows (documents) and 39,225 features (words). In ML, it is not recommended to run a model with more features than rows. Moreover, running a model with 39225 features requires computers with higher memory capacities. Therefore, we will filter the features (terms) of our matrix. Let us filter out the terms that appear at least 5 times in the whole corpus (term frequency) and these words must appear in at least 0.3% of the documents in the corpus (document frequency).

dfm_final <- dfm %>% # building matrix with frequencies as values! not tf-idf
    dfm_trim(min_termfreq=5, termfreq_type="count") %>% #each word should appear at least 1 timw
    dfm_trim(min_docfreq=0.003, docfreq_type="prop")#each word should appear at least in 0.5 % of docs # now converting to tf-idf doc-term matrix

dfm_final

## Document-feature matrix of: 33,441 documents, 942 features (98.67% sparse) and 0 docvars.
##        features
## docs    gt online thank expert industry amp bill can case every
##   6264  10      0     0      0        0   0    0   0    0     0
##   27635  0      9     0      0        0   0    0   0    0     0
##   30645  0      0     9      0        0   0    0   0    0     0
##   26060  0      0     0      7        0   0    0   0    0     0
##   3905   7      0     0      0        0   0    0   0    0     0
##   3251   0      0     0      0        7   0    0   0    0     0
## [ reached max_ndoc ... 33,435 more documents, reached max_nfeat ... 932 more features ]

We end up with a matrix of 942 features (words). This is a reduction of 97% compared to our previous matrix.

Next, we need to divide our corpus into training data, to run the model, and test data, to test the quality of the model with unknown documents. 90% of our corpus will be sent to the training set, and the remaining 10% to the test set.

sentiment_labels <- n_file$Sentiment1[as.integer(rownames(dfm))]

set.seed(1991)
## we will take 10% of the sample for testinf
split <- createDataPartition(y=sentiment_labels, p = 0.9, list = FALSE)
train.data <- dfm %>% dfm_subset(rownames(dfm) %in% rownames(dfm)[split])
test.data <- dfm %>% dfm_subset(!rownames(dfm) %in% rownames(dfm)[split]) 

response <- as.factor(sentiment_labels)
trainY <- response[split]
testY <- response[-split]

test.matched <- test.data %>% 
    dfm_match(features = featnames(train.data))

Naive Bayes

In this section, we will run the Naive Bayes algorithm with our training data, predict a new classification (positive or negative) with our test data, and then analyse and discuss the validity of our model.

Training model

Let us run the Naive Bayes model. We have chosen a multinomial distribution since the presence of words in a document is counted.

model.nb <- textmodel_nb(train.data, trainY, distribution = "multinomial")
summary(model.nb)

## 
## Call:
## textmodel_nb.dfm(x = train.data, y = trainY, distribution = "multinomial")
## 
## Class Priors:
## (showing first 2 elements)
## Negative Positive 
##      0.5      0.5 
## 
## Estimated Feature Scores:
##                 gt   online     thank    expert  industry   senator    wena
## Negative 0.0001485 0.002798 0.0004692 0.0003477 0.0006616 7.764e-05 2.7e-05
## Positive 0.0003857 0.004524 0.0031288 0.0003335 0.0007829 3.190e-05 2.9e-06
##               amp      bill     can      case       dad     every  grateful
## Negative 0.005907 0.0003342 0.00401 0.0008877 0.0001553 0.0009552 4.051e-05
## Positive 0.006722 0.0002813 0.00535 0.0007046 0.0001102 0.0010932 2.523e-04
##             paper      pray    price   railway    staff     want     will
## Negative 0.001681 4.388e-05 0.013022 4.051e-05 0.001009 0.001097 0.005647
## Positive 0.001577 1.740e-04 0.008664 8.699e-06 0.001786 0.001461 0.005640
##              work    check    close     combo    crisis     death     dying
## Negative 0.003261 0.000611 0.001212 2.363e-05 0.0039763 0.0006211 1.654e-04
## Positive 0.004016 0.001070 0.001606 2.030e-05 0.0007481 0.0001479 8.699e-05
##          essential     fight
## Negative  0.001343 0.0011409
## Positive  0.001667 0.0007017

Predicting class for the docs from test set

Then we predict a new classification using our model created above on the test data.

predictedY <- predict(model.nb, newdata = test.matched)
head(predictedY)

##     2714    23802    19031     2180    29133     3528 
## Negative Positive Positive Positive Positive Positive 
## Levels: Negative Positive

We can see for example that our model predicted that document 2714 was a negative tweet, and the same for document 4432. We can run the model again, and instead of having the final classification (positive/negative), let’s see what the probability is for a document to be positive or negative.

#probabilities
predicted.prob <- round(predict(model.nb, newdata = test.matched, type = "prob"), 2)
head(predicted.prob)

##       Negative Positive
## 2714      1.00     0.00
## 23802     0.00     1.00
## 19031     0.49     0.51
## 2180      0.49     0.51
## 29133     0.46     0.54
## 3528      0.02     0.98

We can see for example that document 2714 has a 100% chance of being negative, while document 4432 has a 98% chance of being negative.

Let’s see how accurate our model is and how accurate it is in classifying a document as positive or negative.

cm.po <- confusionMatrix(data = predictedY, reference = testY, positive="Positive", mode = "prec_recall")
cm.po

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative     1228      323
##   Positive      311     1481
##                                           
##                Accuracy : 0.8103          
##                  95% CI : (0.7966, 0.8235)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6185          
##                                           
##  Mcnemar's Test P-Value : 0.6622          
##                                           
##               Precision : 0.8265          
##                  Recall : 0.8210          
##                      F1 : 0.8237          
##              Prevalence : 0.5396          
##          Detection Rate : 0.4430          
##    Detection Prevalence : 0.5360          
##       Balanced Accuracy : 0.8094          
##                                           
##        'Positive' Class : Positive        
##

cm.neg <- confusionMatrix(data = predictedY, reference = testY, positive="Negative", mode = "prec_recall")
cm.neg

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative     1228      323
##   Positive      311     1481
##                                           
##                Accuracy : 0.8103          
##                  95% CI : (0.7966, 0.8235)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6185          
##                                           
##  Mcnemar's Test P-Value : 0.6622          
##                                           
##               Precision : 0.7917          
##                  Recall : 0.7979          
##                      F1 : 0.7948          
##              Prevalence : 0.4604          
##          Detection Rate : 0.3673          
##    Detection Prevalence : 0.4640          
##       Balanced Accuracy : 0.8094          
##                                           
##        'Positive' Class : Negative        
##

The accuracy of our model is 80%. This is a very good result for a ML model. The precision for accurately predicting negative documents is almost 77%, while the precision for accurately predicting positive documents is almost 81%. The results are quite satisfactory, and we can ensure that our model can predict accurately for a new data set of 80%.

Analysis of predictors

In this section, we will analyse the most important predictors for the different classes, i.e. the predictors that have the greatest influence on increasing the probability that a document is negative or positive.

vars.nb <- t(model.nb$param) %>% 
    as.data.frame %>% 
    rownames_to_column("word") %>% 
    mutate(lo = log(Positive/Negative))
head(vars.nb)

##       word     Negative     Positive         lo
## 1       gt 1.485192e-04 3.856671e-04  0.9542604
## 2   online 2.798237e-03 4.523614e-03  0.4803218
## 3    thank 4.691856e-04 3.128833e-03  1.8974169
## 4   expert 3.476699e-04 3.334716e-04 -0.0416960
## 5 industry 6.615855e-04 7.829332e-04  0.1684082
## 6  senator 7.763503e-05 3.189728e-05 -0.8894981

Most important predictors for class “Positive”

head(vars.nb %>% arrange(desc(lo)))

##             word     Negative     Positive       lo
## 1       #contest 3.375436e-06 9.859159e-05 3.374461
## 2 #giveawayalert 3.375436e-06 8.409283e-05 3.215397
## 3      beautiful 3.375436e-06 8.119307e-05 3.180305
## 4        #puzzle 3.375436e-06 7.829332e-05 3.143938
## 5        success 3.375436e-06 7.249382e-05 3.066977
## 6          peace 3.375436e-06 6.669431e-05 2.983595

Most important predictors for class “Negative”

head(vars.nb %>% arrange(lo))

##      word     Negative     Positive        lo
## 1    ruin 6.413329e-05 2.899753e-06 -3.096338
## 2   fraud 3.139156e-04 1.449876e-05 -3.075061
## 3   crude 4.961891e-04 2.319802e-05 -3.062890
## 4 lawsuit 6.075785e-05 2.899753e-06 -3.042271
## 5     wtf 1.822736e-04 8.699258e-06 -3.042271
## 6 tragedy 5.400698e-05 2.899753e-06 -2.924488

We can see that “gratitude”, and “#contest” are the words that most influence the probability of classifying a tweet as positive. On the other hand, the words “evil” and “bullshit” are the ones that most influence the probability of a tweet being classified as positive. These words match their description. #contest is mainly positive because it is written under marketing tweets that urge people to enter a contest. “Evil” and “bullshit” are words that are mostly used in a negative context. By performing a sentiment analysis, tweets containing these words are more likely to be classified as negative.

Are they the same with what we found in the EDA steps?

a <- vars.nb %>% arrange(desc(lo)) %>% top_n(15, lo)
b <- vars.nb %>% arrange(lo) %>% top_n(-15, lo)
c <- vars.nb %>% top_n(30, abs(lo)) %>% arrange(desc(abs(lo))) 
sentiment.lr %>%
    filter(Positive > 0 & Negative > 0) %>%
    top_n(30, abs(logodds)) %>% ungroup() %>% filter(lem %in% c(c$word))

##               lem Negative Positive     g2   logodds
## 1           crude      161       10 188.06 -4.256749
## 2           fraud      105        6 125.01 -4.377043
## 3          arrest      110        8 123.46 -4.029120
## 4           blame       75        5  86.11 -4.154651
## 5             wtf       62        3  76.41 -4.616994
## 6        #contest        1       37  37.53  4.961693
## 7       fraudster       30        2  34.44 -4.154651
## 8       beautiful        1       33  32.87  4.796634
## 9       gratitude        2       37  32.56  3.961693
## 10 #giveawayalert        1       30  29.39  4.659131
## 11        #puzzle        1       28  27.08  4.559595
## 12        lawsuit       21        1  25.96 -4.640077
## 13        success        1       27  25.93  4.507128
## 14           ruin       20        1  24.49 -4.569688
## 15          brent       18        1  21.56 -4.417685
## 16          peace        1       23  21.36  4.275802
## 17          cheat       17        1  20.11 -4.335223
## 18    terroristic       17        1  20.11 -4.335223
## 19        assault       16        1  18.66 -4.247760
## 20        tragedy       16        1  18.66 -4.247760
## 21       bankrupt       13        1  14.37 -3.948200

We can see that 27 of the top 30 most influential words are in the top 30 words with the highest log probabilities found in the previous step. This shows that they are highly correlated. It is important to note that #contest was also the first of the most frequent words among the positive tweets we found in the EDA step.

Wrongly clasified tweets

Next, let’s analyse the wrongly predicted tweets and learn from their characteristics. The wrongly predicted tweets are the tweets in the test data that our model predicts as positive when they were originally negative and vice versa.

predictions <- data.frame(doc_id=as.integer(rownames(predicted.prob)))
predictions$pred_y <- predictedY
predictions$positive_prob_pred <- predicted.prob[,2]
predictions$negative_prob_pred <- predicted.prob[,1]
predictions$true_label <- n_file[rownames(predicted.prob),]$Sentiment1
predictions$text <- n_file[rownames(predicted.prob),]$OriginalTweet

head(predictions)

##   doc_id   pred_y positive_prob_pred negative_prob_pred true_label
## 1   2714 Negative               0.00               1.00   Negative
## 2  23802 Positive               1.00               0.00   Positive
## 3  19031 Positive               0.51               0.49   Positive
## 4   2180 Positive               0.51               0.49   Negative
## 5  29133 Positive               0.54               0.46   Negative
## 6   3528 Positive               0.98               0.02   Positive
##                                                                                                                                                                                                                                                                                        text
## 1     Indonesia - 227 cases, 19 deaths\nPhilippines - 187 cases, 14 deaths\nMalaysia - 673 cases, 2 deaths\nThailand - 177 cases, 1 death\nSingapore - 226 cases , 0 deaths\n\nIndonesia now has the most #COVID19 case fatalities in Southeast Asia #coronavirus #CoronaOutbreak #SARSCoV2
## 2  i want to go shopping i want to walk w/o the thought i might get the COVID-19 virus i want a real hug i want to go eat at a dine in restaurant ohhh to be free w/o the thought of that darn virus i want to go shopping for shoes i don't need just want to go shopping\nno more online!
## 3                              @Consumer_Cell Is there an acceptable level of cases that push your company to change its stand on working from WORK!  We are heading there FAST! SO what's your number? 75K cases? 100K cases? 150K cases? 200K cases? #coronavirus https://t.co/CR1Ssx1PGF
## 4                           @Sideshow_JaneV3 Perhaps... I think the world may be altered in other ways also by COVID-19. More people may want or be forced to work at home. People may find lots of businesses have gone bust, so more online shopping? Travel may be harder. All this may 
## 5          Sports behaving badly? A US Nielsen Sports survey found 45% of fans will stop following a sport if it responds poorly to the current crisis, 57% will try a new sport that impresses with their approach. Full interview with of Nielsen Sports CEO here https://t.co/gaQZX4Q0kZ
## 6 Question why is the government not concerned about closing or limiting the # of people in these type store at one time: \nApparel stores \nGrocery stores  \nSporting stores \nHome improvement stores?\nLIKE their doing to RESTAURANTS? ???? #coronavirus #COVID19 #CoronavirusOutbreak

Wrongly predicted positive tweets

head(predictions %>% filter(pred_y == 'Positive' & true_label == 'Negative'))

##   doc_id   pred_y positive_prob_pred negative_prob_pred true_label
## 1   2180 Positive               0.51               0.49   Negative
## 2  29133 Positive               0.54               0.46   Negative
## 3    830 Positive               0.52               0.48   Negative
## 4   6275 Positive               0.83               0.17   Negative
## 5   3250 Positive               1.00               0.00   Negative
## 6  30419 Positive               0.51               0.49   Negative
##                                                                                                                                                                                                                                                                                                  text
## 1                                     @Sideshow_JaneV3 Perhaps... I think the world may be altered in other ways also by COVID-19. More people may want or be forced to work at home. People may find lots of businesses have gone bust, so more online shopping? Travel may be harder. All this may 
## 2                    Sports behaving badly? A US Nielsen Sports survey found 45% of fans will stop following a sport if it responds poorly to the current crisis, 57% will try a new sport that impresses with their approach. Full interview with of Nielsen Sports CEO here https://t.co/gaQZX4Q0kZ
## 3 Connectivity is essential during times of crisis. \n\nThat's why @CWAunion &amp; allies are asking broadband CEOs to lift data caps, waive fees, &amp; do everything within their power to help people connect to the world from home &amp; stop the spread of COVID-19.\n\nhttps://t.co/GWWf1LBByp
## 4                                                                                                               So while at the grocery store, my car is in the parking lot CLOSER than 6 feet from the cars on either side.  This is not #SocialDistancing   I hope my car doesnÂ’t get #coronavirus
## 5        Due to the #Coronavirus outbreak, we recommend regularly using anti-static cleansing wipes or anti-static foam cleanser on the touchscreen of your tillÂ…\n\nFoam Clene Spray (300ml): Â£8.50\nScreen Clene Wipes (20): Â£10\nScreen Clene Wipes (100): Â£35\n\nNote: All prices exclude VAT
## 6        Shocking it's been more busy than Christmas for Royal Mail and other couriers/deliveries. If it's not essential don't send it and don't buy for delivery. Putting workers in danger that don't have protection in place #StayHomeSaveLives and only do essential online shopping &amp; post.

By analysing the tweets, we can see that our model, even though it incorrectly predicted the original sentiment of the tweets, seems to be closer to reality than the original sentiment of the tweets. For example, let’s analyse 15335. Only because there is a repetition of the word kill several times, the original sentiment analysis considered the tweets to be negative. Whereas in this context, the word kill was not used to share a negative sentiment. Therefore, we can conclude by saying that the original sentiment analysis only considers the presence or absence of a word, to which it assigns a value, and to consider whether a tweet is positive or not, they add up the values of the different words to make a conclusion. Whereas our machine learning model takes into account the different contexts in which a word may appear. Is the situation the same for wrongly predicted negative tweets?

Wrongly predicted negative tweets

head(predictions %>% filter(pred_y == 'Negative' & true_label == 'Positive'))

##   doc_id   pred_y positive_prob_pred negative_prob_pred true_label
## 1   8267 Negative               0.09               0.91   Positive
## 2   5488 Negative               0.07               0.93   Positive
## 3   6966 Negative               0.32               0.68   Positive
## 4    411 Negative               0.03               0.97   Positive
## 5  11560 Negative               0.04               0.96   Positive
## 6  31736 Negative               0.00               1.00   Positive
##                                                                                                                                                                                                                                                                                                               text
## 1                              You may be immune to #COVID19 but ask yourself, are my mom and dad immune? Are my grandparents immune? Is that old lady I pass at the grocery store immune? Try not being selfish for the first time in your life. Stop hording. Use your brain.\n\n#coronavirus #CoronaVirusUpdate
## 2                            Don't miss your FREE 72-page land report with an analysis of land prices in 2019, land trends over the last 13 years and we reveal who is buying land. Plus a surge of cattle numbers in marts and #Coronavirus advice for you and your farm. Only inside this week's @Farmersjournal
## 3                                                                                            due to COVID-19 iÂ’m going to be selling pictures to help pay my bills and buy groceries, please dm me to talk prices!! \nselling feet pics, selling nudes, selling videos, sugar daddy needed, serious inquires only
## 4                                                                                                         Clubs shut down\nBars shut down\nRestaurants shut down\nGym shut down \nGrocery stores are nightmare fuel\nSeriously if u work at a grocery store or retail in general u have my sympathy \n#coronavirus
## 5                                                      anhydrous ammonia is off $100 from same week last year\nprices on inputs like anhydrous ammonia could continue the trend to the downside\nÂ“China is a large importer of ammonia for industrial purposes\n\nhttps://t.co/j6rSiRleK5 https://t.co/1hlz58haLx
## 6 Consumer Prices Plummet as Economy Continues Shrinking: The effects of the COVID-19 economic shutdown continue to be felt, but for consumers, itÂ’s not all bad. Consumer prices fell by the most since 2015, dropping 0.4% in March. [CNBC] Prices had seen aÂ… https://t.co/7mfwu8J5VD https://t.co/CEjYoG7NPV

Let’s analyse the tweet 14686. We can see that the same conclusion made for wrongly predicted positive tweets can also be applied to this case. Just because the tweet contains “very good!”, the original sentiment analysis considered it to be a positive tweet. Whereas on closer inspection, the “very good!” was purely ironic. The hashtag used, and the subject, indicate that the tweet is a negative sentiment.

Overall, we can assume that our model performs better than the original sentiment analysis. One reason for this could be the use of hashtags as predictors of sentiment. The first sentiment analysis probably did not take into account the hastags present in the tweets. The overall sentiment of a tweet was probably the sum of the values assigned to each word. This does not take into account the context in which a word is used. Whereas our machine learning model takes into account the context in which a word may appear.

Random Forest

In this section, we will run another machine learning model called the random forest. To do this, we will use the same corpus of lematised words as used in the Naive bayse model. In this case, we will transform our frequency table into a document terms matrix.

xfreq <- lem_count %>% 
    cast_dtm(doc_id, lem, n)
xfreq

## <<DocumentTermMatrix (documents: 33441, terms: 39225)>>
## Non-/sparse entries: 593581/1311129644
## Sparsity           : 100%
## Maximal term length: 191
## Weighting          : term frequency (tf)

We obtain a matrix that contains 33441 documents and 39225 terms or words. We can see that our matrix has a high sparsity, i.e. many columns or words contain many zeros. We will therefore remove those words that have many zeros in our matrix.

sparse = removeSparseTerms(xfreq, 0.995)## remplace xfreq by frequencies
sparse

## <<DocumentTermMatrix (documents: 33441, terms: 590)>>
## Non-/sparse entries: 373902/19356288
## Sparsity           : 98%
## Maximal term length: 20
## Weighting          : term frequency (tf)

After cleaning, our final matrix has 589 terms.

Before running our model, we need to transform our document terms matrix into a data frame and add the original sentiment of each word.

tSparse = as.data.frame(as.matrix(sparse))
colnames(tSparse) = make.names(colnames(tSparse))
tSparse$sentiment = n_file[row.names(tSparse),]$Sentiment1

Spliting of our Data

Next, we randomly split our dataset between training and test data, as we did in the previous model. In this case, 80% of our data will go to the training set and 20% will be sent to the test set.

library(caTools)
set.seed(100)
split = sample.split(tSparse$sentiment, SplitRatio = 0.8)
trainSparse = subset(tSparse, split==TRUE)
testSparse = subset(tSparse, split==FALSE)

Training model & Testing model

Here we run our model using the training dataset and test its accuracy using our test dataset.

set.seed(100)
trainSparse$sentiment = as.factor(trainSparse$sentiment)
testSparse$sentiment = as.factor(testSparse$sentiment)

RF_model = randomForest(sentiment ~ ., data=trainSparse, ntree = 50)
predictRF = predict(RF_model, newdata=testSparse)

confusionMatrix(data = predictRF, reference = testSparse$sentiment, positive="Positive", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative     2358      663
##   Positive      721     2946
##                                           
##                Accuracy : 0.7931          
##                  95% CI : (0.7832, 0.8027)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5829          
##                                           
##  Mcnemar's Test P-Value : 0.1255          
##                                           
##               Precision : 0.8034          
##                  Recall : 0.8163          
##                      F1 : 0.8098          
##              Prevalence : 0.5396          
##          Detection Rate : 0.4405          
##    Detection Prevalence : 0.5483          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : Positive        
##

confusionMatrix(data = predictRF, reference = testSparse$sentiment, positive="Negative", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative     2358      663
##   Positive      721     2946
##                                           
##                Accuracy : 0.7931          
##                  95% CI : (0.7832, 0.8027)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5829          
##                                           
##  Mcnemar's Test P-Value : 0.1255          
##                                           
##               Precision : 0.7805          
##                  Recall : 0.7658          
##                      F1 : 0.7731          
##              Prevalence : 0.4604          
##          Detection Rate : 0.3526          
##    Detection Prevalence : 0.4517          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : Negative        
##

Our model offers 78% accuracy with 76% precision in accurately predicting negative tweets, and 80% precision in accurately predicting positive tweets. This is pretty good for a machine learning model. Compared to the Naive bayes model (80% accuracy), the random forest model (with a number of threes of 50) is less accurate. I argue that by increasing the number of trees we could get better results. But for the purpose of this project, we use only 50 trees for this random forest.

Analyse of predictors

Let us analyse the most important predictors for our model. For this model, there are two measures of predictor importance:

Mean Decrease Accuracy: How much the model accuracy decreases if we drop that word.
Mean Decrease Gini: Measure of the word importance based on the Gini impurity index used for the calculation of splits in trees.

importance(RF_model) %>% data.frame() %>%  top_n(30, MeanDecreaseGini) %>% arrange(desc(MeanDecreaseGini))

##               MeanDecreaseGini
## crisis               299.53379
## panic                271.50977
## help                 244.67703
## good                 208.93874
## hand                 186.83794
## thank                180.13983
## like                 146.51677
## scam                 121.01588
## safe                 117.33522
## sanitizer            110.71955
## free                 107.39151
## support              103.37800
## great                102.04255
## price                 94.24502
## stop                  91.10413
## bad                   86.71358
## low                   82.14261
## please                79.20764
## food                  75.49845
## buy                   73.09191
## die                   71.19885
## fear                  67.19809
## X.coronavirus         65.31852
## X.covid19             63.81065
## kill                  60.44494
## people                58.85086
## go                    56.50492
## covid19               56.33730
## share                 55.45966
## oil                   54.56170

head(varImpPlot(RF_model))

##          MeanDecreaseGini
## online          35.576536
## thank          180.139834
## expert           7.289899
## industry        11.199352
## amp             47.975771
## bill             7.878483

As we can see, words such as: “panic”, “crisis”, “good”, “support” have the greatest influence on the model. Let us compare the results with what we found in the previous model.

tp_imp_words <- importance(RF_model) %>% data.frame() %>%  top_n(50, MeanDecreaseGini) %>% arrange(desc(MeanDecreaseGini))
c %>% filter(word %in% row.names(tp_imp_words))

## [1] word     Negative Positive lo      
## <0 rows> (or 0-length row.names)

We can see that the top 50 most influential words in the Random Forest, are absent from the top 50 most influential words in the Naive bayes. This can be explained by the reduction in the number of words for both models. The Naive Bayes model contained 942 features whereas the Random Forest contains 589 words. In addition, the model replaces hashtags with strange connotations. For example, #covid19 is replaced by X.covid19. We could rearrange it to have a better combination of predictors for both models, but for the sake of this project (and time) we will ignore it.

Let’s analyse the misclassified tweets. Misclassified tweets are tweets in the test data that our model predicts as positive when they were initially negative and vice versa.

Wrongly clasified tweets

df <- data.frame(predictRF)
predictions <- data.frame(doc_id=as.integer(rownames(df)))
predictions$pred_y <- df$predictRF
predictions$true_label <- n_file[row.names(df),]$Sentiment1
predictions$text <- n_file[row.names(df),]$OriginalTweet

head(predictions)

##   doc_id   pred_y true_label
## 1  33004 Negative   Positive
## 2   2714 Negative   Negative
## 3   6746 Positive   Positive
## 4   7642 Negative   Negative
## 5  23488 Negative   Negative
## 6  16311 Negative   Negative
##                                                                                                                                                                                                                                                                                                                              text
## 1                                                                So proud of members and staff during COVID 19 Culinary staff can t open cafe but can serve meals to MD workers Piecemakers quilting club can t meet but can make masks Food Pantry staff can t socially interact but can provide food to drive thru s demand 700
## 2                                           Indonesia - 227 cases, 19 deaths\nPhilippines - 187 cases, 14 deaths\nMalaysia - 673 cases, 2 deaths\nThailand - 177 cases, 1 death\nSingapore - 226 cases , 0 deaths\n\nIndonesia now has the most #COVID19 case fatalities in Southeast Asia #coronavirus #CoronaOutbreak #SARSCoV2
## 3                                                                                                                                           Pray for the leaders. Pray for the healthcare workers. Pray for the researchers. Pray for the grocery store workers. Pray for those affected. Pray for each other. #Pray #coronavirus
## 4                                                                                                 Correct terms for the coronavirus #CoronaVirusUpdate #COROVID19 #CoronaCrisis the great bogg crisis, loo roll crisis, canÂ’t buy anything I need crisis, supermarket crisis, ventilator crisis, please add anything extra......
## 5                              #Northerners vs #southerners....... fight fight fight fight fight.....\n\nFFS GROW UP! IÂ’m seeing as much footage of the chavs in northern town supermarket car parks as I am in southern supermarket car parks... \n\nGuess what, youÂ’re all pricks! #StayHomeSaveLives https://t.co/RoZ3a1HZIc
## 6 An anonymous donor gave us a certificate for T&amp;T Pet Food &amp; Supply, allowing us to buy this FULL cart of cat food! Due to COVID-19, there is high demand for our Pet Food &amp; Supply Pantry. If you can donate, we need dry dog food, dry cat food &amp; cat litter: https://t.co/2EZh4KMma5. https://t.co/ib2xhK0slf

Wrongly predicted positive tweets

head(predictions %>% filter(pred_y == 'Positive' & true_label == 'Negative'))

##   doc_id   pred_y true_label
## 1   6118 Positive   Negative
## 2  12770 Positive   Negative
## 3  15661 Positive   Negative
## 4  26471 Positive   Negative
## 5  23629 Positive   Negative
## 6   8518 Positive   Negative
##                                                                                                                                                                                                                                                                                        text
## 1                        Day 2: Went to Food4Less to stock up on food. The only frozen pizzas left were California Pizza Kitchen thin crust 4 cheese. I had to make a choice. Do I get the pizza or starve?! I got the pizza. If COVID-19 doesnÂ’t get me, this pizza might finish the job.
## 2                           @ArvindKejriwal @LtGovDelhi Delhi Goods Transport Organisation reiterated its demand made to Delhi LG &amp; Delhi CM  request to declare lockdown in Delhi. ItÂ’s necessary to combat community transmission of COVID-19 &amp; request to arrange food/help for
## 3             Hi friends, is there a list of DMV food donation facilities that are still open for donation drop offs? Or know which areas are in dire need of food donations due to larger demand? I bought $100 of food donations that I need to drop off, thank you in advance! #Covid_19
## 4                           @callistoggawine You should be livid. Not only has he exposed you to Covid-19, he has exposed his buddies who will now expose more folk who will expose nurses, doctors, careworkers, refuse collectors, supermarket staff, delivery drivers, farmers etc. He i
## 5                Disruption in demand markets due to say closure of food markets as part of efforts to contain spread of virus will increase levels of food going to waste This implies an increased need in means of preserving food among small scale and informal food traders to buffer
## 6 ItÂ’s sad, so sad, itÂ’s a sad sad situation,\nAnd itÂ’s getting more and more absurd.. \n#corona #coronavirus #covid_19 #toiletpaper #toiletroll #panicbuying #darkphotography #dslrguru #supermarket #dontpanic #photography #photographyislife #lovenotlooroll https://t.co/fTJ7mAOzyU

Wrongly predicted negative tweets

Let’s analyse the tweet 14984. By reading the tweet, we can clearly see that the tweet contains many “negative” words that influenced the overall sentiment of the tweets. Furthermore, taking into account the context of the tweet, we can agree with the original sentiment analysis that this particular tweet is negative. It would be interesting to have the probability of predicting that a tweet is positive or negative as in the Naives Bayse model. This would give us a more advanced insight into the classification choice made by the model. Unfortunately, we could not find such an opportunity for the random forest.

head(predictions %>% filter(pred_y == 'Negative' & true_label == 'Positive'))

##   doc_id   pred_y true_label
## 1  33004 Negative   Positive
## 2   9136 Negative   Positive
## 3   5649 Negative   Positive
## 4   4798 Negative   Positive
## 5   8267 Negative   Positive
## 6   3773 Negative   Positive
##                                                                                                                                                                                                                                                                                                            text
## 1                                              So proud of members and staff during COVID 19 Culinary staff can t open cafe but can serve meals to MD workers Piecemakers quilting club can t meet but can make masks Food Pantry staff can t socially interact but can provide food to drive thru s demand 700
## 2                   Medical staff - Frontline workers\nEmergency responders - Frontline workers\nPharmacy techs - Frontline workers \nRetail staff inc Truckers - ARE Frontline workers \n\nAll deserve respect, retail workers are just trying to keep your food stocked!\n#StopHoarding #coronavirus #Respect
## 3                                      If you have food at home please don't buy other food everyday.\nPeople like me don't find nothing at the supermarket.\nDon't be selfish we all buy the food, so why i should buy something that i don't like because there is only that.\n#coronavirus #covid19UK #tesco
## 4                                                                                                                                                                           says system is alive and well calls workers heroes  If you see empty shelves that s a demand issue not a supply issue   gt gt gt gt
## 5                           You may be immune to #COVID19 but ask yourself, are my mom and dad immune? Are my grandparents immune? Is that old lady I pass at the grocery store immune? Try not being selfish for the first time in your life. Stop hording. Use your brain.\n\n#coronavirus #CoronaVirusUpdate
## 6 Amid the panic and worries worldwide regarding Covid-19. We urge you to stand in faith and trust in God. Do not be moved by what you see. Do not be moved by the lack of food in stores. Do not be moved by the news reportings. But be moved by the Word of God. \n\nMatthew 6:31-34 https://t.co/lk7pDGoj9I

By analysing these tweets, in particular the 4465, we can say that by taking into account the context of the tweets, our prediction is better than the original sentiment analysis. As we said before for the first model, it seems that the original sentiment analysis only took into account the value attached to a word, ignoring the context (the mix of words in a sentence.) Taking this case, we see that the context of the tweet is negative even though it does not contain many words with negative connotation.

Overall, our Random Forest model performed well, with an accuracy of 78%. For a better result, we could increase the number of trees in our model. Moreover, as mentioned above, hastags are replaced by strange characters when running the model, which does not allow us to make a good comparative study between the two models. For this project, we will recommend using the Naive Bayse model to predict the sentiment of a tweet based on the word it contains.

Tweets classification using Naive bayes and Random Forest

Paul Marie

2022-11-07

This Project

TEXT preprocessing & Explorative analyse

Data Cleaning

‘Sentiment’ variable transformation & Encoding

Tokenization and Cleaning Stops words, Urls, users etc. & lematization

Explorating Analysis

Vocabulary size

Class distribution

Distribution of the words in our corpus

Frequency table

Distribution of the most frequent words

Words Cloud

Constrative analysis

Dunning log-likelihood (G^2)

Log ratio

Log odds ratio distribution

Classification

Naive Bayes

Training model

Predicting class for the docs from test set

Analysis of predictors

Wrongly clasified tweets

Random Forest

Training model & Testing model

Analyse of predictors

Wrongly clasified tweets