library(rtweet)
library(tidyverse)
library(tidytext)
library(wordcloud2)
library(DT)
library(plotly)
- The following shows the process for unnesting the words from usatoday tweets as well as removing stop words and weird web words from the tweets. A table and wordcloud are created to show the top tweets.
news_tweets <- get_timeline("usatoday", n = 5000)
news_words <- news_tweets %>%
unnest_tokens(word, text) %>%
select(screen_name, word)
news_words %>%
anti_join(stop_words) %>%
count(word, sort = T)
Joining, by = "word"
news_words %>%
anti_join(stop_words) %>%
filter(!word == "t.co") %>%
filter(!word == "https") %>%
count(word, sort = T)
Joining, by = "word"
news_words %>%
anti_join(stop_words) %>%
filter(!word == "t.co") %>%
filter(!word == "https") %>%
count(word, sort = T) %>%
top_n(100) %>%
wordcloud2(size = .8)
Joining, by = "word"
Selecting by n
The first datatable shows the top words in usatoday’s tweets with only stop words being removed. The datatable shows that the top two words are weird web words. Because of that, the second data table was made. The second data table shows the top words for usatoday’s tweets with both stop words and weird web words removed. Following is a wordcloud of the top words. As one can see from the datatable and graph, some of the most used words in usatoday’s tweets include “Ukraine”, “Russian”, “president”, and so on. This makes sense considering the Russian invasion of Ukraine is being highly covered by news outlets at the current moment.
- The following shows the process of sentiment analysis of usatoday’s tweets. A sentiment dictionary called bing was used to do this.
bing <- get_sentiments("bing")
bing
news_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
news_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
filter(!word == "supreme")
Joining, by = "word"
news_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
filter(!word == "supreme") %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "News headlines: Words that contribute the most to each sentiment",
x = NULL) +
coord_flip() +
theme_minimal()
Joining, by = "word"
Selecting by n

The first datatable in the sentiment analysis shows words that the sentiment dictionary includes and whether they are positive or negative. The second datatable shows sentiment words that were found in usatoday’s tweets and whether they are postivie or negative. The third datatable shows the same thing, however the word supreme was removed. The word supreme was removed because bing considers it a positive word, however in the news it refers to the new judge elected to the supreme court. Finally, the graph shows sentiment words in usatoday’s tweets with the necessary word removed.
- The following shows the process of sentiment analysis of usatoday’s tweets. A sentiment dictionary called nrc was used to do this.
nrc <- get_sentiments("nrc")
Do you want to download:
Name: NRC Word-Emotion Association Lexicon
URL: http://saifmohammad.com/WebPages/lexicons.html
License: License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
Size: 22.8 MB (cleaned 424 KB)
Download mechanism: http
Citation info:
This dataset was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.'' Computational Intelligence, 29(3): 436-465.
article{mohammad13,
author = {Mohammad, Saif M. and Turney, Peter D.},
title = {Crowdsourcing a Word-Emotion Association Lexicon},
journal = {Computational Intelligence},
volume = {29},
number = {3},
pages = {436-465},
doi = {10.1111/j.1467-8640.2012.00460.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x},
year = {2013}
}
If you use this lexicon, then please cite it.
1: Yes
2: No
library(dplyr)
Enter an item from the menu, or 0 to exit
0
trying URL 'http://saifmohammad.com/WebDocs/NRC-Emotion-Lexicon.zip'
Content type 'application/zip' length 24436570 bytes (23.3 MB)
==================================================
downloaded 23.3 MB
nrc
news_words %>%
inner_join(nrc) %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
news_words %>%
inner_join(nrc) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "News headlines: Words that contribute the most to each sentiment",
x = NULL) +
coord_flip() +
theme_minimal()
Joining, by = "word"
Selecting by n

The nrc dictionary is visibly different from the bing dictionary. The nrc dictionary includes more than just positive and negative. For example, it includes anger, joy, disgust, fear, and so on. The first datatable shows the nrc dictionary. The second datatable shows words pulled from usatoday’s tweets and categorizes them according to the nrc dictionary. Finally, the graph shows the different sentiments the nrc package includes along with words pulled from usatoday’s tweets that fit into each sentiment. I want to note that I did not delete any words like I did with the bing dictionary. My reasoning for this was because of how much broader the nrc dictionary is.
- The following shows the process of unnesting usatoday’s tweets into bigrams, removing the stop words and errors, and creating a datatable and wordcloud of usatoday’s most common bigrams.
news_tweets %>%
select(text) %>% # this selects just the text of the tweets
unnest_tokens(words, text, token = "ngrams", n = 2)
NA
news_tweets %>%
select(text) %>% # this selects just the text of the tweets
unnest_tokens(words, text, token = "ngrams", n = 2) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate them temporarily
filter(!word1 %in% stop_words$word) %>% # remove if first word is a stop word
filter(!word2 %in% stop_words$word) %>% # remove if second word is a stop word
unite(words, word1, word2, sep = " ") # put them back together
NA
remove_words = c("https", "t.co", "kbqavrhlqa", "dzdcdaghgr", "o7nt7jav5t","9hfrnmsh6g", "m0v7cthfdq", "qplwnsfyzk", "19xyv9k77e", "nnnmetphih", "we45nhxa4m", "bbu9ol33be", "actjxv5rle", "9ktxm7bn8w", "xia1xtxqh5", "xpyndwady2", "wdjc9j6oba", "tcypxlzj1h", "fl0dr0i8rd", "lnhbeh4qqz", "aqjcmxq29b", "cm48lqfhzr", "fhht8fqip6", "lebmmzrdcr", "4i6jhw1tgf", "rjnis1ehqw", "lunp8vilnj", "kumv8k3rpt", "qzi3llehc0", "rqlftfzdh5", "ngnztytziv", "slzkjtr12p")
news_tweets %>%
select(text) %>%
unnest_tokens(words, text, token = "ngrams", n = 2) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate them temporarily
filter(!word1 %in% stop_words$word) %>% # remove if first word is a stop word
filter(!word2 %in% stop_words$word) %>% # remove if second word is a stop word
filter(!word1 %in% remove_words) %>% # these two lines remove our remove_words
filter(!word2 %in% remove_words) %>%
unite(words, word1, word2, sep = " ") -> news_bigrams2 # put them back together
news_bigrams2 %>%
count(words, sort = T)
news_bigrams2 %>%
count(words, sort = T) %>%
top_n(100) %>%
wordcloud2(size = .5)
Selecting by n
This is a lengthy process but shows the process of creating bigrams of usatoday’s tweets. The first datatable is usatoday’s tweets unnested. The second datatable is usatoday’s bigrams without stop words. The third datatable is ussatoday’s bigrams with no stop words or weird web words. Finally, the wordcloud is the third datatable in more of a picture form. It contains the most common bigrams of usatoday’s tweets. As the datatable and wordcloud show, some of the most common bigrams include “Covid 19”, “Ketanji Brown”, and “supreme court”. These make sense if we take into account our current events.
- The following process will show the most common words that follow the words “Russia” and “Ukraine”.
first_word <- c("russia", "ukraine") # these need to be lowercase
news_bigrams2 %>%
count(words, sort = TRUE) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate the two words
filter(word1 %in% first_word) %>% # find first words from our list
count(word1, word2, wt = n, sort = TRUE)
first_word <- c("russia", "ukraine") # these need to be lowercase
news_bigrams2 %>%
count(words, sort = TRUE) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate the two words
filter(word1 %in% first_word) %>% # find first words from our list
count(word1, word2, wt = n, sort = TRUE) %>%
mutate(word2 = factor(word2, levels = rev(unique(word2)))) %>% # put the words in order
group_by(word1) %>%
top_n(5) %>%
ggplot(aes(word2, n, fill = word1)) + #
scale_fill_viridis_d() + # set the color palette
geom_col(show.legend = FALSE) +
labs(x = NULL, y = NULL, title = "Word following:") +
facet_wrap(~word1, scales = "free") +
coord_flip()
Selecting by n

Both the datatable and the graph show the most common words that follow the words “Russia” and “Ukraine”. Some common words that follow the word “Russia” include “invades”, “continues”, and “invaded”. The words that follow the word “Ukraine” are a little trickier because they involve emojis. For example, the most common word that follows “Ukraine” is the number 2 emoji. The second most common word that follows “Ukraine” is “president” and then “Russia”.
