This first chunk is downloading the necessary packages to complete the analysis.
library(rtweet)
package 㤼㸱rtweet㤼㸲 was built under R version 4.0.5
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.5replacing previous import 㤼㸱lifecycle::last_warnings㤼㸲 by 㤼㸱rlang::last_warnings㤼㸲 when loading 㤼㸱pillar㤼㸲Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
replacing previous import 㤼㸱lifecycle::last_warnings㤼㸲 by 㤼㸱rlang::last_warnings㤼㸲 when loading 㤼㸱hms㤼㸲-- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.6 v dplyr 1.0.7
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.1.1 v forcats 0.5.1
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.5package 㤼㸱tibble㤼㸲 was built under R version 4.0.5package 㤼㸱tidyr㤼㸲 was built under R version 4.0.5package 㤼㸱readr㤼㸲 was built under R version 4.0.5package 㤼㸱purrr㤼㸲 was built under R version 4.0.5package 㤼㸱dplyr㤼㸲 was built under R version 4.0.5package 㤼㸱stringr㤼㸲 was built under R version 4.0.5package 㤼㸱forcats㤼㸲 was built under R version 4.0.5-- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x purrr::flatten() masks rtweet::flatten()
x dplyr::lag() masks stats::lag()
library(tidytext)
package 㤼㸱tidytext㤼㸲 was built under R version 4.0.5
library(DT)
package 㤼㸱DT㤼㸲 was built under R version 4.0.5Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(plotly)
package 㤼㸱plotly㤼㸲 was built under R version 4.0.5Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
library(wordcloud2)
package 㤼㸱wordcloud2㤼㸲 was built under R version 4.0.5
The following chunk then pulls 5000 tweets from the wall street journal for this analysis.
wsj_tweets <- get_timeline("wsj", n = 5000)
Requesting token on behalf of user...
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.
This chunk along with the following one both separate the words to be analyzed more effectively, along with removing any unnecessary or irrelevant words.
wsj_words <- wsj_tweets %>%
unnest_tokens(word, text) %>%
select(screen_name, word)
wsj_words %>%
anti_join(stop_words) %>%
count(word, sort = T)
Joining, by = "word"
After seeing the list of included words after removing the stop words, https and t.co were both unnecessary in this analysis. The following chunk removes those as well.
wsj_words %>%
anti_join(stop_words) %>%
filter(!word == "https") %>%
filter(!word == "t.co") %>%
count(word, sort = T)
Joining, by = "word"
This chunk then uses the above data to make a word cloud excluding the unnecessary words.
wsj_words %>%
anti_join(stop_words) %>%
filter(!word == "https") %>%
filter(!word == "t.co") %>%
count(word, sort = T) %>%
top_n(200) %>%
wordcloud2(size = .5)
Joining, by = "word"Selecting by n
This chunk is retrieving the general attitudes from bing
bing <- get_sentiments("bing")
bing
This chunk narrows down the general sentiments with the ones specifically in the wall street journal.
wsj_words %>%
inner_join(bing) %>%
filter(!word == "trump") %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
This next chunk then creates a table of all the top words in the sentiment.
wsj_words %>%
inner_join(bing) %>%
filter(!word == "trump") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "News headlines: Words that contribute the most to each sentiment",
x = NULL) +
coord_flip() +
theme_minimal()
Joining, by = "word"Selecting by n
The following chunks are repeating the above process that was done with bing, with nrc instead.
nrc <- get_sentiments("nrc")
nrc
wsj_words %>%
inner_join(nrc) %>%
filter(!word == "trump") %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
wsj_words %>%
inner_join(nrc) %>%
filter(!word == "trump") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "News headlines: Words that contribute the most to each sentiment",
x = NULL) +
coord_flip() +
theme_minimal()
Joining, by = "word"Selecting by n
Next, this chunk unnests the tweets as bigrams in pairs while also removing stop words and errors.
wsj_tweets %>%
select(text) %>% # this selects just the text of the tweets
unnest_tokens(words, text, token = "ngrams", n = 2) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate them temporarily
filter(!word1 %in% stop_words$word) %>% # remove if first word is a stop word
filter(!word2 %in% stop_words$word) %>% # remove if second word is a stop word
unite(words, word1, word2, sep = " ")
Next, this first chunk will create a table and the next will create a word cloud of the bigrams
news_bigrams %>%
count(words, sort = T)
news_bigrams %>%
count(words, sort = T) %>%
top_n(100) %>%
wordcloud2(size = .5)
Selecting by n
This final chunk will then analyze words that follow one another to better understand the information.
first_word <- c("COVID-19", "vaccine") # these need to be lowercase
news_bigrams %>%
count(words, sort = TRUE) %>%
separate(words, c("word1", "word2"), sep = " ") %>% # separate the two words
filter(word1 %in% first_word) %>% # find first words from our list
count(word1, word2, wt = n, sort = TRUE)