This document is one of the Black Lives Matter projects that NYU Data Services staff members worked on in the Summer of 2020. The goal was to quantitatively analyze the reaction to George Floyd’s death and address its political consequences.
You’ll see how we use Twitter data to do sentiment analysis in R statistical programming language. The focus will be on text data wrangling, visualization, and interpretation with basic topic modeling at the end. You can download the code by clicking the “Code” button to the upper right.
First, we can install/load all of the packages that we will be using, and clear out the environment.
packages <- c("readxl","tidytext","plyr","dplyr","tidyr","ggplot2","scales",
"purrr","textdata","wordcloud","reshape2","stringr","igraph",
"ggraph","widyr","grid","arules","tm","topicmodels")
for(i in packages){
if(!require(i,character.only = T, quietly = T)){
install.packages(i)
}
library(i, character.only = T, quietly = T)
}
rm(list=ls())
#Set the seed to ensure that we get the same random numbers every time
#The seed could be any number you choose
set.seed(2020)
We will be using some twitter data web scrapped from June 2020 to August 2020. You can download the data set to your desktop by clicking here, save them in a folder called [data], then read them using the command below.
#Set working directory
setwd("~/Desktop")
#Read the data set for June
tweets1<- read_excel("data/week 2.xls")
tweets2<- read_excel("data/week 3.xls")
#Read the data set for July
tweets3<- read_excel("data/week5.xlsx")
tweets4<- read_excel("data/week6.xlsx")
#Read the data set for August
tweets5<- read_excel("data/week 11.xls")
tweets6<- read_excel("data/week12.xlsx")
The read_excel
function comes from the readxl
package and is useful for reading excel data sets.
#create a list containing all six data frames
dfList<-list(tweets1,tweets2,tweets3,tweets4,tweets5,tweets6)
#create a function to do all the data clean process
result_list <- llply(dfList, function(x) {
#only keep tweets using English as the main language
x<-subset(x,x$`19: Language`== "English")
#change the variable name for future convenience
x$tweet=x$`2: Tweet`
#drop all other variables except tweet
x<-x[,22]
#create a new variable to track the number of tweet
x$tweetnumber<-1:length(x$tweet)
#return the cleaner dataframe with 2 variables
return(x)
})
#apply the function to each dataset
twts1<-as.data.frame(result_list[1])
twts2<-as.data.frame(result_list[2])
twts3<-as.data.frame(result_list[3])
twts4<-as.data.frame(result_list[4])
twts5<-as.data.frame(result_list[5])
twts6<-as.data.frame(result_list[6])
The function gives us 6 data sets with only 2 variables: [tweet] and [tweetnumber]. We want to keep [tweetnumber] to track down every word from the tweet text later.
#stop_words is a combination of English stop words from three lexicons, as a data frame.
data(stop_words)
#customize stop words
custom_stop_words <- bind_rows(
tibble(word = c("t.co","csun","blm","rt","https",
"BLM","blacklivesmatter","black",
"georgefloyd","2",'#blm','#blacklivesmatter',
'#georgefloyd','august'),
lexicon = c("custom")), stop_words)
We want to have some stop words for our analysis so that those wouldn’t appear in our frequency table and interfere our judgment.
#store those special symbols in the variable so we can remove them later
remove_reg <- "&|<|>"
#create a list containing all six data frames
dfList2<-list(twts1,twts2,twts3,twts4,twts5,twts6)
result_list2 <-
llply(dfList2, function(x) {
y <- x %>%
#remove special symbols for the values under the tweet variable
mutate(tweet = str_remove_all(tweet, remove_reg)) %>%
#extract every word from every tweet
unnest_tokens(word, tweet, token = "tweets") %>%
#filter out all stop words
filter(!word %in% custom_stop_words$word,
!word %in%str_remove_all(custom_stop_words$word, "'"),
str_detect(word, "[a-z]"))
return(y)})
tidy1<-as.data.frame(result_list2[1])
tidy2<-as.data.frame(result_list2[2])
tidy3<-as.data.frame(result_list2[3])
tidy4<-as.data.frame(result_list2[4])
tidy5<-as.data.frame(result_list2[5])
tidy6<-as.data.frame(result_list2[6])
Each single word in each tweet is considered as a token. By doing so, we can start to count frequency of each word or even each pair of words.
#Count the Frequency for Each Word
tidy_week11 <- tidy1 %>%dplyr::count(word, sort = TRUE)
tidy_week12 <- tidy2 %>%dplyr::count(word, sort = TRUE)
tidy_week21 <- tidy3 %>%dplyr::count(word, sort = TRUE)
tidy_week22 <- tidy4 %>%dplyr::count(word, sort = TRUE)
tidy_week31 <- tidy5 %>%dplyr::count(word, sort = TRUE)
tidy_week32 <- tidy6 %>%dplyr::count(word, sort = TRUE)
#Remove all non-english tokens
tidy1_english <- tidy_week11[which(!grepl("[^\x01-\x7F]+", tidy_week11$word)),]
tidy2_english <- tidy_week12[which(!grepl("[^\x01-\x7F]+", tidy_week12$word)),]
tidy3_english <- tidy_week21[which(!grepl("[^\x01-\x7F]+", tidy_week21$word)),]
tidy4_english <- tidy_week22[which(!grepl("[^\x01-\x7F]+", tidy_week22$word)),]
tidy5_english <- tidy_week31[which(!grepl("[^\x01-\x7F]+", tidy_week31$word)),]
tidy6_english <- tidy_week32[which(!grepl("[^\x01-\x7F]+", tidy_week32$word)),]
#create a list containing all six data frames
dfList3<-list(tidy1_english,tidy2_english,tidy3_english,tidy4_english,tidy5_english,tidy6_english)
#visualize using bar plot
result_list3 <-
llply(dfList3, function(x) {
plot <- x %>%
#keep only the top 20 tokens
dplyr::top_n(20) %>%
#reorder word based on the count
dplyr::mutate(word = reorder(word, n)) %>%
#plot using ggplot2
ggplot(aes(word, n, fill=word)) +
#specify it's a bar plot
geom_bar(stat="identity")+
scale_fill_hue(c=45, l=80)+
xlab(NULL) +
coord_flip()+
theme(legend.position="none")
return(plot)})
result_list3[[1]]
result_list3[[2]]
result_list3[[3]]
result_list3[[4]]
result_list3[[5]]
result_list3[[6]]
The most common words didn’t change a lot over the 3 months. In order to dig more information, we can try to identify positive words and negative words from them.
#create a list containing all six data frames
dfList4<-list(tidy1,tidy2,tidy3,tidy4,tidy5,tidy6)
#visualize using Word clouds
result_list_wordclouds <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),max.words = 50)
return(plot)})
Word clouds can show both of the most common negative words and the most common positive words. It’s a better way to visualize all tokens.
#Get specific sentiment lexicons in a tidy format, with one row per word, in a form that can be joined with a one-word-per-row data set
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
get_sentiments("loughran")
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # … with 4,140 more rows
The [“bing”] option comes from the included sentiments data frame, and others call the relevant function in the [textdata] package.
#Find the most common positive words using nrc lexicon
nrc_positive <- get_sentiments("nrc") %>%
filter(sentiment == "positive")
result_list4 <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(nrc_positive) %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::top_n(20) %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill=word)) +
geom_bar(stat="identity")+
scale_fill_hue(c=45, l=80)+
xlab(NULL) +
coord_flip()+
theme(legend.position="none")
return(plot)})
result_list4[[1]]
result_list4[[2]]
result_list4[[3]]
result_list4[[4]]
result_list4[[5]]
result_list4[[6]]
Overall trend: At first, people called on love, unity, and justice. Then they began to call on donations. Later, they tried to appeal on voting to pursue justice again. The dataset in August was gathered right after what happened to Jacob Blake in Wisconsin. So people’s need for justice went back. They wanted the government/legal authority to act upon this.
#Find the most common negative words using nrc lexicon
nrc_negative <- get_sentiments("nrc") %>%
filter(sentiment == "negative")
result_list5 <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(nrc_negative) %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::top_n(20) %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill=word)) +
geom_bar(stat="identity")+
scale_fill_hue(c=45, l=80)+
xlab(NULL) +
coord_flip()+
theme(legend.position="none")
return(plot)})
result_list5[[1]]
result_list5[[2]]
result_list5[[3]]
result_list5[[4]]
result_list5[[5]]
result_list5[[6]]
An interesting thing is vote is both negative and positive here. People do start to mention ‘vote’ more over the three months.
Overall trend: At first, people tweeted mostly about how they felt about Geroge Floyd’s death and the police. Then, people tweeted more about how they felt regarding loathing and violence. Later, people talked more about changing the situation through voting.
#Create a visualization about how much each word contributed to each sentiment.
result_list_contribute <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(get_sentiments("bing")) %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
return(plot)})
result_list_contribute[[1]]
result_list_contribute[[2]]
result_list_contribute[[3]]
result_list_contribute[[4]]
result_list_contribute[[5]]
result_list_contribute[[6]]
A problem about using the pre-defined lexicons is they don’t identify human names. For as you can see on the positive plot above, the machine thinks trump as a word rather than a name. And it mistakenly concluded that the word [‘trump’] contributed to positive emotions around 150 times.
#Examine how sentiment(positive/negative) changes throughout each tweet using bing lexicon
result_list6 <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(get_sentiments("bing")) %>%
count(tweetnumber, index = tweetnumber %/% 10, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(index, sentiment, fill = tweetnumber)) +
geom_bar(stat="identity")+
xlab(NULL) +
theme(legend.position="none")
return(plot)})
result_list6[[1]]
result_list6[[2]]
result_list6[[3]]
result_list6[[4]]
result_list6[[5]]
result_list6[[6]]
The zero line means neither positive nor negative. According to the plots we get, [bing] lexicon shows the overall trend is getting more and more negative since it’s moving farther and farther below zero.
#Examine how sentiment(positive/negative) changes throughout each tweet using afinn lexicon
result_list_afinn <-
llply(dfList4, function(x) {
plot <- x %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = tweetnumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")%>%
ggplot(aes(index, sentiment, fill = sentiment)) +
geom_bar(stat="identity")+
xlab(NULL) +
theme(legend.position="none")
return(plot)})
result_list_afinn[[1]]
result_list_afinn[[2]]
result_list_afinn[[3]]
result_list_afinn[[4]]
result_list_afinn[[5]]
result_list_afinn[[6]]
The zero line means neither positive nor negative. According to the plots we get, afinn lexicon also shows that the overall trend is getting more and more negative since it’s moving farther and farther below zero.
#Examine how sentiments changes using loughran lexicon
result_list7 <-
llply(dfList4, function(x) {
plot <- x %>%
count(word) %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill=word)) +
geom_bar(stat="identity")+
scale_fill_hue(c=45, l=80)+
coord_flip() +
facet_wrap(~ sentiment, scales = "free") +
theme(legend.position="none")
return(plot)})
result_list7[[1]]
result_list7[[2]]
result_list7[[3]]
result_list7[[4]]
result_list7[[5]]
result_list7[[6]]
Running sentiment analysis using “loughran” lexicon, we can get the frequency of words in 6 categories. It gave us more information about how sentiments change over time more specifically.
From the output, we can see (1) People’s litigous demand is increasing in magnitude over time. Legal complaint such as petitions starts to attract more and more attention. They called on petition for achieving justice. (2) The negative sentiments shrinked from June to July. But as soon as what happened to Jacob Blake in Wisconsin appeared on news, the negativity went back. And what they were mad the most was the ‘violence’ involved. (3) The sentiments in June is very similar with August. The first one was the reaction to George Floyd, the second was the reaction to Jacob Blake. (4) In June, the uncertainty was strong. People felt unsure and doubt so much. But over the 3 months, the uncertainty decreased. People got calmer, and their movement on seeking legal change probably helped.
#How many times do people mention about 'defund (the police)'
result_list9 <-
llply(dfList2, function(x) {
defund<- x %>%
filter(str_detect(tweet, "defund")) %>%
dplyr::select(tweet)
le<-length(defund$tweet)
return(le)})
de1<-result_list9[[1]]
de2<-result_list9[[2]]
de3<-result_list9[[3]]
de4<-result_list9[[4]]
de5<-result_list9[[5]]
de6<-result_list9[[6]]
defund_mat<-data.frame(x=c(1:6),y=c(de1,de2,de3,de4,de5,de6))
ggplot(defund_mat, aes(x,y)) +
geom_line(color="red",linetype = "dashed")+
geom_point() +
ggtitle("Frequency change of the 'defund police' topic")
Overall, the number of times people mentioned about ‘defunding the police’ is increasing.
#How many times do people mention about voting
result_list12 <-
llply(dfList2, function(x) {
defund<- x %>%
filter(str_detect(tweet, "vote")) %>%
dplyr::select(tweet)
le<-length(defund$tweet)
return(le)})
v1<-result_list12[[1]]
v2<-result_list12[[2]]
v3<-result_list12[[3]]
v4<-result_list12[[4]]
v5<-result_list12[[5]]
v6<-result_list12[[6]]
defund_mat2<-data.frame(x=c(1:6),y=c(v1,v2,v3,v4,v5,v6))
ggplot(defund_mat2, aes(x,y)) +
geom_line(color="red",linetype = "dashed")+
geom_point() +
ggtitle("Frequency change of the 'vote' topic")
Overall, the number of times that people tweet about ‘vote’ increased over the three months.
#How many times do people mention 'biden'
result_list13 <-
llply(dfList2, function(x) {
defund<- x %>%
filter(str_detect(tweet, "biden")) %>%
dplyr::select(tweet)
le<-length(defund$tweet)
return(le)})
j1<-result_list13[[1]]
j2<-result_list13[[2]]
j3<-result_list13[[3]]
j4<-result_list13[[4]]
j5<-result_list13[[5]]
j6<-result_list13[[6]]
defund_mat3<-data.frame(x=c(1:6),y=c(j1,j2,j3,j4,j5,j6))
ggplot(defund_mat3, aes(x,y)) +
geom_line(color="red",linetype = "dashed")+
geom_point() +
ggtitle("Frequency change of the 'biden' topic")
Overall, the number of times that people tweet about ‘biden’ increased from June to July, but decreased in August. It could possibly be considered as a reflection of how people felt about his reactions towards black lives movement.
#How many times do people mention about 'georgefloyd'
result_list14 <-
llply(dfList2, function(x) {
defund<- x %>%
filter(str_detect(tweet, "georgefloyd")) %>%
dplyr::select(tweet)
le<-length(defund$tweet)
return(le)})
d1<-result_list14[[1]]
d2<-result_list14[[2]]
d3<-result_list14[[3]]
d4<-result_list14[[4]]
d5<-result_list14[[5]]
d6<-result_list14[[6]]
defund_mat4<-data.frame(x=c(1:6),y=c(d1,d2,d3,d4,d5,d6))
ggplot(defund_mat4, aes(x,y)) +
geom_line(color="red",linetype = "dashed")+
geom_point() +
ggtitle("Frequency change of the 'georgefloyd' topic")
Overall, the number of times that people tweet about ‘georgefloyd’ decreased over the three months. The topic’s heat is cooling down.
#words most associated with the word "defund" and "police"
result_list10 <-
llply(dfList2, function(x) {
pair <- x %>%
unnest_tokens(word, tweet) %>%
filter(!word %in% custom_stop_words$word)%>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, tweetnumber, sort = TRUE)%>%
filter(item1 %in% c("defund", "police")) %>%
group_by(item1) %>%
top_n(15) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation,fill=correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
return(pair)})
result_list10[[1]]
result_list10[[2]]
result_list10[[3]]
result_list10[[4]]
result_list10[[5]]
result_list10[[6]]
Pairwise correlation plot helps us know people’s attitude about police and “defund” movement. According the plot below, we can see that on average, there is a 30% chance people would think about “brutality” when they think about the police, and the chance that people think about defunding the police when people tweet about police is also increasing over the 3 months in general.
#words most associated with the names of the candidates
result_list11 <-
llply(dfList2, function(x) {
pair <- x %>%
unnest_tokens(word, tweet) %>%
filter(!word %in% custom_stop_words$word)%>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, tweetnumber, sort = TRUE)%>%
filter(item1 %in% c('trump','joebiden')) %>%
group_by(item1) %>%
top_n(15) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation,fill=correlation)) +
facet_wrap(~ item1, scales = "free") +
geom_bar(stat = "identity") +
coord_flip()
return(pair)})
result_list11[[1]]
result_list11[[2]]
result_list11[[3]]
result_list11[[4]]
result_list11[[5]]
result_list11[[6]]
The chance that people tweet about ‘vote’ or ‘pick’ when people tweet about ‘joybiden’ increased over the 3 months. The chance that people tweet about ‘vote’ when people tweet about ‘trump’ decreased over the 3 months.
#set seed to make sure that we get same output when we run it
set.seed(1235)
a <- arrow(angle = 20, length = unit(0.1, "inches"), ends = "last", type = "open")
result_list_network <-
llply(dfList2, function(x) {
network <- x %>%
unnest_tokens(word, tweet) %>%
filter(!word %in% custom_stop_words$word)%>%
group_by(word) %>%
filter(n() >= 100) %>%
pairwise_cor(word, tweetnumber, sort = TRUE) %>%
filter(correlation > 0.4) %>%
graph_from_data_frame() %>%
ggraph(layout = "mds") +
geom_edge_link(aes(color = correlation, width = correlation), arrow = a) +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
return(network)})
result_list_network[[1]]
result_list_network[[2]]
result_list_network[[3]]
result_list_network[[4]]
result_list_network[[5]]
result_list_network[[6]]
#convert the dataset for the last week into a DTM matrix using the cast_dtm function.
dtm <- tidy6 %>% count(tweetnumber, word) %>% cast_dtm(tweetnumber, word, n)
# assume there are 2 topics
dtm.lda <- LDA(dtm, k = 2)
# look at how each word is related to a topic
terms(dtm.lda, 20)
## Topic 1 Topic 2
## [1,] "matter" "police"
## [2,] "white" "#jacobblake"
## [3,] "lives" "people"
## [4,] "people" "#kenosha"
## [5,] "justice" "support"
## [6,] "blake" "lives"
## [7,] "watch" "#acab"
## [8,] "cops" "jacob"
## [9,] "#defundthepolice" "#antifa"
## [10,] "shot" "shooting"
## [11,] "racism" "shot"
## [12,] "#justiceforjacobblake" "matter"
## [13,] "#protest" "movement"
## [14,] "#breonnataylor" "protesters"
## [15,] "time" "stop"
## [16,] "#portlandprotests" "#defundthepolice"
## [17,] "violence" "#portland"
## [18,] "racist" "#racism"
## [19,] "jacob" "#nojusticenopeace"
## [20,] "video" "america"
With the output, we can first look at how each word is related to a topic. The top 20 most likely terms for the two topics are given above. Topic 2 here seemed to be related to people’s perspective about why the thing happened. On the other hand, Topic 1 might be more related to people’s perspective about how they should respond to it.
#re-organize the results in the tibble data format.
dtm.topics <- tidy(dtm.lda, matrix = "beta")
dtm.topics
## # A tibble: 54,370 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 #blmprotest 0.00119
## 2 2 #blmprotest 0.00181
## 3 1 #nba 0.000664
## 4 2 #nba 0.000735
## 5 1 #nbaplayoffs 0.000150
## 6 2 #nbaplayoffs 0.000276
## 7 1 https://t.co/xjetmwhuil 0.0000105
## 8 2 https://t.co/xjetmwhuil 0.00000977
## 9 1 love 0.000593
## 10 2 love 0.00231
## # … with 54,360 more rows
The function tidy re-organized the results in the tibble data format. Here we obtain a data set with three columns representing the topic, the term, and the associated probability with that topic. For example, for “#blmprotest”, it has the probability of 6.195091e-04 within the first topic and 2.383987e-03 within the second topic.
#We can plot the probability and compare them.
dtm.terms <- dtm.topics %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% arrange(topic, -beta)
dtm.terms
## # A tibble: 20 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 matter 0.00886
## 2 1 white 0.00818
## 3 1 lives 0.00807
## 4 1 people 0.00742
## 5 1 justice 0.00506
## 6 1 blake 0.00475
## 7 1 watch 0.00443
## 8 1 cops 0.00434
## 9 1 #defundthepolice 0.00416
## 10 1 shot 0.00394
## 11 2 police 0.0159
## 12 2 #jacobblake 0.0118
## 13 2 people 0.00962
## 14 2 #kenosha 0.00813
## 15 2 support 0.00719
## 16 2 lives 0.00630
## 17 2 #acab 0.00538
## 18 2 jacob 0.00527
## 19 2 #antifa 0.00472
## 20 2 shooting 0.00448
plot.ts(log(dtm.topics[dtm.topics$topic == 1, "beta"]))
lines(log(dtm.topics[dtm.topics$topic == 2, "beta"]), col = 2)
#build correlated topics model
ctm <- CTM(dtm, k = 2, control = list(seed = 2020))
ctm.topics <- tidy(ctm, matrix = "beta")
ctm.terms <- ctm.topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
#visualize the topics and the strongly associated words.
ctm.terms %>% mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = TRUE) + facet_wrap(~topic, scales = "free", labeller = "label_both") +
xlab("Terms") +
ylab("Topics") +
coord_flip()
The figure above shows the top 10 words for each topic. As we assumed, topic 2 here seemed to be related to people’s perspective about why the thing happened, and topic 1 might be more related to people’s perspective about how they should respond to it.
For topic 1, “abolish the police”,“defund the police”, and “protest” are highly correlated. For topic 2, “jacobblake”, “police”, and “shot” are correlated.
That’s all I have for you. Hope it was helpful! Feel free to reach out to NYU Data Services if you encounter any issues with programming in R about this subject or would like to be directed to any additional resources.