Texts can be analyzed in terms of topics. This means we can analyze text in terms of what it is that it is written about in the first place. For example, consider how the topics covered in newspapers change over time. As events unfold, different news outlets with different audiences, news formats, and business strategies will discuss different topics, chossing to emphasize some over others.
Influential news outlets, as well as powerful individuals such as politicians and celebrities, are able to influence the topics that are covered in the news. In turn, the topics covered by the media will influence the opinions and attitudes of readers and viewers. And the public’s opinions and attitudes will inform their voting behaviors and the decisions made by elected officials.
How can we know what topics are being covered in the news outlets? Reading thousands of newspaper articles is too time-consuming. And because it depends on subjective interpretation of texts, reading alone is seen as too open to biased interpretations to be considered scientifically valid. Instead of simply reading large collections of texts to determine the topics that are discussed in the news outlets, researchers have in recent years turned to topics models, which are statistical methods for identfying what combinations of topics are discussed in the news outlets and how the topics discussed change over time. Today we talk about the theory and basic practice of topic modeling, which has recently caught on with a wide range of researchers in the social sciences and humanities.
Topic modeling involves automated procedures for coding collections of texts in terms of meaningful categories that represent the main topics being discussed in texts. Topic models assume that meanings are relational (Saussure, 1959) and that the meanings associated with a topic of conversation can be understood as a set of word clusters.
Topic models treat texts as what linguists call a bag of words, capturing word co-occurrences regardless of syntax, narrative, or location within a text. A topic can be thought of as the cluster of words that tend to come up in a discussion and, therefore, to co-occur more frequently than they otherwise would, whenever the topic is being discussed.
Topic modeling is an instance of probabilistic modeling, and the most widely used probabilistic model for topic modeling is latent Dirichlet allocation (LDA), which is a statistical model of language introduced by Blei, Ng, and Jordan (2003). LDA is based on the idea that every text within a text collection is akin to a bag of words produced according to a mixture of topics that the author or authors intended to discuss.
Here, each topic is a distribution over all observed words in the texts such that words that are strongly associated with the text’s dominant topics have a higher chance of being included within the text’s bag of words. Based on these distributions, authorship is conceptualized as an author repeatedly picking a topic and then a word and placing them in the bag until the document is complete.
The objective of topic modeling is to find the parameters of the LDA process that has generated the final text or text collocation, a process referred to as “inference” in the LDA literature. Among the outputs of the inference is a set of per-word topic distributions associating a probability with every topic-word pair and a similar set of per-topic text distributions describing the probability of choosing a particular topic for every specific text.
Last week we tokenized our tweet texts by bigrams to analyze the relationships between words. Tokenzing by pairs of two conjecutive words, which are bigrams, is done to examine which words tend to follow others immediately. This is a way of revealing which words are important in texts by analyzing the extent to which they are linked to each other.
Today, we’ll analyze the relationships between words by analyzing which words tend to co-occur within the same documents or the same linguistic context. Especially, this measure of word co-occurrences is important because the topics in collections of texts are often determined by the words they tend to be used together frequently. As mentioned above, a topic is often referred to as a cluster of words that are closely linked to each other. So, we will further explore some of the methods for calculating and visualizing such relationships between words in our tweet text dataset.
We’ll also introduce two new packaages: ggraph
, which extends ggplot2 to construct network plots, and widyr
, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.
A bigram refers to a pair of two consecutive words in a text. This method allows us to find the most common two-word pairs that provide context that makes tokens more understandable. But sometimes we want to analyze all of the relationships among words used in the same tweet. So here I’d like to introduce you to a concept of word co-occurrences. By “word co-occurrences,” we refer to the incidence of any pair of words appearing together on the same context of texts, like a tweet. And the relationship strength of a pair of words is weighted by counting co-occurrences.
Word co-occurrences are useful to analyze a semantic network of texts, given the extent to which influential words bridge between other words in a single utterance. So, by analyzing word occurrences among tweets, we can “describe the extent to which words are prominent in creating a structural pattern of coherence in a text” (Corman et al., 2002, p. 179) by locating the “in-between” position of the words in a co-occurrence network. In particular, analyzing word co-occurrences seeks to find words (or bigrams) that link conceptual clusters together and thus help organize the whole; thus, it allows for surveying rich and complex structures in word networks to the extent that the influence of certain words in tweets is dependent not just on their frequency but also on their location in the semantic network structure. That is to say, this method is structurally sensitive to a semantic network because “it accounts for all likely chains of association among words that make texts and conversation coherent” (Corman et al., 2002, p. 181).
So we will analyze the relationships of words that tend to co-occur within the same tweet, even if they do not occur next to each other. Tidy data format is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows; for example, to count the number of times that two words appear within the same tweet, or to how correlated they are. Most operations for finding pairwise counts or correlations need to turn the data into a widy matrix first.
The philosophy behind the widyr package, which can perform operations such as counting and correlating on pairs of values in a tidy dataset. The widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result by Julia Silge
We will examine some of the ways tidy text can be turned into a wide matrix, but in this case it is not necessary. The widyr
package makes operations such as computing counts and correlations easy, by simplifying the pattern of “widen data, perform an operation, then re-tidy data” as seen in the above figure. We will focus on a set of functions that make pairwise comparisons between groups of observations (for example between tweets).
From now on, we will analyze what words tend to appear (co-occur) within the same tweet.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(textclean)
library(stopwords)
load("covid_tweets_423.RData")
covid_tweets[1:5,]
## # A tibble: 5 x 9
## user_id status_id created_at screen_name text lang country lat
## <chr> <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 4794913~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@eve~ en United~ 36.0
## 2 1694802~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Plea~ en United~ 36.9
## 3 2155830~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Aya~ tr Azerba~ 40.2
## 4 7445974~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pres~ en South ~ -26.1
## 5 1558777~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#ATE~ es Ecuador -1.67
## # ... with 1 more variable: lng <dbl>
covid_tweets_words <- covid_tweets %>%
filter(lang=="en") %>%
filter(country %in% c("United States","India","United Kingdom","Canada")) %>%
mutate(text = str_replace_all(text, "RT", " ")) %>%
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stopwords()) %>%
filter(!str_detect(word, "[^[:word:]#@]")) %>%
dplyr::select(country, status_id, word)
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid_tweets_words
## # A tibble: 119,035 x 3
## country status_id word
## <chr> <chr> <chr>
## 1 United States 1253365816281554946 @everythingoes99
## 2 United States 1253365816281554946 @auntvireen
## 3 United States 1253365816281554946 @johnrobertsfox
## 4 United States 1253365816281554946 wtf
## 5 United States 1253365816281554946 talking
## 6 United States 1253365816281554946 death
## 7 United States 1253365816281554946 numbers
## 8 United States 1253365816281554946 taken
## 9 United States 1253365816281554946 cdc
## 10 United States 1253365816281554946 includes
## # ... with 119,025 more rows
One useful function from widyr
is the pairwise_count()
function. The prefix pairwise_
means it will result in one row for each pair of words in the word
column (variable). This lets us count common pairs of words co-appearing within the same tweet given the status_id
column (variable).
#install.packages("widyr")
library(widyr)
word_pairs <- covid_tweets_words %>%
pairwise_count(word, status_id, sort=TRUE) # count words co-occurring within status_id
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
word_pairs
## # A tibble: 1,575,558 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 19 covid 464
## 2 covid 19 464
## 3 #coronavirus #covid19 415
## 4 #covid19 #coronavirus 415
## 5 #covid19 can 298
## 6 can #covid19 298
## 7 people #covid19 244
## 8 #covid19 people 244
## 9 people covid19 215
## 10 us #covid19 215
## # ... with 1,575,548 more rows
Notice that while the input has one row for each pair of a tweet and a word, the output has one row for each pair of words. This results from counting pairs of words that appear together in the same tweet that has the same status_id
. The output is also a tidy format, but of a very different structure that we can use to answer questions.
For example, we can see the most common pairs of words in tweets about COVID-19 are “covid” and “19”, which can be easily expected. We can also easily find the words that most often occur with “covid”:
word_pairs %>%
filter(item1 == "covid")
## # A tibble: 4,232 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 covid 19 464
## 2 covid can 71
## 3 covid #covid19 57
## 4 covid people 57
## 5 covid just 41
## 6 covid get 39
## 7 covid now 39
## 8 covid time 33
## 9 covid like 33
## 10 covid know 33
## # ... with 4,222 more rows
Pairs like “covid” and “people” are the fourth most common co-occurring words, but that’s not particularly meaningful since they are also the most common individual words.
covid_tweets_words %>%
count(word, sort=T)
## # A tibble: 26,805 x 2
## word n
## <chr> <int>
## 1 #covid19 3447
## 2 covid19 2462
## 3 #coronavirus 1134
## 4 can 674
## 5 people 643
## 6 covid 564
## 7 us 535
## 8 19 496
## 9 now 450
## 10 new 446
## # ... with 26,795 more rows
So we may instead want to examine correlation among words, which indicate how often they appear together relative to how often they appear separately.
Here I introduce you to the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.
Consider the following table:
Table | Has word Y | No word Y | Total |
---|---|---|---|
Has word X | n11 | n10 | n1. |
No word X | n01 | n00 | n0. |
Total | n.1 | n.0 | n |
For example, that n11 represents the number of tweets where both word X and word Y appear, n00 the number where neither appears, and n10 and n01 the cases where one appears without the other. In terms of this table, the phi coefficient is:
The phi coefficient is equivalent to the Pearson correlation, which you may have heard of elsewhere, when it is applied to binary data
The pairwise_cor()
function in widyr
allows us to find the phi coefficient between words based on how often they appear in the same tweet. Its syntax is similar to pairwise_count()
.
# we need to filter for at least relatively common words first
word_cors <- covid_tweets_words %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, status_id, sort = TRUE)
word_cors
## # A tibble: 3,646,190 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 #bozeman @kbzk 1.
## 2 #gallatincounty @kbzk 1.
## 3 @kbzk #bozeman 1.
## 4 #gallatincounty #bozeman 1.
## 5 @kbzk #gallatincounty 1.
## 6 #bozeman #gallatincounty 1.
## 7 #jog #motavation 1
## 8 #bulid #motavation 1
## 9 #suppo #motavation 1
## 10 #honor #motavation 1
## # ... with 3,646,180 more rows
This output format is helpful for exploration. For example, we could find the words or hashtags most correlated with a word like “new” or “positive” using a filter
operation.
word_cors %>%
filter(item1 == "new")
## # A tibble: 1,909 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 new york 0.546
## 2 new pet 0.381
## 3 new cats 0.336
## 4 new two 0.293
## 5 new positive 0.244
## 6 new test 0.232
## 7 new cases 0.148
## 8 new #healthy 0.137
## 9 new legendary 0.135
## 10 new #citylife 0.135
## # ... with 1,899 more rows
word_cors %>%
filter(item1 == "positive")
## # A tibble: 1,909 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 positive pet 0.540
## 2 positive cats 0.515
## 3 positive test 0.428
## 4 positive two 0.414
## 5 positive york 0.402
## 6 positive tested 0.319
## 7 positive new 0.244
## 8 positive cases 0.132
## 9 positive negative 0.110
## 10 positive tests 0.109
## # ... with 1,899 more rows
To analyze the semantic network of tweets among COVID-19, we can also visualize all of co-occurrences among the words. And as one common visualization, we can arrange the words into a network, or “graph.” Here we will be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:
Here we will use the igraph
package that provides many powerful functions for manipulating and analyzing networks. One way to create an igraph object from our tidy data is the graph_from_data_frame()
function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes or strengths (degree, in this case correlation
):
#install.packages("igraph")
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
# filter for only relatively strong co-occurrences
word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame()
## IGRAPH d5d49eb DN-- 108 646 --
## + attr: name (v/c), correlation (e/n)
## + edges from d5d49eb (vertex names):
## [1] #bozeman ->@kbzk #gallatincounty->@kbzk
## [3] @kbzk ->#bozeman #gallatincounty->#bozeman
## [5] @kbzk ->#gallatincounty #bozeman ->#gallatincounty
## [7] #jog ->#motavation #bulid ->#motavation
## [9] #suppo ->#motavation #honor ->#motavation
## [11] #succeed ->#motavation #boss ->#motavation
## [13] #selfmade ->#motavation #motavation ->#jog
## [15] #bulid ->#jog #suppo ->#jog
## + ... omitted several edges
The package igraph
has plotting functions built in, but they are not what the package is designed to do, so many other packages have developed visualization methods for graph objects. We recommend the ggraph package (Pederson 2017), because it implements these visualizations in terms of the grammar of graphics, which we are already familiar with ggplot2
.
We can convert an igraph
object into a ggraph
one with the ggraph
functions, after which we add layers to it, much like layers are added in ggplot2
. For example, for a basic graph we need to add three layers: nodes, edges (links), and text.
#install.packages("ggraph")
library(ggraph)
set.seed(200608) # Fixing the way of plotting nodes onto a network
word_cors %>%
filter(correlation > .4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "plum4", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() # white background
## Warning: ggrepel: 25 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
This network of word co-occurrences visualizes some details of the text structure. And we can observe some distinctive phrases or word associations that consist of important discourses or topics around COVID-19. For example, there is a cluster composed of “rs15000”, “package”, “approves”, “cabinet”, “crore”, which suggests the Indian news that the Union Cabinet approved Rs 15,000 crore that is called ‘India COVID-19 Emergency Response and Health System Preparedness Package’. Also, we can see another cluster of words that “new”, “york”, “two”, “cats”, “pet”, “test”, “positive”, suggesting two pet cats in New York are tested positive for covid-19.
So as we can see, constructing a semantic network of texts using word co-occurrences reveals conceptual clusters of words and thus help organize the whole discourses around an issue.
We conclude with a few polishing operations to make a better looking graph:
We add the edge_alpha
aesthetic to the link layer to make links transparent based on how correlated a pair of words is to each other
We tinker with the options to the node layer to make the nodes more attractive (larger, plum colored points)
We add a theme that’s useful for plotting networks, theme_void()
set.seed(200608)
word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "plum4", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) + # Annotate nodes with text (name of each node) and text labels will be repelled from each other to avoid overlapping
theme_void()
## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
So far we’ve looked at how the tidy text approach is useful not only for individual words, but also for exploring the relationships and connections between words. Such relationships can involve n-grams, which enable us to see what words tend to appear after others, or co-occurences and correlations, for words that appear in proximity to each other. This class also demonstrated the ggraph package for visualizing both of these types of relationships as networks and some clusters of words as topics. These network visualizations are a flexible tool for exploring relationships, and will play an important role in your second major assignment.
Your second major assignment will be announced this week, so please keep paying attention to the course announcement on our eclass webpage.