Today’s lectures are based on the Chapter 4 of our textbook, “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.
Last week we treated words as individual units of language, and considered their counts a way of revealing the important issues in tweets about COVID-19 we gathered. Especially, we applied TF-IDF to measure how important a word was to a group of tweets from the U.S. in a collection of all the tweets from the U.S., the U.K., India, and Canada. So we found that the TF-IDF measure was very useful to identify what’s being particularly important to the U.S. compared with other countries by decreasing the weight of commonly used words and increasing the weight of the frequent words unique to the U.S..
However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents or the same linguistic context. This is a way of revealing which words are important in texts by analyzing the extent to which they are linked to each other. Especially, this measure of word co-occurrences is important because the meaning of words is often determined by the words they tend to be used together frequently.
Today, we’ll explore some of the methods for calculating and visualizing relationships between words in our tweet text dataset. This includes the token = "ngrams"
argument in the unnest_tokens
function, which tokenizes by pairs of adjecent words rater than by individual ones. We’ll also introduce two new packaages: ggraph
, which extends ggplot2 to construct network plots, and widyr
, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.
We’ve been using the unnest_token()
function to tokenize by word, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relatinoship between X and Y.
We do this by adding the token = "ngrams"
option to unnest_tokens()
, and setting n
to the number of word we wish to capture in each n-gram. When we set n
to 2, we are examining pairs of two consecutive words, often called “bigrams”.
So we tokenize our COVID-19 tweets data set by bigrams as follows.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.3.0 --
## √ ggplot2 3.3.0 √ purrr 0.3.4
## √ tibble 3.0.0 √ dplyr 0.8.5
## √ tidyr 1.0.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(textclean)
load("covid_tweets_423.RData")
covid_tweets[1:5,]
## # A tibble: 5 x 9
## user_id status_id created_at screen_name text lang country lat
## <chr> <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@ev~ en United~ 36.0
## 2 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Ple~ en United~ 36.9
## 3 215583~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Ay~ tr Azerba~ 40.2
## 4 744597~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pre~ en South ~ -26.1
## 5 155877~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#AT~ es Ecuador -1.67
## # ... with 1 more variable: lng <dbl>
covid_bigrams <- covid_tweets %>%
filter(lang=="en") %>%
filter(country %in% c("United States","India","United Kingdom","Canada")) %>%
mutate(text = str_replace_all(text, "RT", " ")) %>%
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
arrange(created_at)
covid_bigrams
## # A tibble: 189,344 x 9
## user_id status_id created_at screen_name lang country lat lng
## <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl>
## 1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>
As you can see the bigram column in the resulting covid_bigrams
data set, the data structure is still a variation of the tidy text format. It is structured as one-token-per-row. Here each token represents a bigram that consists of two consecutive words in each tweet. So other columns in the data set contain metadata that characterize each tweet, such as user_id
about who posted this tweet or created_at
about when this tweet was posted on Twitter.
Once we tokenize tweet texts by bigrams in the tidy text format, our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count()
:
covid_bigrams %>%
count(bigram, sort= TRUE)
## # A tibble: 108,833 x 2
## bigram n
## <chr> <int>
## 1 covid 19 3029
## 2 it is 519
## 3 in the 513
## 4 of the 447
## 5 i am 422
## 6 do not 344
## 7 we are 284
## 8 on the 283
## 9 of covid 278
## 10 to the 267
## # ... with 108,823 more rows
Not surprisingly, the words ‘covid’ and ‘19’ are the most common bigrams insofar as the tweets were collected when they contained a hashtag “#covid-19”. Here the pound sign “#” was removed and the hyphen “-” were replaced by a blank, so our text pre-processing turned the hashtag “#covid-19” into two consecutive words “covid 19”. So it is expectable that “covid 19” is the most frequent bigram.
But let’s take a look at the other frequent bigrams. A lot of the most common bigrams are pairs of too common (uninteresting) words such as in the
, it is
, i am
, of the
, do not
, on the
, we are
, this is
, is a
, and so on: what we call “stop-words”. These bigrams with stop-words are not of our interest because they do not provide any meaningful information about the COVID-19 issue in the English-speaking countries. What we need to do is then to remove such stop-words bigrams, so more content-oriented bigrams are to be included in the top list of frequent bigrams. How can we then treat such bigrams including stop-words?
At this point, I’d like to introduce you to the function separate()
from the tidyr
package. This function splits a column into multiple based on a delimiter (a character by which a bigram is separated). So this function allows us to separate each bigram into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.
library(tidyr)
covid_bigrams
## # A tibble: 189,344 x 9
## user_id status_id created_at screen_name lang country lat lng
## <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl>
## 1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>
bigrams_separated <- covid_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_separated
## # A tibble: 189,344 x 10
## user_id status_id created_at screen_name lang country lat lng
## <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl>
## 1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## # ... with 189,334 more rows, and 2 more variables: word1 <chr>, word2 <chr>
Given the separator argument to be set as a blank " “, the function separate() turns a single character column of”bigram" into two columns “word1” and “word2”.
library(stopwords)
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stopwords()) %>%
filter(!word2 %in% stopwords())
The filter()
function is applied to the “word1” and “word2” columns so that any element matched by the stopwords()
vector is to be removed from the bigrams_separated
data set.
So we can count “word2” grouped by “word1”, where neither column contains any stop-word.
bigrams_count <- bigrams_filtered %>%
count(word1, word2, sort=TRUE)
bigrams_count
## # A tibble: 54,007 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 covid 19 3029
## 2 covid1 219 220
## 3 new york 142
## 4 covid19 coronavirus 128
## 5 19 pandemic 104
## 6 let us 95
## 7 coronavirus covid19 92
## 8 social distancing 89
## 9 test positive 81
## 10 stay safe 77
## # ... with 53,997 more rows
We can now see some important words that provide meaningful contexts of COVID-19 in the U.S., and the most common pairs of two consecutive words are related to the COVID-19 issue.
In other analyses, we may want to work with the recombined words into bigrams. In this case, we can use the function unite()
from the tidyr
package, which is the inverse of the function separate()
, so we can recombine the two columns of “word1” and “word2” into one column “bigram”. So, separate
, filter
, count
, unite
functions allow us to find the most common bigrams not containing stop-words.
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ") # The name of the new column comes first (resulting from the combination of the following two column names)
bigrams_united
## # A tibble: 71,351 x 9
## user_id status_id created_at screen_name lang country lat lng
## <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl>
## 1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en United~ 32.6 -89.9
## 8 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en United~ 37.8 -122.
## 9 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en United~ 37.8 -122.
## 10 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en United~ 37.8 -122.
## # ... with 71,341 more rows, and 1 more variable: bigram <chr>
Our one-bigram-per-row format is helpful for exploratory analyses of the text, which provide useful information. For example, we can extract important bigrams in the context of U.S., compared with other English-speaking countries like U.K., India, and Canada. Here a bigram can also be treated as a term in the tweets in the same way that we treated individual words. For instance, we can look at the tf-idf of bigrams across English-speaking countries.
bigram_tf_idf <- bigrams_united %>%
count(country, bigram) %>%
bind_tf_idf(bigram, country, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 56,560 x 6
## country bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 India namo app 45 0.00363 1.39 0.00504
## 2 India via namo 45 0.00363 1.39 0.00504
## 3 Canada cdnpoli uspoli 9 0.00160 1.39 0.00222
## 4 Canada covid19 toronto 7 0.00125 1.39 0.00173
## 5 United Kingdom cent registered 14 0.00123 1.39 0.00171
## 6 Canada term care 13 0.00232 0.693 0.00161
## 7 United Kingdom george's day 13 0.00115 1.39 0.00159
## 8 India aarogya setu 14 0.00113 1.39 0.00157
## 9 India narendra modi 14 0.00113 1.39 0.00157
## 10 Canada british columbia 6 0.00107 1.39 0.00148
## # ... with 56,550 more rows
Let’s focus only on alphabet-letter words.
bigram_tf_idf <- bigrams_separated %>%
filter(!word1 %in% stopwords()) %>%
filter(!word2 %in% stopwords()) %>%
filter(!str_detect(word1, "[^[:alpha:]]")) %>%
filter(!str_detect(word2, "[^[:alpha:]]")) %>%
unite(bigram, word1, word2, sep = " ") %>%
count(country, bigram) %>%
bind_tf_idf(bigram, country, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 45,157 x 6
## country bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 India namo app 45 0.00503 1.39 0.00697
## 2 India via namo 45 0.00503 1.39 0.00697
## 3 Canada cdnpoli uspoli 9 0.00214 1.39 0.00296
## 4 United Kingdom cent registered 14 0.00162 1.39 0.00225
## 5 India aarogya setu 14 0.00156 1.39 0.00217
## 6 India narendra modi 14 0.00156 1.39 0.00217
## 7 Canada term care 13 0.00308 0.693 0.00214
## 8 India crore package 13 0.00145 1.39 0.00201
## 9 Canada british columbia 6 0.00142 1.39 0.00197
## 10 India cabinet approves 12 0.00134 1.39 0.00186
## # ... with 45,147 more rows
These tf-idf values can be visualized within each country, just as we did for individual words.
bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(country) %>%
top_n(10) %>%
ungroup %>%
mutate(country = as.factor(country),
bigram = reorder_within(bigram, tf_idf, country)) %>% # To order the words by tf within each country
ggplot(aes(bigram, tf_idf, fill = country)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() + # This line removes the separator and country name as a suffix
labs(x = NULL, y = "TF-IDF") +
facet_wrap(~country, ncol=2, scales="free") +
coord_flip()
## Selecting by word
Much as we discovered in previous class, the units that distinguish each country are almost exclusively person names or organizations.
There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that is not present when one is just counting single words, and may provide context that makes tokens more understandable (for example, “positive cases”, in the United States, is more informative than “positive”). However, the per-bigram counts are also sparser: a typical two-word pair is rather than either of its component words. Thus, bigrams can be especially useful when you have a very large text dataset.
A bigram refers to a pair of two consecutive words in a text. This method allows us to find the most common two-word pairs that provide context that makes tokens more understandable. But sometimes we want to analyze all of the relationships among words used in the same tweet.
So here I’d like to introduce you to a concept of word co-occurrences. By “word co-occurrences,” we refer to the incidence of any pair of words appearing together on the same context of texts, like a tweet. And the relationship strength of a pair of words is weighted by counting co-occurrences.
Word co-occurrences are useful to analyze a semantic network of texts, given the extent to which influential words bridge between other words in a single utterance. So, by analyzing word occurrences among tweets, we can “describe the extent to which words are prominent in creating a structural pattern of coherence in a text” (Corman et al., 2002, p. 179) by locating the “in-between” position of the words in a co-occurrence network. In particular, analyzing word co-occurrences seeks to find words (or bigrams) that link conceptual clusters together and thus help organize the whole; thus, it allows for surveying rich and complex structures in word networks to the extent that the influence of certain words in tweets is dependent not just on their frequency but also on their location in the semantic network structure. That is to say, this method is structurally sensitive to a semantic network because “it accounts for all likely chains of association among words that make texts and conversation coherent” (Corman et al., 2002, p. 181).
So we will analyze the relationships of words that tend to co-occur within the same tweet, even if they do not occur next to each other. Tidy data format is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows; for example, to count the number of times that two words appear within the same tweet, or to how correlated they are. Most operations for finding pairwise counts or correlations need to turn the data into a widy matrix first.
The philosophy behind the widyr package, which can perform operations such as counting and correlating on pairs of values in a tidy dataset. The widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result by Julia Silge
We will examine some of the ways tidy text can be turned into a wide matrix, but in this case it is not necessary. The widyr
package makes operations such as computing counts and correlations easy, by simplifying the pattern of “widen data, perform an operation, then re-tidy data” as seen in the above figure. We will focus on a set of functions that make pairwise comparisons between groups of observations (for example between tweets).
From now on, we will analyze what words tend to appear (co-occur) within the same tweet.
covid_tweets_words <- covid_tweets %>%
filter(lang=="en") %>%
filter(country %in% c("United States","India","United Kingdom","Canada")) %>%
mutate(text = str_replace_all(text, "RT", " ")) %>%
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stopwords()) %>%
filter(!str_detect(word, "[^[:word:]#@]")) %>%
dplyr::select(country, status_id, word)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid_tweets_words
## # A tibble: 119,035 x 3
## country status_id word
## <chr> <chr> <chr>
## 1 United States 1253365816281554946 @everythingoes99
## 2 United States 1253365816281554946 @auntvireen
## 3 United States 1253365816281554946 @johnrobertsfox
## 4 United States 1253365816281554946 wtf
## 5 United States 1253365816281554946 talking
## 6 United States 1253365816281554946 death
## 7 United States 1253365816281554946 numbers
## 8 United States 1253365816281554946 taken
## 9 United States 1253365816281554946 cdc
## 10 United States 1253365816281554946 includes
## # ... with 119,025 more rows
One useful function from widyr
is the pairwise_count()
function. The prefix pairwise_
means it will result in one row for each pair of words in the word
column (variable). This lets us count common pairs of words co-appearing within the same tweet given the status_id
column (variable).
#install.packages("widyr")
library(widyr)
word_pairs <- covid_tweets_words %>%
pairwise_count(word, status_id, sort = TRUE) # count words co-occurring within status_id
word_pairs
## # A tibble: 1,575,558 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 19 covid 464
## 2 covid 19 464
## 3 #coronavirus #covid19 415
## 4 #covid19 #coronavirus 415
## 5 #covid19 can 298
## 6 can #covid19 298
## 7 people #covid19 244
## 8 #covid19 people 244
## 9 people covid19 215
## 10 us #covid19 215
## # ... with 1,575,548 more rows
Notice that while the input has one row for each pair of a tweet and a word, the output has one row for each pair of words. This results from counting pairs of words that appear together in the same tweet that has the same status_id
. The output is also a tidy format, but of a very different structure that we can use to answer questions.
For example, we can see the most common pairs of words in tweets about COVID-19 are “covid” and “19”, which can be easily expected. We can also easily find the words that most often occur with “covid”:
word_pairs %>%
filter(item1 == "covid")
## # A tibble: 4,232 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 covid 19 464
## 2 covid can 71
## 3 covid #covid19 57
## 4 covid people 57
## 5 covid just 41
## 6 covid get 39
## 7 covid now 39
## 8 covid time 33
## 9 covid like 33
## 10 covid know 33
## # ... with 4,222 more rows
Pairs like “covid” and “people” are the fourth most common co-occurring words, but that’s not particularly meaningful since they are also the most common individual words.
covid_tweets_words %>%
count(word, sort=T)
## # A tibble: 26,805 x 2
## word n
## <chr> <int>
## 1 #covid19 3447
## 2 covid19 2462
## 3 #coronavirus 1134
## 4 can 674
## 5 people 643
## 6 covid 564
## 7 us 535
## 8 19 496
## 9 now 450
## 10 new 446
## # ... with 26,795 more rows
So we may instead want to examine correlation among words, which indicate how often they appear together relative to how often they appear separately.
Here I introduce you to the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word S and Y appear, or neither do, than that one appears without the other.
Consider the following table:
Table | Has word Y | No word Y | Total |
---|---|---|---|
Has word X | n11 | n10 | n1. |
No word X | n01 | n00 | n0. |
Total | n.1 | n.0 | n |
For example, that n11 represents the number of tweets where both word X and word Y appear, n00 the number where neither appears, and n10 and n01 the cases where one appears without the other. In terms of this table, the phi coefficient is:
The phi coefficient is equivalent to the Pearson correlation, which you may have heard of elsewhere, when it is applied to binary data
The pairwise_cor()
function in widyr
allows us to find the phi coefficient between words based on how often they appear in the same tweet. Its syntax is similar to pairwise_count()
.
# we need to filter for at least relatively common words first
word_cors <- covid_tweets_words %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, status_id, sort = TRUE)
word_cors
## # A tibble: 3,646,190 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 #bozeman @kbzk 1.
## 2 #gallatincounty @kbzk 1.
## 3 @kbzk #bozeman 1.
## 4 #gallatincounty #bozeman 1.
## 5 @kbzk #gallatincounty 1.
## 6 #bozeman #gallatincounty 1.
## 7 #jog #motavation 1
## 8 #bulid #motavation 1
## 9 #suppo #motavation 1
## 10 #honor #motavation 1
## # ... with 3,646,180 more rows
This output format is helpful for exploration. For examppe, we could find the words or hashtags most correlated with a word like “new” or “positive” using a filter
operation.
word_cors %>%
filter(item1 == "new")
## # A tibble: 1,909 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 new york 0.546
## 2 new pet 0.381
## 3 new cats 0.336
## 4 new two 0.293
## 5 new positive 0.244
## 6 new test 0.232
## 7 new cases 0.148
## 8 new #healthy 0.137
## 9 new legendary 0.135
## 10 new #citylife 0.135
## # ... with 1,899 more rows
word_cors %>%
filter(item1 == "positive")
## # A tibble: 1,909 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 positive pet 0.540
## 2 positive cats 0.515
## 3 positive test 0.428
## 4 positive two 0.414
## 5 positive york 0.402
## 6 positive tested 0.319
## 7 positive new 0.244
## 8 positive cases 0.132
## 9 positive negative 0.110
## 10 positive tests 0.109
## # ... with 1,899 more rows
To analyze the semantic network of tweets among COVID-19, we can also visualize all of co-occurrences among the words. And as one common visualization, we can arrange the words into a network, or “graph.” Here we will be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:
Here we will use the igraph
package that provides many powerful functions for manipulating and analyzing networks. One way to create an igraph object from our tidy data is the graph_from_data_frame()
function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes or strengths (degree, in this case correlation
):
#install.packages("igraph")
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
# filter for only relatively strong co-coccurrences
word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame()
## IGRAPH bd09639 DN-- 108 646 --
## + attr: name (v/c), correlation (e/n)
## + edges from bd09639 (vertex names):
## [1] #bozeman ->@kbzk #gallatincounty->@kbzk
## [3] @kbzk ->#bozeman #gallatincounty->#bozeman
## [5] @kbzk ->#gallatincounty #bozeman ->#gallatincounty
## [7] #jog ->#motavation #bulid ->#motavation
## [9] #suppo ->#motavation #honor ->#motavation
## [11] #succeed ->#motavation #boss ->#motavation
## [13] #selfmade ->#motavation #motavation ->#jog
## [15] #bulid ->#jog #suppo ->#jog
## + ... omitted several edges
The package igraph
has pltoting functions built in, but they are not what the package is designed to do, so many other packages have developed visualization methods for graph objects. We recommend the ggraph package (Pederson 2017), because it implements these visualizations in terms of the grammar of graphics, which we are already familiar with ggplot2
.
We can convert an igraph
object into a ggraph
one with the ggraph
functions, after which we add layers to it, much like layers are added in ggplot2
. For example, for a basic graph we need to add three layers: nodes, edges (links), and text.
#install.packages("ggraph")
library(ggraph)
set.seed(200608) # Fixing the way of plotting nodes onto a network
word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "plum4", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() # white background
This network of word co-occurrences visualizes some details of the text structure. And we can observe some distinctive phrases or word associations that consist of important discourses around COVID-19. For example, there is a cluster composed of “rs15000”, “package”, “approves”, “cabinet”, “crore”, which suggests the Indian news that the Union Cabinet approved Rs 15,000 crore that is called ‘India COVID-19 Emergency Response and Health System Preparedness Package’. Also, we can see another cluster of words that “new”, “york”, “two”, “cats”, “pet”, “test”, “positive”, suggesting two pet cats in New York are tested positive for covid-19.
So as we can see, constructing a semantic network of texts using word co-occurrences reveals conceptual clusters of words and thus help organize the whole discourses around an issue.
We conclude with a few polishing operations to make a better looking graph:
We add the edge_alpha
aesthetic to the link layer to make links transparent based on how correlated a pair of words is to each other
We tinker with the options to the node layer to make the nodes more attractive (larger, plum colored points)
We add a theme that’s useful for plotting networks, theme_void()
set.seed(200608)
word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "plum4", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) + # Annotate nodes with text (name of each node) and text labels will be repelled from each other to avoid overlapping
theme_void()
Today we looked at how the tidy text approach is useful not only for individual words, but also for exploring the relationships and connections between words. Such relationships can involve n-grams, which enable us to see what words tend to appear after others, or co-occurences and correlations, for words that appear in proximity to each other. This class also demonstrated the ggraph package for visualizing both of these types of relationships as networks. These network visualizations are a flexible tool for exploring relationships, and will play an important role in your second major assignment.
Your second major assignment will be announced this week, so please keep paying attention to the course announcement on our eclass webpage.