W13: Information Retrieval through Word Co-occurrences

Today’s lectures are based on the Chapter 4 of our textbook, “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.

Information retrieval by identifying word co-occurrences

Last week we treated words as individual units of language, and considered their counts a way of revealing the important issues in tweets about COVID-19 we gathered. Especially, we applied TF-IDF to measure how important a word was to a group of tweets from the U.S. in a collection of all the tweets from the U.S., the U.K., India, and Canada. So we found that the TF-IDF measure was very useful to identify what’s being particularly important to the U.S. compared with other countries by decreasing the weight of commonly used words and increasing the weight of the frequent words unique to the U.S..

However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents or the same linguistic context. This is a way of revealing which words are important in texts by analyzing the extent to which they are linked to each other. Especially, this measure of word co-occurrences is important because the meaning of words is often determined by the words they tend to be used together frequently.

Today, we’ll explore some of the methods for calculating and visualizing relationships between words in our tweet text dataset. This includes the token = "ngrams" argument in the unnest_tokens function, which tokenizes by pairs of adjecent words rater than by individual ones. We’ll also introduce two new packaages: ggraph, which extends ggplot2 to construct network plots, and widyr, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.

Tokenizing by n-gram

We’ve been using the unnest_token() function to tokenize by word, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relatinoship between X and Y.

We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of word we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”.

So we tokenize our COVID-19 tweets data set by bigrams as follows.

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.3.0 --

## √ ggplot2 3.3.0     √ purrr   0.3.4
## √ tibble  3.0.0     √ dplyr   0.8.5
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0

## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
library(textclean)

load("covid_tweets_423.RData")
covid_tweets[1:5,]

## # A tibble: 5 x 9
##   user_id status_id created_at          screen_name text  lang  country    lat
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> <chr>    <dbl>
## 1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@ev~ en    United~  36.0 
## 2 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Ple~ en    United~  36.9 
## 3 215583~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Ay~ tr    Azerba~  40.2 
## 4 744597~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pre~ en    South ~ -26.1 
## 5 155877~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#AT~ es    Ecuador  -1.67
## # ... with 1 more variable: lng <dbl>

covid_bigrams <- covid_tweets %>% 
  filter(lang=="en") %>% 
  filter(country %in% c("United States","India","United Kingdom","Canada")) %>% 
  mutate(text = str_replace_all(text, "RT", " ")) %>% 
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  arrange(created_at)

covid_bigrams

## # A tibble: 189,344 x 9
##    user_id status_id created_at          screen_name lang  country   lat   lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>

As you can see the bigram column in the resulting covid_bigrams data set, the data structure is still a variation of the tidy text format. It is structured as one-token-per-row. Here each token represents a bigram that consists of two consecutive words in each tweet. So other columns in the data set contain metadata that characterize each tweet, such as user_id about who posted this tweet or created_at about when this tweet was posted on Twitter.

Counting and filtering n-grams

Once we tokenize tweet texts by bigrams in the tidy text format, our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count():

covid_bigrams %>% 
  count(bigram, sort= TRUE)

## # A tibble: 108,833 x 2
##    bigram       n
##    <chr>    <int>
##  1 covid 19  3029
##  2 it is      519
##  3 in the     513
##  4 of the     447
##  5 i am       422
##  6 do not     344
##  7 we are     284
##  8 on the     283
##  9 of covid   278
## 10 to the     267
## # ... with 108,823 more rows

Not surprisingly, the words ‘covid’ and ‘19’ are the most common bigrams insofar as the tweets were collected when they contained a hashtag “#covid-19”. Here the pound sign “#” was removed and the hyphen “-” were replaced by a blank, so our text pre-processing turned the hashtag “#covid-19” into two consecutive words “covid 19”. So it is expectable that “covid 19” is the most frequent bigram.

But let’s take a look at the other frequent bigrams. A lot of the most common bigrams are pairs of too common (uninteresting) words such as in the, it is, i am, of the, do not, on the, we are, this is, is a, and so on: what we call “stop-words”. These bigrams with stop-words are not of our interest because they do not provide any meaningful information about the COVID-19 issue in the English-speaking countries. What we need to do is then to remove such stop-words bigrams, so more content-oriented bigrams are to be included in the top list of frequent bigrams. How can we then treat such bigrams including stop-words?

At this point, I’d like to introduce you to the function separate() from the tidyr package. This function splits a column into multiple based on a delimiter (a character by which a bigram is separated). So this function allows us to separate each bigram into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

library(tidyr)

covid_bigrams

## # A tibble: 189,344 x 9
##    user_id status_id created_at          screen_name lang  country   lat   lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>

bigrams_separated <- covid_bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ") 

bigrams_separated

## # A tibble: 189,344 x 10
##    user_id status_id created_at          screen_name lang  country   lat   lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 2 more variables: word1 <chr>, word2 <chr>

Given the separator argument to be set as a blank " “, the function separate() turns a single character column of”bigram" into two columns “word1” and “word2”.

library(stopwords)

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stopwords()) %>% 
  filter(!word2 %in% stopwords())

The filter() function is applied to the “word1” and “word2” columns so that any element matched by the stopwords() vector is to be removed from the bigrams_separated data set.

So we can count “word2” grouped by “word1”, where neither column contains any stop-word.

bigrams_count <- bigrams_filtered %>% 
  count(word1, word2, sort=TRUE)
bigrams_count

## # A tibble: 54,007 x 3
##    word1       word2           n
##    <chr>       <chr>       <int>
##  1 covid       19           3029
##  2 covid1      219           220
##  3 new         york          142
##  4 covid19     coronavirus   128
##  5 19          pandemic      104
##  6 let         us             95
##  7 coronavirus covid19        92
##  8 social      distancing     89
##  9 test        positive       81
## 10 stay        safe           77
## # ... with 53,997 more rows

We can now see some important words that provide meaningful contexts of COVID-19 in the U.S., and the most common pairs of two consecutive words are related to the COVID-19 issue.

In other analyses, we may want to work with the recombined words into bigrams. In this case, we can use the function unite() from the tidyr package, which is the inverse of the function separate(), so we can recombine the two columns of “word1” and “word2” into one column “bigram”. So, separate, filter, count, unite functions allow us to find the most common bigrams not containing stop-words.

bigrams_united <- bigrams_filtered %>% 
  unite(bigram, word1, word2, sep = " ") # The name of the new column comes first (resulting from the combination of the following two column names)

bigrams_united

## # A tibble: 71,351 x 9
##    user_id status_id created_at          screen_name lang  country   lat    lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  2 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  3 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  4 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  5 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  6 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  7 307160~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  8 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
##  9 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## 10 110415~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## # ... with 71,341 more rows, and 1 more variable: bigram <chr>

Analyzing bigrams

Our one-bigram-per-row format is helpful for exploratory analyses of the text, which provide useful information. For example, we can extract important bigrams in the context of U.S., compared with other English-speaking countries like U.K., India, and Canada. Here a bigram can also be treated as a term in the tweets in the same way that we treated individual words. For instance, we can look at the tf-idf of bigrams across English-speaking countries.

bigram_tf_idf <- bigrams_united %>%
  count(country, bigram) %>%
  bind_tf_idf(bigram, country, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

## # A tibble: 56,560 x 6
##    country        bigram               n      tf   idf  tf_idf
##    <chr>          <chr>            <int>   <dbl> <dbl>   <dbl>
##  1 India          namo app            45 0.00363 1.39  0.00504
##  2 India          via namo            45 0.00363 1.39  0.00504
##  3 Canada         cdnpoli uspoli       9 0.00160 1.39  0.00222
##  4 Canada         covid19 toronto      7 0.00125 1.39  0.00173
##  5 United Kingdom cent registered     14 0.00123 1.39  0.00171
##  6 Canada         term care           13 0.00232 0.693 0.00161
##  7 United Kingdom george's day        13 0.00115 1.39  0.00159
##  8 India          aarogya setu        14 0.00113 1.39  0.00157
##  9 India          narendra modi       14 0.00113 1.39  0.00157
## 10 Canada         british columbia     6 0.00107 1.39  0.00148
## # ... with 56,550 more rows

Let’s focus only on alphabet-letter words.

bigram_tf_idf <- bigrams_separated %>% 
  filter(!word1 %in% stopwords()) %>% 
  filter(!word2 %in% stopwords()) %>% 
  filter(!str_detect(word1, "[^[:alpha:]]")) %>% 
  filter(!str_detect(word2, "[^[:alpha:]]")) %>% 
  unite(bigram, word1, word2, sep = " ") %>% 
  count(country, bigram) %>%
  bind_tf_idf(bigram, country, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

## # A tibble: 45,157 x 6
##    country        bigram               n      tf   idf  tf_idf
##    <chr>          <chr>            <int>   <dbl> <dbl>   <dbl>
##  1 India          namo app            45 0.00503 1.39  0.00697
##  2 India          via namo            45 0.00503 1.39  0.00697
##  3 Canada         cdnpoli uspoli       9 0.00214 1.39  0.00296
##  4 United Kingdom cent registered     14 0.00162 1.39  0.00225
##  5 India          aarogya setu        14 0.00156 1.39  0.00217
##  6 India          narendra modi       14 0.00156 1.39  0.00217
##  7 Canada         term care           13 0.00308 0.693 0.00214
##  8 India          crore package       13 0.00145 1.39  0.00201
##  9 Canada         british columbia     6 0.00142 1.39  0.00197
## 10 India          cabinet approves    12 0.00134 1.39  0.00186
## # ... with 45,147 more rows

These tf-idf values can be visualized within each country, just as we did for individual words.

bigram_tf_idf %>% 
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(country) %>% 
  top_n(10) %>% 
  ungroup %>% 
  mutate(country = as.factor(country), 
         bigram = reorder_within(bigram, tf_idf, country)) %>% # To order the words by tf within each country
  ggplot(aes(bigram, tf_idf, fill = country)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() + # This line removes the separator and country name as a suffix
  labs(x = NULL, y = "TF-IDF") +
  facet_wrap(~country, ncol=2, scales="free") +
  coord_flip()

## Selecting by word

Much as we discovered in previous class, the units that distinguish each country are almost exclusively person names or organizations.

There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that is not present when one is just counting single words, and may provide context that makes tokens more understandable (for example, “positive cases”, in the United States, is more informative than “positive”). However, the per-bigram counts are also sparser: a typical two-word pair is rather than either of its component words. Thus, bigrams can be especially useful when you have a very large text dataset.

Visualizing a network of co-occurrences with ggraph

A bigram refers to a pair of two consecutive words in a text. This method allows us to find the most common two-word pairs that provide context that makes tokens more understandable. But sometimes we want to analyze all of the relationships among words used in the same tweet.

So here I’d like to introduce you to a concept of word co-occurrences. By “word co-occurrences,” we refer to the incidence of any pair of words appearing together on the same context of texts, like a tweet. And the relationship strength of a pair of words is weighted by counting co-occurrences.

Word co-occurrences are useful to analyze a semantic network of texts, given the extent to which influential words bridge between other words in a single utterance. So, by analyzing word occurrences among tweets, we can “describe the extent to which words are prominent in creating a structural pattern of coherence in a text” (Corman et al., 2002, p. 179) by locating the “in-between” position of the words in a co-occurrence network. In particular, analyzing word co-occurrences seeks to find words (or bigrams) that link conceptual clusters together and thus help organize the whole; thus, it allows for surveying rich and complex structures in word networks to the extent that the influence of certain words in tweets is dependent not just on their frequency but also on their location in the semantic network structure. That is to say, this method is structurally sensitive to a semantic network because “it accounts for all likely chains of association among words that make texts and conversation coherent” (Corman et al., 2002, p. 181).

So we will analyze the relationships of words that tend to co-occur within the same tweet, even if they do not occur next to each other. Tidy data format is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows; for example, to count the number of times that two words appear within the same tweet, or to how correlated they are. Most operations for finding pairwise counts or correlations need to turn the data into a widy matrix first.

The philosophy behind the widyr package, which can perform operations such as counting and correlating on pairs of values in a tidy dataset. The widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result by Julia Silge

We will examine some of the ways tidy text can be turned into a wide matrix, but in this case it is not necessary. The widyr package makes operations such as computing counts and correlations easy, by simplifying the pattern of “widen data, perform an operation, then re-tidy data” as seen in the above figure. We will focus on a set of functions that make pairwise comparisons between groups of observations (for example between tweets).

Counting and correlating among tweets

From now on, we will analyze what words tend to appear (co-occur) within the same tweet.

covid_tweets_words <- covid_tweets %>% 
  filter(lang=="en") %>% 
  filter(country %in% c("United States","India","United Kingdom","Canada")) %>% 
  mutate(text = str_replace_all(text, "RT", " ")) %>% 
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tweets(word, text) %>% 
  filter(!word %in% stopwords()) %>% 
  filter(!str_detect(word, "[^[:word:]#@]")) %>% 
  dplyr::select(country, status_id, word)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

covid_tweets_words

## # A tibble: 119,035 x 3
##    country       status_id           word            
##    <chr>         <chr>               <chr>           
##  1 United States 1253365816281554946 @everythingoes99
##  2 United States 1253365816281554946 @auntvireen     
##  3 United States 1253365816281554946 @johnrobertsfox 
##  4 United States 1253365816281554946 wtf             
##  5 United States 1253365816281554946 talking         
##  6 United States 1253365816281554946 death           
##  7 United States 1253365816281554946 numbers         
##  8 United States 1253365816281554946 taken           
##  9 United States 1253365816281554946 cdc             
## 10 United States 1253365816281554946 includes        
## # ... with 119,025 more rows

One useful function from widyr is the pairwise_count() function. The prefix pairwise_ means it will result in one row for each pair of words in the word column (variable). This lets us count common pairs of words co-appearing within the same tweet given the status_id column (variable).

#install.packages("widyr")
library(widyr)

word_pairs <- covid_tweets_words %>% 
  pairwise_count(word, status_id, sort = TRUE) # count words co-occurring within status_id

word_pairs

## # A tibble: 1,575,558 x 3
##    item1        item2            n
##    <chr>        <chr>        <dbl>
##  1 19           covid          464
##  2 covid        19             464
##  3 #coronavirus #covid19       415
##  4 #covid19     #coronavirus   415
##  5 #covid19     can            298
##  6 can          #covid19       298
##  7 people       #covid19       244
##  8 #covid19     people         244
##  9 people       covid19        215
## 10 us           #covid19       215
## # ... with 1,575,548 more rows

Notice that while the input has one row for each pair of a tweet and a word, the output has one row for each pair of words. This results from counting pairs of words that appear together in the same tweet that has the same status_id. The output is also a tidy format, but of a very different structure that we can use to answer questions.

For example, we can see the most common pairs of words in tweets about COVID-19 are “covid” and “19”, which can be easily expected. We can also easily find the words that most often occur with “covid”:

word_pairs %>% 
  filter(item1 == "covid")

## # A tibble: 4,232 x 3
##    item1 item2        n
##    <chr> <chr>    <dbl>
##  1 covid 19         464
##  2 covid can         71
##  3 covid #covid19    57
##  4 covid people      57
##  5 covid just        41
##  6 covid get         39
##  7 covid now         39
##  8 covid time        33
##  9 covid like        33
## 10 covid know        33
## # ... with 4,222 more rows

Pairwise correlation

Pairs like “covid” and “people” are the fourth most common co-occurring words, but that’s not particularly meaningful since they are also the most common individual words.

covid_tweets_words %>% 
  count(word, sort=T)

## # A tibble: 26,805 x 2
##    word             n
##    <chr>        <int>
##  1 #covid19      3447
##  2 covid19       2462
##  3 #coronavirus  1134
##  4 can            674
##  5 people         643
##  6 covid          564
##  7 us             535
##  8 19             496
##  9 now            450
## 10 new            446
## # ... with 26,795 more rows

So we may instead want to examine correlation among words, which indicate how often they appear together relative to how often they appear separately.

Here I introduce you to the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word S and Y appear, or neither do, than that one appears without the other.

Consider the following table:

Table	Has word Y	No word Y	Total
Has word X	n₁₁	n₁₀	n_1.
No word X	n₀₁	n₀₀	n_0.
Total	n_.1	n_.0	n

For example, that n₁₁ represents the number of tweets where both word X and word Y appear, n₀₀ the number where neither appears, and n₁₀ and n₀₁ the cases where one appears without the other. In terms of this table, the phi coefficient is:

The phi coefficient is equivalent to the Pearson correlation, which you may have heard of elsewhere, when it is applied to binary data

The pairwise_cor() function in widyr allows us to find the phi coefficient between words based on how often they appear in the same tweet. Its syntax is similar to pairwise_count().

# we need to filter for at least relatively common words first

word_cors <- covid_tweets_words %>% 
  group_by(word) %>% 
  filter(n() >= 10) %>% 
  pairwise_cor(word, status_id, sort = TRUE)

word_cors

## # A tibble: 3,646,190 x 3
##    item1           item2           correlation
##    <chr>           <chr>                 <dbl>
##  1 #bozeman        @kbzk                    1.
##  2 #gallatincounty @kbzk                    1.
##  3 @kbzk           #bozeman                 1.
##  4 #gallatincounty #bozeman                 1.
##  5 @kbzk           #gallatincounty          1.
##  6 #bozeman        #gallatincounty          1.
##  7 #jog            #motavation              1 
##  8 #bulid          #motavation              1 
##  9 #suppo          #motavation              1 
## 10 #honor          #motavation              1 
## # ... with 3,646,180 more rows

This output format is helpful for exploration. For examppe, we could find the words or hashtags most correlated with a word like “new” or “positive” using a filter operation.

word_cors %>% 
  filter(item1 == "new")

## # A tibble: 1,909 x 3
##    item1 item2     correlation
##    <chr> <chr>           <dbl>
##  1 new   york            0.546
##  2 new   pet             0.381
##  3 new   cats            0.336
##  4 new   two             0.293
##  5 new   positive        0.244
##  6 new   test            0.232
##  7 new   cases           0.148
##  8 new   #healthy        0.137
##  9 new   legendary       0.135
## 10 new   #citylife       0.135
## # ... with 1,899 more rows

word_cors %>% 
  filter(item1 == "positive")

## # A tibble: 1,909 x 3
##    item1    item2    correlation
##    <chr>    <chr>          <dbl>
##  1 positive pet            0.540
##  2 positive cats           0.515
##  3 positive test           0.428
##  4 positive two            0.414
##  5 positive york           0.402
##  6 positive tested         0.319
##  7 positive new            0.244
##  8 positive cases          0.132
##  9 positive negative       0.110
## 10 positive tests          0.109
## # ... with 1,899 more rows

To analyze the semantic network of tweets among COVID-19, we can also visualize all of co-occurrences among the words. And as one common visualization, we can arrange the words into a network, or “graph.” Here we will be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from
to: the node an edge is going towards
weight: A numeric value associated with each edge (link)

Here we will use the igraph package that provides many powerful functions for manipulating and analyzing networks. One way to create an igraph object from our tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes or strengths (degree, in this case correlation):

#install.packages("igraph")
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

# filter for only relatively strong co-coccurrences
word_cors %>% 
  filter(correlation > .5) %>% 
  graph_from_data_frame()

## IGRAPH bd09639 DN-- 108 646 -- 
## + attr: name (v/c), correlation (e/n)
## + edges from bd09639 (vertex names):
##  [1] #bozeman       ->@kbzk           #gallatincounty->@kbzk          
##  [3] @kbzk          ->#bozeman        #gallatincounty->#bozeman       
##  [5] @kbzk          ->#gallatincounty #bozeman       ->#gallatincounty
##  [7] #jog           ->#motavation     #bulid         ->#motavation    
##  [9] #suppo         ->#motavation     #honor         ->#motavation    
## [11] #succeed       ->#motavation     #boss          ->#motavation    
## [13] #selfmade      ->#motavation     #motavation    ->#jog           
## [15] #bulid         ->#jog            #suppo         ->#jog           
## + ... omitted several edges

The package igraph has pltoting functions built in, but they are not what the package is designed to do, so many other packages have developed visualization methods for graph objects. We recommend the ggraph package (Pederson 2017), because it implements these visualizations in terms of the grammar of graphics, which we are already familiar with ggplot2.

We can convert an igraph object into a ggraph one with the ggraph functions, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges (links), and text.

#install.packages("ggraph")
library(ggraph)

set.seed(200608) # Fixing the way of plotting nodes onto a network 

word_cors %>% 
  filter(correlation > .5) %>%
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "plum4", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() # white background

This network of word co-occurrences visualizes some details of the text structure. And we can observe some distinctive phrases or word associations that consist of important discourses around COVID-19. For example, there is a cluster composed of “rs15000”, “package”, “approves”, “cabinet”, “crore”, which suggests the Indian news that the Union Cabinet approved Rs 15,000 crore that is called ‘India COVID-19 Emergency Response and Health System Preparedness Package’. Also, we can see another cluster of words that “new”, “york”, “two”, “cats”, “pet”, “test”, “positive”, suggesting two pet cats in New York are tested positive for covid-19.

So as we can see, constructing a semantic network of texts using word co-occurrences reveals conceptual clusters of words and thus help organize the whole discourses around an issue.

We conclude with a few polishing operations to make a better looking graph:

We add the edge_alpha aesthetic to the link layer to make links transparent based on how correlated a pair of words is to each other
We tinker with the options to the node layer to make the nodes more attractive (larger, plum colored points)
We add a theme that’s useful for plotting networks, theme_void()

set.seed(200608)
word_cors %>% 
  filter(correlation > .5) %>% 
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "plum4", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE) + # Annotate nodes with text (name of each node) and text labels will be repelled from each other to avoid overlapping
  theme_void()

Wrapping up…

Today we looked at how the tidy text approach is useful not only for individual words, but also for exploring the relationships and connections between words. Such relationships can involve n-grams, which enable us to see what words tend to appear after others, or co-occurences and correlations, for words that appear in proximity to each other. This class also demonstrated the ggraph package for visualizing both of these types of relationships as networks. These network visualizations are a flexible tool for exploring relationships, and will play an important role in your second major assignment.

Your second major assignment will be announced this week, so please keep paying attention to the course announcement on our eclass webpage.