W13: Information Retrieval through Word Co-occurrences

Today’s lectures are based on the Chapter 4 of our textbook, “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.

Information retrieval by identifying word co-occurrences

Last week we treated words as individual units of language, and considered their counts a way of revealing the important issues in tweets about COVID-19 we gathered. Especially, we applied TF-IDF to measure how important a word was to a group of tweets from the U.S. in a collection of all the tweets from the U.S., the U.K., India, and Canada. So we found that the TF-IDF measure was very useful to identify what’s being particularly important to the U.S. compared with other countries by decreasing the weight of commonly used words and increasing the weight of the frequent words unique to the U.S.

However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents or the same linguistic context. This is a way of revealing which words are important in texts by analyzing the extent to which they are linked to each other. Especially, this measure of word co-occurrences is important because the meaning of words is often determined by the words they tend to be used together frequently.

Today, we’ll explore some of the methods for calculating and visualizing relationships between words in our tweet text dataset. This includes the token = "ngrams" argument in the unnest_tokens function, which tokenizes by pairs of adjacent words rater than by individual ones. We’ll also introduce two new packages: ggraph, which extends ggplot2 to construct network plots, and widyr, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.

Tokenizing by n-gram

We’ve been using the unnest_token() function to tokenize by word, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relatinoship between X and Y.

We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of word we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”.

So we tokenize our COVID-19 tweets data set by bigrams as follows.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
library(textclean)

load("covid_tweets_423.RData")
covid_tweets[1:5,]

## # A tibble: 5 x 9
##   user_id  status_id created_at          screen_name text   lang  country    lat
##   <chr>    <chr>     <dttm>              <chr>       <chr>  <chr> <chr>    <dbl>
## 1 4794913~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@eve~ en    United~  36.0 
## 2 1694802~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Plea~ en    United~  36.9 
## 3 2155830~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Aya~ tr    Azerba~  40.2 
## 4 7445974~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pres~ en    South ~ -26.1 
## 5 1558777~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#ATE~ es    Ecuador  -1.67
## # ... with 1 more variable: lng <dbl>

covid_bigrams <- covid_tweets %>% 
  filter(lang=="en") %>% 
  filter(country %in% c("United States","India","United Kingdom","Canada")) %>% 
  mutate(text = str_replace_all(text, "RT", " ")) %>% 
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  arrange(created_at)

## Warning: Outer names are only allowed for unnamed scalar atomic inputs

covid_bigrams

## # A tibble: 189,344 x 9
##    user_id  status_id  created_at          screen_name lang  country   lat   lng
##    <chr>    <chr>      <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>

As you can see the bigram column in the resulting covid_bigrams data set, the data structure is still a variation of the tidy text format. It is structured as one-token-per-row. Here each token represents a bigram that consists of two consecutive words in each tweet. So other columns in the data set contain metadata that characterize each tweet, such as user_id about who posted this tweet or created_at about when this tweet was posted on Twitter.

Counting and filtering n-grams

Once we tokenize tweet texts by bigrams in the tidy text format, our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count():

covid_bigrams %>% 
  count(bigram, sort= TRUE)

## # A tibble: 108,833 x 2
##    bigram       n
##    <chr>    <int>
##  1 covid 19  3029
##  2 it is      519
##  3 in the     513
##  4 of the     447
##  5 i am       422
##  6 do not     344
##  7 we are     284
##  8 on the     283
##  9 of covid   278
## 10 to the     267
## # ... with 108,823 more rows

Not surprisingly, the words ‘covid’ and ‘19’ are the most common bigrams insofar as the tweets were collected when they contained a hashtag “#covid-19”. Here the pound sign “#” was removed and the hyphen “-” were replaced by a blank, so our text pre-processing turned the hashtag “#covid-19” into two consecutive words “covid 19”. So it is expectable that “covid 19” is the most frequent bigram.

But let’s take a look at the other frequent bigrams. A lot of the most common bigrams are pairs of too common (uninteresting) words such as in the, it is, i am, of the, do not, on the, we are, this is, is a, and so on: what we call “stop-words”. These bigrams with stop-words are not of our interest because they do not provide any meaningful information about the COVID-19 issue in the English-speaking countries. What we need to do is then to remove such stop-words bigrams, so more content-oriented bigrams are to be included in the top list of frequent bigrams. How can we then treat such bigrams including stop-words?

At this point, I’d like to introduce you to the function separate() from the tidyr package. This function splits a column into multiple based on a delimiter (a character by which a bigram is separated). So this function allows us to separate each bigram into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

library(tidyr)

covid_bigrams

## # A tibble: 189,344 x 9
##    user_id  status_id  created_at          screen_name lang  country   lat   lng
##    <chr>    <chr>      <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 1 more variable: bigram <chr>

bigrams_separated <- covid_bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ") 

bigrams_separated

## # A tibble: 189,344 x 10
##    user_id  status_id  created_at          screen_name lang  country   lat   lng
##    <chr>    <chr>      <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  2 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  3 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  4 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  5 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  6 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  7 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  8 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
##  9 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## 10 3071608~ 125306760~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6 -89.9
## # ... with 189,334 more rows, and 2 more variables: word1 <chr>, word2 <chr>

Given the separator argument to be set as a blank " “, the function separate() turns a single character column of”bigram" into two columns “word1” and “word2”.

library(stopwords)
stopwords()

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stopwords()) %>% 
  filter(!word2 %in% stopwords())

bigrams_filtered

## # A tibble: 71,351 x 10
##    user_id  status_id created_at          screen_name lang  country   lat    lng
##    <chr>    <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  2 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  3 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  4 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  5 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  6 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  7 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  8 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
##  9 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## 10 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## # ... with 71,341 more rows, and 2 more variables: word1 <chr>, word2 <chr>

The filter() function is applied to the “word1” and “word2” columns so that any element matched by the stopwords() vector is to be removed from the bigrams_separated data set.

So we can count “word2” grouped by “word1”, where neither column contains any stop-word.

bigrams_count <- bigrams_filtered %>% 
  count(word1, word2, sort=TRUE)
bigrams_count

## # A tibble: 54,007 x 3
##    word1       word2           n
##    <chr>       <chr>       <int>
##  1 covid       19           3029
##  2 covid1      219           220
##  3 new         york          142
##  4 covid19     coronavirus   128
##  5 19          pandemic      104
##  6 let         us             95
##  7 coronavirus covid19        92
##  8 social      distancing     89
##  9 test        positive       81
## 10 stay        safe           77
## # ... with 53,997 more rows

bigrams_count %>% arrange(desc(n))

## # A tibble: 54,007 x 3
##    word1       word2           n
##    <chr>       <chr>       <int>
##  1 covid       19           3029
##  2 covid1      219           220
##  3 new         york          142
##  4 covid19     coronavirus   128
##  5 19          pandemic      104
##  6 let         us             95
##  7 coronavirus covid19        92
##  8 social      distancing     89
##  9 test        positive       81
## 10 stay        safe           77
## # ... with 53,997 more rows

We can now see some important words that provide meaningful contexts of COVID-19 in the U.S., and the most common pairs of two consecutive words are related to the COVID-19 issue.

In other analyses, we may want to work with the recombined words into bigrams. In this case, we can use the function unite() from the tidyr package, which is the inverse of the function separate(), so we can recombine the two columns of “word1” and “word2” into one column “bigram”. So, separate, filter, count, unite functions allow us to find the most common bigrams not containing stop-words.

bigrams_united <- bigrams_filtered %>% 
  unite(bigram, word1, word2, sep = " ") # The name of the new column comes first (resulting from the combination of the following two column names)

bigrams_united

## # A tibble: 71,351 x 9
##    user_id  status_id created_at          screen_name lang  country   lat    lng
##    <chr>    <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  2 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  3 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  4 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  5 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  6 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  7 3071608~ 12530676~ 2020-04-22 21:06:11 kelsey_Kea~ en    United~  32.6  -89.9
##  8 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
##  9 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## 10 1104150~ 12530676~ 2020-04-22 21:06:16 PoopScoopSF en    United~  37.8 -122. 
## # ... with 71,341 more rows, and 1 more variable: bigram <chr>

Analyzing bigrams

Our one-bigram-per-row format is helpful for exploratory analyses of the text, which provide useful information. For example, we can extract important bigrams in the context of U.S., compared with other English-speaking countries like U.K., India, and Canada. Here a bigram can also be treated as a term in the tweets in the same way that we treated individual words. For instance, we can look at the tf-idf of bigrams across English-speaking countries.

bigram_tf_idf <- bigrams_united %>%
  count(country, bigram) %>%
  bind_tf_idf(bigram, country, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

## # A tibble: 56,560 x 6
##    country        bigram               n      tf   idf  tf_idf
##    <chr>          <chr>            <int>   <dbl> <dbl>   <dbl>
##  1 India          namo app            45 0.00363 1.39  0.00504
##  2 India          via namo            45 0.00363 1.39  0.00504
##  3 Canada         cdnpoli uspoli       9 0.00160 1.39  0.00222
##  4 Canada         covid19 toronto      7 0.00125 1.39  0.00173
##  5 United Kingdom cent registered     14 0.00123 1.39  0.00171
##  6 Canada         term care           13 0.00232 0.693 0.00161
##  7 United Kingdom george's day        13 0.00115 1.39  0.00159
##  8 India          aarogya setu        14 0.00113 1.39  0.00157
##  9 India          narendra modi       14 0.00113 1.39  0.00157
## 10 Canada         british columbia     6 0.00107 1.39  0.00148
## # ... with 56,550 more rows

Let’s focus only on alphabet-letter words.

bigram_tf_idf <- bigrams_separated %>% 
  filter(!word1 %in% stopwords()) %>% 
  filter(!word2 %in% stopwords()) %>% 
  filter(!str_detect(word1, "[^[:alpha:]]")) %>% 
  filter(!str_detect(word2, "[^[:alpha:]]")) %>% 
  unite(bigram, word1, word2, sep = " ") %>% 
  count(country, bigram) %>%
  bind_tf_idf(bigram, country, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

## # A tibble: 45,157 x 6
##    country        bigram               n      tf   idf  tf_idf
##    <chr>          <chr>            <int>   <dbl> <dbl>   <dbl>
##  1 India          namo app            45 0.00503 1.39  0.00697
##  2 India          via namo            45 0.00503 1.39  0.00697
##  3 Canada         cdnpoli uspoli       9 0.00214 1.39  0.00296
##  4 United Kingdom cent registered     14 0.00162 1.39  0.00225
##  5 India          aarogya setu        14 0.00156 1.39  0.00217
##  6 India          narendra modi       14 0.00156 1.39  0.00217
##  7 Canada         term care           13 0.00308 0.693 0.00214
##  8 India          crore package       13 0.00145 1.39  0.00201
##  9 Canada         british columbia     6 0.00142 1.39  0.00197
## 10 India          cabinet approves    12 0.00134 1.39  0.00186
## # ... with 45,147 more rows

These tf-idf values can be visualized within each country, just as we did for individual words.

bigrams_separated %>% 
  filter(!word1 %in% stopwords()) %>% 
  filter(!word2 %in% stopwords()) %>% 
  filter(!str_detect(word1, "[^[:alpha:]]")) %>% 
  filter(!str_detect(word2, "[^[:alpha:]]")) %>% 
  unite(bigram, word1, word2, sep = " ") %>% 
  count(country, bigram) %>%
  arrange(desc(n)) %>% 
  group_by(country) %>% 
  top_n(10, n) %>% 
  ungroup %>% 
  mutate(bigram = reorder_within(bigram, n, country)) %>% # To order the words by tf within each country
  ggplot(aes(bigram, n, fill = country)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() + # This line removes the separator and country name as a suffix
  labs(x = NULL, y = "TF") +
  facet_wrap(~country, ncol=2, scales="free") +
  coord_flip()

bigram_tf_idf %>% 
  arrange(desc(tf_idf)) %>% 
  group_by(country) %>% 
  top_n(10, tf_idf) %>% 
  ungroup %>% 
  mutate(bigram = reorder_within(bigram, tf_idf, country)) %>% # To order the words by tf within each country
  ggplot(aes(bigram, tf_idf, fill = country)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() + # This line removes the separator and country name as a suffix
  labs(x = NULL, y = "TF-IDF") +
  facet_wrap(~country, ncol=2, scales="free") +
  coord_flip()

Much as we discovered in previous class, the units that distinguish each country are almost exclusively person names or organizations.

There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that is not present when one is just counting single words, and may provide context that makes tokens more understandable (for example, “positive cases”, in the United States, is more informative than “positive”). However, the per-bigram counts are also sparser: a typical two-word pair is rather than either of its component words. Thus, bigrams can be especially useful when you have a very large text dataset.

Visualizing a network of co-occurrences with ggraph

A bigram refers to a pair of two consecutive words in a text. This method allows us to find the most common two-word pairs that provide context that makes tokens more understandable. But sometimes we want to analyze all of the relationships among words used in the same tweet.

So here I’d like to introduce you to a concept of word co-occurrences. By “word co-occurrences,” we refer to the incidence of any pair of words appearing together on the same context of texts, like a tweet. And the relationship strength of a pair of words is weighted by counting co-occurrences.

Word co-occurrences are useful to analyze a semantic network of texts, given the extent to which influential words bridge between other words in a single utterance. So, by analyzing word occurrences among tweets, we can “describe the extent to which words are prominent in creating a structural pattern of coherence in a text” (Corman et al., 2002, p. 179) by locating the “in-between” position of the words in a co-occurrence network. In particular, analyzing word co-occurrences seeks to find words (or bigrams) that link conceptual clusters together and thus help organize the whole; thus, it allows for surveying rich and complex structures in word networks to the extent that the influence of certain words in tweets is dependent not just on their frequency but also on their location in the semantic network structure. That is to say, this method is structurally sensitive to a semantic network because “it accounts for all likely chains of association among words that make texts and conversation coherent” (Corman et al., 2002, p. 181).

So we will analyze the relationships of words that tend to co-occur within the same tweet, even if they do not occur next to each other. Tidy data format is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows; for example, to count the number of times that two words appear within the same tweet, or to how correlated they are. Most operations for finding pairwise counts or correlations need to turn the data into a widy matrix first.

The philosophy behind the widyr package, which can perform operations such as counting and correlating on pairs of values in a tidy dataset. The widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result by Julia Silge

We will examine some of the ways tidy text can be turned into a wide matrix, but in this case it is not necessary. The widyr package makes operations such as computing counts and correlations easy, by simplifying the pattern of “widen data, perform an operation, then re-tidy data” as seen in the above figure. We will focus on a set of functions that make pairwise comparisons between groups of observations (for example between tweets).