library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(Matrix)
library(checkdown)

What is Word Co-occurrence Analysis?

Word co-occurrence analysis looks at which words tend to show up together within the same unit of text (a sentence, tweet, headline, etc.). If two words appear together more often than we’d expect by chance, that tells us something about how people talk about a topic — which brands get paired with which adjectives, which features get paired with which products, and so on.

The basic workflow is always the same:

  1. Break text into individual words (tokenize)
  2. Remove common “stop words” (the, and, is, etc.) that don’t carry meaning
  3. Count how often pairs of words appear together in the same sentence
  4. Arrange those counts into a word-by-word co-occurrence matrix
  5. Visualize the strongest pairs as a network graph

We’ll walk through this twice today: first with a tiny, easy-to-follow fruit example, then with a real-world example using investor comments about SpaceX stock.


Part 1: Warm-Up Example — Fruits and Their Features

Before working with messy real-world text, it helps to see the mechanics on something simple where we already know what the answer should look like.

fruit_sentences <- tibble(
  id = 1:5,
  text = c(
    "The apple was crisp, sweet, and red.",
    "Bananas are soft, sweet, and yellow.",
    "The lemon was sour, bright, and yellow.",
    "Strawberries are sweet, red, and juicy.",
    "The lime was sour, small, and green."
  )
)

fruit_sentences
## # A tibble: 5 × 2
##      id text                                   
##   <int> <chr>                                  
## 1     1 The apple was crisp, sweet, and red.   
## 2     2 Bananas are soft, sweet, and yellow.   
## 3     3 The lemon was sour, bright, and yellow.
## 4     4 Strawberries are sweet, red, and juicy.
## 5     5 The lime was sour, small, and green.

Notice the pattern we built in on purpose: “sweet” keeps showing up with “red” and “yellow,” while “sour” keeps showing up with “yellow” and “green.” A good co-occurrence analysis should recover this pattern automatically, without us telling it the answer.

Step 1: Tokenize and Remove Stop Words

fruit_words <- fruit_sentences %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

fruit_words
## # A tibble: 19 × 2
##       id word        
##    <int> <chr>       
##  1     1 apple       
##  2     1 crisp       
##  3     1 sweet       
##  4     1 red         
##  5     2 bananas     
##  6     2 soft        
##  7     2 sweet       
##  8     2 yellow      
##  9     3 lemon       
## 10     3 sour        
## 11     3 bright      
## 12     3 yellow      
## 13     4 strawberries
## 14     4 sweet       
## 15     4 red         
## 16     4 juicy       
## 17     5 lime        
## 18     5 sour        
## 19     5 green

Step 2: Count Word Pairs Within Each Sentence

fruit_pairs <- fruit_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

fruit_pairs
## # A tibble: 26 × 3
##    item1   item2       n
##    <chr>   <chr>   <dbl>
##  1 sweet   red         2
##  2 apple   crisp       1
##  3 apple   sweet       1
##  4 crisp   sweet       1
##  5 apple   red         1
##  6 crisp   red         1
##  7 sweet   bananas     1
##  8 sweet   soft        1
##  9 bananas soft        1
## 10 sweet   yellow      1
## # ℹ 16 more rows

pairwise_count() from the widyr package counts, for every pair of words, how many sentences (id) they appear in together. The upper = FALSE argument keeps only one direction of each pair (apple–red instead of both apple–red and red–apple) so the table is easier to read.

Step 3: Build the Co-occurrence Matrix

A network graph is really just a picture of a matrix — a square table where every word is both a row and a column, and each cell holds the number of times that row-word and column-word appeared together. Building it explicitly makes the underlying structure concrete before we let ggraph draw it for us.

fruit_matrix <- fruit_pairs %>%
  bind_rows(fruit_pairs %>% rename(item1 = item2, item2 = item1)) %>%  # mirror to fill both triangles
  cast_sparse(item1, item2, n) %>%
  as.matrix()

fruit_matrix
##              red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet          2     1     0       1    1      1     0    0      0            1
## apple          1     1     1       0    0      0     0    0      0            0
## crisp          1     0     1       0    0      0     0    0      0            0
## bananas        0     0     1       0    1      1     0    0      0            0
## soft           0     0     1       1    0      1     0    0      0            0
## yellow         0     0     1       1    1      0     1    1      1            0
## lemon          0     0     0       0    0      1     0    1      1            0
## sour           0     0     0       0    0      1     1    0      1            0
## red            0     1     2       0    0      0     0    0      0            1
## strawberries   1     0     1       0    0      0     0    0      0            0
## lime           0     0     0       0    0      0     0    1      0            0
## bright         0     0     0       0    0      1     1    1      0            0
## juicy          1     0     1       0    0      0     0    0      0            1
## green          0     0     0       0    0      0     0    1      0            0
##              juicy lime green apple
## sweet            1    0     0     1
## apple            0    0     0     0
## crisp            0    0     0     1
## bananas          0    0     0     0
## soft             0    0     0     0
## yellow           0    0     0     0
## lemon            0    0     0     0
## sour             0    1     1     0
## red              1    0     0     1
## strawberries     1    0     0     0
## lime             0    0     1     0
## bright           0    0     0     0
## juicy            0    0     0     0
## green            0    1     0     0

cast_sparse() (from tidytext) reshapes our long pair-count table into a wide word-by-word matrix. We mirror the pairs first so the matrix is symmetric — “red” by “sweet” should show the same count as “sweet” by “red.” Reading across any row tells you, at a glance, which words that term tends to appear with most often.

In the fruit co-occurrence matrix, what does the value in cell [‘sweet’, ‘red’] represent?



Why do we mirror the pairs (item1/item2) before casting to a matrix?



Step 4: Visualize the Network

set.seed(580)

fruit_pairs %>%
  filter(n >= 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "steelblue") +
  geom_node_point(size = 5, color = "darkorange") +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  theme_void() +
  labs(title = "Word Co-occurrence Network: Fruit Descriptions")

Even with five short sentences, you should already see “sweet” clustering near “red” and “yellow,” while “sour” clusters near “yellow” and “green.” That’s the core idea — now we apply it to something messier.


Part 2: Real-World Example — SpaceX Investor Sentiment

This example uses a sample of investor comments about SpaceX stock ($SPCX), the kind of unstructured text you’d pull from a forum, X/Twitter thread, or comment section. The goal is to see which words investors tend to use together — for example, does “buy” pair more with “valuation” or with “dip”? Does “lockup” pair with “shares” or with “Musk”?

Step 1: Load the Raw Text

spacex_lines <- readLines("spacex_cooccurence.txt", encoding = "UTF-8")

spacex_text <- tibble(
  id = seq_along(spacex_lines),
  text = spacex_lines
)

spacex_text
## # A tibble: 13 × 2
##       id text                                                                   
##    <int> <chr>                                                                  
##  1     1 My buy level for spacex is 50-60                                       
##  2     2 Everyone of these experts said don't buy. I did. I bought and sold. It…
##  3     3 Below 100 i buy                                                        
##  4     4 Typical lockups don’t apply to SpaceX.  It has a unique, multi-phase…
##  5     5 I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy a…
##  6     6 My personal high-conviction targets for the TOP 15 Digital Assets of 2…
##  7     7 I'd rather miss the move than invest at current valuations lol         
##  8     8 Buy one to two shares in the beginning while all the hype continues so…
##  9     9 I want to invest in spacex but not at this valuation. Will wait 6 mont…
## 10    10 Nothing is new here. I expect the SpaceX price to drop significantly i…
## 11    11 TSLA and SPCX Merger in a couple of years.                             
## 12    12 I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise i…
## 13    13 I will be DCAing into this stock for the next 15 years! Its already be…

Each line in the file is treated as its own “document” (id), similar to how each sentence was its own document in the fruit example.

Step 2: Tokenize and Clean

Real-world text needs a bit more cleanup than our fruit example — things like stock tickers ($SPCX), numbers, and encoding artifacts (’) need to be handled.

custom_stop_words <- bind_rows(
  stop_words,
  tibble(word = c("im", "ive", "dont", "didnt", "isnt", "lol"), lexicon = "custom")
)

spacex_words <- spacex_text %>%
  mutate(text = str_replace_all(text, "[^[:alnum:][:space:]$]", " ")) %>%
  unnest_tokens(word, text, token = "words") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%   # drop pure numbers
  anti_join(custom_stop_words, by = "word")

spacex_words %>% count(word, sort = TRUE) %>% head(15)
## # A tibble: 15 × 2
##    word              n
##    <chr>         <int>
##  1 buy               6
##  2 shares            6
##  3 spacex            6
##  4 stock             4
##  5 price             3
##  6 spcx              3
##  7 apply             2
##  8 bought            2
##  9 invest            2
## 10 massive           2
## 11 months            2
## 12 move              2
## 13 significantly     2
## 14 sold              2
## 15 term              2

A purrr-friendly habit worth pointing out to students: instead of looping over each line to clean it, mutate() + str_replace_all() vectorizes the cleanup across every row at once — no for loop needed.

Step 3: Count Word Pairs

spacex_pairs <- spacex_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

spacex_pairs %>% head(15)
## # A tibble: 15 × 3
##    item1  item2             n
##    <chr>  <chr>         <dbl>
##  1 shares price             3
##  2 buy    sold              2
##  3 buy    shares            2
##  4 spacex shares            2
##  5 spacex term              2
##  6 shares term              2
##  7 spacex price             2
##  8 term   price             2
##  9 shares significantly     2
## 10 price  significantly     2
## 11 spacex wait              2
## 12 spacex months            2
## 13 wait   months            2
## 14 buy    level             1
## 15 buy    spacex            1

Step 4: Build the Co-occurrence Matrix

Same idea as the fruit example, but now on real investor language. Because the SpaceX vocabulary is much larger than the fruit example, we’ll restrict the matrix to the top 15 most frequent words so it stays readable.

top_words <- spacex_words %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  pull(word)

spacex_matrix <- spacex_pairs %>%
  filter(item1 %in% top_words, item2 %in% top_words) %>%
  bind_rows(spacex_pairs %>% rename(item1 = item2, item2 = item1) %>%
              filter(item1 %in% top_words, item2 %in% top_words)) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

spacex_matrix
##               price sold shares term significantly wait months spacex bought
## shares            3    1      0    2             2    1      1      2      0
## buy               1    2      2    0             1    0      0      1      1
## spacex            2    0      2    2             1    2      2      0      0
## term              2    0      2    0             1    1      1      2      0
## price             0    0      3    2             2    1      1      2      0
## wait              1    0      1    1             1    0      2      2      0
## bought            0    1      0    0             0    0      0      0      0
## sold              0    0      1    0             0    0      0      0      1
## apply             1    0      1    1             0    0      0      1      0
## massive           1    0      1    1             0    0      0      1      0
## stock             1    0      1    1             0    0      0      1      0
## move              0    1      0    0             0    0      0      0      1
## invest            0    0      0    0             0    1      1      1      0
## significantly     2    0      2    1             0    1      1      1      0
## months            1    0      1    1             1    2      0      2      0
## spcx              0    1      1    0             0    0      0      0      1
##               move apply massive stock spcx invest buy
## shares           0     1       1     1    1      0   2
## buy              1     0       0     0    1      0   0
## spacex           0     1       1     1    0      1   1
## term             0     1       1     1    0      0   0
## price            0     1       1     1    0      0   1
## wait             0     0       0     0    0      1   0
## bought           1     0       0     0    1      0   1
## sold             1     0       0     0    1      0   2
## apply            0     0       1     1    0      0   0
## massive          0     1       0     1    0      0   0
## stock            0     1       1     0    0      0   0
## move             0     0       0     0    0      1   1
## invest           1     0       0     0    0      0   0
## significantly    0     0       0     0    0      0   1
## months           0     0       0     0    0      1   0
## spcx             0     0       0     0    0      0   1

Scanning a row of this matrix is itself a mini business insight — for example, looking at the “buy” row tells you immediately which other words investors most often used in the same breath as “buy,” without needing the graph at all. The network in the next step is simply a visual translation of this same matrix.

Why did we restrict the SpaceX co-occurrence matrix to the top 15 words?



From a business-analytics standpoint, what is the main value of building the matrix explicitly (rather than jumping straight to the network plot)?



Step 5: Visualize the Strongest Co-occurrences

To keep the network readable, we’ll only keep pairs that occurred together more than once.

set.seed(580)

spacex_pairs %>%
  filter(n > 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "firebrick") +
  geom_node_point(size = 4, color = "steelblue") +
  geom_node_text(aes(label = name), repel = TRUE, size = 3.5) +
  theme_void() +
  labs(
    title = "Word Co-occurrence Network: SpaceX Investor Comments",
    subtitle = "Edge thickness = number of sentences containing both words"
  )

Discussion Questions for Students

  1. Which word pairs surprised you? Did “buy” cluster more with optimism words (“dip,” “DCA”) or caution words (“wait,” “valuation”)?
  2. How would the network change if we treated $SPCX and SpaceX as the same token instead of separate ones?
  3. What’s the risk of drawing conclusions from a co-occurrence network built on only 14 comments? How would you communicate that limitation in a business report?

Wrap-Up: From Toy Example to Business Insight

The fruit example exists purely to build intuition — once students can predict the network shape by eye, they’re ready to trust the same code on real text, where the patterns aren’t obvious in advance. The SpaceX example mirrors what a sentiment or brand-perception analysis would look like in practice: same five steps (tokenize, clean, pair, matrix, visualize), just messier inputs.


References

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/

Robinson, D. (2021). widyr: Widen, process, then re-tidy data [R package documentation]. https://CRAN.R-project.org/package=widyr

Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. https://igraph.org

Pedersen, T. L. (2024). ggraph: An implementation of grammar of graphics for graphs and networks [R package documentation]. https://CRAN.R-project.org/package=ggraph

Danes, J. E., & Mardle, S. (2023). checkdown: Interactive quiz and checkbox elements for R Markdown [R package documentation]. https://CRAN.R-project.org/package=checkdown