Text Analysis & NLP - Word Co-occurrence Analysis Demo 1

library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(Matrix)
library(checkdown)

Tasks for you:

You will perform a word co-occurrence analysis using the reviews or comments you scraped in a previous exercise. We will discuss the key steps in class and brainstorm the most efficient approach to completing the analysis.

Keep in mind that this tutorial includes a few multiple choice questions, several embedded discussion questions (see the end of the file), and tasks (see the section on removing stopwords).

What is Word Co-occurrence Analysis?

Word co-occurrence analysis looks at which words tend to show up together within the same unit of text (a sentence, tweet, headline, etc.). If two words appear together more often than we’d expect by chance, that tells us something about how people talk about a topic — which brands get paired with which adjectives, which features get paired with which products, and so on.

The basic workflow is always the same:

Break text into individual words (tokenize)
Remove common “stop words” (the, and, is, etc.) that don’t carry meaning
Count how often pairs of words appear together in the same sentence
Arrange those counts into a word-by-word co-occurrence matrix
Visualize the strongest pairs as a network graph

We’ll walk through this twice today: first with a tiny, easy-to-follow fruit example, then with a real-world example using investor comments about SpaceX stock.

Part 1: Warm-Up Example — Fruits and Their Features

Before working with messy real-world text, it helps to see the mechanics on something simple where we already know what the answer should look like.

fruit_sentences <- tibble(
  id = 1:5,
  text = c(
    "The apple was crisp, sweet, and red.",
    "Bananas are soft, sweet, and yellow.",
    "The lemon was sour, bright, and yellow.",
    "Strawberries are sweet, red, and juicy.",
    "The lime was sour, small, and green."
  )
)

fruit_sentences

## # A tibble: 5 × 2
##      id text                                   
##   <int> <chr>                                  
## 1     1 The apple was crisp, sweet, and red.   
## 2     2 Bananas are soft, sweet, and yellow.   
## 3     3 The lemon was sour, bright, and yellow.
## 4     4 Strawberries are sweet, red, and juicy.
## 5     5 The lime was sour, small, and green.

Notice the pattern we built in on purpose: “sweet” keeps showing up with “red” and “yellow,” while “sour” keeps showing up with “yellow” and “green.” A good co-occurrence analysis should recover this pattern automatically, without us telling it the answer.

Step 1: Tokenize and Remove Stop Words

fruit_words <- fruit_sentences %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

fruit_words

## # A tibble: 19 × 2
##       id word        
##    <int> <chr>       
##  1     1 apple       
##  2     1 crisp       
##  3     1 sweet       
##  4     1 red         
##  5     2 bananas     
##  6     2 soft        
##  7     2 sweet       
##  8     2 yellow      
##  9     3 lemon       
## 10     3 sour        
## 11     3 bright      
## 12     3 yellow      
## 13     4 strawberries
## 14     4 sweet       
## 15     4 red         
## 16     4 juicy       
## 17     5 lime        
## 18     5 sour        
## 19     5 green

Step 2: Count Word Pairs Within Each Sentence

fruit_pairs <- fruit_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

fruit_pairs

## # A tibble: 26 × 3
##    item1   item2       n
##    <chr>   <chr>   <dbl>
##  1 sweet   red         2
##  2 apple   crisp       1
##  3 apple   sweet       1
##  4 crisp   sweet       1
##  5 apple   red         1
##  6 crisp   red         1
##  7 sweet   bananas     1
##  8 sweet   soft        1
##  9 bananas soft        1
## 10 sweet   yellow      1
## # ℹ 16 more rows

pairwise_count() from the widyr package counts, for every pair of words, how many sentences (id) they appear in together. The upper = FALSE argument keeps only one direction of each pair (apple–red instead of both apple–red and red–apple) so the table is easier to read.

Step 3: Build the Co-occurrence Matrix

A network graph is really just a picture of a matrix — a square table where every word is both a row and a column, and each cell holds the number of times that row-word and column-word appeared together. Building it explicitly makes the underlying structure concrete before we let ggraph draw it for us.

fruit_matrix <- fruit_pairs %>%
  bind_rows(fruit_pairs %>% rename(item1 = item2, item2 = item1)) %>%  # mirror to fill both triangles
  cast_sparse(item1, item2, n) %>%
  as.matrix()

fruit_matrix

##              red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet          2     1     0       1    1      1     0    0      0            1
## apple          1     1     1       0    0      0     0    0      0            0
## crisp          1     0     1       0    0      0     0    0      0            0
## bananas        0     0     1       0    1      1     0    0      0            0
## soft           0     0     1       1    0      1     0    0      0            0
## yellow         0     0     1       1    1      0     1    1      1            0
## lemon          0     0     0       0    0      1     0    1      1            0
## sour           0     0     0       0    0      1     1    0      1            0
## red            0     1     2       0    0      0     0    0      0            1
## strawberries   1     0     1       0    0      0     0    0      0            0
## lime           0     0     0       0    0      0     0    1      0            0
## bright         0     0     0       0    0      1     1    1      0            0
## juicy          1     0     1       0    0      0     0    0      0            1
## green          0     0     0       0    0      0     0    1      0            0
##              juicy lime green apple
## sweet            1    0     0     1
## apple            0    0     0     0
## crisp            0    0     0     1
## bananas          0    0     0     0
## soft             0    0     0     0
## yellow           0    0     0     0
## lemon            0    0     0     0
## sour             0    1     1     0
## red              1    0     0     1
## strawberries     1    0     0     0
## lime             0    0     1     0
## bright           0    0     0     0
## juicy            0    0     0     0
## green            0    1     0     0

fruit_matrix

##              red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet          2     1     0       1    1      1     0    0      0            1
## apple          1     1     1       0    0      0     0    0      0            0
## crisp          1     0     1       0    0      0     0    0      0            0
## bananas        0     0     1       0    1      1     0    0      0            0
## soft           0     0     1       1    0      1     0    0      0            0
## yellow         0     0     1       1    1      0     1    1      1            0
## lemon          0     0     0       0    0      1     0    1      1            0
## sour           0     0     0       0    0      1     1    0      1            0
## red            0     1     2       0    0      0     0    0      0            1
## strawberries   1     0     1       0    0      0     0    0      0            0
## lime           0     0     0       0    0      0     0    1      0            0
## bright         0     0     0       0    0      1     1    1      0            0
## juicy          1     0     1       0    0      0     0    0      0            1
## green          0     0     0       0    0      0     0    1      0            0
##              juicy lime green apple
## sweet            1    0     0     1
## apple            0    0     0     0
## crisp            0    0     0     1
## bananas          0    0     0     0
## soft             0    0     0     0
## yellow           0    0     0     0
## lemon            0    0     0     0
## sour             0    1     1     0
## red              1    0     0     1
## strawberries     1    0     0     0
## lime             0    0     1     0
## bright           0    0     0     0
## juicy            0    0     0     0
## green            0    1     0     0

cast_sparse() (from the tidytext package) reshapes our long pair-count table into a wide word-by-word matrix. We mirror the pairs first so the matrix is symmetric, which means that “red” by “sweet” should show the same count as “sweet” by “red.” Reading across any row tells you, at a glance, which words that term tends to appear with most often.

Step 4: Visualize the Network

set.seed(580)

fruit_pairs %>%
  filter(n >= 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "steelblue") +
  geom_node_point(size = 5, color = "darkorange") +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  theme_void() +
  labs(title = "Word Co-occurrence Network: Fruit Descriptions")

Even with five short sentences, you should already see “sweet” clustering near “red” and “yellow,” while “sour” clusters near “yellow” and “green.” That’s the core idea — now we apply it to something messier.

Part 2: Real-World Example — SpaceX Investor Sentiment

This example uses a sample of investor comments about SpaceX stock ($SPCX) I scraped from a website. The original sample has about 500 observations. The goal of this demo here is to see which words investors tend to use together? For example, does “buy” pair more with “valuation” or with “dip”? Does “sell” pair with “shares”?

Step 1: Load the Raw Text

spacex_lines <- readLines("spacex_cooccurence.txt", encoding = "UTF-8")
gsub("^\\s+|\\s+$", "", spacex_lines, useBytes = TRUE)

##  [1] "My buy level for spacex is 50-60"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [2] "Everyone of these experts said don't buy. I did. I bought and sold. It was a good move."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [3] "Below 100 i buy"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [4] "Typical lockups donâ€™t apply to SpaceX.  It has a unique, multi-phase lock-up release schedule that is intended to facilitate trickle selling rather than a massive dump of shares upon the expiration of a single massive lockup date.  The historical model doesnâ€™t apply.  More importantly, itâ€™s important to know the fact that SpaceX has floated only 5% of its stock has a profound impact on its future prospects, driving extreme near-term price volatility, delaying its inclusion in major passive stock indexes, and keeping absolute strategic control firmly in the hands of Elon Musk. By reserving 95% of its shares for insiders and early investors, SpaceX has created an ecosystem where the stock trades on scarcity rather than fundamental performance metrics."
##  [5] "I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy again."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [6] "My personal high-conviction targets for the TOP 15 Digital Assets of 2026: $Solana, $SPCX55K, $XRP, $BTC."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [7] "I'd rather miss the move than invest at current valuations lol"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
##  [8] "Buy one to two shares in the beginning while all the hype continues so at keast you can follow the price in your portfolio. Once it drops significantly start dollar cost averaging or buy s big chunk at once."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [9] "I want to invest in spacex but not at this valuation. Will wait 6 months and reassess."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [10] "Nothing is new here. I expect the SpaceX price to drop significantly in the next 6 to 12 months. When everyone else panics and sells, that's when the opportunity arises. Collect and accumulate capital, and wait. Your grandchildren will remember the decision you will make soon enough. I have pre IPO shares and yes, my shares are locked up for 180 days. Will I sell when the day comes? NO, I am in for the long term."                                                                                                                                                                                                                                                                                                                                                             
## [11] "TSLA and SPCX Merger in a couple of years."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [12] "I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise if SPCX runs to $1,000."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [13] "I will be DCAing into this stock for the next 15 years! Its already below 20% of its high of 230! Strategy applies for all stocks."

spacex_text <- tibble(
  id = seq_along(spacex_lines),
  text = spacex_lines
)

spacex_text

## # A tibble: 13 × 2
##       id text                                                                   
##    <int> <chr>                                                                  
##  1     1 My buy level for spacex is 50-60                                       
##  2     2 Everyone of these experts said don't buy. I did. I bought and sold. It…
##  3     3 Below 100 i buy                                                        
##  4     4 Typical lockups donâ€™t apply to SpaceX.  It has a unique, multi-phase…
##  5     5 I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy a…
##  6     6 My personal high-conviction targets for the TOP 15 Digital Assets of 2…
##  7     7 I'd rather miss the move than invest at current valuations lol         
##  8     8 Buy one to two shares in the beginning while all the hype continues so…
##  9     9 I want to invest in spacex but not at this valuation. Will wait 6 mont…
## 10    10 Nothing is new here. I expect the SpaceX price to drop significantly i…
## 11    11 TSLA and SPCX Merger in a couple of years.                             
## 12    12 I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise i…
## 13    13 I will be DCAing into this stock for the next 15 years! Its already be…

Each line in the file is treated as its own “document” (id), similar to how each sentence was its own document in the fruit example.

Step 2: Tokenize and Clean

Real-world text needs a bit more cleanup than our fruit example . Words like stock tickers ($SPCX), numbers, and encoding artifacts (â€™) need to be handled.

Tasks for you:

Add your own stopwords: Look at the word tibble on my tutorial. Find 1–2 words that feel too generic to be meaningful, words that are frequent only because they appear in most sentences, not because they signal something meaninful.

Add those words to the tibble(word = c()) line in the chunk below, then re-run that chunk only without publishing the document.

Reflection questions for you (add your responses below)

Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph (see the last code chunk) change after removing these additional stopwords? If so, please briefly explain how and why.

custom_stop_words <- bind_rows(
  stop_words,
  tibble(word = c("im", "ive", "dont", "didnt", "isnt", "lol"), lexicon = "custom")
)

spacex_words <- spacex_text %>%
  mutate(text = str_replace_all(text, "[^[:alnum:][:space:]$]", " ")) %>%
  unnest_tokens(word, text, token = "words") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%   # drop pure numbers
  anti_join(custom_stop_words, by = "word")

spacex_words %>% count(word, sort = TRUE) %>% head(15)

## # A tibble: 15 × 2
##    word              n
##    <chr>         <int>
##  1 buy               6
##  2 shares            6
##  3 spacex            6
##  4 stock             4
##  5 price             3
##  6 spcx              3
##  7 apply             2
##  8 bought            2
##  9 invest            2
## 10 massive           2
## 11 months            2
## 12 move              2
## 13 significantly     2
## 14 sold              2
## 15 term              2

A purrr-friendly habit worth pointing out to students: instead of looping over each line to clean it, mutate() + str_replace_all() vectorizes the cleanup across every row at once — no for loop needed.

Step 3: Count Word Pairs

spacex_pairs <- spacex_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

spacex_pairs %>% head(15)

## # A tibble: 15 × 3
##    item1  item2             n
##    <chr>  <chr>         <dbl>
##  1 shares price             3
##  2 buy    sold              2
##  3 buy    shares            2
##  4 spacex shares            2
##  5 spacex term              2
##  6 shares term              2
##  7 spacex price             2
##  8 term   price             2
##  9 shares significantly     2
## 10 price  significantly     2
## 11 spacex wait              2
## 12 spacex months            2
## 13 wait   months            2
## 14 buy    level             1
## 15 buy    spacex            1

Step 4: Build the Co-occurrence Matrix

Same idea as the fruit example, but now on real investor language. Because the SpaceX vocabulary is much larger than the fruit example, we’ll restrict the matrix to the top 15 most frequent words so it stays readable.

top_words <- spacex_words %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  pull(word)

spacex_matrix <- spacex_pairs %>%
  filter(item1 %in% top_words, item2 %in% top_words) %>%
  bind_rows(spacex_pairs %>% rename(item1 = item2, item2 = item1) %>%
              filter(item1 %in% top_words, item2 %in% top_words)) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

spacex_matrix

##               price sold shares term significantly wait months spacex bought
## shares            3    1      0    2             2    1      1      2      0
## buy               1    2      2    0             1    0      0      1      1
## spacex            2    0      2    2             1    2      2      0      0
## term              2    0      2    0             1    1      1      2      0
## price             0    0      3    2             2    1      1      2      0
## wait              1    0      1    1             1    0      2      2      0
## bought            0    1      0    0             0    0      0      0      0
## sold              0    0      1    0             0    0      0      0      1
## apply             1    0      1    1             0    0      0      1      0
## massive           1    0      1    1             0    0      0      1      0
## stock             1    0      1    1             0    0      0      1      0
## move              0    1      0    0             0    0      0      0      1
## invest            0    0      0    0             0    1      1      1      0
## significantly     2    0      2    1             0    1      1      1      0
## months            1    0      1    1             1    2      0      2      0
## spcx              0    1      1    0             0    0      0      0      1
##               move apply massive stock spcx invest buy
## shares           0     1       1     1    1      0   2
## buy              1     0       0     0    1      0   0
## spacex           0     1       1     1    0      1   1
## term             0     1       1     1    0      0   0
## price            0     1       1     1    0      0   1
## wait             0     0       0     0    0      1   0
## bought           1     0       0     0    1      0   1
## sold             1     0       0     0    1      0   2
## apply            0     0       1     1    0      0   0
## massive          0     1       0     1    0      0   0
## stock            0     1       1     0    0      0   0
## move             0     0       0     0    0      1   1
## invest           1     0       0     0    0      0   0
## significantly    0     0       0     0    0      0   1
## months           0     0       0     0    0      1   0
## spcx             0     0       0     0    0      0   1

Scanning a row of this matrix is itself a mini business insight — for example, looking at the “buy” row tells you immediately which other words investors most often used in the same breath as “buy,” without needing the graph at all. The network in the next step is simply a visual translation of this same matrix.

Step 5: Visualize the Strongest Co-occurrences

To keep the network readable, we’ll only keep pairs that occurred together more than once.

set.seed(580)

spacex_pairs %>%
  filter(n > 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "firebrick") +
  geom_node_point(size = 4, color = "steelblue") +
  geom_node_text(aes(label = name), repel = TRUE, size = 3.5) +
  theme_void() +
  labs(
    title = "Word Co-occurrence Network: SpaceX Investor Comments",
    subtitle = "Edge thickness = number of sentences containing both words"
  )

Tasks for you (try to use external sources to support your responses):

See the step on tokenization and processing

Add those words to the tibble(word = c()) line in the chunk below, then re-run that chunk only without publishing the document.

Reflection questions for you (add your responses below)

Question 1 Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph (see the last code chunk) change after removing these additional stopwords? If so, please briefly explain how and why.**

Question 2 Change “n” in the function, “filter(n > 1)”, in step 5 to a different value. What is your final value of n? Will it give you a better result?

Additional discussion Questions (try to use external sources to support your responses)

Question 3 On my BlueSky post here: https://bsky.app/profile/did:plc:jbmmqoxfdgpoavycm7fz4k3r/post/3mphyccr2vc2g, I had two co-occurence network graphs. Which one do you like better? Why?

Question 4 Which word pairs surprised you? Did “buy” cluster more with optimism words (“dip”) or caution words (“wait,” “valuation”)?

Question 5 Should we treat stock tickers, $SPCX and SpaceX, as the same token instead of separate ones?

Question 6 What’s the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?

Question 7 Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?

Wrap-Up: From Toy Example to Business Insight

The fruit example exists purely to build intuition — once students can predict the network shape by eye, they’re ready to trust the same code on real text, where the patterns aren’t obvious in advance. The SpaceX example mirrors what a sentiment or brand-perception analysis would look like in practice: same five steps (tokenize, clean, pair, matrix, visualize), just messier inputs.

References

Boost Brand Credibility with Co Occurrence SEO. https://www.linkedin.com/posts/umer-abid-78045131a_co-occurrence-seo-is-a-powerful-concept-that-activity-7421759779892809728-MGh_/

https://seopressor.com/blog/why-co-citation-and-co-occurrence-are-such-big-deal/

Kong, J., Scott, A., & Goerg, G. M. (2016). Improving topic clustering on search queries with word co-occurrence and bipartite graph coclustering. https://research.google/pubs/improving-topic-clustering-on-search-queries-with-word-co-occurrence-and-bipartite-graph-co-clustering/

Colladon, A. F. (2018). The semantic brand score. Journal of Business Research, 88, 150-160.

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/

Robinson, D. (2021). widyr: Widen, process, then re-tidy data [R package documentation]. https://CRAN.R-project.org/package=widyr

Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. https://igraph.org

Pedersen, T. L. (2024). ggraph: An implementation of grammar of graphics for graphs and networks [R package documentation]. https://CRAN.R-project.org/package=ggraph

Text Analysis & NLP - Word Co-occurrence Analysis Demo 1

Special Topics in Business Analytics

Zhenning “Jimmy” Xu, Ph.D.

June 30, 2026

Tasks for you:

What is Word Co-occurrence Analysis?

Part 1: Warm-Up Example — Fruits and Their Features

Step 1: Tokenize and Remove Stop Words

Step 2: Count Word Pairs Within Each Sentence

Step 3: Build the Co-occurrence Matrix

Step 4: Visualize the Network

Part 2: Real-World Example — SpaceX Investor Sentiment

Step 1: Load the Raw Text

Step 2: Tokenize and Clean

Tasks for you:

Reflection questions for you (add your responses below)

Step 3: Count Word Pairs

Step 4: Build the Co-occurrence Matrix

Step 5: Visualize the Strongest Co-occurrences

Tasks for you (try to use external sources to support your responses):

Reflection questions for you (add your responses below)

Additional discussion Questions (try to use external sources to support your responses)

Wrap-Up: From Toy Example to Business Insight

References