library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(Matrix)
library(checkdown)
You will perform a word co-occurrence analysis using the reviews or comments you scraped in a previous exercise. We will discuss the key steps in class and brainstorm the most efficient approach to completing the analysis.
Keep in mind that this tutorial includes a few multiple choice questions, several embedded discussion questions (see the end of the file), and tasks (see the section on removing stopwords).
Word co-occurrence analysis looks at which words tend to show up together within the same unit of text (a sentence, tweet, headline, etc.). If two words appear together more often than we’d expect by chance, that tells us something about how people talk about a topic — which brands get paired with which adjectives, which features get paired with which products, and so on.
The basic workflow is always the same:
We’ll walk through this twice today: first with a tiny, easy-to-follow fruit example, then with a real-world example using investor comments about SpaceX stock.
Before working with messy real-world text, it helps to see the mechanics on something simple where we already know what the answer should look like.
fruit_sentences <- tibble(
id = 1:5,
text = c(
"The apple was crisp, sweet, and red.",
"Bananas are soft, sweet, and yellow.",
"The lemon was sour, bright, and yellow.",
"Strawberries are sweet, red, and juicy.",
"The lime was sour, small, and green."
)
)
fruit_sentences
## # A tibble: 5 × 2
## id text
## <int> <chr>
## 1 1 The apple was crisp, sweet, and red.
## 2 2 Bananas are soft, sweet, and yellow.
## 3 3 The lemon was sour, bright, and yellow.
## 4 4 Strawberries are sweet, red, and juicy.
## 5 5 The lime was sour, small, and green.
Notice the pattern we built in on purpose: “sweet” keeps showing up with “red” and “yellow,” while “sour” keeps showing up with “yellow” and “green.” A good co-occurrence analysis should recover this pattern automatically, without us telling it the answer.
fruit_words <- fruit_sentences %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
fruit_words
## # A tibble: 19 × 2
## id word
## <int> <chr>
## 1 1 apple
## 2 1 crisp
## 3 1 sweet
## 4 1 red
## 5 2 bananas
## 6 2 soft
## 7 2 sweet
## 8 2 yellow
## 9 3 lemon
## 10 3 sour
## 11 3 bright
## 12 3 yellow
## 13 4 strawberries
## 14 4 sweet
## 15 4 red
## 16 4 juicy
## 17 5 lime
## 18 5 sour
## 19 5 green
fruit_pairs <- fruit_words %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
fruit_pairs
## # A tibble: 26 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 sweet red 2
## 2 apple crisp 1
## 3 apple sweet 1
## 4 crisp sweet 1
## 5 apple red 1
## 6 crisp red 1
## 7 sweet bananas 1
## 8 sweet soft 1
## 9 bananas soft 1
## 10 sweet yellow 1
## # ℹ 16 more rows
pairwise_count() from the widyr package
counts, for every pair of words, how many sentences (id)
they appear in together. The upper = FALSE argument keeps
only one direction of each pair (apple–red instead of both apple–red and
red–apple) so the table is easier to read.
A network graph is really just a picture of a matrix
— a square table where every word is both a row and a column, and each
cell holds the number of times that row-word and column-word appeared
together. Building it explicitly makes the underlying structure concrete
before we let ggraph draw it for us.
fruit_matrix <- fruit_pairs %>%
bind_rows(fruit_pairs %>% rename(item1 = item2, item2 = item1)) %>% # mirror to fill both triangles
cast_sparse(item1, item2, n) %>%
as.matrix()
fruit_matrix
## red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet 2 1 0 1 1 1 0 0 0 1
## apple 1 1 1 0 0 0 0 0 0 0
## crisp 1 0 1 0 0 0 0 0 0 0
## bananas 0 0 1 0 1 1 0 0 0 0
## soft 0 0 1 1 0 1 0 0 0 0
## yellow 0 0 1 1 1 0 1 1 1 0
## lemon 0 0 0 0 0 1 0 1 1 0
## sour 0 0 0 0 0 1 1 0 1 0
## red 0 1 2 0 0 0 0 0 0 1
## strawberries 1 0 1 0 0 0 0 0 0 0
## lime 0 0 0 0 0 0 0 1 0 0
## bright 0 0 0 0 0 1 1 1 0 0
## juicy 1 0 1 0 0 0 0 0 0 1
## green 0 0 0 0 0 0 0 1 0 0
## juicy lime green apple
## sweet 1 0 0 1
## apple 0 0 0 0
## crisp 0 0 0 1
## bananas 0 0 0 0
## soft 0 0 0 0
## yellow 0 0 0 0
## lemon 0 0 0 0
## sour 0 1 1 0
## red 1 0 0 1
## strawberries 1 0 0 0
## lime 0 0 1 0
## bright 0 0 0 0
## juicy 0 0 0 0
## green 0 1 0 0
fruit_matrix
## red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet 2 1 0 1 1 1 0 0 0 1
## apple 1 1 1 0 0 0 0 0 0 0
## crisp 1 0 1 0 0 0 0 0 0 0
## bananas 0 0 1 0 1 1 0 0 0 0
## soft 0 0 1 1 0 1 0 0 0 0
## yellow 0 0 1 1 1 0 1 1 1 0
## lemon 0 0 0 0 0 1 0 1 1 0
## sour 0 0 0 0 0 1 1 0 1 0
## red 0 1 2 0 0 0 0 0 0 1
## strawberries 1 0 1 0 0 0 0 0 0 0
## lime 0 0 0 0 0 0 0 1 0 0
## bright 0 0 0 0 0 1 1 1 0 0
## juicy 1 0 1 0 0 0 0 0 0 1
## green 0 0 0 0 0 0 0 1 0 0
## juicy lime green apple
## sweet 1 0 0 1
## apple 0 0 0 0
## crisp 0 0 0 1
## bananas 0 0 0 0
## soft 0 0 0 0
## yellow 0 0 0 0
## lemon 0 0 0 0
## sour 0 1 1 0
## red 1 0 0 1
## strawberries 1 0 0 0
## lime 0 0 1 0
## bright 0 0 0 0
## juicy 0 0 0 0
## green 0 1 0 0
cast_sparse() (from the tidytext package) reshapes our long pair-count table into a wide word-by-word matrix. We mirror the pairs first so the matrix is symmetric, which means that “red” by “sweet” should show the same count as “sweet” by “red.” Reading across any row tells you, at a glance, which words that term tends to appear with most often.
set.seed(580)
fruit_pairs %>%
filter(n >= 1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "steelblue") +
geom_node_point(size = 5, color = "darkorange") +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
theme_void() +
labs(title = "Word Co-occurrence Network: Fruit Descriptions")
Even with five short sentences, you should already see “sweet” clustering near “red” and “yellow,” while “sour” clusters near “yellow” and “green.” That’s the core idea — now we apply it to something messier.
This example uses a sample of investor comments about SpaceX stock
($SPCX) I scraped from a website. The original sample has
about 500 observations. The goal of this demo here is to see which words
investors tend to use together? For example, does “buy” pair
more with “valuation” or with “dip”? Does “sell” pair with “shares”?
spacex_lines <- readLines("spacex_cooccurence.txt", encoding = "UTF-8")
gsub("^\\s+|\\s+$", "", spacex_lines, useBytes = TRUE)
## [1] "My buy level for spacex is 50-60"
## [2] "Everyone of these experts said don't buy. I did. I bought and sold. It was a good move."
## [3] "Below 100 i buy"
## [4] "Typical lockups don’t apply to SpaceX. It has a unique, multi-phase lock-up release schedule that is intended to facilitate trickle selling rather than a massive dump of shares upon the expiration of a single massive lockup date. The historical model doesn’t apply. More importantly, it’s important to know the fact that SpaceX has floated only 5% of its stock has a profound impact on its future prospects, driving extreme near-term price volatility, delaying its inclusion in major passive stock indexes, and keeping absolute strategic control firmly in the hands of Elon Musk. By reserving 95% of its shares for insiders and early investors, SpaceX has created an ecosystem where the stock trades on scarcity rather than fundamental performance metrics."
## [5] "I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy again."
## [6] "My personal high-conviction targets for the TOP 15 Digital Assets of 2026: $Solana, $SPCX55K, $XRP, $BTC."
## [7] "I'd rather miss the move than invest at current valuations lol"
## [8] "Buy one to two shares in the beginning while all the hype continues so at keast you can follow the price in your portfolio. Once it drops significantly start dollar cost averaging or buy s big chunk at once."
## [9] "I want to invest in spacex but not at this valuation. Will wait 6 months and reassess."
## [10] "Nothing is new here. I expect the SpaceX price to drop significantly in the next 6 to 12 months. When everyone else panics and sells, that's when the opportunity arises. Collect and accumulate capital, and wait. Your grandchildren will remember the decision you will make soon enough. I have pre IPO shares and yes, my shares are locked up for 180 days. Will I sell when the day comes? NO, I am in for the long term."
## [11] "TSLA and SPCX Merger in a couple of years."
## [12] "I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise if SPCX runs to $1,000."
## [13] "I will be DCAing into this stock for the next 15 years! Its already below 20% of its high of 230! Strategy applies for all stocks."
spacex_text <- tibble(
id = seq_along(spacex_lines),
text = spacex_lines
)
spacex_text
## # A tibble: 13 × 2
## id text
## <int> <chr>
## 1 1 My buy level for spacex is 50-60
## 2 2 Everyone of these experts said don't buy. I did. I bought and sold. It…
## 3 3 Below 100 i buy
## 4 4 Typical lockups don’t apply to SpaceX. It has a unique, multi-phase…
## 5 5 I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy a…
## 6 6 My personal high-conviction targets for the TOP 15 Digital Assets of 2…
## 7 7 I'd rather miss the move than invest at current valuations lol
## 8 8 Buy one to two shares in the beginning while all the hype continues so…
## 9 9 I want to invest in spacex but not at this valuation. Will wait 6 mont…
## 10 10 Nothing is new here. I expect the SpaceX price to drop significantly i…
## 11 11 TSLA and SPCX Merger in a couple of years.
## 12 12 I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise i…
## 13 13 I will be DCAing into this stock for the next 15 years! Its already be…
Each line in the file is treated as its own “document” (id), similar to how each sentence was its own document in the fruit example.
Real-world text needs a bit more cleanup than our fruit example .
Words like stock tickers ($SPCX), numbers, and encoding
artifacts (’) need to be handled.
Add your own stopwords: Look at the word tibble on my tutorial. Find 1–2 words that feel too generic to be meaningful, words that are frequent only because they appear in most sentences, not because they signal something meaninful.
Add those words to the tibble(word = c()) line in the chunk below, then re-run that chunk only without publishing the document.
Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph (see the last code chunk) change after removing these additional stopwords? If so, please briefly explain how and why.
custom_stop_words <- bind_rows(
stop_words,
tibble(word = c("im", "ive", "dont", "didnt", "isnt", "lol"), lexicon = "custom")
)
spacex_words <- spacex_text %>%
mutate(text = str_replace_all(text, "[^[:alnum:][:space:]$]", " ")) %>%
unnest_tokens(word, text, token = "words") %>%
filter(!str_detect(word, "^[0-9]+$")) %>% # drop pure numbers
anti_join(custom_stop_words, by = "word")
spacex_words %>% count(word, sort = TRUE) %>% head(15)
## # A tibble: 15 × 2
## word n
## <chr> <int>
## 1 buy 6
## 2 shares 6
## 3 spacex 6
## 4 stock 4
## 5 price 3
## 6 spcx 3
## 7 apply 2
## 8 bought 2
## 9 invest 2
## 10 massive 2
## 11 months 2
## 12 move 2
## 13 significantly 2
## 14 sold 2
## 15 term 2
A purrr-friendly habit worth pointing out to students:
instead of looping over each line to clean it, mutate() +
str_replace_all() vectorizes the cleanup across every row
at once — no for loop needed.
spacex_pairs <- spacex_words %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
spacex_pairs %>% head(15)
## # A tibble: 15 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 shares price 3
## 2 buy sold 2
## 3 buy shares 2
## 4 spacex shares 2
## 5 spacex term 2
## 6 shares term 2
## 7 spacex price 2
## 8 term price 2
## 9 shares significantly 2
## 10 price significantly 2
## 11 spacex wait 2
## 12 spacex months 2
## 13 wait months 2
## 14 buy level 1
## 15 buy spacex 1
Same idea as the fruit example, but now on real investor language. Because the SpaceX vocabulary is much larger than the fruit example, we’ll restrict the matrix to the top 15 most frequent words so it stays readable.
top_words <- spacex_words %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 15) %>%
pull(word)
spacex_matrix <- spacex_pairs %>%
filter(item1 %in% top_words, item2 %in% top_words) %>%
bind_rows(spacex_pairs %>% rename(item1 = item2, item2 = item1) %>%
filter(item1 %in% top_words, item2 %in% top_words)) %>%
cast_sparse(item1, item2, n) %>%
as.matrix()
spacex_matrix
## price sold shares term significantly wait months spacex bought
## shares 3 1 0 2 2 1 1 2 0
## buy 1 2 2 0 1 0 0 1 1
## spacex 2 0 2 2 1 2 2 0 0
## term 2 0 2 0 1 1 1 2 0
## price 0 0 3 2 2 1 1 2 0
## wait 1 0 1 1 1 0 2 2 0
## bought 0 1 0 0 0 0 0 0 0
## sold 0 0 1 0 0 0 0 0 1
## apply 1 0 1 1 0 0 0 1 0
## massive 1 0 1 1 0 0 0 1 0
## stock 1 0 1 1 0 0 0 1 0
## move 0 1 0 0 0 0 0 0 1
## invest 0 0 0 0 0 1 1 1 0
## significantly 2 0 2 1 0 1 1 1 0
## months 1 0 1 1 1 2 0 2 0
## spcx 0 1 1 0 0 0 0 0 1
## move apply massive stock spcx invest buy
## shares 0 1 1 1 1 0 2
## buy 1 0 0 0 1 0 0
## spacex 0 1 1 1 0 1 1
## term 0 1 1 1 0 0 0
## price 0 1 1 1 0 0 1
## wait 0 0 0 0 0 1 0
## bought 1 0 0 0 1 0 1
## sold 1 0 0 0 1 0 2
## apply 0 0 1 1 0 0 0
## massive 0 1 0 1 0 0 0
## stock 0 1 1 0 0 0 0
## move 0 0 0 0 0 1 1
## invest 1 0 0 0 0 0 0
## significantly 0 0 0 0 0 0 1
## months 0 0 0 0 0 1 0
## spcx 0 0 0 0 0 0 1
Scanning a row of this matrix is itself a mini business insight — for example, looking at the “buy” row tells you immediately which other words investors most often used in the same breath as “buy,” without needing the graph at all. The network in the next step is simply a visual translation of this same matrix.
To keep the network readable, we’ll only keep pairs that occurred together more than once.
set.seed(580)
spacex_pairs %>%
filter(n > 1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "firebrick") +
geom_node_point(size = 4, color = "steelblue") +
geom_node_text(aes(label = name), repel = TRUE, size = 3.5) +
theme_void() +
labs(
title = "Word Co-occurrence Network: SpaceX Investor Comments",
subtitle = "Edge thickness = number of sentences containing both words"
)
Add your own stopwords: Look at the word tibble on my tutorial. Find 1–2 words that feel too generic to be meaningful, words that are frequent only because they appear in most sentences, not because they signal something meaninful.
Add those words to the tibble(word = c()) line in the chunk below, then re-run that chunk only without publishing the document.
Question 1 Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph (see the last code chunk) change after removing these additional stopwords? If so, please briefly explain how and why.**
Question 2 Change “n” in the function, “filter(n > 1)”, in step 5 to a different value. What is your final value of n? Will it give you a better result?
Question 3 On my BlueSky post here: https://bsky.app/profile/did:plc:jbmmqoxfdgpoavycm7fz4k3r/post/3mphyccr2vc2g, I had two co-occurence network graphs. Which one do you like better? Why?
Question 4 Which word pairs surprised you? Did “buy” cluster more with optimism words (“dip”) or caution words (“wait,” “valuation”)?
Question 5 Should we treat stock tickers,
$SPCX and SpaceX, as the same token instead of
separate ones?
Question 6 What’s the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?
Question 7 Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?
The fruit example exists purely to build intuition — once students can predict the network shape by eye, they’re ready to trust the same code on real text, where the patterns aren’t obvious in advance. The SpaceX example mirrors what a sentiment or brand-perception analysis would look like in practice: same five steps (tokenize, clean, pair, matrix, visualize), just messier inputs.
Boost Brand Credibility with Co Occurrence SEO. https://www.linkedin.com/posts/umer-abid-78045131a_co-occurrence-seo-is-a-powerful-concept-that-activity-7421759779892809728-MGh_/
https://seopressor.com/blog/why-co-citation-and-co-occurrence-are-such-big-deal/
Kong, J., Scott, A., & Goerg, G. M. (2016). Improving topic clustering on search queries with word co-occurrence and bipartite graph coclustering. https://research.google/pubs/improving-topic-clustering-on-search-queries-with-word-co-occurrence-and-bipartite-graph-co-clustering/
Colladon, A. F. (2018). The semantic brand score. Journal of Business Research, 88, 150-160.
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/
Robinson, D. (2021). widyr: Widen, process, then re-tidy data [R package documentation]. https://CRAN.R-project.org/package=widyr
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. https://igraph.org
Pedersen, T. L. (2024). ggraph: An implementation of grammar of graphics for graphs and networks [R package documentation]. https://CRAN.R-project.org/package=ggraph