library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(Matrix)
library(checkdown)
Word co-occurrence analysis looks at which words tend to show up together within the same unit of text (a sentence, tweet, headline, etc.). If two words appear together more often than we’d expect by chance, that tells us something about how people talk about a topic — which brands get paired with which adjectives, which features get paired with which products, and so on.
The basic workflow is always the same:
We’ll walk through this twice today: first with a tiny, easy-to-follow fruit example, then with a real-world example using investor comments about SpaceX stock.
Before working with messy real-world text, it helps to see the mechanics on something simple where we already know what the answer should look like.
fruit_sentences <- tibble(
id = 1:5,
text = c(
"The apple was crisp, sweet, and red.",
"Bananas are soft, sweet, and yellow.",
"The lemon was sour, bright, and yellow.",
"Strawberries are sweet, red, and juicy.",
"The lime was sour, small, and green."
)
)
fruit_sentences
## # A tibble: 5 × 2
## id text
## <int> <chr>
## 1 1 The apple was crisp, sweet, and red.
## 2 2 Bananas are soft, sweet, and yellow.
## 3 3 The lemon was sour, bright, and yellow.
## 4 4 Strawberries are sweet, red, and juicy.
## 5 5 The lime was sour, small, and green.
Notice the pattern we built in on purpose: “sweet” keeps showing up with “red” and “yellow,” while “sour” keeps showing up with “yellow” and “green.” A good co-occurrence analysis should recover this pattern automatically, without us telling it the answer.
fruit_words <- fruit_sentences %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
fruit_words
## # A tibble: 19 × 2
## id word
## <int> <chr>
## 1 1 apple
## 2 1 crisp
## 3 1 sweet
## 4 1 red
## 5 2 bananas
## 6 2 soft
## 7 2 sweet
## 8 2 yellow
## 9 3 lemon
## 10 3 sour
## 11 3 bright
## 12 3 yellow
## 13 4 strawberries
## 14 4 sweet
## 15 4 red
## 16 4 juicy
## 17 5 lime
## 18 5 sour
## 19 5 green
fruit_pairs <- fruit_words %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
fruit_pairs
## # A tibble: 26 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 sweet red 2
## 2 apple crisp 1
## 3 apple sweet 1
## 4 crisp sweet 1
## 5 apple red 1
## 6 crisp red 1
## 7 sweet bananas 1
## 8 sweet soft 1
## 9 bananas soft 1
## 10 sweet yellow 1
## # ℹ 16 more rows
pairwise_count() from the widyr package
counts, for every pair of words, how many sentences (id)
they appear in together. The upper = FALSE argument keeps
only one direction of each pair (apple–red instead of both apple–red and
red–apple) so the table is easier to read.
A network graph is really just a picture of a matrix
— a square table where every word is both a row and a column, and each
cell holds the number of times that row-word and column-word appeared
together. Building it explicitly makes the underlying structure concrete
before we let ggraph draw it for us.
fruit_matrix <- fruit_pairs %>%
bind_rows(fruit_pairs %>% rename(item1 = item2, item2 = item1)) %>% # mirror to fill both triangles
cast_sparse(item1, item2, n) %>%
as.matrix()
fruit_matrix
## red crisp sweet bananas soft yellow lemon sour bright strawberries
## sweet 2 1 0 1 1 1 0 0 0 1
## apple 1 1 1 0 0 0 0 0 0 0
## crisp 1 0 1 0 0 0 0 0 0 0
## bananas 0 0 1 0 1 1 0 0 0 0
## soft 0 0 1 1 0 1 0 0 0 0
## yellow 0 0 1 1 1 0 1 1 1 0
## lemon 0 0 0 0 0 1 0 1 1 0
## sour 0 0 0 0 0 1 1 0 1 0
## red 0 1 2 0 0 0 0 0 0 1
## strawberries 1 0 1 0 0 0 0 0 0 0
## lime 0 0 0 0 0 0 0 1 0 0
## bright 0 0 0 0 0 1 1 1 0 0
## juicy 1 0 1 0 0 0 0 0 0 1
## green 0 0 0 0 0 0 0 1 0 0
## juicy lime green apple
## sweet 1 0 0 1
## apple 0 0 0 0
## crisp 0 0 0 1
## bananas 0 0 0 0
## soft 0 0 0 0
## yellow 0 0 0 0
## lemon 0 0 0 0
## sour 0 1 1 0
## red 1 0 0 1
## strawberries 1 0 0 0
## lime 0 0 1 0
## bright 0 0 0 0
## juicy 0 0 0 0
## green 0 1 0 0
cast_sparse() (from tidytext) reshapes our
long pair-count table into a wide word-by-word matrix. We mirror the
pairs first so the matrix is symmetric — “red” by “sweet” should show
the same count as “sweet” by “red.” Reading across any row tells you, at
a glance, which words that term tends to appear with most often.
set.seed(580)
fruit_pairs %>%
filter(n >= 1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "steelblue") +
geom_node_point(size = 5, color = "darkorange") +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
theme_void() +
labs(title = "Word Co-occurrence Network: Fruit Descriptions")
Even with five short sentences, you should already see “sweet” clustering near “red” and “yellow,” while “sour” clusters near “yellow” and “green.” That’s the core idea — now we apply it to something messier.
This example uses a sample of investor comments about SpaceX stock
($SPCX), the kind of unstructured text you’d pull from a
forum, X/Twitter thread, or comment section. The goal is to see which
words investors tend to use together — for example, does “buy”
pair more with “valuation” or with “dip”? Does “lockup” pair with
“shares” or with “Musk”?
spacex_lines <- readLines("spacex_cooccurence.txt", encoding = "UTF-8")
spacex_text <- tibble(
id = seq_along(spacex_lines),
text = spacex_lines
)
spacex_text
## # A tibble: 13 × 2
## id text
## <int> <chr>
## 1 1 My buy level for spacex is 50-60
## 2 2 Everyone of these experts said don't buy. I did. I bought and sold. It…
## 3 3 Below 100 i buy
## 4 4 Typical lockups don’t apply to SpaceX. It has a unique, multi-phase…
## 5 5 I sold my $SPCX shares at $197. Once it crashes to ~ $75, I will buy a…
## 6 6 My personal high-conviction targets for the TOP 15 Digital Assets of 2…
## 7 7 I'd rather miss the move than invest at current valuations lol
## 8 8 Buy one to two shares in the beginning while all the hype continues so…
## 9 9 I want to invest in spacex but not at this valuation. Will wait 6 mont…
## 10 10 Nothing is new here. I expect the SpaceX price to drop significantly i…
## 11 11 TSLA and SPCX Merger in a couple of years.
## 12 12 I bought 5 Calls yesterdays dip. FOMO is real. I wouldnt be surprise i…
## 13 13 I will be DCAing into this stock for the next 15 years! Its already be…
Each line in the file is treated as its own “document” (id), similar to how each sentence was its own document in the fruit example.
Real-world text needs a bit more cleanup than our fruit example —
things like stock tickers ($SPCX), numbers, and encoding
artifacts (’) need to be handled.
custom_stop_words <- bind_rows(
stop_words,
tibble(word = c("im", "ive", "dont", "didnt", "isnt", "lol"), lexicon = "custom")
)
spacex_words <- spacex_text %>%
mutate(text = str_replace_all(text, "[^[:alnum:][:space:]$]", " ")) %>%
unnest_tokens(word, text, token = "words") %>%
filter(!str_detect(word, "^[0-9]+$")) %>% # drop pure numbers
anti_join(custom_stop_words, by = "word")
spacex_words %>% count(word, sort = TRUE) %>% head(15)
## # A tibble: 15 × 2
## word n
## <chr> <int>
## 1 buy 6
## 2 shares 6
## 3 spacex 6
## 4 stock 4
## 5 price 3
## 6 spcx 3
## 7 apply 2
## 8 bought 2
## 9 invest 2
## 10 massive 2
## 11 months 2
## 12 move 2
## 13 significantly 2
## 14 sold 2
## 15 term 2
A purrr-friendly habit worth pointing out to students:
instead of looping over each line to clean it, mutate() +
str_replace_all() vectorizes the cleanup across every row
at once — no for loop needed.
spacex_pairs <- spacex_words %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
spacex_pairs %>% head(15)
## # A tibble: 15 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 shares price 3
## 2 buy sold 2
## 3 buy shares 2
## 4 spacex shares 2
## 5 spacex term 2
## 6 shares term 2
## 7 spacex price 2
## 8 term price 2
## 9 shares significantly 2
## 10 price significantly 2
## 11 spacex wait 2
## 12 spacex months 2
## 13 wait months 2
## 14 buy level 1
## 15 buy spacex 1
Same idea as the fruit example, but now on real investor language. Because the SpaceX vocabulary is much larger than the fruit example, we’ll restrict the matrix to the top 15 most frequent words so it stays readable.
top_words <- spacex_words %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 15) %>%
pull(word)
spacex_matrix <- spacex_pairs %>%
filter(item1 %in% top_words, item2 %in% top_words) %>%
bind_rows(spacex_pairs %>% rename(item1 = item2, item2 = item1) %>%
filter(item1 %in% top_words, item2 %in% top_words)) %>%
cast_sparse(item1, item2, n) %>%
as.matrix()
spacex_matrix
## price sold shares term significantly wait months spacex bought
## shares 3 1 0 2 2 1 1 2 0
## buy 1 2 2 0 1 0 0 1 1
## spacex 2 0 2 2 1 2 2 0 0
## term 2 0 2 0 1 1 1 2 0
## price 0 0 3 2 2 1 1 2 0
## wait 1 0 1 1 1 0 2 2 0
## bought 0 1 0 0 0 0 0 0 0
## sold 0 0 1 0 0 0 0 0 1
## apply 1 0 1 1 0 0 0 1 0
## massive 1 0 1 1 0 0 0 1 0
## stock 1 0 1 1 0 0 0 1 0
## move 0 1 0 0 0 0 0 0 1
## invest 0 0 0 0 0 1 1 1 0
## significantly 2 0 2 1 0 1 1 1 0
## months 1 0 1 1 1 2 0 2 0
## spcx 0 1 1 0 0 0 0 0 1
## move apply massive stock spcx invest buy
## shares 0 1 1 1 1 0 2
## buy 1 0 0 0 1 0 0
## spacex 0 1 1 1 0 1 1
## term 0 1 1 1 0 0 0
## price 0 1 1 1 0 0 1
## wait 0 0 0 0 0 1 0
## bought 1 0 0 0 1 0 1
## sold 1 0 0 0 1 0 2
## apply 0 0 1 1 0 0 0
## massive 0 1 0 1 0 0 0
## stock 0 1 1 0 0 0 0
## move 0 0 0 0 0 1 1
## invest 1 0 0 0 0 0 0
## significantly 0 0 0 0 0 0 1
## months 0 0 0 0 0 1 0
## spcx 0 0 0 0 0 0 1
Scanning a row of this matrix is itself a mini business insight — for example, looking at the “buy” row tells you immediately which other words investors most often used in the same breath as “buy,” without needing the graph at all. The network in the next step is simply a visual translation of this same matrix.
To keep the network readable, we’ll only keep pairs that occurred together more than once.
set.seed(580)
spacex_pairs %>%
filter(n > 1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "firebrick") +
geom_node_point(size = 4, color = "steelblue") +
geom_node_text(aes(label = name), repel = TRUE, size = 3.5) +
theme_void() +
labs(
title = "Word Co-occurrence Network: SpaceX Investor Comments",
subtitle = "Edge thickness = number of sentences containing both words"
)
$SPCX and
SpaceX as the same token instead of separate ones?The fruit example exists purely to build intuition — once students can predict the network shape by eye, they’re ready to trust the same code on real text, where the patterns aren’t obvious in advance. The SpaceX example mirrors what a sentiment or brand-perception analysis would look like in practice: same five steps (tokenize, clean, pair, matrix, visualize), just messier inputs.
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/
Robinson, D. (2021). widyr: Widen, process, then re-tidy data [R package documentation]. https://CRAN.R-project.org/package=widyr
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. https://igraph.org
Pedersen, T. L. (2024). ggraph: An implementation of grammar of graphics for graphs and networks [R package documentation]. https://CRAN.R-project.org/package=ggraph
Danes, J. E., & Mardle, S. (2023). checkdown: Interactive quiz and checkbox elements for R Markdown [R package documentation]. https://CRAN.R-project.org/package=checkdown